Detect JavaScript within a page loaded in PhantomJS - javascript

I'm using PhantomJS as a crawler; if there is no JS in a page I can assume that it's completely loaded when onLoadFinished fires, but if there is JS in a page, I need to wait a bit to give the scripts a chance to do stuff. This is my current stab at detecting JS:
var pageHasJS = page.evaluate(function () {
return (document.getElementsByTagName("script").length > 0 ||
document.evaluate("count(//#*[starts-with(name(), 'on')])",
document, null, XPathResult.NUMBER_TYPE,
null).numberValue > 0);
})
This looks for <script> tags and for elements with an onsomething attribute.
Q1: Is there any other HTML construct that can sneak JS into a page? javascript: URLs do not count, because nothing will ever get clicked.
Q2: Is there a better way to do the second test? I believe it is not possible to do that with querySelector, hence resorting to XPath, but maybe there is some other feature that would accomplish the same task.
Q3: The crawler does not interact with the page once it is loaded. The onload event is the only legacy event attribute that I know of that fires in the absence of user interaction. Are there any others? In other words, would it be safe to replace the second test with document.evaluate("count(//#onload)", ...) or maybe even !!document.body.getAttribute("onload")?

Instead of checking for script tags and giving fixed amount of time, you can intercept the actual HTTP request (take a look at onResourceRequested / onResourceReceived) and take the screenshot after all resources have been loaded. Take a look at ajax-render

Related

JQuery: Detect when element and all children are loaded

I'd like to have code executed when a particular element and all of it's children are in the DOM. I know how to poll for the desired element's existence, or even better, to use a MutationObserver, but the desired element is itself rather large and I can't be sure that all of it's children are fully loaded simply based on it existing.
I could wait for ready which is called when all the DOM is loaded, but the page usually takes a rather long time to load. In the interests of speed i'd like to know without necessarily waiting for $(document).ready().
I did find the on function, I love the fact that it will be called for elements which don't even exist yet:
$(document).on('SomeEvent', '#desiredElem', handler);
...however I don't know of an event which is fired for an html element being fully in the DOM.
My script is being injected into the browser, and I know from logging that it's running a long time before $(document).ready() or DOMContentLoaded. Basically i'd like to take advantage of that. I can't add <script> tags to the HTML, unfortunately.
On a side note, an event for an object existing would be interesting. It would save me from having to use MutationObserver.
Since you've said that what you're trying to discern is when an element and all of its children that are present in the source HTML of the page are loaded, there are a couple things you can do:
You can test for the presence of any known element after the last child. Since HTML elements are loaded serially in order, if the element after the last child is present, then everything before it must already be in the DOM because page HTML is inserted only in the order it appears in the page HTML source.
If you put a <script> tag after the relevant HTML in the page, then all HTML before that <script> tag will already be in the DOM when that script tag runs.
You can either poll or use a mutation observer, but chances are this won't really help you because no scripts run while the DOM parser is in the middle of inserting a bunch of HTML into the page. So, your scripts or mutation events would only run after the whole page DOM was inserted anyway or when the page pauses in order to load some other inline resource such as a <script> tag.
You can fallback to the DOMContentLoaded event which will tell you when the whole DOM is loaded.
For more concrete help, we need to understand much more about your specific situation including an example of the HTML you're trying to watch for and a full understanding of exactly what constraints you have (what you can modify in the page source and where in the page you can insert or run scripts).
You need to setup a timer and keep observing the DOM checking if the element exists or not
function checkIfLoaded( callBack, elementSelector )
setTimeout( function(){
$( elementSelector ).size() == 0 )
{
//continue checking
checkIfLoaded( callBack, elementSelector )
}
else
{
callBack();
}
}, 1000 );
)
checkIfLoaded( function(){ alert( "div loaded now" ) }, "#divId" );

Audio duration NaN on certain page request action

I have been trying to create my custom media player using HTML5 and Jquery.
I have followed different approaches and ran into some trouble based on my way of refreshing the page.
First Case
$(document).ready(function(){
duration = Math.ceil($('audio')[0].duration);
$('#duration').html(duration);
});
In this case, the duration returns NaN when I redirect the page to the same URL by pressing the ENTER key in the address bar. However, it works completely fine when I refresh using the reload button or by pressing the F5 button.
Second Case
I read in some answers that loading duration after the loadedmetadataevent might help. So I tried the following:
$(document).ready(function(){
$('audio').on('loadedmetadata', function(){
duration = Math.ceil($('audio')[0].duration);
$('#duration').html(duration);
});
});
Surprisingly, in this case, the inverse of the first case happened. The duration gets displayed completely fine in the case of a redirect, i.e., pressing ENTER while in the address bar. However, in the case of refreshing using the F5 button or the reload button, the duration doesn't get displayed at all, not even NaN which led me to believe that the code doesn't get executed at all.
Further reading suggested this might be a bug within the webkit browsers but I couldn't find anything conclusive or helpful.
What could be the cause behind this peculiar behavior?
It'd be great if you could explain it along with the solution to this problem.
Edit:
I am mainly looking for an explanation behind this difference in behavior. I would like to understand the mechanism behind rendering a page in the case of redirect and refresh.
It sounds like the problem is that the event handler is set too late, i.e. the audio file has loaded its metadata before the document is even ready.
Try setting the event handler as soon as possible by removing the $(document).ready call:
$('audio').on('loadedmetadata', function(){
duration = Math.ceil($('audio')[0].duration);
$('#duration').html(duration);
});
Note that this requires that the <script> tag be after the <audio> tag in the document.
Alternatively, you can tweak your logic slightly, so that the code that updates the duration always runs (but fails gracefully if it gets a NaN):
function updateDuration() {
var duration = Math.ceil($('audio')[0].duration);
if (duration)
$('#duration').html(duration);
}
$(document).ready(function(){
$('audio').on('loadedmetadata', updateDuration);
updateDuration();
});
Lovely code examples and stuff from people - but the explanation is actually very simple.
If the file is already in the cache then the loadedmetadata event will not fire (nor will a number of other events - basically because they've already fired by the time you attach your listeners) and the duration will be set. If it's not in the cache then the duration will be NaN, and the event will fire.
The solution is sort of simple.
function runWhenLoaded() { /* read duration etc, this = audio element */ }
if (!audio.readyState) { // or $audio[0].readyState
audio.addEventListener("loadedmetadata", runWhenLoaded);
// or $audio.on("loadedmetadata", runWhenLoaded);
} else {
runWhenLoaded.call(audio);
// or runWhenLoaded.call($audio[0]);
}
I've included the jQuery alternatives in the code comments.
According to w3 spec this is standard behavior when duration returns NaN.
So I suggest use durationchange event:
$('audio').on('durationchange', function(){
var duration = $('audio')[0].duration;
if(!isNaN(duration)) {
$('#duration').html(Math.ceil(duration));
}
});
NOTE: This code (and your too) will not work correct in case if you have more than one audio element on page. Reason is that you listen events from all audio elements on page and each element will fire own event:
$('audio').on('durationchange', function(){...});
OR
You can try:
<script>
function durationchange() {
var duration = $('audio')[0].duration;
if(!isNaN(duration)) {
$('#duration').html(Math.ceil(duration));
}
}
</script>
<audio ondurationchange="durationchange()">
<source src="test.mp3" type="audio/mpeg">
</audio>
Note that behaviors will differ from one browser to another. On Chrome, you have different type of loading. When resources are not in cache, it will fetch either the complete file (for js or css for example), either a part of the file (mp3 for example). This partial file contains metadata that allows browser to determine duration and other data such as the time it'll take to download whole file at this rate, trigerring for example canplay or canplaythrough events. If you look at network usage in you dev console, you'll see that the HTTP status code will be either 200 (succesful load) or 206(partial load - for mp3 for example).
When you hit refresh, elements are checked to see if they changed. HTTP status will then be 304, meaning file hasn't been modified. If it hasn't changed and is still in browser cache, then it won't be downloaded. The call to determine if it has or not changed comes from the server providing the file.
When ou simply click enter in adress bar, it's automatically taken from cache, not validating online. So it's much faster.
So depending on how you call or refresh your page (either simmple enter, refresh or complete refresh without cache), you'll have big differences on the moment you get the metadata from your mp3. Between taking the metadata from cache directly vs making a request to a server, the difference can be a few hundreds milliseconds, which is enough to change what data is available at different moment.
That being said, listening to loadedmetada should give consistent result. This event is triggered when the data with duration information is loaded, so whatever way the page is loaded, it shouldn't matter if that called is properly made. At this point you have to consider maybe some interference from other elements. What you should do is follow your audio through various events to get exactly where its at at different moments. So in you document ready you could add listeners for different event and see where the problem occurs. Like this:
$('audio')[0].addEventListener('loadstart', check_event)
$('audio')[0].addEventListener('loadeddata', check_event)
$('audio')[0].addEventListener('loadedmetadata', check_event)//at this point you should be able to call duration
$('audio')[0].addEventListener('canplay', check_event) //and so on
function check_event(e) {
console.log(e.target, e.type)
}
You'll see that depending on the way you refresh, these events can come at different moments, maybe explaining inconsistencies in your outputs.

Chrome Extension Javascript/jQuery load page and check if changed

I'm looking for a way to load a website and then check after 1 min or so whether the content has changed, if not, repeat. This is because the website I'm trying to get content from contains javascript for loading the div I need. I thought of using some kind of iFrame, but I have no idea where to start and Google isn't helping me.
Edit
This is the code I'm running with atm and scrapUrl is a defined url so don't worry about it:
var iframe = document.body.appendChild(document.createElement('iframe'));
iframe.src = scrapUrl;
$(iframe).ready(function() {
$(iframe).load(function() {
alert('loaded');
alert($(iframe).contents().find('div#description').html());
});
});
It outputs "loaded" and after that "undefined"
So you're doing a lazy load of content in a div, and you want to know when that div has loaded? Depending how you're doing it, you'd be better to set a flag and react to the AJAX "load" event associated with that lazy load.
If you must do it the way you suggest, try this:
Create an interval (setInterval) that checks the load status, or the contents of the div
if false, do nothing. If true, clearInterval.

Stop jQuery evaluating scripts that have already been executed

I have a history API script that loads new page content without the need for a page refresh. I have come into a problem with inline scripts, where the scripts are evaluated by jQuery even if they have been done so previously. So for example, if someone re-visits a page with an inline script that script will be executed each time they re-visit. This causes problems as say if a DOM element is added in a script than that element will be added several times if they have visited that page several times.
For reasons I won't go into I cannot put these inline scripts into an external file and load them that way.
Here's the coded that deals with the scripts;
dom.filter('script').each(function(){//function to allow inline javascript, has to be after page fadeIn incase scripts reference page DOM
$.globalEval(this.text || this.textContent || this.innerHTML || '');
var script_src = ($(this).attr('src'));
if (script_src === 'AJAX/request_feed.js' || script_src === 'js/profile.js'){
$(window).unbind('scroll');
$.getScript(script_src);
}
});
Should you require any more parts from the whole history script just ask. i don't think they're required though.
Note: The if clause is there for a scroll loader i have. I have two scroll_loaders so the scroll event needs to be unbinded and binded each time. No need to worry about that though.
You could store your executed scripts in a cookie, then check the cookie before executing. One way or another, you'll need some way of keeping track of what has been executed and what hasn't been.
Alternate Suggestion
You could tweak your scripts to be self regulating:
myscript.js:
(function(){
if ($(this).parent().script_registry.inArray('myscript')) return false;
$(this).parent().script_registry.push('myscript'); // Register this script as launched
alert('Do stuff..');
});
Note: The above code may not be 100% syntactically correct.

onHashChange running onLoad... awkward

So I'd like my page to load content if a window's hash has changed.
Using Mootools, this is pretty easy:
$extend(Element.NativeEvents, {
hashchange: 1
});
and then:
window.addEvent('hashchange', function() {});
However, the hashchange event is firing when the page is being loaded, even though the specification require it not to fire until the page load is complete!
Unless I am loading the page for the first time, with no hash, then all works as expected.
I think the problem here is the fact that the browser considers the page load "complete", and then runs the rest of the JavaScript, which includes hash detection to load the requisite page.
For example, if I typed in http://foo.bar/, all would work fine. However, http://foo.bar/#test would, ideally, load the initial page, detect the hash, and load the "test" content.
Unfortunately, the browser loads the initial page, considers it "domready", and THEN loads the "test" content, which would then fire onHashChange. Oops?
This causes an infinite loop, unless I specifically ask the browser NOT to update the hash if an onHashChange event is firing. That's easy:
var noHashChange;
noHashChange = true;
var hashes = window.location.hash.substr(1).split("/"); // Deciphers the hash, in this case, hashes[0] is "test"
selectContent(hashes[0]); // Here, selectContent would read noHashChange, and wouldn't update the hash
noHashChange = false;
So now, updating the hash AFTER the page has loaded will work properly. Except it still goes nuts on an initial page load and fetches the content about 3 or 4 times, because it keeps detecting the hash has changed. Messy.
I think it may have something to do with how I am setting the hash, but I can't think of a better way to do so except:
window.location.hash = foobar;
... inside of a function that is run whenever new content is selected.
Therein lies the problem, yes? The page is loaded, THEN the content is loaded (if there is content)...
I hope I've been coherent...
Perhaps you could check the hash first to eliminate the recursion:
if(window.location.hash != foobar){ window.location.hash = foobar;}
Why is the onHashChange handler changing the hash anyways? If there's some default that it's selecting first before loading the content, then perhaps that could go in a seperate function.
(I say this because it looks like you've some sort of directory structure-esque convention to your location.hash'es, perhaps you're selecting a specific leaf of a tree when the root is selected or something?)
you could implement an observer for the hash object that will trigger a function when the has object has changed.it does nothing to do with the actual loading of the page.
the best way to do this is via Object.prototype.watch
see other pages on same topic : On - window.location.hash - Change?
have a look at MooTools History it implements the onhashchange if the new html5 history api isn't available, no need to reinvent the wheel :)

Categories

Resources