PhantomJS not mimicking browser behavior when looking at YouTube videos - javascript

I posted this question to the PhantomJS mailing list a week ago, but have gotten no response. Hoping for better luck here...
I've been trying to use PhantomJS to scrape information from YouTube, but haven't been able to get it working.
Consider a YouTube video embedded into a web page via an iframe element. If you load the URL referenced by the src attribute directly into a browser, you get a full-page version of the video, where the video is encapsulated in an embed element. The embed element is not present in the initial page content; rather, some script tags on the page cause some Javascript to be evaluated which eventually adds the embed element to the DOM. I want to be able to access this embed element when it appears, but it never appears when I load the page in PhantomJS.
Here's the code I'm using:
var page = require("webpage").create();
page.settings.userAgent = "Mozilla/5.0 (X11; rv:24.0) Gecko/20130909 Firefox/24.0";
page.open("https://www.youtube.com/embed/dQw4w9WgXcQ", function (status) {
if (status !== "success") {
console.log("Failed to load page");
phantom.exit();
} else {
setTimeout(function () {
var size = page.evaluate(function () {
return document.getElementsByTagName("EMBED").length;
});
console.log(size);
phantom.exit();
}, 15000);
}
});
I only ever see "0" printed to the console, no matter how long I set the timeout. If I look for "DIV" elements I get "3", and if I look for "SCRIPT" elements I get "5", so the code seems to be sound. I just never find any "EMBED" tags, even though if I load the URL above in my browser I do find one soon after page-load.
Does anyone have any idea what the problem might be? Thanks in advance for any help.

Patrick's answer got me on the right track, but the full story is as follows.
Youtube's Javascript probes the browser's capabilities before deciding whether to create some kind of video element. After trawling through the minified code, I was eventually able to fool Youtube into thinking PhantomJS supported HTML5 video by wrapping document.createElement in the page's onInitialized callback.
page.onInitialized = function () {
page.evaluate(function () {
var create = document.createElement;
document.createElement = function (tag) {
var elem = create.call(document, tag);
if (tag === "video") {
elem.canPlayType = function () { return "probably" };
}
return elem;
};
});
};
However, this was a misstep; to get the <embed> tag I was originally after, I needed to make Youtube's code think PhantomJS supports Flash, not HTML5 video. That's also doable:
page.onInitialized = function () {
page.evaluate(function () {
window.navigator = {
plugins: { "Shockwave Flash": { description: "Shockwave Flash 11.2 e202" } },
mimeTypes: { "application/x-shockwave-flash": { enabledPlugin: true } }
};
});
};
So that's how it's done.

phantomjs does not support flash, or the html5 video element.

As on option - try to build phantomjs with video/audio support by yourself.
Original answer link: https://github.com/ariya/phantomjs/issues/10839#issuecomment-331457673

Related

How do I make a webpage think its images are done loading?

To give you some background, many (if not all) websites load their images one by one, so if there are a lot of images, and/or you have a slow computer, most of the images wont show up. This is avoidable for the most part, however if you're running a script to exact image URLs, then you don't need to see the image, you just want its URL. My question is as follows:
Is it possible to trick a webpage into thinking an image is done loading so that it will start loading the next one?
Typically browser will not wait for one image to be downloaded before requesting the next image. It will request all images simultaneously, as soon as it gets the srcs of those images.
Are you sure that the images are indeed waiting for previous image to download or are they waiting for a specific time interval?
In case if you are sure that it depends on download of previous image, then what you can do is, route all your requests through some proxy server / firewall and configure it to return an empty file with HTTP status 200 whenever an image is requested from that site.
That way the browser (or actually the website code) will assume that it has downloaded the image successfully.
how do I do that? – Jack Kasbrack
That's actually a very open ended / opinion based question. It will also depend on your OS, browser, system permissions etc. Assuming you are using Windows and have sufficient permissions, you can try using Fiddler. It has an AutoResponder functionality that you can use.
(I've no affiliation with Fiddler / Telerik as such. I'm suggesting it only as an example and because I've used it in the past and know that it can be used for the aforementioned purpose. There will be many more products that provide similar functionality and you should use the product of your choice.)
use a plugin called lazy load. what it does is it will load the whole webpage and will just load the image later on. it will only load the image when the user scroll on it.
To extract all image URLs to a text file maybe you could use something like this,
If you execute this script inside any website it will list the URLs of the images
document.querySelectorAll('*[src]').forEach((item) => {
const isImage = item.src.match(/(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|jpeg|gif|png|svg)/g);
if (isImage) console.log(item.src);
});
You could also use the same idea to read Style from elements and get images from background url or something, like that:
document.querySelectorAll('*').forEach((item) => {
const computedItem = getComputedStyle(item);
Object.keys(computedItem).forEach((attr) => {
const style = computedItem[attr];
const image = style.match(/(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|jpeg|gif|png|svg)/g);
if (image) console.log(image[0]);
});
});
So, at the end of the day you could do some function like that, which will return an array of all images on the site
function getImageURLS() {
let images = [];
document.querySelectorAll('*').forEach((item) => {
const computedItem = getComputedStyle(item);
Object.keys(computedItem).forEach((attr) => {
const style = computedItem[attr];
const image = style.match(/(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|jpeg|gif|png|svg)/g);
if (image) images.push(image[0]);
});
});
document.querySelectorAll('*[src]').forEach((item) => {
const isImage = item.src.match(/(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|jpeg|gif|png|svg)/g);
if (isImage) images.push(item.src);
});
return images;
}
It can probably be optimized but, well you get the idea..
If you just want to extract images once. You can use some tools like
1) Chrome Extension
2) Software
3) Online website
If you want to run it multiple times. Probably use the above code https://stackoverflow.com/a/53245330/4674358 wrapped in if condition
if(document.readyState === "complete") {
extractURL();
}
else {
//Add onload or DOMContentLoaded event listeners here: for example,
window.addEventListener("onload", function () {
extractURL();
}, false);
//or
/*document.addEventListener("DOMContentLoaded", function () {
extractURL();
}, false);*/
}
extractURL() {
//code mentioned above
}
You want the "DOMContentLoaded" event docs. It fires as soon as the document is fully parsed, but before everything has been loaded.
let addIfImage = (list, image) => image.src.match(/(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|jpeg|gif|png|svg)/g) ?
[image.src, ...list] :
list;
let getSrcFromTags= (tag = 'img') => Array.from(document.getElementsByTagName(tag))
.reduce(addIfImage, []);
if (document.readyState === "loading") {
document.addEventListener("DOMContentLoaded", doSomething);
} else { // `DOMContentLoaded` already fired
doSomething();
}
I am using this, works as expected:
var imageLoading = function(n) {
var image = document.images[n];
var downloadingImage = new Image();
downloadingImage.onload = function(){
image.src = this.src;
console.log('Image ' + n + ' loaded');
if (document.images[++n]) {
imageLoading(n);
}
};
downloadingImage.src = image.getAttribute("data-src");
}
document.addEventListener("DOMContentLoaded", function(event) {
setTimeout(function() {
imageLoading(0);
}, 0);
});
And change every src attribute of image element to data-src

Javascript get updated link in real time

I'm new to javascript and I've created a kinda successful extension on chrome for dubtrack I've been trying to figure out for quite awhile how to make my injected script run in real time and grab the latest youtube music video url any help would be much appreciated my extension is very basic and it's not for profit I just made it to play around with javascript and jquery.
Here's the section of code that I'd like to have function in real time.
$('#grab').click(function() {
function getId(url) {
var regExp = /^.*(youtu.be\/|v\/|u\/\w\/|embed\/|watch\?v=|\&v=)([^#\&\?]*).*/;
var match = url.match(regExp);
if (match && match[2].length == 11) {
return match[2];
} else {
return 'error';
}
}
src = $('iframe').attr('src');
setInterval(function() {
src = $('iframe').attr('src');
}, 10000);
window.open('http://youtubeinmp3.com/fetch/?video=http://www.youtube.com/watch?v=' + getId(src), '_blank');
});
Relevant links
GitHub
Chrome Extension
Thank you for taking the time to read my question.
You're bad at explaining (and might want to edit the question to reflect what you want), but basically the problem is this:
You have a YouTube embed in the page, with a particular video ID in src.
When the video changes, that happens without updating the src (by using YT embed API).
Therefore, if you try to grab just the src, it's not the latest video but the first you loaded.
As an extension, I see two ways of trying to solve it:
You could try to initialize the YT API yourself to get a player reference. I don't know if it will break the code of Dubtrack.
You could inject a script in the iframe as well that would somehow extract the video being played in a way other than relying on src.
It's an open problem how to solve it, and the fact that you're basically providing "just" a bookmarklet may be an obstacle.

Intercept new downloads in Firefox Addon SDK

I have written a simple download manager for Windows and I would like to create an addon for Firefox that when enabled intercepts new downloads in Firefox and sends them to the download manager.
I have already done this for Google Chrome using:
chrome.downloads.onCreated.addListener(function(details) {
// stop the download
chrome.downloads.cancel(details.id, null);
}
The question is how can I achieve something similar using the Firefox add-on SDK.
I see there is a way of intercepting page loads to view the content / headers which might be helpful but then I won't know if the request will turn into a download or not.
Firefox add-on SDK: Get http response headers
I could perhaps look for a content type that is not text/html or check for a content disposition header but that could cause problems if I don't correctly handle all cases.
Is there no way of accessing the download manager using the JS SDK or some way of knowing when a download has been started / being started and stop it?
The http-on-examine-response observer that the linked question discusses is the wrong way to go. It concerns all requests not just downloads.
Instead use the Downloads.jsm to observe new downloads, then cancel them, and so on.
To load Downloads.jsm in the SDK use:
const {Cu} = require("chrome");
Cu.import("resource://gre/modules/Downloads.jsm");
Cu.import("resource://gre/modules/Task.jsm");
Then you can add your listener.
let view = {
onDownloadAdded: function(download) {
console.log("Added", download);
},
onDownloadChanged: function(download) {
console.log("Changed", download);
},
onDownloadRemoved: function(download) {
console.log("Removed", download);
}
};
Task.spawn(function() {
try {
let list = yield Downloads.getList(Downloads.ALL);
yield list.addView(view);
} catch (ex) {
console.error(ex);
}
});
The linked MDN docs have more information and samples.
Since your add-on is a restartless SDK add-on, you'll need to remove the listener again using .removeView on unload, or else there will be a memory leak.
Here's the JSM way.
Components.utils.import("resource://gre/modules/Downloads.jsm");
Components.utils.import("resource://gre/modules/Task.jsm");
Components.utils.import("resource://gre/modules/FileUtils.jsm");
var view = {
onDownloadChanged: function (download) {
console.log(download, 'Changed');
if (download.succeeded) {
var file = new FileUtils.File(this.target.path);
console.log('file', file);
}
}
};
var list;
Task.spawn(function () {
list = yield Downloads.getList(Downloads.ALL);
list.addView(view);
}).then(null, Components.utils.reportError);
Remember to removeView to stop listening. Can do this anywhere, like in shutdown function or whatever, doesn't have to be within that Task.spawn so list must be global var.
list.removeView(view); //to stop listening
Here's the old way, which seems to still work. Although I thought they said they're going to take out the old downloadManager:
var observerService = Components.classes["#mozilla.org/download-manager;1"].getService(Components.interfaces.nsIDownloadManager);
observerService.addListener({
onDownloadStateChange: function (state, dl) {
console.log('dl=', dl);
console.log('state=', state);
console.log('targetFile', dl.targetFile);
if (state == 7 && dl.targetFile.leafName.substr(-4) == ".txt") {
//guys just downloaded (succesfully) a .txt file
}
}
});
Heres a mozillazine with some more on this: http://forums.mozillazine.org/viewtopic.php?f=19&t=2792021

Detect when an iframe is loaded

I'm using an <iframe> (I know, I know, ...) in my app (single-page application with ExtJS 4.2) to do file downloads because they contain lots of data and can take a while to generate the Excel file (we're talking anything from 20 seconds to 20 minutes depending on the parameters).
The current state of things is : when the user clicks the download button, he is "redirected" by Javascript (window.location.href = xxx) to the page doing the export, but since it's done in PHP, and no headers are sent, the browser continuously loads the page, until the file is downloaded. But it's not very user-friendly, because nothing shows him whether it's still loading, done (except the file download), or failed (which causes the page to actually redirect, potentially making him lose the work he was doing).
So I created a small non-modal window docked in the bottom right corner that contains the iframe as well as a small message to reassure the user. What I need is to be able to detect when it's loaded and be able to differenciate 2 cases :
No data : OK => Close window
Text data : Error message => Display message to user + Close window
But I tried all 4 events (W3Schools doc) and none is ever fired. I could at least understand that if it's not HTML data returned, it may not be able to fire the event, but even if I force an error to return text data, it's not fired.
If anyone know of a solution for this, or an alternative system that may fit here, I'm all ears ! Thanks !
EDIT : Added iframe code. The idea is to get a better way to close it than a setTimeout.
var url = 'http://mywebsite.com/my_export_route';
var ifr = $('<iframe class="dl-frame" src="'+url+'" width="0" height="0" frameborder="0"></iframe>');
ifr.appendTo($('body'));
setTimeout(function() {
$('.dl-frame').remove();
}, 3000);
I wonder if it would require some significant changes in both frontend and backend code, but have you considered using AJAX? The workflow would be something like this: user sends AJAX request to start file generating and frontend constantly polls it's status from the server, when it's done - show a download link to the user. I believe that workflow would be more straightforward.
Well, you could also try this trick. In parent window create a callback function for the iframe's complete loading myOnLoadCallback, then call it from the iframe with parent.myOnLoadCallback(). But you would still have to use setTimeout to handle server errors/connection timeouts.
And one last thing - how did you tried to catch iframe's events? Maybe it something browser-related. Have you tried setting event callbacks in HTML attributes directly? Like
<iframe onload="done()" onerror="fail()"></iframe>
That's a bad practice, I know, but sometimes job need to be done fast, eh?
UPDATE
Well, I'm afraid you have to spend a long and painful day with a JS debugger. load event should work. I still have some suggestions, though:
1) Try to set event listener before setting element's src. Maybe onload event fires so fast that it slips between creating element and setting event's callback
2) At the same time try to check if your server code plays nicely with iframes. I have made a simple test which attempts to download a PDF from Dropbox, try to replace my URL with your backed route's.
<script src="https://code.jquery.com/jquery-1.11.1.min.js"></script>
<iframe id="book"></iframe>
<button id="go">Request downloads!</button>
<script>
var bookUrl = 'https://www.dropbox.com/s/j4o7tw09lwncqa6/thinkpython.pdf';
$('#book').on('load', function(){
console.log('WOOT!', arguments);
});
$('#go').on('click', function(){
$('#book').attr('src', bookUrl);
});
</script>
UPDATE 2
3) Also, look at the Network tab of your browser's debugger, what happens when you set src to the iframe, it should show request and server's response with headers.
I've tried with jQuery and it worked just fine as you can see in this post.
I made a working example here.
It's basically this:
<iframe src="http://www.example.com" id="myFrame"></iframe>
And the code:
function test() {
alert('iframe loaded');
}
$('#myFrame').load(test);
Tested on IE11.
I guess I'll give a more hacky alternative to the more proper ways of doing it that the others have posted. If you have control over the PHP download script, perhaps you can just simply output javascript when the download is complete. Or perhaps redirect to a html page that runs javascript. The javascript run, can then try to call something in the parent frame. What will work depends if your app runs in the same domain or not
Same domain
Same domain frame can just use frame javascript objects to reference each other. so it could be something like, in your single page application you can have something like
window.downloadHasFinished=function(str){ //Global pollution. More unique name?
//code to be run when download has finished
}
And for your download php script, you can have it output this html+javascript when it's done
<script>
if(parent && parent.downloadHasFinished)
parent.downloadHasFinished("if you want to pass a data. maybe export url?")
</script>
Demo jsfiddle (Must run in fullscreen as the frames have different domain)
Parent jsfiddle
Child jsfiddle
Different Domains
For different domains, We can use postMessage. So in your single page application it will be something like
$(window).on("message",function(e){
var e=e.originalEvent
if(e.origin=="http://downloadphp.anotherdomain.com"){ //for security
var message=e.data //data passed if any
//code to be run when download has finished
}
});
and in your php download script you can have it output this html+javascript
<script>
parent.postMessage("if you want to pass data",
"http://downloadphp.anotherdomain.com");
</script>
Parent Demo
Child jsfiddle
Conclusion
Honestly, if the other answers work, you should probably use those. I just thought this was an interesting alternative so I posted it up.
You can use the following script. It comes from a project of mine.
$("#reportContent").html("<iframe id='reportFrame' sandbox='allow-same-origin allow-scripts' width='100%' height='300' scrolling='yes' onload='onReportFrameLoad();'\></iframe>");
Maybe you should use
$($('.dl-frame')[0].contentWindow.document).ready(function () {...})
Try this (pattern)
$(function () {
var session = function (url, filename) {
// `url` : URL of resource
// `filename` : `filename` for resource (optional)
var iframe = $("<iframe>", {
"class": "dl-frame",
"width": "150px",
"height": "150px",
"target": "_top"
})
// `iframe` `load` `event`
.one("load", function (e) {
$(e.target)
.contents()
.find("html")
.html("<html><body><div>"
+ $(e.target)[0].nodeName
+ " loaded" + "</div><br /></body></html>");
alert($(e.target)[0].nodeName
+ " loaded" + "\nClick link to download file");
return false
});
var _session = $.when($(iframe).appendTo("body"));
_session.then(function (data) {
var link = $("<a>", {
"id": "file",
"target": "_top",
"tabindex": "1",
"href": url,
"download": url,
"html": "Click to start {filename} download"
});
$(data)
.contents()
.find("body")
.append($(link))
.addBack()
.find("#file")
.attr("download", function (_, o) {
return (filename || o)
})
.html(function (_, o) {
return o.replace(/{filename}/,
(filename || $(this).attr("download")))
})
});
_session.always(function (data) {
$(data)
.contents()
.find("a#file")
.focus()
// start 6 second `download` `session`,
// on `link` `click`
.one("click", function (e) {
var timer = 6;
var t = setInterval(function () {
$(data)
.contents()
.find("div")
// `session` notifications
.html("Download session started at "
+ new Date() + "\n" + --timer);
}, 1000);
setTimeout(function () {
clearInterval(t);
$(data).replaceWith("<span class=session-notification>"
+ "Download session complete at\n"
+ new Date()
+ "</span><br class=session-notification />"
+ "<a class=session-restart href=#>"
+ "Restart download session</a>");
if ($("body *").is(".session-restart")) {
// start new `session`,
// on `.session-restart` `click`
$(".session-restart")
.on("click", function () {
$(".session-restart, .session-notification")
.remove()
// restart `session` (optional),
// or, other `session` `complete` `callback`
&& session(url, filename ? filename : null)
})
};
}, 6000);
});
});
};
// usage
session("http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf", "ECMA_JS.pdf")
});
jsfiddle http://jsfiddle.net/guest271314/frc82/
In regards to your comment about to get a better way to close it instead of setTimeout. You could use jQuery fadeOut option or any of the transitions and in the 'complete' callback remove the element. Below is an example you can dump right into a fiddle and only need to reference jQuery.
I also wrapped inside listener for 'load' event to not do the fade until the iFrame has been loaded as question originally was asking.
// plugin your URL here
var url = 'http://jquery.com';
// create the iFrame, set attrs, and append to body
var ifr = $("<iframe>")
.attr({
"src": url,
"width": 300,
"height": 100,
"frameborder": 0
})
.addClass("dl-frame")
.appendTo($('body'))
;
// log to show its part of DOM
console.log($(".dl-frame").length + " items found");
// create listener for load
ifr.one('load', function() {
console.log('iframe is loaded');
// call $ fadeOut to fade the iframe
ifr.fadeOut(3000, function() {
// remove iframe when fadeout is complete
ifr.remove();
// log after, should no longer exist in DOM
console.log($(".dl-frame").length + " items found");
});
});
If you are doing a file download from a iframe the load event wont fire :) I was doing this a week ago. The only solution to this problem is to call a download proxy script with a tag and then return that tag trough a cookie then the file is loaded. min while yo need to have a setInterval on the page witch will watch for that specific cookie.
// Jst to clearyfy
var token = new Date().getTime(); // ticks
$('<iframe>',{src:"yourproxy?file=somefile.file&token="+token}).appendTo('body');
var timers = [];
timers[timers.length+1] = setInterval(function(){
var _index = timers.length+1;
var cookie = $.cooke(token);
if(typeof cookie != "undefined"){
// File has been downloaded
$.removeCookie(token);
clearInterval(_index);
}
},400);
in your proxy script add the cookie with the name set to the string sent bay the token url parameter.
If you control the script in server that generates excel or whatever you are sending to iframe why don't you put a UID flag and store it in session with value 0, so... when iframe is created and server script is called just set UID flag to 1 and when script is finished (the iframe will be loaded) just put it to 2.
Then you only need a timer and a periodic AJAX call to the server to check the UID flag... if it's set to 0 the process doesn't started, if it's 1 the file is creating, and finally if it's 2 the process has been ended.
What do you think? If you need more information about this approach just ask.
What you are saying could be done for images and other media formats using $(iframe).load(function() {...});
For PDF files or other rich media, you can use the following Library:
http://johnculviner.com/jquery-file-download-plugin-for-ajax-like-feature-rich-file-downloads/
Note: You will need JQuery UI
You can use this library. The code snippet for you purpose would be something like:
window.onload = function () {
rajax_obj = new Rajax('',
{
action : 'http://mywebsite.com/my_export_route',
onComplete : function(response) {
//This will only called if you have returned any response
// instead of file from your export script
// In your case 2
// Text data : Error message => Display message to user
}
});
}
Then you can call rajax_obj.post() on your download link click.
Download
NB: You should add some header to your PHP script so it force file download
header('Content-Disposition: attachment; filename="'.$file.'"');
header('Content-Transfer-Encoding: binary');
There is two solutions that i can think of. Either you have PHP post it's progress to a MySQL table where from frontend will be pulling information from using AJAX calls to check up on the progress of the generation. Using somekind of unique key that is being generated when accessing the page would be ideal for multiple people generating excel files at the same time.
Another solution would be to use nodejs & then in PHP post the progress of the excel file using cURL or a socket to a nodejs service. Then when receiving updates from PHP in nodejs you simply write the progress of the excel file for the right socket. This will cut off some browser support though. Unless you go through with it using external libraries to bring websocket support for pretty much all browsers & versions.
Hope this answer helped. I was having the same issue previous year. Ended up doing AJAX polling having PHP post progress on the fly.
Try this:
Note: You should be on the same domain.
var url = 'http://mywebsite.com/my_export_route',
iFrameElem = $('body')
.append('<iframe class="dl-frame" src="' + url + '" width="0" height="0" frameborder="0"></iframe>')
.find('.dl-frame').get(0),
iDoc = iFrameElem.contentDocument || iFrameElem.contentWindow.document;
$(iDoc).ready(function (event) {
console.log('iframe ready!');
// do stuff here
});

Navigating / scraping hashbang links with javascript (phantomjs)

I'm trying to download the HTML of a website that is almost entirely generated by JavaScript. So, I need to simulate browser access and have been playing around with PhantomJS. Problem is, the site uses hashbang URLs and I can't seem to get PhantomJS to process the hashbang -- it just keeps calling up the homepage.
The site is http://www.regulations.gov. The default takes you to #!home. I've tried using the following code (from here) to try and process different hashbangs.
if (phantom.state.length === 0) {
if (phantom.args.length === 0) {
console.log('Usage: loadreg_1.js <some hash>');
phantom.exit();
}
var address = 'http://www.regulations.gov/';
console.log(address);
phantom.state = Date.now().toString();
phantom.open(address);
} else {
var hash = phantom.args[0];
document.location = hash;
console.log(document.location.hash);
var elapsed = Date.now() - new Date().setTime(phantom.state);
if (phantom.loadStatus === 'success') {
if (!first_time) {
var first_time = true;
if (!document.addEventListener) {
console.log('Not SUPPORTED!');
}
phantom.render('result.png');
var markup = document.documentElement.innerHTML;
console.log(markup);
phantom.exit();
}
} else {
console.log('FAIL to load the address');
phantom.exit();
}
}
This code produces the correct hashbang (for instance, I can set the hash to '#!contactus') but it doesn't dynamically generate any different HTML--just the default page. It does, however, correctly output that has when I call document.location.hash.
I've also tried to set the initial address to the hashbang, but then the script just hangs and doesn't do anything. For example, if I set the url to http://www.regulations.gov/#!searchResults;rpp=10;po=0 the script just hangs after printing the address to the terminal and nothing ever happens.
The issue here is that the content of the page loads asynchronously, but you're expecting it to be available as soon as the page is loaded.
In order to scrape a page that loads content asynchronously, you need to wait to scrape until the content you're interested in has been loaded. Depending on the page, there might be different ways of checking, but the easiest is just to check at regular intervals for something you expect to see, until you find it.
The trick here is figuring out what to look for - you need something that won't be present on the page until your desired content has been loaded. In this case, the easiest option I found for top-level pages is to manually input the H1 tags you expect to see on each page, keying them to the hash:
var titleMap = {
'#!contactUs': 'Contact Us',
'#!aboutUs': 'About Us'
// etc for the other pages
};
Then in your success block, you can set a recurring timeout to look for the title you want in an h1 tag. When it shows up, you know you can render the page:
if (phantom.loadStatus === 'success') {
// set a recurring timeout for 300 milliseconds
var timeoutId = window.setInterval(function () {
// check for title element you expect to see
var h1s = document.querySelectorAll('h1');
if (h1s) {
// h1s is a node list, not an array, hence the
// weird syntax here
Array.prototype.forEach.call(h1s, function(h1) {
if (h1.textContent.trim() === titleMap[hash]) {
// we found it!
console.log('Found H1: ' + h1.textContent.trim());
phantom.render('result.png');
console.log("Rendered image.");
// stop the cycle
window.clearInterval(timeoutId);
phantom.exit();
}
});
console.log('Found H1 tags, but not ' + titleMap[hash]);
}
console.log('No H1 tags found.');
}, 300);
}
The above code works for me. But it won't work if you need to scrape search results - you'll need to figure out an identifying element or bit of text that you can look for without having to know the title ahead of time.
Edit: Also, it looks like the newest version of PhantomJS now triggers an onResourceReceived event when it gets new data. I haven't looked into this, but you might be able to bind a listener to this event to achieve the same effect.

Categories

Resources