PhantomJS page.injectJs doesn't work - javascript

I'm currently trying to write the page source code into a text file by a URL. Everything works well, but I want to additionally inject a JavaScript file. The problem is that the file does not include properly. Only the last pages that are loaded, but others are incomplete.
//phantomjs C:\PhantomJS\Script\test1.js
var fs = require('fs');
var numeroEpisode = 0;
var maxEpisode = 10;
var fichierLien = fs.read('C:\\PhantomJS\\Fichier\\lien.txt');
var ListeLien = fichierLien.split(/[\n]/);
var page = require('webpage').create();
function GetPage()
{
if (numeroEpisode > maxEpisode)
{
phantom.exit();
}
page.open(ListeLien[numeroEpisode], function(status)
{
if(status !== 'success')
{
console.log('Impossible de charger la page.');
}
else
{
console.log('URL: '+ListeLien[numeroEpisode]+'');
page.injectJs('http://mylink.com', function() { });
var path = 'C:\\PhantomJS\\Fichier\\episode_'+numeroEpisode+'.html';
fs.write(path, page.content, 'w');
setTimeout(GetPage, 15000); // run again in 15 seconds
numeroEpisode++;
}
});
}
GetPage();

Don't mix up page.injectJs() and page.includeJs().
injectJs(filename): Loads a local JavaScript file into the page and evaluates it synchronously.
includeJs(url, callback): Loads a remote JavaScript file from the specified URL and evaluates it. Since it has to request a remote resource, this is done asynchronously. The passed callback is called as soon as the operation finished. If you don't use the callback, your code will most likely run before the remote JavaScript was included. Use that callback:
page.includeJs('http://mylink.com', function() {
var path = 'C:\\PhantomJS\\Fichier\\episode_'+numeroEpisode+'.html';
fs.write(path, page.content, 'w');
numeroEpisode++;
setTimeout(GetPage, 15000); // run again in 15 seconds
});
Since the JavaScript that you load changes something on the page, you probably need to load it after all the pages script have run. If this is a JavaScript heavy page, then you need to wait a little. You can wait a static amount of time:
setTimeout(function(){
page.includeJs('http://mylink.com', function() {
//...
});
}, 5000); // 5 seconds
or utilize waitFor to wait until an element appears that denotes that the page is completely loaded. This can be very tricky sometimes.
If you still want to use injectJs() instead of includeJs() (for example because of its synchronous nature), then you need to download the external JavaScript file to your machine and then you can use injectJs().

Related

Why PhantomJS not scraping the page it is redirected to?

I am scraping http://www.asx.com.au/asx/markets/optionPrices.do?by=underlyingCode&underlyingCode=XJO
It shows a blank white page at first, in that page there is some obfuscated JS code.
That code sends a POST request automatically, and then loads actual page.
I have this code to follow the redirected page, but its not working.
var page;
var myurl = "http://www.asx.com.au/asx/markets/optionPrices.do?by=underlyingCode&underlyingCode=XJO";
var renderPage = function (url) {
page = require('webpage').create();
page.onNavigationRequested = function (url, type, willNavigate, main) {
if (main && url != myurl) {
myurl = url;
console.log("redirect caught")
// GUILTY CODE
renderPage(url);
}
};
page.open(url, function (status) {
if (status === "success") {
console.log("success")
page.render('yourscreenshot.png');
phantom.exit(0);
} else {
console.log("failed")
phantom.exit(1);
}
});
}
renderPage(myurl);
It only outputs
success
redirect caught
See my code, why GUILTY CODE part is not being executed ... Why renderPage(url) is not being called after redirect caught?
From my understanding phantomJS doesn't really handle redirects well. That may be your issue. You may want to test this in a different way. Or you can use another browser to perform these tests to confirm. Check out this git issue to see what I mean https://github.com/ariya/phantomjs/issues/10389.

How to follow a document.location.reload in PhantomJS?

I've loaded a page in PhantomJS (using it from NodeJS) and there's a JS function on that page doRedirect() which contains
...
document.cookie = "key=" + assignedKey
document.location.reload(true)
I run doRedirect() from PhantomJS like this
page.evaluate(function() {
return doRedirect()
}).then(function(result) {
// result is null here
})
I'd like PhantomJS to follow the document.location.reload(true) and return the contents of that new page. How can this be done?
document.location.reload() doesn't navigate anywhere, it reloads the page. It's like clicking the refresh button your browser. It's all happening in the frontend, not the server, where 300 Redirect happens.
Simply call that function, wait for PhantomJS to finish loading the page, then ask it for the contents.
You can wait for PhantomJS to finish loading by using the page.onLoadFinished event. Additionally, you might need to use setTimeout() after load to wait some additional amount of time for page content to load asynchronously.
var webPage = require('webpage');
var page = webPage.create();
page.onLoadFinished = function(status) {
// page has loaded, but wait extra time for async content
setTimeout(function() {
// do your work here
}, 2000); // milliseconds, 2 seconds
};

Grab JavaScript console output with PhantomJS and evaluate it

I'm trying to parse the status page of my router to get the number of wlan devices. The page uses some JavaScript to get the status, so I tried to use PhantomJS, but had no luck.
This is the html source of the status page (status.html and status.js): http://pastebin.com/dmvptBqv
The developer tools of my browser show me this output on the console (anonymized):
([
{"vartype":"value","varid":"device_name","varvalue":"Speedport W 921V"},
{"vartype":"value","varid":"factorydefault","varvalue":"1"},
{"vartype":"value","varid":"rebooting","varvalue":"0"},
{"vartype":"value","varid":"router_state","varvalue":"OK"},
{"vartype":"value","varid":"bngscrat","varvalue":"0"},
{"vartype":"value","varid":"acsreach","varvalue":"0"},
Full reference
How can I get this evaluated output out of PhantomJS? Maybe it is very simple and I just missed the part in the documentation.
I think that i have to use the evluate function, but have no idea what is the correct function for the document object to return the complete evaluation.
var webPage = require('webpage');
var page = webPage.create();
page.open('blubb', function (status) {
var js= page.evaluate(function() {
return document.???;
});
console.log(js);
phantom.exit();
});
The main problem that you have is to get the console messages from the page into a single structure that you can do further processing on. This is easily done with the following code which waits indefinitely until the first console message appears and stops waiting as soon as no further messages appeared during 1 second.
var logs = []
timeoutID;
page.onConsoleMessage = function(msg){
if (timeoutID) clearTimeout(timeoutID);
logs.push(msg); // possibly also further processing
timeoutID = setTimeout(function(){
page.onConsoleMessage = function(msg){
console.log("CONSOLE: " + msg);
};
// TODO: further processing
console.log(JSON.stringify(logs, undefined, 4));
phantom.exit();
}, 1000);
};
page.open(url); // wait indefinitely
If each msg is valid JSON, then you can parse it immediately to get JavaScript objects. Change
logs.push(msg);
to
logs.push(JSON.parse(msg));

Load a javascript/ajax call on click using phantomjs

I am trying to build a webscraper with which I can download the HTML source after information is received from a ajax call on click.
Simply speaking initially I download a the webpage and then on clicking the next button the page is loaded with a new set of images using a ajax call and I need to capture the html source after clicking next.
The next click source looks something like this
Next Page
And on the same page is the javascript function nextpage which handles the ajax call.
Is there a way to do this using phantomjs? I am very new to phantomjs so let me know if anything is not clear.
Currently I am only able to load the contents from original webpage.
var page = require('webpage').create();
page.open('somewebpage', function (status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
var p = page.evaluate(function () {
return document.getElementsByTagName('html')[0].innerHTML
});
console.log(p);
}
phantom.exit();
});
Thanks
Try:
var content = page.evaluate( function() { return
(new XMLSerializer()).serializeToString( document ); } );

Navigating / scraping hashbang links with javascript (phantomjs)

I'm trying to download the HTML of a website that is almost entirely generated by JavaScript. So, I need to simulate browser access and have been playing around with PhantomJS. Problem is, the site uses hashbang URLs and I can't seem to get PhantomJS to process the hashbang -- it just keeps calling up the homepage.
The site is http://www.regulations.gov. The default takes you to #!home. I've tried using the following code (from here) to try and process different hashbangs.
if (phantom.state.length === 0) {
if (phantom.args.length === 0) {
console.log('Usage: loadreg_1.js <some hash>');
phantom.exit();
}
var address = 'http://www.regulations.gov/';
console.log(address);
phantom.state = Date.now().toString();
phantom.open(address);
} else {
var hash = phantom.args[0];
document.location = hash;
console.log(document.location.hash);
var elapsed = Date.now() - new Date().setTime(phantom.state);
if (phantom.loadStatus === 'success') {
if (!first_time) {
var first_time = true;
if (!document.addEventListener) {
console.log('Not SUPPORTED!');
}
phantom.render('result.png');
var markup = document.documentElement.innerHTML;
console.log(markup);
phantom.exit();
}
} else {
console.log('FAIL to load the address');
phantom.exit();
}
}
This code produces the correct hashbang (for instance, I can set the hash to '#!contactus') but it doesn't dynamically generate any different HTML--just the default page. It does, however, correctly output that has when I call document.location.hash.
I've also tried to set the initial address to the hashbang, but then the script just hangs and doesn't do anything. For example, if I set the url to http://www.regulations.gov/#!searchResults;rpp=10;po=0 the script just hangs after printing the address to the terminal and nothing ever happens.
The issue here is that the content of the page loads asynchronously, but you're expecting it to be available as soon as the page is loaded.
In order to scrape a page that loads content asynchronously, you need to wait to scrape until the content you're interested in has been loaded. Depending on the page, there might be different ways of checking, but the easiest is just to check at regular intervals for something you expect to see, until you find it.
The trick here is figuring out what to look for - you need something that won't be present on the page until your desired content has been loaded. In this case, the easiest option I found for top-level pages is to manually input the H1 tags you expect to see on each page, keying them to the hash:
var titleMap = {
'#!contactUs': 'Contact Us',
'#!aboutUs': 'About Us'
// etc for the other pages
};
Then in your success block, you can set a recurring timeout to look for the title you want in an h1 tag. When it shows up, you know you can render the page:
if (phantom.loadStatus === 'success') {
// set a recurring timeout for 300 milliseconds
var timeoutId = window.setInterval(function () {
// check for title element you expect to see
var h1s = document.querySelectorAll('h1');
if (h1s) {
// h1s is a node list, not an array, hence the
// weird syntax here
Array.prototype.forEach.call(h1s, function(h1) {
if (h1.textContent.trim() === titleMap[hash]) {
// we found it!
console.log('Found H1: ' + h1.textContent.trim());
phantom.render('result.png');
console.log("Rendered image.");
// stop the cycle
window.clearInterval(timeoutId);
phantom.exit();
}
});
console.log('Found H1 tags, but not ' + titleMap[hash]);
}
console.log('No H1 tags found.');
}, 300);
}
The above code works for me. But it won't work if you need to scrape search results - you'll need to figure out an identifying element or bit of text that you can look for without having to know the title ahead of time.
Edit: Also, it looks like the newest version of PhantomJS now triggers an onResourceReceived event when it gets new data. I haven't looked into this, but you might be able to bind a listener to this event to achieve the same effect.

Categories

Resources