PhantomJS.Org takes too long to respond against my phantom js script? - javascript

I have a simple script I got from "Getting Started with PhantomJS" book.
var system = require('system');
var url = system.args[1];
var page = require('webpage').create();
page.open(url, function(status) {
if ( status === "success" ) {
console.log("Page is loaded.");
phantom.exit(0);
} });
When I run a command like "phantomjs chapter2.js http://www.google.com" I get the correct response of "Page is loaded". Same with facebook.com
It's funny because the book told me to run "phantomjs chapter2.js http://www.phantomjs.org", but all it does is hang on me for a minute before the script stops and goes back to the command prompt without printing anything back.
Is it a problem on my end with my internet connection?

The www subdomain of phantom.js is a dead end: http://isup.me/www.phantomjs.org
Remove the subdomain and it'll work as expected: http://isup.me/phantomjs.org

The page http://www.phantomjs.org does not resolve.
http://isup.me/www.phantomjs.org shows a dead end.

Related

Why PhantomJS not scraping the page it is redirected to?

I am scraping http://www.asx.com.au/asx/markets/optionPrices.do?by=underlyingCode&underlyingCode=XJO
It shows a blank white page at first, in that page there is some obfuscated JS code.
That code sends a POST request automatically, and then loads actual page.
I have this code to follow the redirected page, but its not working.
var page;
var myurl = "http://www.asx.com.au/asx/markets/optionPrices.do?by=underlyingCode&underlyingCode=XJO";
var renderPage = function (url) {
page = require('webpage').create();
page.onNavigationRequested = function (url, type, willNavigate, main) {
if (main && url != myurl) {
myurl = url;
console.log("redirect caught")
// GUILTY CODE
renderPage(url);
}
};
page.open(url, function (status) {
if (status === "success") {
console.log("success")
page.render('yourscreenshot.png');
phantom.exit(0);
} else {
console.log("failed")
phantom.exit(1);
}
});
}
renderPage(myurl);
It only outputs
success
redirect caught
See my code, why GUILTY CODE part is not being executed ... Why renderPage(url) is not being called after redirect caught?
From my understanding phantomJS doesn't really handle redirects well. That may be your issue. You may want to test this in a different way. Or you can use another browser to perform these tests to confirm. Check out this git issue to see what I mean https://github.com/ariya/phantomjs/issues/10389.

Make NodeJS/JSDom wait for full rendering before scraping

I'm trying to scrape data from a website that I need to log into. Unfortunately, I'm getting different results using JSDom/NodeJS than I would if I were to use a web browser, such as FF. In particular, I'm not getting the log in form with the username, password and submit button.
I understand much of Javascript, at least, is asynchronous. However, I thought the "done" function of JSDom waits synchronously for the full rendering of the page. I guess what I'd like to do is simulate an HTTPS get and wait for the full document.ready to be done.
var jsdom = require("jsdom");
var jsdom_global = require("jsdom-global");
var fs = require("fs");
var jquery = fs.readFileSync("./jquery-3.1.1.min.js", "utf-8");
jsdom.env({
url: "https://wemc.smarthub.coop/Login.html#login:",
src: [jquery],
done: function (err, window) {
var $ = window.$;
if($("button#LoginSubmitButton").length) {
console.log('Click button found');
} else {
console.log('Click button not found');
}
// The following text boxes are not coming back:
// $("input#LoginUsernameTextBox")
// $("input#LoginPasswordTextBox")
// If I enable the line below, I see a lot less than I would if I
// do a view source in any reasonable browser.
//console.log($("body").html());
}
});
Usually, this will happen because JSDOM doesn't execute the JS when it hits the page. In that case, the only elements returned will be the server rendered HTML.
You could try a headless browser module such as PhantomJS etc and see how that goes for you. There's a section about the distinction between the two at the bottom of the JSDOM github page.

CasperJS isn't doing the same request as normal browser

I have made some script in CasperJs to login to page, open specific page and get content of it.
I'm opening desired page with content to scrap with casper.thenOpen and it shows different content on casperjs and different on my browser. I have setted up useragent from my browser and still no effects. Here are request and response headers from casperjs:
http://screenshot.sh/m7QeJGUbqK5Kg
And here from my browser:
http://screenshot.sh/n8j3Gf9QkuXxm
I don't know why the results are different on browser and casperjs. I don't know why there is no cookie in request header in casperjs but i'm sure that it's logged in because previusly in code i can see code allowed only for logged users.
Thanks in advance for any help
Code:
casper.thenOpen('website.com', function() {
this.echo("start");
this.echo(this.fetchText('html'), "INFO");
casper.then(function () {
var json;
var start;
var end;
function checkReload()
{
json = JSON.parse(this.fetchText('html'));
if(json.msg.indexOf('desiredtext') === 0) {
this.echo("good", 'PARAMETER');
return;
}
else
{
this.echo('bad news: ' + json.msg, 'COMMENT');
}
casper.thenOpen('website.com');
this.echo(this.fetchText('html'));
this.wait(1, checkReload);
}
this.then(checkReload);
});
});
But problem isn't in code(I also tried to put request headers from browser to casper but this didn't work)

How to Handle redirects in Node.JS with HorsemanJs and PhantomJS

I´ve recently started using horseman.js to scrap a page with node. I can´t figure out how exactly it works and I can´t find good examples on the internet.
My main goal is to log on a platform and extract some data. I´ve managed to do this with PhantomJS, but know I want to learn how to do it with horseman.JS.
My code should open the login page, fill the login and password inputs and click on the "login" button. Pretty easy so far. However, after clicking on the "login" button the site makes 2 redirects before loading the actual page where I want to work.
My problem is that I don´t know how to make my code wait for that page.
With phantomJS I had a workaround with the page URL. The following code shows how I´ve managed to do it with phantomJS and it works just fine:
var page = require('webpage').create();
var urlHome = 'http://akna.com.br/site/montatela.php?t=acesse&header=n&footer=n';
var fillLoginInfo = function(){
$('#cmpLogin').val('mylogin');
$('#cmpSenha').val('mypassword');
$('.btn.btn-default').click();
};
page.onLoadFinished = function(){
var url = page.url;
console.log("Page Loaded: " + url);
if(url == urlHome){
page.evaluate(fillLoginInfo);
return;
}
// After the redirects the url has a "sid" parameter, I wait for that to apear when the page loads.
else if(url.indexOf("sid=") >0){
//Keep struggling with more codes!
return;
}
}
page.open(urlHome);
However, I can´t find a way to handle the redirects with horseman.JS.
Here is what I´ve been trying with horseman.JS without any success:
var Horseman = require("node-horseman");
var horseman = new Horseman();
var urlHome = 'http://akna.com.br/site/montatela.php?t=acesse&header=n&footer=n';
var fillLoginInfo = function(){
$('#cmpLogin').val('myemail');
$('#cmpSenha').val('mypassword');
$('.btn.btn-default').click();
}
var okStatus = function(){
return horseman.status();
}
horseman
.open(urlHome)
.type('input[name="cmpLogin"]','myemail')
.type('input[name="cmpSenha"]','mypassword')
.click('.btn-success')
.waitFor(okStatus, 200)
.screenshot('image.png')
.close();
How do I handle the redirects?
I'm currently solving the same problem, and my best solution so far is to use the waitForSelector method to target something on the final page.
E.g.
horseman
.open(urlHome)
.type('input[name="cmpLogin"]','myemail')
.type('input[name="cmpSenha"]','mypassword')
.click('.btn-success')
.waitForSelector("#loginComplete")
.screenshot('image.png')
.close();
Of course you have to know the page you're waiting for to do this.
If you know there are two redirects, you can use the approach of .waitForNextPage() twice. A naive approach if you didn't know how many redirects to expect would be to chain these until a timeout is reached (I don't recommend this as it will be slow!),
Perhaps a cleverer way, you can also use on events to capture redirects, like .on('navigationRequested') or .on('urlChanged').
Although it doesn't answer your question directly, this link may help: https://github.com/ariya/phantomjs/issues/11507

Unable to load page resources with PhantomJS

I'm using PhantomJS to get page content for given URL.
The problem is that on some pages PhantomJS can not load some resources (js, css...), and the error I'm getting is:
error code 5, Operation canceled
Web page on which I can reproduce this problem is www.lifehacker.com
The resources I can not get are:
http://x.kinja-static.com/assets/stylesheets/tiger-4ee27d6612a71ee3c68440f8e9c0025c.css
http://c.amazon-adsystem.com/aax2/amzn_ads.js
and some others too...
The command I'm running is:
phantomjs --debug=true --cookies-file=cookies.txt --ignore-ssl-errors=true --ssl-protocol=tlsv1 fetchpage.js http://www.lifehacker.com
and even if I remove options like cookies-file, ignore-ssl-errors, ssl-protocol the result is still the same.
The fetchpage.js script is:
var webPage = require('webpage');
var system = require('system');
var page = webPage.create();
if (system.args.length === 1) {
console.log('Usage: fetchpage.js <some URL>');
phantom.exit(1);
}
var url = system.args[1];
page.open(url, function (status) {
console.log("STATUS: " + status);
if (status !== 'success') {
console.log(
"Error opening url \"" + page.reason_url
+ "\": " + page.reason
+ "\": " + page
);
phantom.exit(1);
} else {
var content = page.content;
console.log(content);
phantom.exit(1);
}
});
If I open that same page in Chrome, page loads just fine. Also if I copy those resource URLs that phantomjs can not load and paste them in Chrome, they load just fine.
I have tried to google for similar problems, but I only found some suggestions about setting timeout which did not work for me.
I have tried the same thing with phantomjs v1.9.0, 1.9.8 and 2.0.1-development.
What's even more interesting, sometimes phantomjs script manages to get full response from all resources, so I'm suspecting on cache, but I couldn't force server to avoid cache. I have tried to send custom headers through phantomjs like this:
...
var page = webPage.create();
page.customHeaders = {
"Cache-Control":"no-cache",
"Pragma":"no-cache"
};
page.open(url, function (status) {
...
but nothing changed.
I am running out of ideas..
For coders who come across this page during their quest to find an solution for resources not completely loading on phantomjs. I had a project where the script would stall/hang on a few resources. It was 50/50 if it would execute or not.
Some digging and I found the following page:
https://github.com/ariya/phantomjs/issues/10652
Where the solution to set an timeout for resources was working out for me:
page.settings.resourceTimeout = 10000;
In regards to the above question I am not sure if this is completely appropiate but at least the information is easier to find now and can be regarded part of an solution to some.

Categories

Resources