Unable to load page resources with PhantomJS - javascript

I'm using PhantomJS to get page content for given URL.
The problem is that on some pages PhantomJS can not load some resources (js, css...), and the error I'm getting is:
error code 5, Operation canceled
Web page on which I can reproduce this problem is www.lifehacker.com
The resources I can not get are:
http://x.kinja-static.com/assets/stylesheets/tiger-4ee27d6612a71ee3c68440f8e9c0025c.css
http://c.amazon-adsystem.com/aax2/amzn_ads.js
and some others too...
The command I'm running is:
phantomjs --debug=true --cookies-file=cookies.txt --ignore-ssl-errors=true --ssl-protocol=tlsv1 fetchpage.js http://www.lifehacker.com
and even if I remove options like cookies-file, ignore-ssl-errors, ssl-protocol the result is still the same.
The fetchpage.js script is:
var webPage = require('webpage');
var system = require('system');
var page = webPage.create();
if (system.args.length === 1) {
console.log('Usage: fetchpage.js <some URL>');
phantom.exit(1);
}
var url = system.args[1];
page.open(url, function (status) {
console.log("STATUS: " + status);
if (status !== 'success') {
console.log(
"Error opening url \"" + page.reason_url
+ "\": " + page.reason
+ "\": " + page
);
phantom.exit(1);
} else {
var content = page.content;
console.log(content);
phantom.exit(1);
}
});
If I open that same page in Chrome, page loads just fine. Also if I copy those resource URLs that phantomjs can not load and paste them in Chrome, they load just fine.
I have tried to google for similar problems, but I only found some suggestions about setting timeout which did not work for me.
I have tried the same thing with phantomjs v1.9.0, 1.9.8 and 2.0.1-development.
What's even more interesting, sometimes phantomjs script manages to get full response from all resources, so I'm suspecting on cache, but I couldn't force server to avoid cache. I have tried to send custom headers through phantomjs like this:
...
var page = webPage.create();
page.customHeaders = {
"Cache-Control":"no-cache",
"Pragma":"no-cache"
};
page.open(url, function (status) {
...
but nothing changed.
I am running out of ideas..

For coders who come across this page during their quest to find an solution for resources not completely loading on phantomjs. I had a project where the script would stall/hang on a few resources. It was 50/50 if it would execute or not.
Some digging and I found the following page:
https://github.com/ariya/phantomjs/issues/10652
Where the solution to set an timeout for resources was working out for me:
page.settings.resourceTimeout = 10000;
In regards to the above question I am not sure if this is completely appropiate but at least the information is easier to find now and can be regarded part of an solution to some.

Related

Screenshot of Trading View Chart using PhantomJS

I'm trying to render a screenshot of a Trading View chart widget on my server, similar to the following :
https://jsfiddle.net/exatjd8w/
I'm not that familiar with PhantomJS, but tried several ways to take a shot of the chart once it's loaded, the last try using the following code:
var page = require('webpage').create();
page.open('https://mywebsite.com/chart',
function(status) {
console.log("Status: " + status);
if (status === "success") {
page.includeJs('https://ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js', function() {
(page.evaluate(function() {
// jQuery is loaded, now manipulate the DOM
console.log('Status: jQuery is loaded, now manipulate the DOM');
var date = new Date(),
time = date.getTime()
$('#main-widget-frame-container iframe').on('load', function() {
console.log('iframe loaded, now take snapshot')
page.render('screenshot_' + time + '.png')
})
}))
})
}
}
);
Unfortunately, still unable to do it right since the above code running forever without results.
Any ideas, suggestions?
Thank you in advance.
PhantomJS is no longer maintained, and has a couple of issues that might cause this.
I'd recommend switching to Puppeteer which has a really nice API, uses Google Chrome (headless) under the hood, and is actively maintained by the chrome team:
https://github.com/GoogleChrome/puppeteer

Missing data in PhantomJS screenshots

Im working in a script that get an screenshot of a website every day. I already did it for other sites and it worked correctly but for the first time i have the next problem... my phantomjs script capture almost all the data in the website, but not all (in fact it doesn't print the most important for my case).
Until now i was using this simple script adapted:
var page = require('webpage').create();
page.open('http://www.website.com', function() {
setTimeout(function() {
page.render('render.png');
phantom.exit();
}, 200);
});
But when i run the same script for this site its losing some data. Take the screenshot but miss the prices...
Screenshot of the site with phantomjs
After exploring a bit i saw that if i make a DOM capture (for example using a PHP Simple HTML DOM parser) i can get most of the data but not the prices.
$html = file_get_html('https://www.falabella.com.ar/falabella-ar/category/cat10178/TV-LED-y-Smart-TV');
$prods = $html->find('div[class=fb-pod-group__item]');
foreach ($prods as $prod) {
// For example i can get the title
$title = $prod->find('h4[class=fb-responsive-hdng-5 fb-pod__product-title]',0)->plaintext;
// But not the price
$price = $prod->find('h4[class=fb-price]',0)->plaintext;
}
Exploring the console log i found the javascript objects where these values are. If i return the object fbra_browseProductListConfig.state.searchItemList.resultList[0].prices[0].originalPrice;
i see the price of the first product and so on and so on...:
Console log of the site
also i can get it with a phantomjs script like this:
var page = require("webpage").create();
page.open("https://www.falabella.com.ar/falabella-ar/category/cat10122/Cafeteras-express", function(status) {
var price = page.evaluate(function() {
return fbra_browseProductListConfig.state.searchItemList.resultList[0].prices[0].originalPrice;
});
console.log("The price is " + price);
phantom.exit();
});
In other posts (like this) i read about changing the timeout intervals but its not working for me (i tried all the scripts shared in the quoted post). The problem is not that the website doesn't fully charge. But it seems that this data (the prices) is not printed in the DOM. I even downloaded the full site from the terminal with wget command and the prices are not there o_O.
Edited
When i execute the script i get the next errors:
./phantomjs fala.js
ReferenceError: Can't find variable: Set
https://www.falabella.com.ar/static/assets/scripts/react/vendor.js?vid=111111111:22
https://www.falabella.com.ar/static/assets/scripts/react/vendor.js?vid=111111111:1 in t
https://www.falabella.com.ar/static/assets/scripts/react/vendor.js?vid=111111111:22
https://www.falabella.com.ar/static/assets/scripts/react/vendor.js?vid=111111111:1 in t
https://www.falabella.com.ar/static/assets/scripts/react/vendor.js?vid=111111111:22
https://www.falabella.com.ar/static/assets/scripts/react/vendor.js?vid=111111111:1 in t
https://www.falabella.com.ar/static/assets/scripts/react/vendor.js?vid=111111111:1
TypeError: undefined is not an object (evaluating 't.componentDomId')
https://www.falabella.com.ar/static/assets/scripts/react/productListApp.js?vid=111111111:3
https://www.falabella.com.ar/static/assets/scripts/react/vendor.js?vid=111111111:22
Maybe the problem is there because the script "productListApp.js" executes the prices?

PhantomJS.Org takes too long to respond against my phantom js script?

I have a simple script I got from "Getting Started with PhantomJS" book.
var system = require('system');
var url = system.args[1];
var page = require('webpage').create();
page.open(url, function(status) {
if ( status === "success" ) {
console.log("Page is loaded.");
phantom.exit(0);
} });
When I run a command like "phantomjs chapter2.js http://www.google.com" I get the correct response of "Page is loaded". Same with facebook.com
It's funny because the book told me to run "phantomjs chapter2.js http://www.phantomjs.org", but all it does is hang on me for a minute before the script stops and goes back to the command prompt without printing anything back.
Is it a problem on my end with my internet connection?
The www subdomain of phantom.js is a dead end: http://isup.me/www.phantomjs.org
Remove the subdomain and it'll work as expected: http://isup.me/phantomjs.org
The page http://www.phantomjs.org does not resolve.
http://isup.me/www.phantomjs.org shows a dead end.

Is it possible to control Firefox's DNS requests in an addon?

I was wondering if it was possible to intercept and control/redirect DNS requests made by Firefox?
The intention is to set an independent DNS server in Firefox (not the system's DNS server)
No, not really. The DNS resolver is made available via the nsIDNSService interface. That interface is not fully scriptable, so you cannot just replace the built-in implementation with your own Javascript implementation.
But could you perhaps just override the DNS server?
The built-in implementation goes from nsDNSService to nsHostResolver to PR_GetAddrByName (nspr) and ends up in getaddrinfo/gethostbyname. And that uses whatever the the system (or the library implementing it) has configured.
Any other alternatives?
Not really. You could install a proxy and let it resolve domain names (requires some kind of proxy server of course). But that is a very much a hack and nothing I'd recommend (and what if the user already has a real, non-resolving proxy configured; would need to handle that as well).
You can detect the "problem loading page" and then probably use redirectTo method on it.
Basically they all load about:neterror url with a bunch of info after it. IE:
about:neterror?e=dnsNotFound&u=http%3A//www.cu.reporterror%28%27afew/&c=UTF-8&d=Firefox%20can%27t%20find%20the%20server%20at%20www.cu.reporterror%28%27afew.
about:neterror?e=malformedURI&u=about%3Abalk&c=&d=The%20URL%20is%20not%20valid%20and%20cannot%
But this info is held in the docuri. So you have to do that. Here's example code that will detect problem loading pages:
var listenToPageLoad_IfProblemLoadingPage = function(event) {
var win = event.originalTarget.defaultView;
var docuri = window.gBrowser.webNavigation.document.documentURI; //this is bad practice, it returns the documentUri of the currently focused tab, need to make it get the linkedBrowser for the tab by going through the event. so use like event.originalTarget.linkedBrowser.webNavigation.document.documentURI <<i didnt test this linkedBrowser theory but its gotta be something like that
var location = win.location + ''; //I add a " + ''" at the end so it makes it a string so we can use string functions like location.indexOf etc
if (win.frameElement) {
// Frame within a tab was loaded. win should be the top window of
// the frameset. If you don't want do anything when frames/iframes
// are loaded in this web page, uncomment the following line:
// return;
// Find the root document:
//win = win.top;
if (docuri.indexOf('about:neterror') == 0) {
Components.utils.reportError('IN FRAME - PROBLEM LOADING PAGE LOADED docuri = "' + docuri + '"');
}
} else {
if (docuri.indexOf('about:neterror') == 0) {
Components.utils.reportError('IN TAB - PROBLEM LOADING PAGE LOADED docuri = "' + docuri + '"');
}
}
}
window.gBrowser.addEventListener('DOMContentLoaded', listenToPageLoad_IfProblemLoadingPage, true);

PhantomJS cannot open a page

I'm trying to get a html version of my url because of my backbone structure with multiple javascript code but this lines work only sometimes... Yes, sometimes works and the page content is loaded but sometimes phantom stucks and is not able to open the page. In fact, it doesn't log anything.
I'd played with timeouts but I'd got nothing. Any help? It appears to be a no deterministic behaviour. Thanks in advance!
var page = require('webpage').create();
page.open('myurl', function(status) {
if (status !== 'success') {
console.log('FAIL to load the address')
phantom.exit(1);
} else {
console.log( "Successful page open!" );
console.log(page.content);
phantom.exit(0);
}
});

Categories

Resources