In PhantomJS, why does jQuery call cause script to hang? - javascript

I'm using PhantomJS 2.0.0, on a Mac OS X Yosemite:
$ phantomjs --version
2.0.0
My script, shown below, appears to hang at the line where $('h1').size() is called:
system = require('system');
function usage() {
console.log("usage: phantomjs " + system.args[0] + " <url>");
phantom.exit(1);
}
console.log("system.args.length=" + system.args.length);
if (system.args.length != 2) {
console.log("usage bad....");
usage();
} else {
var url = system.args[1];
var page = require('webpage').create();
console.log("Opening page: " + url);
page.open(url, function (status) {
if (status !== "success") {
console.log("Unable to access network");
} else {
console.log("Setting timeout...");
window.setTimeout(function() {
page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js", function() {
console.log("Searching for Seawolf Calendar...");
console.log("h1.size=" + $('h1').size());
console.log("Exiting with status 0...");
phantom.exit(0);
});
}, 5000);
}
});
}
The script is invoked from the command-line like this, for example:
phantomjs download-framed-content.js "http://www.sonoma.edu/calendar/groups/clubs.html"
with output like this:
system.args.length=2
Opening page: http://www.sonoma.edu/calendar/groups/clubs.html
Setting timeout...
Searching for Seawolf Calendar...
[Hung ...]
Why is the jQuery call hanging the script?

PhantomJS 2.0.0 doesn't show any errors for some reason (this is a known bug).
The error would be that $ is not a function. If jQuery is present in the page, then you can use it in the page, but it won't work outside of the page context (inside page.evaluate()).
You can only access the DOM/page context through page.evaluate():
console.log("h1.size=" + page.evaluate(function(){
return $('h1').size();
}));
Note that you cannot use any outside variables inside of the page.evaluate(), because it is sandboxed. The documentation says:
Note: The arguments and the return value to the evaluate function must be a simple primitive object. The rule of thumb: if it can be serialized via JSON, then it is fine.
Closures, functions, DOM nodes, etc. will not work!

Related

PhantomJS - Trying to find out if an element is empty or not

I'm trying to write a simple PhantomJS script where I find an element by ID and determine if it is empty of not. I've tried a few suggested things such as .childNodes.length, .textContent, etc.
These either result in a null error:
TypeError: null is not an object (evaluating 'document.getElementById('idname').childNodes')
Or phantom just crashes and refuses to check the links at all, usually this happens if I run my script twice in a row without much pause. And it will sometimes sit and do nothing.
I've written other scrapers that effectively used getElementById in this way, and they were successful, although there I was just checking if the element existed by checking if it was !== null. Checking manually, this element does exist in all the pages I'm checking, it's just that it sometimes has content and sometimes doesn't (it's a div). Anyway, here is my code:
var fs = require('fs')
var urls = fs.read('urls.txt').split('\n');
var page;
page = require('webpage').create();
console.log('The default user agent is ' + page.settings.userAgent);
page.settings.userAgent = 'SpecialAgent';
function check_link(url){
page = require('webpage').create();
page.open(url, function(status){
if (status !== 'success') {
console.log('Unable to access network');
} else {
var error = page.evaluate(function() {
return document.getElementById('error-message');
});
console.log(error.childNodes.length);
fs.write('results.csv', error.childNodes.length + ', ' + url + '\n', 'a');
page.release();
setTimeout(next_link, 1000);
}
});
}
function next_link(){
var url = urls.shift();
console.log(url);
if(!urls){
phantom.exit(0);
} else{
check_link(url);
}
}
next_link();
PhantomJS provides access to the sandboxed page context (DOM context) through page.evaluate() with the following note:
Note: The arguments and the return value to the evaluate function must be a simple primitive object. The rule of thumb: if it can be serialized via JSON, then it is fine.
Closures, functions, DOM nodes, etc. will not work!
So you cannot pass the DOM node out of the page context, but you can do everything you want with it in the page context and then pass out the result.
var errors = page.evaluate(function() {
var e = document.getElementById('error-message');
return (e && e.childNodes) ? e.childNodes.length : -1
});
console.log(errors);

How to set custom header and footer content by using phantom package in nodeJS?

Actually, I am working with generating pdf reports using phantom package in nodeJS. I've found that we can use phantom.callback method for setting it. But I have a problem that when this callback returns simple text it works fine, but when I try to use complicated function which use closures and jade engine for generating html I have error in phantom output that jade variable is not defined, I think this problem was occurred because callback which is mentioned about work in context of child phantom process, therefore all variables which are defined in my code in that callback don't work. So, how can I solve this issue? Maybe you know better phantomJS wrapper for doing this stuff?
I use this package phantom": "0.7.x"
//there I define all varaibles (jade, fs, etc., so I am sure that they are correct)
function generatePage(_page, reportConfig, phantom, html, report) {
_page.set('viewportSize', reportConfig.viewportSize);
var config = _.extend(reportConfig.paperSize, {
header: {
height: "2cm",
//contents: phantom.callback(headerCallback)
contents: phantom.callback(function (pageNum, numPages) {
var fn = jade.compile(headerTemplate); //Jade in undefined in phantom stdout
var templateData = _.extend({ currentPage: pageNum, numberPages: numPages }, report);
var generatedHtml = fn(templateData);
return "<h1>HEADER</h1><br />" /*+ generatedHtml*/;
})
}
, footer: {
height: "1cm",
contents: phantom.callback(function (pageNum, numPages) {
return "<p>Page " + pageNum + " of " + numPages + "</p>"; //WORKS fine
})
}
}
);
_page.set('paperSize', config);
_page.setContent(html);
return _page;
}
The phantom callback doesn't work this way, the function you are sending as callback will be stringified and recompiled in the phantom context where your dependencies will be unknown.
Is a late answer but i didn't find anything like this out there in the wild. So maybe will help others.
You have to generate the html out of the page context, right after you define jade and others, and compile the result into a function which you'll send as callback:
//there I define all varaibles (jade, fs, etc., so I am sure that they are correct)
var fn = jade.compile(headerTemplate); //Jade is undefined in phantom context
var templateData = report;
var generatedHtml = fn(templateData);
//You can use some patterns in your template for pageNum and numPages like #pageNum# and replace them after compilation.
//here you are compiling the result into a function which you will send
//as callback(you have to remove all spaces and all breaklines to avoid compilation errors)
var headerCallbak = 'function(pageNum, numPages) { var x = \''+ generatedHtml .trim().replace(/(\r\n|\n|\r)/gm,"") +'\'; return x.relpace("#pageNumber#", pageNum).replace("#numPages#",numPages);}';
function generatePage(_page, reportConfig, phantom, html, report) {
_page.set('viewportSize', reportConfig.viewportSize);
var config = _.extend(reportConfig.paperSize, {
header: {
height: "2cm",
contents: phantom.callback(headerCallbak)
}
, footer: {
height: "1cm",
contents: phantom.callback(function (pageNum, numPages) {
return "<p>Page " + pageNum + " of " + numPages + "</p>"; //WORKS fine
})
}
}
);
_page.set('paperSize', config);
_page.setContent(html);
return _page;
}

Using phantom.js to scrape data

Following on from this question, I am trying to scrape data using phantomjs, modifying a script from here:
My goal is to integrate a working function (see 2nd code snippet) into the script below in 1st code snippet. I have tried doing this but keep getting errors. Is there a way I can actually do the integration?
(note: using phantomjs because the site is an angular app where initial HTML doesn't contain any of the data I amlooking for, i.e.a headless web browser. So I need to load the page in memory, wait for angular to do its thing (a set delay of some sort), and then scrape the rendered DOM)
The errors (and output) I get when I execute my script (phantomjs scraping.js) are as follow:
console> SPR-ERROR: 103 - Invalid published date console> v6
ReferenceError: Can't find variable: angular
http://stage.inc.com/js/Inc5000ListApp.js?UPDATE1:2
http://www.inc.com/inc5000/index.html:2485
console> SPR-ERROR:103 - Invalid published date (date)
====================================================
Step "0"
====================================================
console>Reached scrapeData
console>
Seems like it is connecting to the desired site. How do I modify this script below to fit the extraction code at the bottom of this qn:
var page = new WebPage(),
url = 'http://www.inc.com/inc5000/index.html',
stepIndex = 0;
/**
* From PhantomJS documentation:
* This callback is invoked when there is a JavaScript console. The callback may accept up to three arguments:
* the string for the message, the line number, and the source identifier.
*/
page.onConsoleMessage = function (msg, line, source) {
console.log('console> ' + msg);
};
/**
* From PhantomJS documentation:
* This callback is invoked when there is a JavaScript alert. The only argument passed to the callback is the string for the message.
*/
page.onAlert = function (msg) {
console.log('alert!!> ' + msg);
};
// Callback is executed each time a page is loaded...
page.open(url, function (status) {
if (status === 'success') {
// State is initially empty. State is persisted between page loads and can be used for identifying which page we're on.
console.log('============================================');
console.log('Step "' + stepIndex + '"');
console.log('============================================');
// Inject jQuery for scraping (you need to save jquery-1.6.1.min.js in the same folder as this file)
page.injectJs('jquery-1.6.1.min.js');
// Our "event loop"
if(!phantom.state){
//initialize();
scrapeData();
} else {
phantom.state();
}
// Save screenshot for debugging purposes
page.render("step" + stepIndex++ + ".png");
}
});
function scrapeData(){
page.evaluate(function() {
console.log('Reached scrapeData');
var DATA = [];
$('tr.ng-scope').each(function(){
var $tds = $(this).find('td');
DATA.push({
rank: $tds.eq(0).text(),
company: $tds.eq(1).text(),
growth: $tds.eq(2).text(),
revenue: $tds.eq(3).text(),
industry: $tds.eq(4).text()
});
});
console.log(DATA);
});
phantom.state = parseResults;
// scraping code here
}
// Step 1
function initialize() {
page.evaluate(function() {
console.log('Searching...');
});
// Phantom state doesn't change between page reloads
// We use the state to store the search result handler, ie. the next step
phantom.state = parseResults;
}
// Step 2
function parseResults() {
page.evaluate(function() {
$('#search-result a').each(function(index, link) {
console.log($(link).attr('href'));
})
console.log('Parsed results');
});
// If there was a 3rd step we could point to another function
// but we would have to reload the page for the callback to be called again
phantom.exit();
}
I know this code below works in the console, but how I can integrate it with the code script above to successfully scrape data from multiple pages on the site:
request('http://www.inc.com/inc5000/index.html', function (error, response, html) {
if(error || response.statusCode != 200) return;
var $ = cheerio.load(html);
var DATA = [];
$('tr.ng-scope').each(function(){
var $tds = $(this).find('td');
DATA.push({
rank: $tds.eq(0).text(),
company: $tds.eq(1).text(),
growth: $tds.eq(2).text(),
revenue: $tds.eq(3).text(),
industry: $tds.eq(4).text()
});
});
console.log(DATA);
});

Using Multiple page.open in Single Script

My goal is to execute PhantomJS by using:
// adding $op and $er for debugging purposes
exec('phantomjs script.js', $op, $er);
print_r($op);
echo $er;
And then inside script.js, I plan to use multiple page.open() to capture screenshots of different pages such as:
var url = 'some dynamic url goes here';
page = require('webpage').create();
page.open(url, function (status) {
console.log('opening page 1');
page.render('./slide1.png');
});
page = require('webpage').create();
page.open(url, function (status) {
console.log('opening page 2');
page.render('./slide2.png');
});
page = require('webpage').create();
page.open(url, function (status) {
console.log('opening page 3');
page.render('./slide3.png');
phantom.exit(); //<-- Exiting phantomJS only after opening all 3 pages
});
On running exec, I get the following output on page:
Array ( [0] => opening page 3 ) 0
As a result I only get the screenshot of the 3rd page. I'm not sure why PhantomJS is skipping the first and second blocks of code (evident from the missing console.log() messages that were supposed to be output from 1st and 2nd block) and only executing the third block of code.
The problem is that the second page.open is being invoked before the first one finishes, which can cause multiple problems. You want logic roughly like the following (assuming the filenames are given as command line arguments):
function handle_page(file){
page.open(file,function(){
...
page.evaluate(function(){
...do stuff...
});
page.render(...);
setTimeout(next_page,100);
});
}
function next_page(){
var file=args.shift();
if(!file){phantom.exit(0);}
handle_page(file);
}
next_page();
Right, it's recursive. This ensures that the processing of the function passed to page.open finishes, with a little 100ms grace period, before you go to the next file.
By the way, you don't need to keep repeating
page = require('webpage').create();
I've tried the accepted answer suggestions, but it doesn't work (at least not for v2.1.1).
To be accurate the accepted answer worked some of the time, but I still experienced sporadic failed page.open() calls, about 90% of the time on specific data sets.
The simplest answer I found is to instantiate a new page module for each url.
// first page
var urlA = "http://first/url"
var pageA = require('webpage').create()
pageA.open(urlA, function(status){
if (status){
setTimeout(openPageB, 100) // open second page call
} else{
phantom.exit(1)
}
})
// second page
var urlB = "http://second/url"
var pageB = require('webpage').create()
function openPageB(){
pageB.open(urlB, function(){
// ...
// ...
})
}
The following from the page module api documentation on the close method says:
close() {void}
Close the page and releases the memory heap associated with it. Do not use the page instance after calling this.
Due to some technical limitations, the web page object might not be completely garbage collected. This is often encountered when the same object is used over and over again. Calling this function may stop the increasing heap allocation.
Basically after I tested the close() method I decided using the same web page instance for different open() calls is too unreliable and it needed to be said.
You can use recursion:
var page = require('webpage').create();
// the urls to navigate to
var urls = [
'http://phantomjs.org/',
'https://twitter.com/sidanmor',
'https://github.com/sidanmor'
];
var i = 0;
// the recursion function
var genericCallback = function () {
return function (status) {
console.log("URL: " + urls[i]);
console.log("Status: " + status);
// exit if there was a problem with the navigation
if (!status || status === 'fail') phantom.exit();
i++;
if (status === "success") {
//-- YOUR STUFF HERE ----------------------
// do your stuff here... I'm taking a picture of the page
page.render('example' + i + '.png');
//-----------------------------------------
if (i < urls.length) {
// navigate to the next url and the callback is this function (recursion)
page.open(urls[i], genericCallback());
} else {
// try navigate to the next url (it is undefined because it is the last element) so the callback is exit
page.open(urls[i], function () {
phantom.exit();
});
}
}
};
};
// start from the first url
page.open(urls[i], genericCallback());
Using Queued Processes, sample:
var page = require('webpage').create();
// Queue Class Helper
var Queue = function() {
this._tasks = [];
};
Queue.prototype.add = function(fn, scope) {
this._tasks.push({fn: fn,scope: scope});
return this;
};
Queue.prototype.process = function() {
var proxy, self = this;
task = this._tasks.shift();
if(!task) {return;}
proxy = {end: function() {self.process();}};
task.fn.call(task.scope, proxy);
return this;
};
Queue.prototype.clear = function() {
this._tasks = []; return this;
};
// Init pages .....
var q = new Queue();
q.add(function(proxy) {
page.open(url1, function() {
// page.evaluate
proxy.end();
});
});
q.add(function(proxy) {
page.open(url2, function() {
// page.evaluate
proxy.end();
});
});
q.add(function(proxy) {
page.open(urln, function() {
// page.evaluate
proxy.end();
});
});
// .....
q.add(function(proxy) {
phantom.exit()
proxy.end();
});
q.process();
I hope this is useful, regards.

Hot Code Push NodeJS

I've been trying to figure out this "Hot Code Push" on Node.js. Basically, my main file (that is run when you type node app.js) consists of some settings, configurations, and initializations. In that file I have a file watcher, using chokidar. When I file has been added, I simply require the file. If a file has been changed or updated I would delete the cache delete require.cache[path] and then re-require it. All these modules don't export anything, it just works with the single global Storm object.
Storm.watch = function() {
var chokidar, directories, self = this;
chokidar = require('chokidar');
directories = ['server/', 'app/server', 'app/server/config', 'public'];
clientPath = new RegExp(_.regexpEscape(path.join('app', 'client')));
watcher = chokidar.watch(directories, {
ignored: function(_path) {
if (_path.match(/\./)) {
!_path.match(/\.(js|coffee|iced|styl)$/);
} else {
!_path.match(/(app|config|public)/);
}
},
persistent: true
});
watcher.on('add', function(_path){
self.fileCreated(path.resolve(Storm.root, _path));
//Storm.logger.log(Storm.cliColor.green("File Added: ", _path));
//_console.info("File Updated");
console.log(Storm.css.compile(' {name}: {file}', "" +
"name" +
"{" +
"color: white;" +
"font-weight:bold;" +
"}" +
"hr {" +
"background: grey" +
"}")({name: "File Added", file: _path.replace(Storm.root, ""), hr: "=================================================="}));
});
watcher.on('change', function(_path){
_path = path.resolve(Storm.root, _path);
if (fs.existsSync(_path)) {
if (_path.match(/\.styl$/)) {
self.clientFileUpdated(_path);
} else {
self.fileUpdated(_path);
}
} else {
self.fileDeleted(_path);
}
//Storm.logger.log(Storm.cliColor.green("File Changed: ", _path));
console.log(Storm.css.compile(' {name}: {file}', "" +
"name" +
"{" +
"color: yellow;" +
"font-weight:bold;" +
"}" +
"hr {" +
"background: grey" +
"}")({name: "File Changed", file: _path.replace(Storm.root, ""), hr: "=================================================="}));
});
watcher.on('unlink', function(_path){
self.fileDeleted(path.resolve(Storm.root, _path));
//Storm.logger.log(Storm.cliColor.green("File Deleted: ", _path));
console.log(Storm.css.compile(' {name}: {file}', "" +
"name" +
"{" +
"color: red;" +
"font-weight:bold;" +
"}" +
"hr {" +
"background: grey" +
"}")({name: "File Deleted", file: _path.replace(Storm.root, ""), hr: "=================================================="}));
});
watcher.on('error', function(error){
console.log(error);
});
};
Storm.watch.prototype.fileCreated = function(_path) {
if (_path.match('views')) {
return;
}
try {
require.resolve(_path);
} catch (error) {
require(_path);
}
};
Storm.watch.prototype.fileDeleted = function(_path) {
delete require.cache[require.resolve(_path)];
};
Storm.watch.prototype.fileUpdated = function(_path) {
var self = this;
pattern = function(string) {
return new RegExp(_.regexpEscape(string));
};
if (_path.match(pattern(path.join('app', 'templates')))) {
Storm.View.cache = {};
} else if (_path.match(pattern(path.join('app', 'helpers')))) {
self.reloadPath(path, function(){
self.reloadPaths(path.join(Storm.root, 'app', 'controllers'));
});
} else if (_path.match(pattern(path.join('config', 'assets.coffee')))) {
self.reloadPath(_path, function(error, config) {
//Storm.config.assets = config || {};
});
} else if (_path.match(/app\/server\/(models|controllers)\/.+\.(?:coffee|js|iced)/)) {
var isController, directory, klassName, klass;
self.reloadPath(_path, function(error, config) {
if (error) {
throw new Error(error);
}
});
Storm.serverRefresh();
isController = RegExp.$1 == 'controllers';
directory = 'app/' + RegExp.$1;
klassName = _path.split('/');
klassName = klassName[klassName.length - 1];
klassName = klassName.split('.');
klassName.pop();
klassName = klassName.join('.');
klassName = _.camelize(klassName);
if (!klass) {
require(_path);
} else {
console.log(_path);
self.reloadPath(_path)
}
} else if (_path.match(/config\/routes\.(?:coffee|js|iced)/)) {
self.reloadPath(_path);
} else {
this.reloadPath(_path);
}
};
Storm.watch.prototype.reloadPath = function(_path, cb) {
_path = require.resolve(path.resolve(Storm.root, path.relative(Storm.root, _path)));
delete require.cache[_path];
delete require.cache[path.resolve(path.join(Storm.root, "server", "application", "server.js"))];
//console.log(require.cache[path.resolve(path.join(Storm.root, "server", "application", "server.js"))]);
require("./server.js");
Storm.App.use(Storm.router);
process.nextTick(function(){
Storm.serverRefresh();
var result = require(_path);
if (cb) {
cb(null, result);
}
});
};
Storm.watch.prototype.reloadPaths = function(directory, cb) {
};
Some of the code is incomplete / not used as I'm trying a lot of different methods.
What's Working:
For code like the following:
function run() {
console.log(123);
}
Works perfectly. But any asynchronous code fails to update.
Problem = Asynchronous Code
app.get('/', function(req, res){
// code here..
});
If I then update the file when the nodejs process is running, nothing happens, though it goes through the file watcher and the cache is deleted, then re-established. Another instance where it doesn't work is:
// middleware.js
function hello(req, res, next) {
// code here...
}
// another file:
app.use(hello);
As app.use would still be using the old version of that method.
Question:
How could I fix the problem? Is there something I'm missing?
Please don't throw suggestions to use 3rd party modules like forever. I'm trying to incorporate the functionality within the single instance.
EDIT:
After studying meteors codebase (there's surprisingly little resources on "Hot Code Push" in node.js or browser.) and tinkering around with my own implementation I've successfully made a working solution. https://github.com/TheHydroImpulse/Refresh.js . This is still at an early stage of development, but it seems solid right now. I'll be implementing a browser solution too, just for sake of completion.
Deleting require's cache doesn't actually "unload" your old code, nor does it undo what that code did.
Take for example the following function:
var callbacks=[];
registerCallback = function(cb) {
callbacks.push(cb);
};
Now let's say you have a module that calls that global function.
registerCallback(function() { console.log('foo'); });
After your app starts up, callbacks will have one item. Now we'll modify the module.
registerCallback(function() { console.log('bar'); });
Your 'hot patching' code runs, deletes the require.cached version and re-loads the module.
What you must realize is that now callbacks has two items. First, it has a reference to the function that logs foo (which was added on app startup) and a reference to the function that logs bar (which was just added).
Even though you deleted the cached reference to the module's exports, you can't actually delete the module. As far as the JavaScript runtime is concerned, you simply removed one reference out of many. Any other part of your application can still be hanging on to a reference to something in the old module.
This is exactly what is happening with your HTTP app. When the app first starts up, your modules attach anonymous callbacks to routes. When you modify those modules, they attach a new callback to the same routes; the old callbacks are not deleted. I'm guessing that you're using Express, and it calls route handlers in the order they were added. Thus, the new callback never gets a chance to run.
To be honest, I wouldn't use this approach to reloading you app on modification. Most people write app initialization code under the assumption of a clean environment; you're violating that assumption by running initialization code in a dirty environment – that is, one which is already up and running.
Trying to clean up the environment to allow your initialization code to run is almost certainly more trouble than it's worth. I'd simply restart the entire app when your underlying files have changed.
Meteor solves this problem by allowing modules to "register" themselves as part of the hot code push process.
They implement this in their reload package:
https://github.com/meteor/meteor/blob/master/packages/reload/reload.js#L105-L109
I've seen that Meteor.reload API used in some plugins on GitHub, but they also use it in the session package:
https://github.com/meteor/meteor/blob/master/packages/session/session.js#L103-L115
if (Meteor._reload) {
Meteor._reload.onMigrate('session', function () {
return [true, {keys: Session.keys}];
});
(function () {
var migrationData = Meteor._reload.migrationData('session');
if (migrationData && migrationData.keys) {
Session.keys = migrationData.keys;
}
})();
}
So basically, when the page/window loads, meteor runs a "migration", and it's up to the package to define the data/methods/etc. that get recomputed when a hot code push is made.
It's also being used by their livedata package (search reload).
Between refreshes they're saving the "state" using window.sessionStorage.

Categories

Resources