CasperJS 1.0.4 - error in this.getElementsInfo() - javascript

I am new to casperJS. I have installed casperJS 1.0.4 and phantomJS 1.8.2 on windows 8.
My objective is to scrape some data from net. i want to open this webpage and fetch the list of towns in vermont. I replicated the code given by Victor W Yee. When i run the code, it opens the desired page, i take a snapshot of it as verification but when i try and fetch data from the table I get an error on this line:
var town_names_info = this.getElementsInfo(town_selector);
Error says:
TypeError: 'undefined' is not a function(evaluating'this.getElementsInfo(town_selector)')
F:/Trial Codes/intro to casper_JS/Vermont/vermont.js:21
F:/Trial Codes/intro to casper_JS/Vermont:1335 in runStep
F:/Trial Codes/intro to casper_JS/Vermont:332 in checkStep
Any suggestions ??
My whole code is:
var utils = require('utils');
var casper = require('casper').create({
verbose: false,
logLevel: 'debug'
});
var url = 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont';
var town_selector;
casper.start(url, function() {
this.capture("result1.png");
this.echo("* "+this.getTitle()+" *");
});
casper.then(function() {
// Get info on all elements matching this CSS selector
town_selector = 'table[id="sortable wikitable"] tbody tr td:nth-of-type(2)';
var town_names_info = this.getElementsInfo(town_selector); // an array of object literals
// Pull out the town name text and push into the town_names array
var town_names = [];
for (var i = 0; i < town_names_info.length; i++)
{
town_names.push(town_names_info[i].text);
}
// Dump the town_names array to screen
utils.dump(town_names);
});
casper.run(function() {
this.exit();
});

getElementsInfo() was added in CasperJS version 1.1 (note the green note in the page). You can use 1.1.0-beta3 because this "beta" version is actually stable. While you're at it updating, you should use a more up-to-date version of PhantomJS such as 1.9.7 or 1.9.8 (has some problems with CasperJS).

Related

Web scraping with R and phantomjs using a small js script returning error

I need to get the content from this page containing some scripts:
https://grouper.swissdrg.org/swissdrg/single?version=7.3&pc=1337_70_0_0_M_11_00_15_0_2018/08/07_2018/08/22_C18.4_C07_-_45.81.11$$&provider=acute&locale=de.
For other pages containing js its working fine but not for the one I need.
phantomjs.exe is in the root directoy and successfully invoked by a system call (win7 64 bit):
system("phantomjs WebScrapeV1.js")
The java script file WebScrapeV1.js is as follows:
var url ='https://grouper.swissdrg.org/swissdrg/single?version=7.3&pc=1337_70_0_0_M_11_00_15_0_2018/08/07_2018/08/22_C18.4_C07_-_45.81.11$$&provider=acute&locale=de';
var page = new WebPage()
var fs = require('fs');
page.open(url, function (status) {
just_wait();
});
function just_wait() {
setTimeout(function() {
fs.write('WebScrapeV1.html', page.content, 'w');
phantom.exit();
}, 2500);
}
This is the error I get:
Error: [mobx.array] Index out of bounds, function (t) {return{key:t.version,text:t["name_"+e.root.navigation.lang],value:t.version}} is larger than 30
https://grouper.swissdrg.org/packs/App-3dd15966701d9f6fd4db.js:1 in br
Unhandled promise rejection TypeError: undefined is not a constructor (evaluating 'n.push(this.pdx)')
A longer timeout may be what you need. I had to use 3600 to get all the contents (that site was super super slow for me). Here's a way you can modify the timeout in the event of errors without having to hand-modify a phantomjs script.
First, we'll make a function to wrap up all the complexity:
#' Read contents from a URL with phantomjs
#'
#' #param url the URL to scrape
#' #param timeout how long to wait, default is `2500` (ms)
#' #param .verbose, if `TRUE` (the default), display the generated
#' scraping script and any `stdout` output from phantomjs
read_phantom <- function(url, timeout=2500, .verbose = TRUE) {
suppressPackageStartupMessages({
require("glue", character.only = TRUE, quiet=TRUE)
require("crayon", character.only = TRUE, quiet=TRUE)
})
phantom_template <- "
var url = {url};
var page = new WebPage()
var fs = require('fs');
page.open(url, function (status) {{
just_wait();
});
function just_wait() {{
setTimeout(function() {{
fs.write({output_file}, page.content, 'w');
phantom.exit();
}, {timeout});
}
"
url <- shQuote(url)
phantom_bin <- Sys.which("phantomjs")
tf_in <- tempfile(fileext = ".js")
on.exit(unlink(tf_in), add=TRUE)
tf_out <- tempfile(fileext = ".html")
on.exit(unlink(tf_out), add=TRUE)
output_file <- shQuote(tf_out)
phantom_script <- glue(phantom_template)
if (.verbose) {
cat(
crayon::white("Using the following generated scraping script:\n"),
crayon::green(phantom_script), "\n", sep=""
)
}
writeLines(phantom_script, tf_in)
system2(
command = phantom_bin,
args = tf_in,
stdout = if (.verbose) "" else NULL
)
paste0(readLines(tf_out, warn = FALSE), collapse="\n")
}
Now, we'll use your URL with a longer timeout:
read_phantom(
url = "https://grouper.swissdrg.org/swissdrg/single?version=7.3&pc=1337_70_0_0_M_11_00_15_0_2018/08/07_2018/08/22_C18.4_C07_-_45.81.11$$&provider=acute&locale=de",
timeout = 3600
) -> doc
substr(doc, 1, 100)
## [1] "<html><head>\n<script src=\"https://js-agent.newrelic.com/nr-1071.min.js\"></script><script type=\" text"
nchar(doc)
## [1] 26858
Note that phantomjs is considered a legacy tool as the main developers have moved on since headless Chrome came on the scene. Unfortunately, there's no way to set a timeout for headless Chrome in the simple cmd line interface to it so you're kinda stuck with phantomjs for now.
I'd suggest trying splashr but you're on Windows and splashr requires Docker; alternatively, decapitated has an orchestration counterpart gepetto but that requires nodejs; either of those combos seem to be a painful for may folks to get working on that legacy operating system.

Lighthouse's "mobile-friendly" returning false for all sites I test

I'm using the CLI version of Google's Lighthouse performance testing tool to measure certain attributes of a large list of websites. I'm passing the results as JSON to STDOUT then onto a Node script that plucks the values that I'm interested in out to a CSV file.
One of the measures collecting is audits.mobile-friendly.rawValue, which I was expecting to be a flag for either passing Google's mobile friendly test. So the assumption is that value would be true for a mobile optimized site. I collected this value for ~2,000 websites, and all came back false.
Here's an example call that I am making to the command line:
lighthouse http://nytimes.com --disable-device-emulation --disable-network-throttling --chrome-flags="--headless" --output "json" --quiet --output-path "stdout" | node lighthouse_parser.js >> speed_log.csv
and here's the output of that command:
"data_url","data_score","data_total_byte_weight","data_consistently_interactive_time","data_first_interactive_time","data_is_on_https","data_redirects_http","data_mobile_friendly","timestamp"
"https://www.nytimes.com/",18.181818181818183,4211752,,18609.982,false,true,false,"2018-04-02T17:16:39-04:00"
Here's the code for my lighthouse_parser.js:
var moment = require('moment');
var getStdin = require('get-stdin');
var json2csv = require('json2csv');
var timestamp = moment().format();
getStdin().then(str => {
try {
process_files(str);
} catch (error) {
console.error(error);
}
});
function process_files(this_file) {
var obj = JSON.parse(this_file);
var data_url = obj.url;
var data_score = obj.score;
var data_total_byte_weight = obj.audits['total-byte-weight'].rawValue;
var data_consistently_interactive_time = obj.audits['consistently-interactive'].rawValue;
var data_first_interactive_time = obj.audits['first-interactive'].rawValue;
var data_is_on_https = obj.audits['is-on-https'].rawValue;
var data_redirects_http = obj.audits['redirects-http'].rawValue;
var data_mobile_friendly = obj.audits['mobile-friendly'].rawValue;
var the_result = {
"data_url": data_url,
"data_score": data_score,
"data_total_byte_weight": data_total_byte_weight,
"data_consistently_interactive_time": data_consistently_interactive_time,
"data_first_interactive_time": data_first_interactive_time,
"data_is_on_https": data_is_on_https,
"data_redirects_http": data_redirects_http,
"data_mobile_friendly": data_mobile_friendly,
"timestamp": timestamp,
};
var return_this = json2csv({
data: the_result,
header: false
});
console.log(return_this);
}
I haven't been able to get one true value for audits.mobile-friendly.rawValue on ANY site.
Any thoughts on what I'm doing wrong?
The mobile-friendly audit result you're looking at here is this one:
It's essentially a placeholder audit that tells you to use the Mobile-Friendly Test. So, indeed, it's value will never change. ;)
The viewport, content-width and (to some degree) font-size audits can be used to provide a definition of mobile friendliness, which is comparable with what the dedicated MFT returns.

CasperJS crashes - can they be avoided with refresh or reinstance?

hope you're having an awesome day.
I'm running a CasperJS scrape across around 100,000 links over the course of a few days (continuously).
For every 500 or so, casperJS crashes randomly. When reloaded and started from the last link, however, it continues for another 500.
I was wondering if someone knows of an effective way I might be able to refresh or close & reinstance casperjs, to avoid this burnout? I was thinking of an exit() paired with a wait, but very keen on thoughts!
The script is similar to:
var casper = require('casper').create({
verbose: true,
logLevel: 'error',
pageSettings: {
loadImages: false,
loadPlugins: true,
userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11'
},
clientScripts: ['vendor/jquery.min.js', 'vendor/lodash.js'],
viewportSize: {
width: 1600,
height:1000
}
});
var linkArray = [ // Includes 100,000 + links ]
function inspectUrl(url) {
casper.thenOpen(url, function() {
title = this.getPageTitle();
bodyText = this.fetchText('body');
// Includes a bunch of other tasks to do.
}
casper.start('https://www.google.com.au', function() {
console.log('Booting up CasperJS...');
});
casper.then(function() {
for (var i = 0; i < linkArray.length; i++) {
inspectUrl(linkArray[i]);
};
});
casper.run()
There is a known PhantomJS memory problem. You should develop a "runner", which runs your CasperJS script with some 400 links, collects result, then runs another instance of the script with another portion of links and so far.
Maybe you can make some CasperJS instances run in parallel, if you need speed.
You can develop such a runner with PhantomJS, using the spawn function.
The function is described briefly in the PhantomJS docs: http://phantomjs.org/api/child_process/
UPDATE:
You can find below a working example of such a runner. The example is very simple, just to demonstrate how one could spawn CasperJS instances and collect their results. In particular, there is no error handling in the example at all. The example have been tested with PhantomJS 2.1.1.
The runner uses Q promises, so first you have to make file package.json with the following content:
{
"dependencies": {
"q": "1.4.1"
}
}
and run installer:
npm install
Then you have to create runner.js:
var Q = require('q');
var childProcess = require('child_process');
var parserTasks = [
'http://phantomjs.org/',
'http://casperjs.org/',
'https://jquery.com/'
];
run(parserTasks).then(function(result) {
console.log('Tasks result: ' + JSON.stringify(result));
phantom.exit();
});
function run(tasks) {
if (tasks.length) {
var task = tasks.pop();
return runTask(task).then(function(result) {
console.log('result: ' + result);
return run(tasks).then(function(results) {
return([result].concat(results));
});
});
} else {
return Q([]);
}
}
function runTask(task) {
var defer = Q.defer();
var spawn = childProcess.spawn;
var result = '';
var child = spawn('casperjs', ['parser.js', task]);
console.log("spawn run: " + task);
child.stdout.on("data", function(data) {
result += data;
});
child.on("exit", function() {
defer.resolve(result);
});
return defer.promise;
}
and parser.js
var casper = require('casper').create();
var url = casper.cli.args[0];
var result;
casper.start();
casper.thenOpen(url, function() {
result = this.getTitle();
});
casper.run(function() {
this.echo(result).exit();
});
You could execute the runner the following way, meaning that phantomjs executable is somewhere on PATH.
phantomjs runner.js
The output should be the following:
spawn run: https://jquery.com/
result: jQuery
spawn run: http://casperjs.org/
result: CasperJS, a navigation scripting and testing utility for PhantomJS and SlimerJS
spawn run: http://phantomjs.org/
result: PhantomJS | PhantomJS
Tasks result: ["jQuery\n","CasperJS, a navigation scripting and testing utility for PhantomJS and SlimerJS\n","PhantomJS | PhantomJS\n"]

MongoDB JavaScript Yield large set of results

I'm trying to query a large set of results from a MongoDB over Python. I do this via JavaScript, because I want to get something like the grandchildren in a tree-like structure. My code looks like the following:
col = db.getCollection(...)
var res = new Array();
col.find( { "type" : ["example"] } ).forEach(
function(entry)
{
v1 = col.find( {"_id" : entry["..."]} )
... (walk through the structure) ...
vn = ...
res.push([v1["_id"], vn["data"]]);
}
);
return res;
Now, I'm having the problem, that the resulting array becomes very (too) large and the memory gets exceeded. Is there a way, to yield the results instead of pushing them into an array?
Alright, I think I know, what you mean. I created a structure like the following:
var bulksize = 1000;
var col = db.getCollection("..");
var queryRes = col.find( { ... } )
process = function(entity) { ... }
nextEntries = function()
{
var res = new Array();
for(var i=0; i<bulksize; i++)
{
if(hasNext())
res.push(process(queryRes.next()));
else
break;
}
return res;
}
hasNext = function()
{
return queryRes.hasNext();
}
The script separates the results into bulks of 1000 entries. And from Python side eval the noted script and then I do the following:
while database.eval('hasNext()'):
print "test"
for res in database.eval('return nextEntries()'):
doSth(res)
The interesting thing is, that the console always says:
test
test
test
test
test
test
Then I get the error:
pymongo.errors.OperationFailure: command SON([('$eval', Code('return nextEntries()', {})), ('args', ())]) failed: invoke failed: JS Error: ReferenceError: nextEntries is not defined nofile_a:0
This means, that the first calls of nextEntries() work, but then the function is not there, anymore. Could it be, that MongoDB does something like a clearing of the JavaScript cache? The problem does not depend on the bulksize (tested with 10, 100, 1000, 10000 and always the same result).
Alright, I found a line in the source code of MongoDB, which clears all JavaScripts that are used more than 10 times. So if no changes on the database server are wanted, it is necessary to query the database multiple times and send bulks to the client by selecting amounts of items with help of the skip() and limit() functions. This works surprisingly fast. Thanks for your help.

PhantomJS page fetching with nested loop to get new pages

I want to fetch a list online from a certain URL that is in JSON format and then use the DATA_ID from each item in that list to call a new URL. I'm just new with PhantomJS and I can't figure out why nest loops inside the page.open() acts all weird. Also the way to use phantom.exit() seems to be really weird doing what I want to achieve.
Here's my code:
console.log('Loading recipes');
console.log('===============================================================');
var page = require('webpage').create();
var url = 'http://www.hiddenurl.com/recipes/all';
page.open(url, function (status) {
//Page is loaded!
var js = page.evaluate(function () {
return document.getElementsByTagName('pre')[0];
});
var recipes = JSON.parse(js.innerHTML).results;
//console.log(recipes[0].name.replace('[s]', ''));
for (i = 0; i < recipes.length; i++) {
console.log(recipes[i].name.replace('[s]', ''));
var craft_page = require('webpage').create();
var craft_url = 'http://www.hiddenurl.com/recipe/' + recipes[i].data_id;
craft_page.open(craft_url, function (craft_status) {
//Page is loaded!
var craft_js = craft_page.evaluate(function () {
return document.getElementsByTagName('body')[0];
});
var craftp = craft_js.innerHTML;
console.log('test');
});
if (i == 5) {
console.log('===============================================================');
phantom.exit();
//break;
}
}
});
The thing that happens here is that this line:
console.log(recipes[i].name.replace('[s]', ''));
..prints the following:
===============================================================
Item from DATA_ID 1
Item from DATA_ID 2
Item from DATA_ID 3
Item from DATA_ID 4
Item from DATA_ID 5
..then it just prints the next:
===============================================================
..followed by:
'test'
'test'
'test'
'test'
'test'
Why is this not happening serial? The data from the innerly called page() request gets heaped up and dumped at the end, even after phantom.exit() should actually already be called.
Also when I free-loop a normal data-set I get this error:
QEventDispatcherUNIXPrivate(): Unable to create thread pipe: Too many open files
2013-01-31T15:35:18 [FATAL] QEventDispatcherUNIXPrivate(): Can not continue without a thread pipe
Abort trap: 6
Is there any way I can set GLOBAL_PARAMETERS or direct the process in some way so I can just handle 100's of page requests?
Thanks in advance!
I've made a workaround with Python by calling PhantomJS separately through the shell, like this:
import os
import json
cmd = "./phantomjs fetch.js"
fin,fout = os.popen4(cmd)
result = fout.read()
recipes = json.loads(result)
print recipes['count']
Not the actual solution for the PhantomJS issue, but it's a working solution and has less problems with memory and code-structure.

Categories

Resources