The script below contains some URLs in "links" array. The function gatherLinks() is used to gather more URLs from sitemap.xml of the URLs in "links" array. Once the "links" array has enough URLs (decided by variable "limit"), function request() is called for each URL in "links" array to send a request to the server and fetch the response. Time taken for each response is reported. Total time taken by the program is reported when the program ends.
I wrote a PhantomJS program (source below) to send some requests and calculate the time taken (in order to compare the performance of 2.0.0 and 1.9.8). I get links using sitemap.xml file of the sites I hardcode in "links" array.
When run using PhantomJS 2.0.0, after some 65 requests the program (method page.open() of request function) starts outputting the following:
select: Invalid argument
select: Invalid argument
select: Invalid argument
select: Invalid argument
select: Invalid argument
.
.
.
.
When run using PhantomJS 1.9.8, it crashes after about 200 requests with the following error.
"PhantomJS has crashed. Please read the crash reporting guide at https://github.com/ariya/phantomjs/wiki/Crash-Reporting and file a bug report at https://github.com/ariya/phantomjs/issues/new with the crash dump file attached: /tmp/2A011800-3367-4B4A-A945-3B532B4D9B0F.dmp"
I tried to send the crash report but their guide is not very useful for me.
It's not the urls that I use, I have tried using other urls but same results.
Is there something wrong with my program? I am using OSX.
var system = require('system');
var fs = require('fs');
var links = [];
links = [
"http://somesite.com",
"http://someothersite.com",
.
.
.
];
var index = 0, fail = 0, limit = 300;
finalTime = Date.now();
var gatherLinks = function(link){
var page = require('webpage').create();
link = link + "/sitemap.xml";
console.log("Fetching links from " + link);
page.open(link, function(status){
if(status != "success"){
console.log("Sitemap Request FAILED, status: " + status);
fail++;
return;
}
var content = page.content;
parser = new DOMParser();
xmlDoc = parser.parseFromString(content, 'text/xml');
var loc = xmlDoc.getElementsByTagName('loc');
for(var i = 0; i < loc.length; i++){
if(links.length < limit){
links[links.length] = loc[i].textContent;
} else{
console.log(links.length + " Links prepared. Starting requests.\n");
index = 0;
request();
return;
}
}
if(index >= links.length){
index = 0;
console.log(links.length + " Links prepared\n\n");
request();
}
gatherLinks(links[index++]);
});
};
var request = function(){
t = Date.now();
var page = require('webpage').create();
page.open(links[index], function(status) {
console.log('Loading link #' + (index + 1) + ': ' + links[index]);
console.log("Time taken: " + (Date.now() - t) + " msecs");
if(status != "success"){
console.log("Request FAILED, status: " + status);
fail++;
}
if(index >= links.length-1){
console.log("\n\nAll links done, final time taken: " + (Date.now() - finalTime) + " msecs");
console.log("Requests sent: " + links.length + ", Failures: " + fail);
console.log("Success ratio: " + ((links.length - fail)/links.length)*100 + "%");
phantom.exit();
}
index++;
request();
});
}
gatherLinks(links[0]);
After playing around with the program, I couldn't find any particular pattern to the problems I mention below. For 2.0.0, I could only once succeed in sending 300 requests without an error. I have tried all different combinations of URLs, program usually fails between request 50-80. I maintain a log of urls that failed, all of them run fine when I send a single request using another PhantomJS program. For 1.9.8, it's much more stable and the crash I mention below is not very frequent. But again, I couldn't find any pattern to the crashing, it still crashes once in a while.
There are lots of problems with your code. The main one is probably that you're creating a new page for every single request and never close it afterwards. I think you're running out of memory.
I don't see a reason to create a new page for every request, so you can easily reuse a single page for all requests. Simply move the line var page = require('webpage').create(); to the global scope out of gatherLinks() and request(). If you don't want to do that, then you can call page.close() after you're done with it, but keep the asynchronous nature of PhantomJS in mind.
If the reason to use multiple page objects was to prevent cache re-use for later requests, then I have to tell you that this doesn't solve that problem. page objects in a single PhantomJS process can be regarded as tabs or windows and they share cookies and cache. If you want to isolate every request, then you will need to run every request in its own process for example through the use of the Child Process Module.
There is another problem with your code. You probably wanted to write the following in gatherLinks():
if(index >= links.length){
index = 0;
console.log(links.length + " Links prepared\n\n");
request();
return; // ##### THIS #####
}
gatherLinks(links[index++]);
Related
I am making a GET request to the api using XMLHttpRequests. The api takes in a "uuid" value denoted by the variable "a" here as a parameter, uses that value for processing and is supposed to spit out some information that I am trying to print to the console.
However, the problem I am running into is that, whenever the api successfully receives the uuid value it returns a message where the newresponse.response.status is Initiated. However I want to wait till the newresponse.response.status is Success (this usually takes a short bit like 3-4 seconds more).
The original code is shown below:
function getrequestresults(status, response) {
let parse = JSON.parse(response);
let a = parse.uuid;
let newrequest = new XMLHttpRequest();
newrequest.open('GET', "http://localhost:5000/api/v1/dags/results" + "?" + "uuid" + "=" + a, true);
newrequest.onload = function() {
//console.log(newrequest.status);
console.log(newrequest.response);
};
newrequest.send();
}
One of the attempts I made at fixing this was as follows:
function getrequestresults(status, response) {
let parse = JSON.parse(response);
let a = parse.uuid;
let newrequest = new XMLHttpRequest();
newrequest.open('GET', "http://localhost:5000/api/v1/dags/results" + "?" + "uuid" + "=" + a, true);
newrequest.onload = function() {
while (newrequest.response.status != "Success") {
//the following line was just to see what's going on, idea was to keep looping till, desired status is reached
console.log(newrequest.response);
}
//console.log(newrequest.status);
console.log(newrequest.response);
};
newrequest.send();
}
However, this did not work as it seems that the "onload" function only runs once, whereas I need it to run multiple times until newrequest.response.status is Success.
I am quite new with sending XMLHttpRequests and some help would be appreciated. Thanks.
I have the task to go trough around 200k links and check the status code of their responses. Anything other than 2xx would mean a problem which means that link has to be manually checked (added to a DB later).
The links I have from a DB and are both http and https, some of them are not valid (e.g. ww.fyxs.d). The format I get is JSON and it's something like this
{
"id": xxx,
"url": xxxx
}
I went with a really simple solution which unfortunately doesn't work.
I am taking the links from a json file and then starting from the back sending a http/https.get request, waiting for the response, checking and processing the status code and moving to the next link after removing the previous one from the list to preserve memory. The problem is that I keep getting 4xx almost all the time and if I do a GET from a REST client I get a 200 OK.
I don't know if it's possible but I only need the correct status code and the body I'm not interested in hence the HEAD method. I also tried with
method: 'GET' - still wrong status codes and http/https.request - I don't even get a response.
Here is my code:
var https = require('https');
var http = require('http');
var urlMod = require('url');
var links = require('./links_to_check.json').links_to_check;
var callsRemaining = links.length;
var current = links.length - 1;
startQueue();
function startQueue(){
getCode(links[current].url);
current--;
}
function getCode(url){
var urlObj = urlMod.parse(url);
var options = {
method: 'HEAD',
hostName: urlObj.host,
path: urlObj.path
};
var httpsIndex = url.indexOf('https');
if(httpsIndex > -1 && httpsIndex < 5){
https.get(options,function(response){
proccessResponse(response.statusCode);
}).on('error', (e) => {
startQueue();
});
}else{
if(url.indexOf('http:') < 0) return;
http.get(options,function(response){
proccessResponse(response.statusCode);
}).on('error', (e) => {
startQueue();
});
}
}
function proccessResponse(responseCode){
console.log("response => " + responseCode);
if(responseCode != 200){
errorCount++;
}
ResponseReady();
}
function ResponseReady(){
--callsRemaining;
if(callsRemaining <= 0){
//Proccess error when done
}
links.pop();
startQueue();
}
I would really appreciate some help - when we succeed I will publish it as a module so if someone needs to check a set of links they can just use it :)
After we solve this I was thinking of using async.map and splitting the links to chunks and running the analysis in parallel so it's faster. The current process written in shell takes around 36 hours.
Say my company serves a large log file (4+ GB), where the most recent logs are at the top. I want to build a webpage to search that file for a keyword "Mike". Bandwidth is not a restriction, but this webpage can only be static files (i.e. no server-side functionality).
Example log file:
Joe completed Task 1234 on 2013-10-10
Joe completed Task 1235 on 2013-10-11
Mike completed Task 1236 on 2013-10-11
Joe completed Task 1237 on 2013-10-13
...
Obviously, I can't put the entire file into memory in the browser, so I'm trying to find a way to request the file, search the data as it gets downloaded, then throw away non-relevant data to save memory. I am using the xhr.onprogress event to get the partially downloaded log file via xhr.responseText and search that, but I can't reset the responseText after I'm done reading it.
Here's my algorithm so far:
var xhr = new XMLHttpRequest();
xhr.onprogress = function(e){
var cur_len = xhr.responseText.length;
var found_mike = xhr.responseText.indexOf("Mike") != -1 ? true : false;
xhr.responseText = ""; //clear responseText to save memory
console.log("%d - %s - %d", cur_len, found_mike, xhr.responseText.length);
};
xhr.open("get", "mylogfile.txt", true);
xhr.send();
I would expect the console to say something like 234343 - false - 0, but instead I get 234343 - false - 234343, and the browser runs out of memory (since responseText isn't being cleared).
Is there a way I can discard the responseText so that the browser can download and process a file without holding the entire file in memory?
EDIT: Also, if responseText is read-only, why doesn't it throw an error/warning?
After asking a friend, and he had a great answer: Range headers (stackoverflow question, jsfiddle)
var chunk_size = 100000; //100kb chunks
var regexp = /Mike/g;
var mikes = [];
function next_chunk(pos, file_len){
if(pos > file_len){
return;
}
var chunk_end = pos + chunk_size < file_len ? pos + chunk_size : file_len;
var xhr = new XMLHttpRequest();
xhr.onreadystatechange = function(){
if(xhr.readyState == 4 && xhr.status == 206){
//push mikes to result
while ((match = regexp.exec(xhr.responseText)) != null) {
mikes.push(pos + match.index);
}
//request next chunk
file_len = parseInt(xhr.getResponseHeader("Content-Range").split("/")[1]);
next_chunk(chunk_end + 1, file_len);
}
};
xhr.open("get", "mylogfile.txt", true);
xhr.setRequestHeader("Range", "bytes=" + pos + "-" + chunk_end);
xhr.send();
}
next_chunk(0, chunk_size);
So I'm trying to write a hook into ESPN fantasy football's HTML lite draft page to cross-reference player ranking lists (from a CSV file) to eliminate already-drafted players from the available pool. I've done this by hand in the past: but with a 16-team draft by the late rounds, it's nearly impossible to keep up since by then no one really knows who the players are.
I'm very much a Javascript and PhantomJS newbie, so please don't laugh.
At this point, I can see the page.onResourceReceived metadata in my console as the AJAX polls the PhantomJS instance. But I can't figure out how to access the data actually being received by the "browser". According to Chrome's inspector, the "Preview" tab under the Network Inspector tab -- either a time sync signal or the data of the actual player who was drafted is being sent to the browser in JSON format.
Long story short, how do I get the actual JSON data when I receive the page.onResourceReceived metadata?
(P.S. I know I commented out phantom.exit(); that's to keep the script from terminating after the redirect and onLoad is complete--I need to keep it running to listen for the draft updates)
var draft = 'http://games.espn.go.com/ffl/htmldraft?leagueId=1246633&teamId=8&fromTeamId=8';
var draftURL = encodeURIComponent(draft);
var page = require('webpage').create(),
server = 'https://r.espn.go.com/espn/memberservices/pc/login',
data = 'SUBMIT=1&failedLocation=&aff_code=espn_fantgames&appRedirect=' + draftURL + '&cookieDomain=.go.com&multipleDomains=true&username=[redacted]&password=[redacted]&submit=Sign+In';
page.onResourceReceived = function (response) {
console.log('Response (#' + response.id + ', stage "' + response.stage + '"): ' + JSON.stringify(response));
};
page.open(server, 'post', data, function (status) {
if (status !== 'success') {
console.log('Unable to post!');
} else {
page.render('example.png');
//console.log(page.content)
}
//phantom.exit();
});
The following version of your script will just grab and return the entire contents of the URL you are accessing. You are not really going to get useful json data, I don't think, just an html page, unless I'm missing something. In my tests, all I get is html:
var draft = 'http://games.espn.go.com/ffl/htmldraft?leagueId=1246633&teamId=8&fromTeamId=8';
var draftURL = encodeURIComponent(draft);
var page = require('webpage').create(),
server = 'https://r.espn.go.com/espn/memberservices/pc/login',
data = 'SUBMIT=1&failedLocation=&aff_code=espn_fantgames&appRedirect=' + draftURL + '&cookieDomain=.go.com&multipleDomains=true&username=[redacted]&password=[redacted]&submit=Sign+In';
page.open(server, 'post', data, function (status) {
if (status == 'success') {
var delay, checker = (function() {
var html = page.evaluate(function () {
var body = document.getElementsByTagName('body')[0];
return document.getElementsByTagName('html')[0].outerHTML;
});
if (html) {
clearTimeout(delay);
console.log(html);
phantom.exit();
}
});
delay = setInterval(checker, 100);
}
else {
phantom.exit();
}
});
Currently, phantomjs doesn't include the response body in the onResponseReceived events.
You could instead slimerjs, which mirrors the phantomjs, but does allow you to access response.body (which should have the JSON data). Example here:
http://darrendev.blogspot.jp/2013/11/saving-downloaded-files-in-slimerjs-and.html
Alternatively, you could write a chrome extension and create a content script that grabs the data.
I have a large complex web app with thousands of lines of Javascript. There is a small set of intermittent Javascript bugs that are report by users.
I think these are epiphenomena of race conditions - something has not initialised correctly and the Javascript crashes causing 'down stream' js not to run.
Is there anyway to get Javascript execution crashes to log back server side?
All the js logging libraries like Blackbird and Log4JavaScript are client-side only.
I have written a remote error logging function using window.onerror as suggested by #pimvdb
Err = {};
Err.Remoterr = {};
Err.Remoterr.onerror = function (msg, errorfileurl, lineno) {
var jsonstring, response, pageurl, cookies;
// Get some user input
response = prompt("There has been an error. " +
"It has been logged and will be investigated.",
"Put in comments (and e-mail or phone number for" +
" response.)");
// get some context of where and how the error occured
// to make debugging easier
pageurl = window.location.href;
cookies = document.cookie;
// Make the json message we are going to post
// Could use JSON.stringify() here if you are sure that
// JSON will have run when the error occurs
// http://www.JSON.org/js.html
jsonstring = "{\"set\": {\"jserr\": " +
"{\"msg\": \"" + msg + "\", " +
"\"errorfileurl\": \"" + errorfileurl + "\", " +
"\"pageurl\": \"" + pageurl + "\", " +
"\"cookies\": \"" + cookies + "\", " +
"\"lineno\": \"" + lineno + "\", " +
"\"response\": \"" + response + "\"}}}";
// Use the jquery cross-browser post
// http://api.jquery.com/jQuery.post/
// this assumes that no errors happen before jquery has initialised
$.post("?jserr", jsonstring, null, "json");
// I don't want the page to 'pretend' to work
// so I am going to return 'false' here
// Returning 'true' will clear the error in the browser
return false;
};
window.onerror = Err.Remoterr.onerror;
I deploy this between the head and body tags of the webpage.
You will want to change the JSON and the URL that you post it to depending on how you are going to log the data server side.
Take a look at https://log4sure.com (disclosure: I created it) - but it is really useful, check it out and decide for yourself. It allows you to log errors/event and also lets you create your custom log table. It also allows you to monitor your logs real-time. And the best part, its free.
You can also use bower to install it, use bower install log4sure
The set up code is really easy too:
// setup
var _logServer;
(function() {
var ls = document.createElement('script');
ls.type = 'text/javascript';
ls.async = true;
ls.src = 'https://log4sure.com/ScriptsExt/log4sure.min.js';
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(ls, s);
ls.onload = function() {
// use your token here.
_logServer = new LogServer("use-your-token-here");
};
})();
// example for logging text
_logServer.logText("your log message goes here.")
//example for logging error
divide = function(numerator, divisor) {
try {
if (parseFloat(value) && parseFloat(divisor)) {
throw new TypeError("Invalid input", "myfile.js", 12, {
value: value,
divisor: divisor
});
} else {
if (divisor == 0) {
throw new RangeError("Divide by 0", "myfile.js", 15, {
value: value,
divisor: divisor
});
}
}
} catch (e) {
_logServer.logError(e.name, e.message, e.stack);
}
}
// another use of logError in window.onerror
// must be careful with window.onerror as you might be overwriting some one else's window.onerror functionality
// also someone else can overwrite window.onerror.
window.onerror = function(msg, url, line, column, err) {
// may want to check if url belongs to your javascript file
var data = {
url: url,
line: line,
column: column,
}
_logServer.logError(err.name, err.message, err.stack, data);
};
// example for custom logs
var foo = "some variable value";
var bar = "another variable value";
var flag = "false";
var temp = "yet another variable value";
_logServer.log(foo, bar, flag, temp);