Running JS in a killable 'thread'; detecting and canceling long-running processes - javascript

Summary: How can I execute a JavaScript function, but then "execute" (kill) it if it does not finish with a timeframe (e.g. 2 seconds)?
Details
I'm writing a web application for interactively writing and testing PEG grammars. Unfortunately, the JavaScript library I'm using for parsing using a PEG has a 'bug' where certain poorly-written or unfinished grammars cause infinite execution (not even detected by some browsers). You can be happily typing along, working on your grammar, when suddenly the browser locks up and you lose all your hard work.
Right now my code is (very simplified):
grammarTextarea.onchange = generateParserAndThenParseInputAndThenUpdateThePage;
I'd like to change it to something like:
grammarTextarea.onchange = function(){
var result = runTimeLimited( generateParserAndThenParseInput, 2000 );
if (result) updateThePage();
};
I've considered using an iframe or other tab/window to execute the content, but even this messy solution is not guaranteed to work in the latest versions of major browsers. However, I'm happy to accept a solution that works only in latest versions of Safari, Chrome, and Firefox.

Web workers provide this capability—as long as the long-running function does not require access to the window or document or closures—albeit in a somewhat-cumbersome manner. Here's the solution I ended up with:
main.js
var worker, activeMsgs, userTypingTimeout, deathRowTimer;
killWorker(); // Also creates the first one
grammarTextarea.onchange = grammarTextarea.oninput = function(){
// Wait until the user has not typed for 500ms before parsing
clearTimeout(userTypingTimeout);
userTypingTimeout = setTimeout(askWorkerToParse,500);
}
function askWorkerToParse(){
worker.postMessage({action:'parseInput',input:grammarTextarea.value});
activeMsgs++; // Another message is in flight
clearTimeout(deathRowTimer); // Restart the timer
deathRowTimer = setTimeout(killWorker,2000); // It must finish quickly
};
function killWorker(){
if (worker) worker.terminate(); // This kills the thread
worker = new Worker('worker.js') // Create a new worker thread
activeMsgs = 0; // No messages are pending on this new one
worker.addEventListener('message',handleWorkerResponse,false);
}
function handleWorkerResponse(evt){
// If this is the last message, it responded in time: it gets to live.
if (--activeMsgs==0) clearTimeout(deathRowTimer);
// **Process the evt.data.results from the worker**
},false);
worker.js
importScripts('utils.js') // Each worker is a blank slate; must load libs
self.addEventListener('message',function(evt){
var data = evt.data;
switch(data.action){
case 'parseInput':
// Actually do the work (which sometimes goes bad and locks up)
var parseResults = parse(data.input);
// Send the results back to the main thread.
self.postMessage({kind:'parse-results',results:parseResults});
break;
}
},false);

Related

What's the correct way to handle removing a potentially busy file in NodeJS?

I have a NodeJS server managing some files. It's going to watch for a known filename from an external process and, once received, read it and then delete it. However, sometimes it's attempted to be read/deleted before the file has "unlocked" from previous use so likely will fail occasionally. What I'd like to do is retry this file asap, either as soon as it's finished or continuously at a fast pace.
I'd rather avoid a long sleep where possible, because this needs to be handled ASAP and every second counts.
fs.watchFile(intput_json_file, {interval: 10}, function(current_stats, previous_stats) {
var json_data = "";
try {
var file_cont = fs.readFileSync(input_json_file); // < TODO: async this
json_data = JSON.parse(file_cont.toString());
fs.unlink(input_json_file);
} catch (error) {
console.log("The JSON in the could not be parsed. File will continue to be watched.");
console.log(error);
return;
}
// Else, this has loaded properly.
fs.unwatchFile(input_json_file);
// ... do other things with the file's data.
}
// set a timeout for the file watching, just in case
setTimeout(fs.unwatchFile, CLEANUP_TIMEOUT, input_json_file);
I expect "EBUSY: resource busy or locked" to turn up occasionally, but fs.watchFile isn't always called when the file is unlocked.
I thought of creating a function and then calling it with a delay of 1-10ms, where it could call itself if that fails too, but that feels like a fast route to a... cough stack overflow.
I'd also like to steer clear of synchronous methods so that this scales nicely, but being relatively new to NodeJS all the callbacks are starting to turn into a maze.
May be it will be over for this story, but you can create own fs with full control. In this case other programs will write data directly to your program. Just search by word fuse and fuse-binding

The "await" property of "async" function sleeps after an instance - Javascript

I am working on a scraper . I am using Phantom JS along with Node JS. Phantom JS loads the page with async function, just like : var status = await page.open(url). Sometimes, because of the slow internet the page takes longer to load and after a time the page status is not returned, to check while its loaded or not. And the page.open() sleeps, which doesn't return anything at all, and all the execution is waiting.
So, my basic question is; is there any way to keep this page.open(url) alive, as the execution of the rest of the code waits until the page is loaded.
My Code is
const phantom = require('phantom');
ph_instance = await phantom.create();
ph_page = await ph_instance.createPage();
var status = await ph_page.open("https://www.cscscholarship.org/");
if (status == 'success') {
console.log("Page is loaded successfully !");
//do more stuff
}
From your comment, it seems like it might be timing out (because of slow internet sometimes)... you can validate this by adding the onResourceTimeout method to your code (link: http://phantomjs.org/api/webpage/handler/on-resource-timeout.html)
It would look something like this:
ph_instance.onResourceTimeout = (request) => {
console.log('Timeout caught:' + JSON.stringify(request));
};
And if that ends up being true, you can increase the default resource timeout settings (link: http://phantomjs.org/api/webpage/property/settings.html) like this:
ph_instance.settings.resourceTimeout = 60000 // 60 seconds
Edit: I know the question is about phantom, but I wanted to also mention another framework I've used for scraping projects before called Puppeteer (link: https://pptr.dev/) I personally found that their API's are easier to understand and code in, and it's currently a maintained project unlike Phantom JS which is not maintained anymore (their last release was two years ago).

NodeJS - Memory/CPU management with MongoJS Stream

I'm parsing a fairly large dataset from MongoDB (of about 40,000 documents, each with a decent amount of data inside).
The stream is being accessed like so:
var cursor = db.domains.find({ html: { $exists: true } });
cursor.on('data', function(rec) {
i++;
var url = rec.domain;
var $ = cheerio.load(rec.html);
checkList($, rec, url, i);
// This "checkList" function parses HTML data with Cheerio to find different elements on the page. Lots of if/else statements
});
cursor.on('end', function(){
console.log("Streamed all objects!");
})
Each record gets parsed with Cheerio (the record contains HTML data from a page scraped earlier) and then I process the Cheerio data to look for various selectors, then saved back to MongoDB.
For the first ~2,000 objects the data is parsed quite quickly (in ~30 seconds). After that it becomes far slower, around 50 records being parsed per second.
Looking in my Macbook Air's activity monitor I see that it's not using a crazy amount of memory (226.5mb / 8gb ram) but it is using a whole lot of CPU (io.js is taking up 99% of my cpu).
Is this a possible memory leak? The checkLists function isn't particularly intensive (or at least, as far as I can tell - there are quite a few nested if/else statements but not much else).
Am I meant to be clearing my variables after they're being used, like setting $ = '' or similar? Any other reason with Node would be using so much CPU?
You basically need to "pause" the stream or otherwise "throttle" it from executing on very data item recieved straight away. So the code in the "event" does not wait before completion before the next event is fired, unless you stop the events emitting.
var cursor = db.domains.find({ html: { $exists: true } });
cursor.on('data', function(rec) {
cursor.pause(); // stop processessing new events
i++;
var url = rec.domain;
var $ = cheerio.load(rec.html);
checkList($, rec, url, i);
// if checkList() is synchronous then here
cursor.resume(); // start events again
});
cursor.on('end', function(){
console.log("Streamed all objects!");
})
If checkList() contains async methods, then pass in the cursor
checkList($, rec, url, i,cursor);
And process the "resume" inside:
function checkList(data, rec, url, i, cursor) {
somethingAsync(args,function(err,result) {
// We're done
cursor.resume(); // start events again
})
}
The "pause" stops the events emitting from the stream until the "resume" is called. This means your operations don't "stack up" in memory and wait for each to complete.
You probably want more advanced flow control for some parallel processing, but this is basically how you do it with streams.
And resume inside

HTML5 Webworker Startup Synchronization Guarantees

I have a bit of javascript I want to run in a webworker, and I am having a hard time understanding the correct approach to getting them to work in lock-step. I invoke the WebWorker from the main script as in the following simplified script:
// main.js
werker = new Worker("webWorkerScaffold.js");
// #1
werker.onmessage = function(msgObj){
console.log("Worker Reply")
console.log(msgObj);
doSomethingWithMsg(msgObj);
};
werker.onerror = function(err){
console.log("Worker Error:");
console.log(err);
};
werker.postMessage("begin");
Then the complimentary worker script looks like the following:
// webWorkerScaffold.js
var doWorkerStuffs = function(msg){}; // Omitted
// #2
onmessage = function (msgObj){
// Messages in will always be json
if (msgObj.data.msg === "begin")
doWorkerStuffs();
};
This code (the actual version) works as expected, but I am having a difficult time confirming it will always perform correctly. Consider the following:
The "new Worker()" call is made, spawning a new thread.
The spawned thread is slow to load (lets say hangs at "// #2")
The parent thread does "werker.postMessage..." with no recipient
... ?
The same applies in the reverse direction, where I might change the worker script to make noise outward once it is setup internally, under that scenario the main thread could hang at "// #1" and miss the incoming message as it dosen't have its comm's up.
Is there some way to guarantee that these scripts move forward in a lock-step way?
What I am really looking for is a zmq-like REP/REQ semantic, where one or the other blocks (or calls back) when 1:1 transactions can take place.

Background tasks in JavaScript that do not interfere UI interaction

I am working on a mobile App implemented in JavaScript that has to do lots of background stuff -- mainly fetching data from a server (via JSONP) and writing it into a database using local storage. In the foreground the user may navigate on the locally stored data and thus wants a smoothly reacting App.
This interaction (fetching data from server via JSONP and after doing some minor work with it, storing it in the local database) is done asynchronously, but because of the necessary DOM manipulation for JSONP and the database interaction I am not able to this work with WebWorkers. Thus I am running in the problem in blocking the JavaScript event queue with lots of "background processing requests" and the App reacts very slowly (or even locks completely).
Here is a small sketch of the way I am doing the background stuff:
var pendingTasksOnNewObjects = new Object();
//add callbacks to queue
function addTaskToQueue(id, callback) {
if (!pendingTasksOnNewObjects[id]) {
pendingTasksOnNewObjects[id] = new Array();
}
pendingTasksOnNewObjects[id].push(callback);
}
// example for this pending tasks:
var myExampleTask = function () {
this.run = function(jsonObject) {
// do intersting stuff here as early as a specific new object arrives
// i.e. update some elements in the DOM to respect the newly arrived object
};
};
addTaskToQueue("id12345", myExampleTask);
// method to fetch documents from the server via JSONP
function getDocuments(idsAsCommaSeparatedString) {
var elem;
elem = document.createElement("script");
elem.setAttribute("type", "text/javascript");
elem.setAttribute("src", "http://serverurl/servlet/myServlet/documents/"+idsAsCommaSeparatedString);
document.head.appendChild(elem);
}
// "callback" method that is used in the JSONP result to process the data
function locallyStoreDocuments(jsonArray) {
var i;
for (i=0; i<jsonArray.length; i++) {
var obj = jsonArray[i];
storeObjectInDatabase(obj);
runTasks(obj.docID, obj);
}
return true;
}
// process tasks when new object arrives
function runTasks(id, jsonObject) {
if(pendingTasksOnNewObjects[id]) {
while(pendingTasksOnNewObjects[id][0]) {
pendingTasksOnNewObjects[id][0].run(jsonObject);
pendingTasksOnNewObjects[id].shift();
}
}
return true;
}
I already looked around reading some stuff about the event processing mechanisms in JavaScript (i.e. John Resigs description of working with timers), but what I really want to do is the following: Check if there is something to do in the event queue. If so, wait for 100ms and check again. If nothing is currently waiting in the queue, process a small amount of data and check again.
As far as I read, this is not possible, since JavaScript does not have direct access to the event queue.
So are there any design ideas on how to optimize my code to get closer to the goal not to interfere with the UI events to give the user a smooth App interaction?
Web Workers seem to be available in iOS 5, http://caniuse.com/webworkers

Categories

Resources