HTML5 Webworker Startup Synchronization Guarantees - javascript

I have a bit of javascript I want to run in a webworker, and I am having a hard time understanding the correct approach to getting them to work in lock-step. I invoke the WebWorker from the main script as in the following simplified script:
// main.js
werker = new Worker("webWorkerScaffold.js");
// #1
werker.onmessage = function(msgObj){
console.log("Worker Reply")
console.log(msgObj);
doSomethingWithMsg(msgObj);
};
werker.onerror = function(err){
console.log("Worker Error:");
console.log(err);
};
werker.postMessage("begin");
Then the complimentary worker script looks like the following:
// webWorkerScaffold.js
var doWorkerStuffs = function(msg){}; // Omitted
// #2
onmessage = function (msgObj){
// Messages in will always be json
if (msgObj.data.msg === "begin")
doWorkerStuffs();
};
This code (the actual version) works as expected, but I am having a difficult time confirming it will always perform correctly. Consider the following:
The "new Worker()" call is made, spawning a new thread.
The spawned thread is slow to load (lets say hangs at "// #2")
The parent thread does "werker.postMessage..." with no recipient
... ?
The same applies in the reverse direction, where I might change the worker script to make noise outward once it is setup internally, under that scenario the main thread could hang at "// #1" and miss the incoming message as it dosen't have its comm's up.
Is there some way to guarantee that these scripts move forward in a lock-step way?
What I am really looking for is a zmq-like REP/REQ semantic, where one or the other blocks (or calls back) when 1:1 transactions can take place.

Related

What's the correct way to handle removing a potentially busy file in NodeJS?

I have a NodeJS server managing some files. It's going to watch for a known filename from an external process and, once received, read it and then delete it. However, sometimes it's attempted to be read/deleted before the file has "unlocked" from previous use so likely will fail occasionally. What I'd like to do is retry this file asap, either as soon as it's finished or continuously at a fast pace.
I'd rather avoid a long sleep where possible, because this needs to be handled ASAP and every second counts.
fs.watchFile(intput_json_file, {interval: 10}, function(current_stats, previous_stats) {
var json_data = "";
try {
var file_cont = fs.readFileSync(input_json_file); // < TODO: async this
json_data = JSON.parse(file_cont.toString());
fs.unlink(input_json_file);
} catch (error) {
console.log("The JSON in the could not be parsed. File will continue to be watched.");
console.log(error);
return;
}
// Else, this has loaded properly.
fs.unwatchFile(input_json_file);
// ... do other things with the file's data.
}
// set a timeout for the file watching, just in case
setTimeout(fs.unwatchFile, CLEANUP_TIMEOUT, input_json_file);
I expect "EBUSY: resource busy or locked" to turn up occasionally, but fs.watchFile isn't always called when the file is unlocked.
I thought of creating a function and then calling it with a delay of 1-10ms, where it could call itself if that fails too, but that feels like a fast route to a... cough stack overflow.
I'd also like to steer clear of synchronous methods so that this scales nicely, but being relatively new to NodeJS all the callbacks are starting to turn into a maze.
May be it will be over for this story, but you can create own fs with full control. In this case other programs will write data directly to your program. Just search by word fuse and fuse-binding

The "await" property of "async" function sleeps after an instance - Javascript

I am working on a scraper . I am using Phantom JS along with Node JS. Phantom JS loads the page with async function, just like : var status = await page.open(url). Sometimes, because of the slow internet the page takes longer to load and after a time the page status is not returned, to check while its loaded or not. And the page.open() sleeps, which doesn't return anything at all, and all the execution is waiting.
So, my basic question is; is there any way to keep this page.open(url) alive, as the execution of the rest of the code waits until the page is loaded.
My Code is
const phantom = require('phantom');
ph_instance = await phantom.create();
ph_page = await ph_instance.createPage();
var status = await ph_page.open("https://www.cscscholarship.org/");
if (status == 'success') {
console.log("Page is loaded successfully !");
//do more stuff
}
From your comment, it seems like it might be timing out (because of slow internet sometimes)... you can validate this by adding the onResourceTimeout method to your code (link: http://phantomjs.org/api/webpage/handler/on-resource-timeout.html)
It would look something like this:
ph_instance.onResourceTimeout = (request) => {
console.log('Timeout caught:' + JSON.stringify(request));
};
And if that ends up being true, you can increase the default resource timeout settings (link: http://phantomjs.org/api/webpage/property/settings.html) like this:
ph_instance.settings.resourceTimeout = 60000 // 60 seconds
Edit: I know the question is about phantom, but I wanted to also mention another framework I've used for scraping projects before called Puppeteer (link: https://pptr.dev/) I personally found that their API's are easier to understand and code in, and it's currently a maintained project unlike Phantom JS which is not maintained anymore (their last release was two years ago).

NodeJS - Memory/CPU management with MongoJS Stream

I'm parsing a fairly large dataset from MongoDB (of about 40,000 documents, each with a decent amount of data inside).
The stream is being accessed like so:
var cursor = db.domains.find({ html: { $exists: true } });
cursor.on('data', function(rec) {
i++;
var url = rec.domain;
var $ = cheerio.load(rec.html);
checkList($, rec, url, i);
// This "checkList" function parses HTML data with Cheerio to find different elements on the page. Lots of if/else statements
});
cursor.on('end', function(){
console.log("Streamed all objects!");
})
Each record gets parsed with Cheerio (the record contains HTML data from a page scraped earlier) and then I process the Cheerio data to look for various selectors, then saved back to MongoDB.
For the first ~2,000 objects the data is parsed quite quickly (in ~30 seconds). After that it becomes far slower, around 50 records being parsed per second.
Looking in my Macbook Air's activity monitor I see that it's not using a crazy amount of memory (226.5mb / 8gb ram) but it is using a whole lot of CPU (io.js is taking up 99% of my cpu).
Is this a possible memory leak? The checkLists function isn't particularly intensive (or at least, as far as I can tell - there are quite a few nested if/else statements but not much else).
Am I meant to be clearing my variables after they're being used, like setting $ = '' or similar? Any other reason with Node would be using so much CPU?
You basically need to "pause" the stream or otherwise "throttle" it from executing on very data item recieved straight away. So the code in the "event" does not wait before completion before the next event is fired, unless you stop the events emitting.
var cursor = db.domains.find({ html: { $exists: true } });
cursor.on('data', function(rec) {
cursor.pause(); // stop processessing new events
i++;
var url = rec.domain;
var $ = cheerio.load(rec.html);
checkList($, rec, url, i);
// if checkList() is synchronous then here
cursor.resume(); // start events again
});
cursor.on('end', function(){
console.log("Streamed all objects!");
})
If checkList() contains async methods, then pass in the cursor
checkList($, rec, url, i,cursor);
And process the "resume" inside:
function checkList(data, rec, url, i, cursor) {
somethingAsync(args,function(err,result) {
// We're done
cursor.resume(); // start events again
})
}
The "pause" stops the events emitting from the stream until the "resume" is called. This means your operations don't "stack up" in memory and wait for each to complete.
You probably want more advanced flow control for some parallel processing, but this is basically how you do it with streams.
And resume inside

Node.js Asynchronous File I/O

I'm new to Node.js and recently learned about the fs module. I'm a little confused about asynchronous vs. synchronous file i/o.
Consider the following test:
var fs = require('fs');
var txtfile = 'async.txt';
var buffer1 = Buffer(1024);
var buffer2 = '1234567890';
fs.appendFile(txtfile, buffer1, function(err) {
if (err) { throw err };
console.log('appended buffer1');
});
fs.appendFile(txtfile, buffer2, function(err) {
if (err) { throw err };
console.log('appended buffer2');
});
About half the time when I run this, it prints appended buffer2 before appended buffer1. But when I open the text file, the data always appears to be in the right order - a bunch of garbage from Buffer(1024) followed by 1234567890. I would have expected the reverse or a jumbled mess.
What's going on here? Am I doing something wrong? Is there some kind of lower-level i/o queue that maintains order?
I've seen some talk about filesystem i/o differences with Node; I'm on a Mac if that makes any difference.
From my understanding, although the code is asynchronous, at the OS level, the file I/O operations of the SAME file are not. That means only one file I/O operation is processing at a time to a single file.
During the 1st append is occurring, the file is locked. Although the 2nd append has been processed, the file I/O part of it is put in the queue by the OS and finishes with no error status. My guess is the OS does some checks to make sure the write operation will be successful such as file exists, is writable, and diskspace is large enough, and etc. If all those conditions met, the OS returns to the application with no error status and will finish the writing operation later when possible. Since the buffer of the 2nd append is much smaller, it might finish processing (not writing to file part of it) before first append finished writing to file. You, therefore, saw the 2nd console.log() first.

Running JS in a killable 'thread'; detecting and canceling long-running processes

Summary: How can I execute a JavaScript function, but then "execute" (kill) it if it does not finish with a timeframe (e.g. 2 seconds)?
Details
I'm writing a web application for interactively writing and testing PEG grammars. Unfortunately, the JavaScript library I'm using for parsing using a PEG has a 'bug' where certain poorly-written or unfinished grammars cause infinite execution (not even detected by some browsers). You can be happily typing along, working on your grammar, when suddenly the browser locks up and you lose all your hard work.
Right now my code is (very simplified):
grammarTextarea.onchange = generateParserAndThenParseInputAndThenUpdateThePage;
I'd like to change it to something like:
grammarTextarea.onchange = function(){
var result = runTimeLimited( generateParserAndThenParseInput, 2000 );
if (result) updateThePage();
};
I've considered using an iframe or other tab/window to execute the content, but even this messy solution is not guaranteed to work in the latest versions of major browsers. However, I'm happy to accept a solution that works only in latest versions of Safari, Chrome, and Firefox.
Web workers provide this capability—as long as the long-running function does not require access to the window or document or closures—albeit in a somewhat-cumbersome manner. Here's the solution I ended up with:
main.js
var worker, activeMsgs, userTypingTimeout, deathRowTimer;
killWorker(); // Also creates the first one
grammarTextarea.onchange = grammarTextarea.oninput = function(){
// Wait until the user has not typed for 500ms before parsing
clearTimeout(userTypingTimeout);
userTypingTimeout = setTimeout(askWorkerToParse,500);
}
function askWorkerToParse(){
worker.postMessage({action:'parseInput',input:grammarTextarea.value});
activeMsgs++; // Another message is in flight
clearTimeout(deathRowTimer); // Restart the timer
deathRowTimer = setTimeout(killWorker,2000); // It must finish quickly
};
function killWorker(){
if (worker) worker.terminate(); // This kills the thread
worker = new Worker('worker.js') // Create a new worker thread
activeMsgs = 0; // No messages are pending on this new one
worker.addEventListener('message',handleWorkerResponse,false);
}
function handleWorkerResponse(evt){
// If this is the last message, it responded in time: it gets to live.
if (--activeMsgs==0) clearTimeout(deathRowTimer);
// **Process the evt.data.results from the worker**
},false);
worker.js
importScripts('utils.js') // Each worker is a blank slate; must load libs
self.addEventListener('message',function(evt){
var data = evt.data;
switch(data.action){
case 'parseInput':
// Actually do the work (which sometimes goes bad and locks up)
var parseResults = parse(data.input);
// Send the results back to the main thread.
self.postMessage({kind:'parse-results',results:parseResults});
break;
}
},false);

Categories

Resources