Javascript (node.js) capped number of child processes - javascript

hopefully I can describe what I'm looking for clearly enough. Working with Node and Python.
I'm trying to run a number of child processes (.py scripts, using child_process.exec()) in parallel, but no more than a specified number at a time (say, 2). I receive an unknown number of requests in batches (say this batch has 3 requests). I'd like to stop spawning processes until one of the current ones finishes.
for (var i = 0; i < requests.length; i++) {
//code that would ideally block execution for a moment
while (active_pids.length == max_threads){
console.log("Waiting for more threads...");
sleep(100)
continue
};
//code that needs to run if threads are available
active_pids.push(i);
cp.exec('python python-test.py '+ requests[i],function(err, stdout){
console.log("Data processed for: " + stdout);
active_pids.shift();
if (err != null){
console.log(err);
}
});
}
I know that while loop doesn't work, it was the first attempt.
I'm guessing there's a way to do this with
setTimeout(someSpawningFunction(){
if (active_pids.length == max_threads){
return
} else {
//spawn process?
}
},100)
But I can't quite wrap my head around it.
Or maybe
waitpid(-1)
Inserted in the for loop above in an if statement in place of the while loop? However I can't get the waitpid() module to install at the moment.
And yes, I understand that blocking execution is considered very bad in JS, but in my case, I need it to happen. I'd rather avoid external cluster manager-type libraries if possible.
Thanks for any help.
EDIT/Partial Solution
An ugly hack would be to use the answer from: this SO question (execSync()). But that would block the loop until the LAST child finished. That's my plan so far, but not ideal.

async.timesLimit from the async library is the perfect tool to use here. It allows you to asynchronously run a function n times, but run a maximum of k of those function calls in parallel at any given time.
async.timesLimit(requests.length, max_threads, function(i, next){
cp.exec('python python-test.py '+ requests[i], function(err, stdout){
console.log("Data processed for: " + stdout);
if (err != null){
console.log(err);
}
// this task is resolved
next(null, stdout);
});
}, function(err, stdoutArray) {
// this runs after all processes have run; what's next?
});
Or, if you want errors to be fatal and stop the loop, call next(err, stdout).

You can just maintain a queue of external processes waiting to run and a counter for how many are currently running. The queue would simply contain an object for each process that had properties containing the data you need to know what process to run. You can just use an array of those objects for the queue.
Whenever you get a new request to run an external process, you add it to the queue and then you start up external processes increasing your counter each time you start one until your counter hits your max number.
Then, while monitoring those external processes, whenever one finishes, you decrement the counter and if your queue of tasks waiting to run is not empty, you start up another one and increase the counter again.
The async library has this type of functionality built in (running a specific number of operations at a time), though it isn't very difficult to implement yourself with a queue and a counter. The key is that you just have to hook into the completion even for your external process so you can maintain the counter and start up any new tasks that are waiting.
There is no reason to need to use synchronous or serial execution or to block in order to achieve your goal here.

Related

Does the nodejs (libuv) event loop execute all the callbacks in one phase (queue) before moving to the next or run in a round robin fashion?

I am studying about the event loop provided by libuv in Node. I came across the following blog by Deepal Jayasekara and also saw the explanations of Bert Belder and Daniel Khan on youtube.
There is one point that I am not clear with- As per my understanding, the event loop processes all the items of one phase before moving on to another. So if that is the case, I should be able to block the event loop if the setTimeout phase is constantly getting callbacks added to it.
However, when I tried to replicate that- it doesn't happen. The following is the code:
var http = require('http');
http.createServer(function (req, res) {
res.writeHead(200, {'Content-Type': 'text/plain'});
res.write('Hello World!');
console.log("Response sent");
res.end();
}).listen(8081);
setInterval(() => {
console.log("Entering for loop");
// Long running loop that allows more callbacks to get added to the setTimeout phase before this callback's processing completes
for (let i = 0; i < 7777777777; i++);
console.log("Exiting for loop");
}, 0);
The event loop seems to run in a round robin fashion. It first executes the callbacks that were added before i send a request to the server, then processes the request and then continues with the callbacks. It feels like a single queue is running.
From the little that I understood, there isn't a single queue and all the expired timer callbacks should get executed first before moving to the next phase. Hence the above snippet should not be able to return the Hello World response.
What could be the possible explanation for this?
Thanks.
If you look in libuv itself, you find that the operative part of running timers in the event loop is the function uv_run_timers().
void uv__run_timers(uv_loop_t* loop) {
struct heap_node* heap_node;
uv_timer_t* handle;
for (;;) {
heap_node = heap_min(timer_heap(loop));
if (heap_node == NULL)
break;
handle = container_of(heap_node, uv_timer_t, heap_node);
if (handle->timeout > loop->time)
break;
uv_timer_stop(handle);
uv_timer_again(handle);
handle->timer_cb(handle);
}
}
The way it works is the event loop sets a time mark at the current time and then it processes all timers that are due by that time one after another without updating the loop time. So, this will fire all the timers that are already past their time, but won't fire any new timers that come due while it's processing the ones that were already due.
This leads to a bit fairer scheduling as it runs all timers that are due, then goes and runs the rest of the types of events in the event loop, then comes back to do any more timers that are due again. This will NOT process any timers that are not due at the start of this event loop cycle, but come due while it's processing other timers. Thus, you see the behavior you asked about.
The above function is called from the main part of the event loop with this code:
int uv_run(uv_loop_t *loop, uv_run_mode mode) {
DWORD timeout;
int r;
int ran_pending;
r = uv__loop_alive(loop);
if (!r)
uv_update_time(loop);
while (r != 0 && loop->stop_flag == 0) {
uv_update_time(loop); <== establish loop time
uv__run_timers(loop); <== process only timers due by that loop time
ran_pending = uv_process_reqs(loop);
uv_idle_invoke(loop);
uv_prepare_invoke(loop);
.... more code here
}
Note the call to uv_update_time(loop) right before calling uv__run_timers(). That sets the timer that uv__run_timers() references. Here's the code for uv_update_time():
void uv_update_time(uv_loop_t* loop) {
uint64_t new_time = uv__hrtime(1000);
assert(new_time >= loop->time);
loop->time = new_time;
}
from the docs,
when the event loop enters a
given phase, it will perform any operations specific to that phase,
then execute callbacks in that phase's queue until the queue has been
exhausted or the maximum number of callbacks has executed. When the
queue has been exhausted or the callback limit is reached, the event
loop will move to the next phase, and so on.
Also from the docs,
When delay is larger than 2147483647 or less than 1, the delay will be set to 1
Now, when when you run your snippet following things happen,
script execution begins and callbacks are registered to specific phases. Also, as the docs suggests the setInterval delay is implicitly converted to 1 sec.
After 1 sec, your setInterval callback will be executed, it will block eventloop until all iterations and completed. Meanwhile, eventloop will not be notified of any incoming request atleast until loop terminates.
Once, all iterations are completed, and there is a timeout of 1 sec, the poll phase will execute your HTTP request callback, if any.
back to step 2.

fs.readFile is very slow, am I making too many request?

node.js beginner here:
A node.js applications scrapes an array of links (linkArray) from a list of ~30 urls.
Each domain/url has a corresponding (name).json file that is used to check whether the scraped links are new or not.
All pages are fetched, links are scraped into arrays, and then passed to:
function checkLinks(linkArray, name){
console.log(name, "checkLinks");
fs.readFile('json/'+name+'.json', 'utf8', function readFileCallback(err, data){
if(err && err.errno != -4058) throw err;
if(err && err.errno == -4058){
console.log(name+'.json', " is NEW .json");
compareAndAdd(linkArray, {linkArray: []}.linkArray, name);
}
else{
//file EXISTS
compareAndAdd(linkArray, JSON.parse(data).linkArray, name);
}
});
}
compareAndAdd() reads:
function compareAndAdd(arrNew, arrOld, name){
console.log(name, "compareAndAdd()");
if(!arrOld) var arrOld = [];
if(!arrNew) var arrNew = [];
//compare and remove dups
function hasDup(value) {
for (var i = 0; i < arrOld.length; i++)
if(value.href == arrOld[i].href)
if(value.text.length <= arrOld[i].text.length) return false;
arrOld.push(value);
return true;
}
var rArr = arrNew.filter(hasDup);
//update existing array;
if(rArr.length > 0){
fs.writeFile('json/'+name+'.json', JSON.stringify({linkArray: arrOld}), function (err) {
if (err) return console.log(err);
console.log(" "+name+'.json UPDATED');
});
}
else console.log(" "+name, "no changes, nothing to update");
return;
}
checkLinks() is where the program hangs, it's unbelievably slow. I understand that fs.readFile is being hit multiple times a second, but imo less than 30 hits seems pretty trivial: assuming this is a function meant to be used to serve data to (potentially) millions of users. Am I expecting too much from fs.readFile, or (more likely) is there another component (like writeFile, or something else entirely) that's locking everything up.
supplemental:
using write/readFileSync creates a lot of problems: this program is inherently async because it begins with request to external websites with largely varied response times and read/write would frequently collide. the functions above insure that writing for a given file only happens after it's been read. (though it is very slow)
Also, this program does not exit on its own, and I do not know why.
edit
I've reworked the program to read first then write synchronously last and the process is down to ~12 seconds. Apparently fs.readFile was getting hung when called multiple times. I don't understand when/how to use asynchronous fs, if multiple calls hangs the function.
All async fs operations are executed inside the libuv thread pool, which has a default size of 4 (can be changed by setting the UV_THREADPOOL_SIZE environment variable to something different). If all threads in the thread pool are busy, any fs operations will be queued up.
I should also point out that fs is not the only module that uses the thread pool, dns.lookup() (the default hostname resolution method used internally by node), async zlib methods, crypto.randomBytes(), and a couple of other things IIRC also use the libuv thread pool. This is just something to keep in mind.
If you read many files (checkLinks) in a loop, firstly ALL the fs.readFile functions will be called. And only AFTER that callbacks will be processed (they processed only if main function stack is empty). This would lead to significant starting delay. But don't worry about that.
You point that a program never ends. So, make a counter, count calls to checkLinks, and decrease the counter after callback function is called. Inside the callback, check the counter against 0 and then do finalizing logic (I suspect this could be a response to the http request).
Actually, it doesn't matter whether you use async version or sync. They will work relatively the same time.

Node.js: How to prevent two callbacks from running simultaneously?

I'm a bit new to Node.js. I've run into a problem where I want to prevent a callback from running while it is already being executed. For example:
items.forEach(function(item) {
doLongTask(item, function handler(result) {
// If items.length > 1, this will get executed multiple times.
});
});
How do I make the other invocations of handler wait for the first one to finish before going ahead? I'm thinking something along the lines of a queue, but I'm a newbie to Node.js so I'm not exactly sure what to do. Ideas?
There are already libraries which take care of that, the most used being async.
You will be interested in the async.eachSeries() function.
As for an actual example...
const async = require('async')
async.eachSeries(
items,
(item, next) => {
// Do stuff with item, and when you are done, call next
// ...
next()
},
err => {
// either there was an error in one of the handlers and
// execution was stopped, or all items have been processed
}
)
As for how the library does this, you are better of having a look at the source code.
It should be noted that this only ever makes sense if your item handler ever performs an asynchronous operation, like interfacing with the filesystem or with internet etc. There exists no operation in Node.js that would cause a piece of JS code to be executed in parallel to another JS code within the same process. So, if all you do is some calculations, you don't need to worry about this at all.
How to prevent two callbacks from running simultaneously?
They won't run simultaneously unless they're asynchronous, because Node runs JavaScript on a single thread. Asynchronous operations can overlap, but the JavaScript thread will only ever be doing one thing at a time.
So presumably doLongTask is asynchronous. You can't use forEach for what you'd like to do, but it's still not hard: You just keep track of where you are in the list, and wait to start processing the next until the previous one completes:
var n = 0;
processItem();
function processItem() {
if (n < items.length) {
doLongTask(items[n], function handler(result) {
++n;
processItem();
});
}
}

Using Javascript node.js how do I parallel process For loops?

I only started to learn javascript 2 days ago so I'm pretty new. I've written code which is optimal but takes 20 minutes to run. I was wondering if there's a simple way to parallel process with for loops e.g.
for (x=0; x<5; x++){
processor 1 do ...
for (x=5; x<10; x++){
processor 2 do ...
Since the OP wants to process the loop in parallel, the async.each() function from the async library is the ideal way to go.
I've had faster execution times using async.each compared to forEach in nodejs.
web workers can run your code in parallel, but without sharing memory/variables etc - basically you pass input parameters to the worker, it works and gives you back the result.
http://www.html5rocks.com/en/tutorials/workers/basics/
You can find nodejs implementations of this, example
https://www.npmjs.com/package/webworker-threads
OR, depending on how your code is written, if you're waiting on a lot of asynchronous functions, you can always rewrite your code to run faster (eg using event queuess instead of for loops - just beware of dependencies, order of execution, etc)
To run code in parallel or want to make requests in parallel you can
use Promise.all or Promise.settled.
Make all the queries in parallel (asynchronously). Resulting in each query firing at the same time.
let promise1 = new Promise((resolve) => setTimeout(() => resolve('any-value'), 3000);
let responses = await Promise.all([promise1, promise2, promise3, ...])
for(let response of responses) {
// format responses
// respond to client
}
For more examples check out this article
You might want to take a look at the async.js project, especially the parallel function.
Important quote about it :
parallel is about kicking-off I/O tasks in parallel, not about parallel execution of code. If your tasks do not use any timers or perform any I/O, they will actually be executed in series. Any synchronous setup sections for each task will happen one after the other. JavaScript remains single-threaded.
Example :
async.parallel([
function(callback){
setTimeout(function(){
callback(null, 'one');
}, 200);
},
function(callback){
setTimeout(function(){
callback(null, 'two');
}, 100);
}
],
// optional callback
function(err, results){
// the results array will equal ['one','two'] even though
// the second function had a shorter timeout.
});

Node.js sync vs. async

I'm currently learning node.js and I see 2 examples for sync and async program (same one).
I do understand the concept of a callback, but i'm trying to understand the benefit for the second (async) example, as it seems that the two of them are doing the exact same thing even though this difference...
Can you please detail the reason why would the second example be better?
I'll be happy to get an ever wider explanation that would help me understand the concept..
Thank you!!
1st example:
var fs = require('fs');
function calculateByteSize() {
var totalBytes = 0,
i,
filenames,
stats;
filenames = fs.readdirSync(".");
for (i = 0; i < filenames.length; i ++) {
stats = fs.statSync("./" + filenames[i]);
totalBytes += stats.size;
}
console.log(totalBytes);
}
calculateByteSize();
2nd example:
var fs = require('fs');
var count = 0,
totalBytes = 0;
function calculateByteSize() {
fs.readdir(".", function (err, filenames) {
var i;
count = filenames.length;
for (i = 0; i < filenames.length; i++) {
fs.stat("./" + filenames[i], function (err, stats) {
totalBytes += stats.size;
count--;
if (count === 0) {
console.log(totalBytes);
}
});
}
});
}
calculateByteSize();
Your first example is all blocking I/O. In other words, you would need to wait until the readdir operation is complete before looping through each file. Then you would need to block (wait) for each individual sync stat operation to run before moving on to the next file. No code could run after calculateByteSize() call until all operations are completed.
The async (second) example on the otherhand is all non-blocking using the callback pattern. Here, the execution returns to just after the calculateByteSize() call as soon as fs.readdir is called (but before the callback is run). Once the readdir task is complete it performs a callback to your anonymous function. Here it loops through each of the files and again does non-blocking calls to fs.stat.
The second is more advantageous. If you can pretend that calls to readdir or stat can range from 250ms to 750ms to complete (this is probably not the case), you would be waiting for serial calls to your sync operations. However, the async operations would not cause you to wait between each call. In other words, looping over the readdir files, you would need to wait for each stat operation to complete if you were doing it synchronously. If you were to do it asynchronously, you would not have to wait to call each fs.stat call.
In your first example, the node.js process, which is single-threaded, is blocking for the entire duration of your readdirSync, and can't do anything else except wait for the result to be returned. In the second example, the process can handle other tasks and the event loop will return it to the continuation of the callback when the result is available. So you can handle a much much higher total throughput by using asynchronous code -- the time spent waiting for the readdir in the first example is probably thousands of times as long as the time actually spend executing your code, so you're wasting 99.9% or more of your CPU time.
In your example the benefit of async programming is indeed not much visible. But suppose that your program needs to do other things as well. Remember that your JavaScript code is running in a single thread, so when you chose the synchronous implementation the program can't do anything else but waiting for the IO operation to finish. When you use async programming, your program can do other important tasks while the IO operation runs in the background (outside the JavaScript thread)
Can you please detail the reason why would the second example be better? I'll be happy to get an ever wider explanation that would help me understand the concept..
It's all about concurrency for network servers (thus the name "node"). If this were in a build script the second, synchronous example would be "better" in that is is more straightforward. And given a single disk, there might not be much actual benefit to making it asynchronous.
However, in a network service, the first, synchronous version would block the entire process and defeat node's main design principle. Performance would be slow as number of concurrent clients increased. However the second, asynchronous example, would perform relatively well as while waiting for the relatively-slow filesystem to come back with results, it could handle all the relatively-fast CPU operations concurrently. The async version should basically be able to saturate your filesystem and however much your filesystem can deliver, node will be able to get it out to clients at that rate.
Lots of good answers here, but be sure to also read the docs:
The synchronous versions will block the entire process until they complete--halting all connections.
There is a good overview of sync vs async in the documentation: http://nodejs.org/api/fs.html#fs_file_system

Categories

Resources