Using Javascript node.js how do I parallel process For loops?

Using Javascript node.js how do I parallel process For loops? - javascript

I only started to learn javascript 2 days ago so I'm pretty new. I've written code which is optimal but takes 20 minutes to run. I was wondering if there's a simple way to parallel process with for loops e.g.
for (x=0; x<5; x++){
processor 1 do ...
for (x=5; x<10; x++){
processor 2 do ...

Since the OP wants to process the loop in parallel, the async.each() function from the async library is the ideal way to go.
I've had faster execution times using async.each compared to forEach in nodejs.

web workers can run your code in parallel, but without sharing memory/variables etc - basically you pass input parameters to the worker, it works and gives you back the result.
http://www.html5rocks.com/en/tutorials/workers/basics/
You can find nodejs implementations of this, example
https://www.npmjs.com/package/webworker-threads
OR, depending on how your code is written, if you're waiting on a lot of asynchronous functions, you can always rewrite your code to run faster (eg using event queuess instead of for loops - just beware of dependencies, order of execution, etc)

To run code in parallel or want to make requests in parallel you can
use Promise.all or Promise.settled.
Make all the queries in parallel (asynchronously). Resulting in each query firing at the same time.
let promise1 = new Promise((resolve) => setTimeout(() => resolve('any-value'), 3000);
let responses = await Promise.all([promise1, promise2, promise3, ...])
for(let response of responses) {
// format responses
// respond to client
}
For more examples check out this article

You might want to take a look at the async.js project, especially the parallel function.
Important quote about it :
parallel is about kicking-off I/O tasks in parallel, not about parallel execution of code. If your tasks do not use any timers or perform any I/O, they will actually be executed in series. Any synchronous setup sections for each task will happen one after the other. JavaScript remains single-threaded.
Example :
async.parallel([
function(callback){
setTimeout(function(){
callback(null, 'one');
}, 200);
},
function(callback){
setTimeout(function(){
callback(null, 'two');
}, 100);
}
],
// optional callback
function(err, results){
// the results array will equal ['one','two'] even though
// the second function had a shorter timeout.
});

Related

Callback not getting executed when I expect it

I have the following code that uses fetch. From what I understand, the callback function will not be invoked until the promise is fulfilled. Because of that, I was expecting the callback functions to be executed in the middle of processing other things (such as the for loop). However, it is not doing what I expect. My code is as follows:
console.log("Before fetch")
fetch('https://example.com/data')
.then(function(response){
console.log("In first then")
return response.json()
})
.then(function(json){
console.log("In second then")
console.log(json)
})
.catch(function(error){
console.log("An error has occured")
console.log(error)
})
console.log("After fetch")
for(let i = 0; i < 1000000; i++){
if (i % 10000 == 0)
console.log(i)
}
console.log("The End")
Rather than the callback being immediately run when the promise is fulfilled, it seems to wait until all the rest of my code is processed before the callback function is activated. Why is this?
The output of my code looks like this:
Before fetch
After fetch
0
10000
.
.
.
970000
980000
990000
The End
In first then
In second then
However, I was expecting the last two lines to appear somewhere prior to this point. What is going on here and how can I change my code so that it reflects when the promise is actually fulfilled?

The key here is that the for loop you're running afterwards is a long, synchronous block of code. That is the reason why synchronous APIs are deprecated / not recommended in JavaScript, as they block all asynchronous callbacks from executing until completion. JavaScript is not multithreaded, and it does not have concepts like interrupts in C, so if the thread is executing a large loop, nothing else will have the chance to run until that loop is finished.
In Node.js, the child_process API allows you to run daemon processes, and the Web Worker API for browsers allows concurrent processes to run in parallel, both of these using serialized event-based messaging to communicate between threads, but aside from that, everything in the above paragraph applies universally to JavaScript.
In general, a possible solution to breaking up long synchronous processes like the one you have there is batching. Using promises, you could rewrite the for loop like this:
(async () => {
for(let i = 0; i < 100000; i++){
if (i % 10000 == 0) {
console.log(i);
// release control for minimum of 4 ms
await new Promise(resolve => { setTimeout(resolve, 0); });
}
}
})().then(() => {
console.log("The End");
});
setTimeout(() => { console.log('Can interrupt loop'); }, 1);
Reason for 4ms minimum: https://developer.mozilla.org/en-US/docs/Web/API/WindowOrWorkerGlobalScope/setTimeout#Reasons_for_delays_longer_than_specified

No matter how fast is your promise fulfilled. Callbacks added to Event Loop and will invoke after all synchronous tasks are finished. In this example synchronous task is for loop. You can try event with setTimeout with 0ms, it will also work after loop. Remember JS is single-threaded and doesn't support parallel tasks.
Referance:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/EventLoop

There's no guarantee when the callback will execute. The for-loop requires very little processing time, so it's possible that the JS engine just decided to wait until it's over to complete the callback functions. There's no certain way of forcing a specific order of functions either unless you chain them as callbacks.

Node.js: How to prevent two callbacks from running simultaneously?

I'm a bit new to Node.js. I've run into a problem where I want to prevent a callback from running while it is already being executed. For example:
items.forEach(function(item) {
doLongTask(item, function handler(result) {
// If items.length > 1, this will get executed multiple times.
});
});
How do I make the other invocations of handler wait for the first one to finish before going ahead? I'm thinking something along the lines of a queue, but I'm a newbie to Node.js so I'm not exactly sure what to do. Ideas?

There are already libraries which take care of that, the most used being async.
You will be interested in the async.eachSeries() function.
As for an actual example...
const async = require('async')
async.eachSeries(
items,
(item, next) => {
// Do stuff with item, and when you are done, call next
// ...
next()
},
err => {
// either there was an error in one of the handlers and
// execution was stopped, or all items have been processed
}
)
As for how the library does this, you are better of having a look at the source code.
It should be noted that this only ever makes sense if your item handler ever performs an asynchronous operation, like interfacing with the filesystem or with internet etc. There exists no operation in Node.js that would cause a piece of JS code to be executed in parallel to another JS code within the same process. So, if all you do is some calculations, you don't need to worry about this at all.

How to prevent two callbacks from running simultaneously?
They won't run simultaneously unless they're asynchronous, because Node runs JavaScript on a single thread. Asynchronous operations can overlap, but the JavaScript thread will only ever be doing one thing at a time.
So presumably doLongTask is asynchronous. You can't use forEach for what you'd like to do, but it's still not hard: You just keep track of where you are in the list, and wait to start processing the next until the previous one completes:
var n = 0;
processItem();
function processItem() {
if (n < items.length) {
doLongTask(items[n], function handler(result) {
++n;
processItem();
});
}
}

Node.js sync vs. async

I'm currently learning node.js and I see 2 examples for sync and async program (same one).
I do understand the concept of a callback, but i'm trying to understand the benefit for the second (async) example, as it seems that the two of them are doing the exact same thing even though this difference...
Can you please detail the reason why would the second example be better?
I'll be happy to get an ever wider explanation that would help me understand the concept..
Thank you!!
1st example:
var fs = require('fs');
function calculateByteSize() {
var totalBytes = 0,
i,
filenames,
stats;
filenames = fs.readdirSync(".");
for (i = 0; i < filenames.length; i ++) {
stats = fs.statSync("./" + filenames[i]);
totalBytes += stats.size;
}
console.log(totalBytes);
}
calculateByteSize();
2nd example:
var fs = require('fs');
var count = 0,
totalBytes = 0;
function calculateByteSize() {
fs.readdir(".", function (err, filenames) {
var i;
count = filenames.length;
for (i = 0; i < filenames.length; i++) {
fs.stat("./" + filenames[i], function (err, stats) {
totalBytes += stats.size;
count--;
if (count === 0) {
console.log(totalBytes);
}
});
}
});
}
calculateByteSize();

Your first example is all blocking I/O. In other words, you would need to wait until the readdir operation is complete before looping through each file. Then you would need to block (wait) for each individual sync stat operation to run before moving on to the next file. No code could run after calculateByteSize() call until all operations are completed.
The async (second) example on the otherhand is all non-blocking using the callback pattern. Here, the execution returns to just after the calculateByteSize() call as soon as fs.readdir is called (but before the callback is run). Once the readdir task is complete it performs a callback to your anonymous function. Here it loops through each of the files and again does non-blocking calls to fs.stat.
The second is more advantageous. If you can pretend that calls to readdir or stat can range from 250ms to 750ms to complete (this is probably not the case), you would be waiting for serial calls to your sync operations. However, the async operations would not cause you to wait between each call. In other words, looping over the readdir files, you would need to wait for each stat operation to complete if you were doing it synchronously. If you were to do it asynchronously, you would not have to wait to call each fs.stat call.

In your first example, the node.js process, which is single-threaded, is blocking for the entire duration of your readdirSync, and can't do anything else except wait for the result to be returned. In the second example, the process can handle other tasks and the event loop will return it to the continuation of the callback when the result is available. So you can handle a much much higher total throughput by using asynchronous code -- the time spent waiting for the readdir in the first example is probably thousands of times as long as the time actually spend executing your code, so you're wasting 99.9% or more of your CPU time.

In your example the benefit of async programming is indeed not much visible. But suppose that your program needs to do other things as well. Remember that your JavaScript code is running in a single thread, so when you chose the synchronous implementation the program can't do anything else but waiting for the IO operation to finish. When you use async programming, your program can do other important tasks while the IO operation runs in the background (outside the JavaScript thread)

Can you please detail the reason why would the second example be better? I'll be happy to get an ever wider explanation that would help me understand the concept..
It's all about concurrency for network servers (thus the name "node"). If this were in a build script the second, synchronous example would be "better" in that is is more straightforward. And given a single disk, there might not be much actual benefit to making it asynchronous.
However, in a network service, the first, synchronous version would block the entire process and defeat node's main design principle. Performance would be slow as number of concurrent clients increased. However the second, asynchronous example, would perform relatively well as while waiting for the relatively-slow filesystem to come back with results, it could handle all the relatively-fast CPU operations concurrently. The async version should basically be able to saturate your filesystem and however much your filesystem can deliver, node will be able to get it out to clients at that rate.

Lots of good answers here, but be sure to also read the docs:
The synchronous versions will block the entire process until they complete--halting all connections.
There is a good overview of sync vs async in the documentation: http://nodejs.org/api/fs.html#fs_file_system

How exactly does the nodejs promise library work for multiple promises?

Recently I made a webscraper in nodejs using 'promise'. I created a Promise for each url I wanted to scrape and then used all method:
var fetchUrlArray=[];
for(...){
var mPromise = new Promise(function(resolve,reject){
(http.get(...))()
});
fetchUrlArray.push(mPromise);
}
Promise.all(fetchUrlArray).then(...)
There were thousands of urls but only a few of them got timed out. I got the impression that it was handling 5 promises in parallel at a time.
My question is how exactly does promise.all() work. Does it:
Call each promise one by one and switch to the next one till the previous one is resolved.
Or does in process the promises in a batch of a few from the array.
Or does it fire all promises
What is the best way to solve this problem in nodejs. Because as it stands I can solve this problem way faster in Java/C#

What you pass Promise.all() is an array of promises. It knows absolutely nothing about what is behind those promises. All it knows is that those promises will get resolved or rejected sometime in the future and it will create a new master promise that follows the sum of all the promises you passed it. This is one of the nice things about promises. They are an abstraction that lets you coordinate any type of action (usually asynchronous) without regard for what type of action it is. As such, promises have literally nothing to do with the actual action. All they do is monitor the completion or error of the action and report that back to those agents following the promise. Other code actually runs the action.
In your particular case, you are immediately calling http.get() in a tight loop and your code (nothing to do with promises) is launching a zillion http.get() operations at once. Those will get fired as fast as the underlying transport can do them (likely subject to connection limits).
If you want them to be launched serially or in batches of say 10 at a time, then you have to code it that way yourself. Promises have nothing to do with that.
You could use promises to help you code them to launch serially or in batches, but it would take extra of your code to do that either way to make that happen.
The Async library is specifically built for running things in parallel, but with a maximum number in flight at any given time because this is a common scheme where you either have connection limits on your end or you don't want to overwhelm the receiving server. You may be interested in the parallelLimit option which lets you run a number of async operations in parallel, but with a maximum number in flight at any given time.

I would do it like this
Personally, I'm not a big fan of Promises. I think the API is extremely verbose and the resulting code is very hard to read. The method defined below results in very flat code and it's much easier to immediately understand what's going on. At least imo.
Here's a little thing I created for an answer to this question
// void asyncForEach(Array arr, Function iterator, Function callback)
// * iterator(item, done) - done can be called with an err to shortcut to callback
// * callback(done) - done recieves error if an iterator sent one
function asyncForEach(arr, iterator, callback) {
// create a cloned queue of arr
var queue = arr.slice(0);
// create a recursive iterator
function next(err) {
// if there's an error, bubble to callback
if (err) return callback(err);
// if the queue is empty, call the callback with no error
if (queue.length === 0) return callback(null);
// call the callback with our task
// we pass `next` here so the task can let us know when to move on to the next task
iterator(queue.shift(), next);
}
// start the loop;
next();
}
You can use it like this
var urls = [
"http://example.com/cat",
"http://example.com/hat",
"http://example.com/wat"
];
function eachUrl(url, done){
http.get(url, function(res) {
// do something with res
done();
}).on("error", function(err) {
done(err);
});
}
function urlsDone(err) {
if (err) throw err;
console.log("done getting all urls");
}
asyncForEach(urls, eachUrl, urlsDone);
Benefits of this
no external dependencies or beta apis
reusable on any array you want to perform async tasks on
non-blocking, just as you've come to expect with node
could be easily adapted for parallel processing
by writing your own utility, you better understand how this kind of thing works
If you just want to grab a module to help you, look into async and the async.eachSeries method.

First, a clarification: A promise does represent the future result of a computation, nothing else. It does not represent the task or computation itself, which means it cannot be "called" or "fired".
Your script does create all those thousands of promises immediately, and each of those creations does call http.get immediately. I would suspect that the http library (or something it depends on) has a connection pool with a limit of how many requests to make in parallel, and defers the rest implicitly.
Promise.all does not do any "processing" - it's not responsible for starting the tasks and resolving the passed promises. It only listens to them and checks whether they all are ready, and returns a promise for that eventual result.

In node.js, how to use child_process.exec so all can happen asynchronously?

I have a server built on node.js. Below is one of the request handler functions:
var exec = require("child_process").exec
function doIt(response) {
//some trivial and fast code - can be ignored
exec(
"sleep 10", //run OS' sleep command, sleep for 10 seconds
//sleeping(10), //commented out. run a local function, defined below.
function(error, stdout, stderr) {
response.writeHead(200, {"Content-Type": "text/plain"});
response.write(stdout);
response.end();
});
//some trivial and fast code - can be ignored
}
Meanwhile, in the same module file there is a local function "sleeping" defined, which as its name indicates will sleep for 10 seconds.
function sleeping(sec) {
var begin = new Date().getTime();
while (new Date().getTime() < begin + sec*1000); //just loop till timeup.
}
Here come three questions --
As we know, node.js is single-processed, asynchronous, event-driven. Is it true that ALL functions with a callback argument is asynchronous? For example, if I have a function my_func(callback_func), which takes another function as an argument. Are there any restrictions on the callback_func or somewhere to make my_func asynchronous?
So at least the child_process.exec is asynchronous with a callback anonymous function as argument. Here I pass "sleep 10" as the first argument, to call the OS's sleep command and wait for 10 seconds. It won't block the whole node process, i.e. any other request sent to another request handler won't be blocked as long as 10 seconds by the "doIt" handler. However, if immediately another request is sent to the server and should be handled by the same "doIt" handler, will it have to wait till the previous "doIt" request ends?
If I use the sleeping(10) function call (commented out) to replace the "sleep 10", I found it does block other requests till 10 seconds after. Could anyone explain why the difference?
Thanks a bunch!
-- update per request --
One comment says this question seemed duplicate to another one (How to promisify Node's child_process.exec and child_process.execFile functions with Bluebird?) that was asked one year after this one.. Well these are too different - this was asked for asynchronous in general with a specific buggy case, while that one was asking about the Promise object per se. Both the intent and use cases vary.
(If by any chance these are similar, shouldn't the newer one marked as duplicate to the older one?)

First you can promisify the child_process.
const util = require('util');
const exec = util.promisify(require('child_process').exec);
async function lsExample() {
const { stdout, stderr } = await exec('ls');
if (stderr) {
// handle error
console.log('stderr:', stderr);
}
console.log('stdout:', stdout);
}
lsExample()
As an async function, lsExample returns a promise.
Run all promises in parallel with Promise.all([]).
Promise.all([lsExample(), otherFunctionExample()]);
If you need to wait on the promises to finish in parallel, await them.
await Promise.all([aPromise(), bPromise()]);
If you need the values from those promises
const [a, b] = await Promise.all([aPromise(), bPromise(])

1) No. For example .forEach is synchronous:
var lst = [1,2,3];
console.log("start")
lst.forEach(function(el) {
console.log(el);
});
console.log("end")
Whether function is asynchronous or not it purely depends on the implementation - there are no restrictions. You can't know it a priori (you have to either test it or know how it is implemented or read and believe in documentation). There's even more, depending on arguments the function can be either asynchronous or synchronous or both.
2) No. Each request will spawn a separate "sleep" process.
3) That's because your sleeping function is a total mess - it is not sleep at all. What it does is it uses an infinite loop and checks for date (thus using 100% of CPU). Since node.js is single-threaded then it just blocks entire server - because it is synchronous. This is wrong, don't do this. Use setTimeout instead.

Develop Reference

JavaScript is the programming language of the Web.