Node.js sync vs. async

Node.js sync vs. async - javascript

I'm currently learning node.js and I see 2 examples for sync and async program (same one).
I do understand the concept of a callback, but i'm trying to understand the benefit for the second (async) example, as it seems that the two of them are doing the exact same thing even though this difference...
Can you please detail the reason why would the second example be better?
I'll be happy to get an ever wider explanation that would help me understand the concept..
Thank you!!
1st example:
var fs = require('fs');
function calculateByteSize() {
var totalBytes = 0,
i,
filenames,
stats;
filenames = fs.readdirSync(".");
for (i = 0; i < filenames.length; i ++) {
stats = fs.statSync("./" + filenames[i]);
totalBytes += stats.size;
}
console.log(totalBytes);
}
calculateByteSize();
2nd example:
var fs = require('fs');
var count = 0,
totalBytes = 0;
function calculateByteSize() {
fs.readdir(".", function (err, filenames) {
var i;
count = filenames.length;
for (i = 0; i < filenames.length; i++) {
fs.stat("./" + filenames[i], function (err, stats) {
totalBytes += stats.size;
count--;
if (count === 0) {
console.log(totalBytes);
}
});
}
});
}
calculateByteSize();

Your first example is all blocking I/O. In other words, you would need to wait until the readdir operation is complete before looping through each file. Then you would need to block (wait) for each individual sync stat operation to run before moving on to the next file. No code could run after calculateByteSize() call until all operations are completed.
The async (second) example on the otherhand is all non-blocking using the callback pattern. Here, the execution returns to just after the calculateByteSize() call as soon as fs.readdir is called (but before the callback is run). Once the readdir task is complete it performs a callback to your anonymous function. Here it loops through each of the files and again does non-blocking calls to fs.stat.
The second is more advantageous. If you can pretend that calls to readdir or stat can range from 250ms to 750ms to complete (this is probably not the case), you would be waiting for serial calls to your sync operations. However, the async operations would not cause you to wait between each call. In other words, looping over the readdir files, you would need to wait for each stat operation to complete if you were doing it synchronously. If you were to do it asynchronously, you would not have to wait to call each fs.stat call.

In your first example, the node.js process, which is single-threaded, is blocking for the entire duration of your readdirSync, and can't do anything else except wait for the result to be returned. In the second example, the process can handle other tasks and the event loop will return it to the continuation of the callback when the result is available. So you can handle a much much higher total throughput by using asynchronous code -- the time spent waiting for the readdir in the first example is probably thousands of times as long as the time actually spend executing your code, so you're wasting 99.9% or more of your CPU time.

In your example the benefit of async programming is indeed not much visible. But suppose that your program needs to do other things as well. Remember that your JavaScript code is running in a single thread, so when you chose the synchronous implementation the program can't do anything else but waiting for the IO operation to finish. When you use async programming, your program can do other important tasks while the IO operation runs in the background (outside the JavaScript thread)

Can you please detail the reason why would the second example be better? I'll be happy to get an ever wider explanation that would help me understand the concept..
It's all about concurrency for network servers (thus the name "node"). If this were in a build script the second, synchronous example would be "better" in that is is more straightforward. And given a single disk, there might not be much actual benefit to making it asynchronous.
However, in a network service, the first, synchronous version would block the entire process and defeat node's main design principle. Performance would be slow as number of concurrent clients increased. However the second, asynchronous example, would perform relatively well as while waiting for the relatively-slow filesystem to come back with results, it could handle all the relatively-fast CPU operations concurrently. The async version should basically be able to saturate your filesystem and however much your filesystem can deliver, node will be able to get it out to clients at that rate.

Lots of good answers here, but be sure to also read the docs:
The synchronous versions will block the entire process until they complete--halting all connections.
There is a good overview of sync vs async in the documentation: http://nodejs.org/api/fs.html#fs_file_system

Related

Using "await" inside non-async function

I have an async function that runs by a setInterval somewhere in my code. This function updates some cache in regular intervals.
I also have a different, synchronous function which needs to retrieve values - preferably from the cache, yet if it's a cache-miss, then from the data origins
(I realize making IO operations in a synchronous manner is ill-advised, but lets assume this is required in this case).
My problem is I'd like the synchronous function to be able to wait for a value from the async one, but it's not possible to use the await keyword inside a non-async function:
function syncFunc(key) {
if (!(key in cache)) {
await updateCacheForKey([key]);
}
}
async function updateCacheForKey(keys) {
// updates cache for given keys
...
}
Now, this can be easily circumvented by extracting the logic inside updateCacheForKey into a new synchronous function, and calling this new function from both existing functions.
My question is why absolutely prevent this use case in the first place? My only guess is that it has to do with "idiot-proofing", since in most cases, waiting on an async function from a synchronous one is wrong. But am I wrong to think it has its valid use cases at times?
(I think this is possible in C# as well by using Task.Wait, though I might be confusing things here).

My problem is I'd like the synchronous function to be able to wait for a value from the async one...
They can't, because:
JavaScript works on the basis of a "job queue" processed by a thread, where jobs have run-to-completion semantics, and
JavaScript doesn't really have asynchronous functions — even async functions are, under the covers, synchronous functions that return promises (details below)
The job queue (event loop) is conceptually quite simple: When something needs to be done (the initial execution of a script, an event handler callback, etc.), that work is put in the job queue. The thread servicing that job queue picks up the next pending job, runs it to completion, and then goes back for the next one. (It's more complicated than that, of course, but that's sufficient for our purposes.) So when a function gets called, it's called as part of the processing of a job, and jobs are always processed to completion before the next job can run.
Running to completion means that if the job called a function, that function has to return before the job is done. Jobs don't get suspended in the middle while the thread runs off to do something else. This makes code dramatically simpler to write correctly and reason about than if jobs could get suspended in the middle while something else happens. (Again it's more complicated than that, but again that's sufficient for our purposes here.)
So far so good. What's this about not really having asynchronous functions?!
Although we talk about "synchronous" vs. "asynchronous" functions, and even have an async keyword we can apply to functions, a function call is always synchronous in JavaScript. An async function is a function that synchronously returns a promise that the function's logic fulfills or rejects later, queuing callbacks the environment will call later.
Let's assume updateCacheForKey looks something like this:
async function updateCacheForKey(key) {
const value = await fetch(/*...*/);
cache[key] = value;
return value;
}
What that's really doing, under the covers, is (very roughly, not literally) this:
function updateCacheForKey(key) {
return fetch(/*...*/).then(result => {
const value = result;
cache[key] = value;
return value;
});
}
(I go into more detail on this in Chapter 9 of my recent book, JavaScript: The New Toys.)
It asks the browser to start the process of fetching the data, and registers a callback with it (via then) for the browser to call when the data comes back, and then it exits, returning the promise from then. The data isn't fetched yet, but updateCacheForKey is done. It has returned. It did its work synchronously.
Later, when the fetch completes, the browser queues a job to call that promise callback; when that job is picked up from the queue, the callback gets called, and its return value is used to resolve the promise then returned.
My question is why absolutely prevent this use case in the first place?
Let's see what that would look like:
The thread picks up a job and that job involves calling syncFunc, which calls updateCacheForKey. updateCacheForKey asks the browser to fetch the resource and returns its promise. Through the magic of this non-async await, we synchronously wait for that promise to be resolved, holding up the job.
At some point, the browser's network code finishes retrieving the resource and queues a job to call the promise callback we registered in updateCacheForKey.
Nothing happens, ever again. :-)
...because jobs have run-to-completion semantics, and the thread isn't allowed to pick up the next job until it completes the previous one. The thread isn't allowed to suspend the job that called syncFunc in the middle so it can go process the job that would resolve the promise.
That seems arbitrary, but again, the reason for it is that it makes it dramatically easier to write correct code and reason about what the code is doing.
But it does mean that a "synchronous" function can't wait for an "asynchronous" function to complete.
There's a lot of hand-waving of details and such above. If you want to get into the nitty-gritty of it, you can delve into the spec. Pack lots of provisions and warm clothes, you'll be some time. :-)
Jobs and Job Queues
Execution Contexts
Realms and Agents

You can call an async function from within a non-async function via an Immediately Invoked Function Expression (IIFE):
(async () => await updateCacheForKey([key]))();
And as applied to your example:
function syncFunc(key) {
if (!(key in cache)) {
(async () => await updateCacheForKey([key]))();
}
}
async function updateCacheForKey(keys) {
// updates cache for given keys
...
}

This shows how a function can be both sync and async, and how the Immediately Invoked Function Expression idiom is only immediate if the path through the function being called does synchronous things.
function test() {
console.log('Test before');
(async () => await print(0.3))();
console.log('Test between');
(async () => await print(0.7))();
console.log('Test after');
}
async function print(v) {
if(v<0.5)await sleep(5000);
else console.log('No sleep')
console.log(`Printing ${v}`);
}
function sleep(ms : number) {
return new Promise(resolve => setTimeout(resolve, ms));
}
test();
(Based off of Ayyappa's code in a comment to another answer.)
The console.log looks like this:
16:53:00.804 Test before
16:53:00.804 Test between
16:53:00.804 No sleep
16:53:00.805 Printing 0.7
16:53:00.805 Test after
16:53:05.805 Printing 0.3
If you change the 0.7 to 0.4 everything runs async:
17:05:14.185 Test before
17:05:14.186 Test between
17:05:14.186 Test after
17:05:19.186 Printing 0.3
17:05:19.187 Printing 0.4
And if you change both numbers to be over 0.5, everything runs sync, and no promises get created at all:
17:06:56.504 Test before
17:06:56.504 No sleep
17:06:56.505 Printing 0.6
17:06:56.505 Test between
17:06:56.505 No sleep
17:06:56.505 Printing 0.7
17:06:56.505 Test after
This does suggest an answer to the original question, though. You could have a function like this (disclaimer: untested nodeJS code):
const cache = {}
async getData(key, forceSync){
if(cache.hasOwnProperty(key))return cache[key] //Runs sync
if(forceSync){ //Runs sync
const value = fs.readFileSync(`${key}.txt`)
cache[key] = value
return value
}
//If we reach here, the code will run async
const value = await fsPromises.readFile(`${key}.txt`)
cache[key] = value
return value
}

Now, this can be easily circumvented by extracting the logic inside updateCacheForKey into a new synchronous function, and calling this new function from both existing functions.
T.J. Crowder explains the semantics of async functions in JavaScript perfectly. But in my opinion the paragraph above deserves more discussion. Depending on what updateCacheForKey does, it may not be possible to extract its logic into a synchronous function because, in JavaScript, some things can only be done asynchronously. For example there is no way to perform a network request and wait for its response synchronously. If updateCacheForKey relies on a server response, it can't be turned into a synchronous function.
It was true even before the advent of asynchronous functions and promises: XMLHttpRequest, for instance, gets a callback and calls it when the response is ready. There's no way of obtaining a response synchronously. Promises are just an abstraction layer on callbacks and asynchronous functions are just an abstraction layer on promises.
Now this could have been done differently. And it is in some environments:
In PHP, pretty much everything is synchronous. You send a request with curl and your script blocks until it gets a response.
Node.js has synchronous versions of its file system calls (readFileSync, writeFileSync etc.) which block until the operation completes.
Even plain old browser JavaScript has alert and friends (confirm, prompt) which block until the user dismisses the modal dialog.
This demonstrates that the designers of the JavaScript language could have opted for synchronous versions of XMLHttpRequest, fetch etc. Why didn't they?
[W]hy absolutely prevent this use case in the first place?
This is a design decision.
alert, for instance, prevents the user from interacting with the rest of the page because JavaScript is single threaded and the one and only thread of execution is blocked until the alert call completes. Therefore there's no way to execute event handlers, which means no way to become interactive. If there was a syncFetch function, it would block the user from doing anything until the network request completes, which can potentially take minutes, even hours or days.
This is clearly against the nature of the interactive environment we call the "web". alert was a mistake in retrospect and it should not be used except under very few circumstances.
The only alternative would be to allow multithreading in JavaScript which is notoriously difficult to write correct programs with. Are you having trouble wrapping your head around asynchronous functions? Try semaphores!

It is possible to add a good old .then() to the async function and it will work.
Should consider though instead of doing that, changing your current regular function to async one, and all the way up the call stack until returned promise is not needed, i.e. there's no work to be done with the value returned from async function. In which case it actually CAN be called from a synchronous one.

fs.readFile is very slow, am I making too many request?

node.js beginner here:
A node.js applications scrapes an array of links (linkArray) from a list of ~30 urls.
Each domain/url has a corresponding (name).json file that is used to check whether the scraped links are new or not.
All pages are fetched, links are scraped into arrays, and then passed to:
function checkLinks(linkArray, name){
console.log(name, "checkLinks");
fs.readFile('json/'+name+'.json', 'utf8', function readFileCallback(err, data){
if(err && err.errno != -4058) throw err;
if(err && err.errno == -4058){
console.log(name+'.json', " is NEW .json");
compareAndAdd(linkArray, {linkArray: []}.linkArray, name);
}
else{
//file EXISTS
compareAndAdd(linkArray, JSON.parse(data).linkArray, name);
}
});
}
compareAndAdd() reads:
function compareAndAdd(arrNew, arrOld, name){
console.log(name, "compareAndAdd()");
if(!arrOld) var arrOld = [];
if(!arrNew) var arrNew = [];
//compare and remove dups
function hasDup(value) {
for (var i = 0; i < arrOld.length; i++)
if(value.href == arrOld[i].href)
if(value.text.length <= arrOld[i].text.length) return false;
arrOld.push(value);
return true;
}
var rArr = arrNew.filter(hasDup);
//update existing array;
if(rArr.length > 0){
fs.writeFile('json/'+name+'.json', JSON.stringify({linkArray: arrOld}), function (err) {
if (err) return console.log(err);
console.log(" "+name+'.json UPDATED');
});
}
else console.log(" "+name, "no changes, nothing to update");
return;
}
checkLinks() is where the program hangs, it's unbelievably slow. I understand that fs.readFile is being hit multiple times a second, but imo less than 30 hits seems pretty trivial: assuming this is a function meant to be used to serve data to (potentially) millions of users. Am I expecting too much from fs.readFile, or (more likely) is there another component (like writeFile, or something else entirely) that's locking everything up.
supplemental:
using write/readFileSync creates a lot of problems: this program is inherently async because it begins with request to external websites with largely varied response times and read/write would frequently collide. the functions above insure that writing for a given file only happens after it's been read. (though it is very slow)
Also, this program does not exit on its own, and I do not know why.
edit
I've reworked the program to read first then write synchronously last and the process is down to ~12 seconds. Apparently fs.readFile was getting hung when called multiple times. I don't understand when/how to use asynchronous fs, if multiple calls hangs the function.

All async fs operations are executed inside the libuv thread pool, which has a default size of 4 (can be changed by setting the UV_THREADPOOL_SIZE environment variable to something different). If all threads in the thread pool are busy, any fs operations will be queued up.
I should also point out that fs is not the only module that uses the thread pool, dns.lookup() (the default hostname resolution method used internally by node), async zlib methods, crypto.randomBytes(), and a couple of other things IIRC also use the libuv thread pool. This is just something to keep in mind.

If you read many files (checkLinks) in a loop, firstly ALL the fs.readFile functions will be called. And only AFTER that callbacks will be processed (they processed only if main function stack is empty). This would lead to significant starting delay. But don't worry about that.
You point that a program never ends. So, make a counter, count calls to checkLinks, and decrease the counter after callback function is called. Inside the callback, check the counter against 0 and then do finalizing logic (I suspect this could be a response to the http request).
Actually, it doesn't matter whether you use async version or sync. They will work relatively the same time.

Node.js: How to prevent two callbacks from running simultaneously?

I'm a bit new to Node.js. I've run into a problem where I want to prevent a callback from running while it is already being executed. For example:
items.forEach(function(item) {
doLongTask(item, function handler(result) {
// If items.length > 1, this will get executed multiple times.
});
});
How do I make the other invocations of handler wait for the first one to finish before going ahead? I'm thinking something along the lines of a queue, but I'm a newbie to Node.js so I'm not exactly sure what to do. Ideas?

There are already libraries which take care of that, the most used being async.
You will be interested in the async.eachSeries() function.
As for an actual example...
const async = require('async')
async.eachSeries(
items,
(item, next) => {
// Do stuff with item, and when you are done, call next
// ...
next()
},
err => {
// either there was an error in one of the handlers and
// execution was stopped, or all items have been processed
}
)
As for how the library does this, you are better of having a look at the source code.
It should be noted that this only ever makes sense if your item handler ever performs an asynchronous operation, like interfacing with the filesystem or with internet etc. There exists no operation in Node.js that would cause a piece of JS code to be executed in parallel to another JS code within the same process. So, if all you do is some calculations, you don't need to worry about this at all.

How to prevent two callbacks from running simultaneously?
They won't run simultaneously unless they're asynchronous, because Node runs JavaScript on a single thread. Asynchronous operations can overlap, but the JavaScript thread will only ever be doing one thing at a time.
So presumably doLongTask is asynchronous. You can't use forEach for what you'd like to do, but it's still not hard: You just keep track of where you are in the list, and wait to start processing the next until the previous one completes:
var n = 0;
processItem();
function processItem() {
if (n < items.length) {
doLongTask(items[n], function handler(result) {
++n;
processItem();
});
}
}

Using Javascript node.js how do I parallel process For loops?

I only started to learn javascript 2 days ago so I'm pretty new. I've written code which is optimal but takes 20 minutes to run. I was wondering if there's a simple way to parallel process with for loops e.g.
for (x=0; x<5; x++){
processor 1 do ...
for (x=5; x<10; x++){
processor 2 do ...

Since the OP wants to process the loop in parallel, the async.each() function from the async library is the ideal way to go.
I've had faster execution times using async.each compared to forEach in nodejs.

web workers can run your code in parallel, but without sharing memory/variables etc - basically you pass input parameters to the worker, it works and gives you back the result.
http://www.html5rocks.com/en/tutorials/workers/basics/
You can find nodejs implementations of this, example
https://www.npmjs.com/package/webworker-threads
OR, depending on how your code is written, if you're waiting on a lot of asynchronous functions, you can always rewrite your code to run faster (eg using event queuess instead of for loops - just beware of dependencies, order of execution, etc)

To run code in parallel or want to make requests in parallel you can
use Promise.all or Promise.settled.
Make all the queries in parallel (asynchronously). Resulting in each query firing at the same time.
let promise1 = new Promise((resolve) => setTimeout(() => resolve('any-value'), 3000);
let responses = await Promise.all([promise1, promise2, promise3, ...])
for(let response of responses) {
// format responses
// respond to client
}
For more examples check out this article

You might want to take a look at the async.js project, especially the parallel function.
Important quote about it :
parallel is about kicking-off I/O tasks in parallel, not about parallel execution of code. If your tasks do not use any timers or perform any I/O, they will actually be executed in series. Any synchronous setup sections for each task will happen one after the other. JavaScript remains single-threaded.
Example :
async.parallel([
function(callback){
setTimeout(function(){
callback(null, 'one');
}, 200);
},
function(callback){
setTimeout(function(){
callback(null, 'two');
}, 100);
}
],
// optional callback
function(err, results){
// the results array will equal ['one','two'] even though
// the second function had a shorter timeout.
});

In node.js, how to use child_process.exec so all can happen asynchronously?

I have a server built on node.js. Below is one of the request handler functions:
var exec = require("child_process").exec
function doIt(response) {
//some trivial and fast code - can be ignored
exec(
"sleep 10", //run OS' sleep command, sleep for 10 seconds
//sleeping(10), //commented out. run a local function, defined below.
function(error, stdout, stderr) {
response.writeHead(200, {"Content-Type": "text/plain"});
response.write(stdout);
response.end();
});
//some trivial and fast code - can be ignored
}
Meanwhile, in the same module file there is a local function "sleeping" defined, which as its name indicates will sleep for 10 seconds.
function sleeping(sec) {
var begin = new Date().getTime();
while (new Date().getTime() < begin + sec*1000); //just loop till timeup.
}
Here come three questions --
As we know, node.js is single-processed, asynchronous, event-driven. Is it true that ALL functions with a callback argument is asynchronous? For example, if I have a function my_func(callback_func), which takes another function as an argument. Are there any restrictions on the callback_func or somewhere to make my_func asynchronous?
So at least the child_process.exec is asynchronous with a callback anonymous function as argument. Here I pass "sleep 10" as the first argument, to call the OS's sleep command and wait for 10 seconds. It won't block the whole node process, i.e. any other request sent to another request handler won't be blocked as long as 10 seconds by the "doIt" handler. However, if immediately another request is sent to the server and should be handled by the same "doIt" handler, will it have to wait till the previous "doIt" request ends?
If I use the sleeping(10) function call (commented out) to replace the "sleep 10", I found it does block other requests till 10 seconds after. Could anyone explain why the difference?
Thanks a bunch!
-- update per request --
One comment says this question seemed duplicate to another one (How to promisify Node's child_process.exec and child_process.execFile functions with Bluebird?) that was asked one year after this one.. Well these are too different - this was asked for asynchronous in general with a specific buggy case, while that one was asking about the Promise object per se. Both the intent and use cases vary.
(If by any chance these are similar, shouldn't the newer one marked as duplicate to the older one?)

First you can promisify the child_process.
const util = require('util');
const exec = util.promisify(require('child_process').exec);
async function lsExample() {
const { stdout, stderr } = await exec('ls');
if (stderr) {
// handle error
console.log('stderr:', stderr);
}
console.log('stdout:', stdout);
}
lsExample()
As an async function, lsExample returns a promise.
Run all promises in parallel with Promise.all([]).
Promise.all([lsExample(), otherFunctionExample()]);
If you need to wait on the promises to finish in parallel, await them.
await Promise.all([aPromise(), bPromise()]);
If you need the values from those promises
const [a, b] = await Promise.all([aPromise(), bPromise(])

1) No. For example .forEach is synchronous:
var lst = [1,2,3];
console.log("start")
lst.forEach(function(el) {
console.log(el);
});
console.log("end")
Whether function is asynchronous or not it purely depends on the implementation - there are no restrictions. You can't know it a priori (you have to either test it or know how it is implemented or read and believe in documentation). There's even more, depending on arguments the function can be either asynchronous or synchronous or both.
2) No. Each request will spawn a separate "sleep" process.
3) That's because your sleeping function is a total mess - it is not sleep at all. What it does is it uses an infinite loop and checks for date (thus using 100% of CPU). Since node.js is single-threaded then it just blocks entire server - because it is synchronous. This is wrong, don't do this. Use setTimeout instead.

Develop Reference

JavaScript is the programming language of the Web.