Most NodeJS programmers know why it's bad to block NodeJS's single-threaded event loop. For example, when reading a file, we know it's better to use fs.readFile instead of fs.readFileSync. My question is: how does one go about creating an asyn API in the first place if the underlying task to be performed by the API is inherently synchronous? Another words, how do I go about creating an API such as fs.readFileSync, and not use the event-loop thread to perform the underlying task? Do I have to go outside of NodeJS & Javascript to do this?
how does one go about creating an asyn API in the first place if the
underlying task to be performed by the API is inherently synchronous?
In node.js, there are the following choices for creating your own asynchronous operation:
Base it on existing async operations (such as fs.readFile()). If your main operation itself is synchronous and can only be done synchronously in node.js, then this option would obviously not solve your problem.
Break your code into small chunks using setTimeout(), setImmediate() or nextTick() so that other things can interleave with your operation and you won't block other things from sharing the CPU. This doesn't prevent CPU usage, but it does allow sharing of the CPU with other operations. Here's an example of iterating over an array in chunks that doesn't block other operations from sharing the CPU: Best way to iterate over an array without blocking the UI.
Move your synchronous operation into another process (either another node.js process or any other process) that can do its thing and then communicate back asynchronously when done using any appropriate inter-process communication mechanism (TCP, stdio, etc...).
Write a node.js plug-in where you have the ability to use native OS threading or native IO events and you can then create a new operation in node.js that has an async interface.
Do I have to go outside of NodeJS & Javascript to do this?
Items #1, #2, #3 above can all be done from Javascript. Item #4 involves create a native code plug-in (where you can have access to native threading).
Another words, how do I go about creating an API such as
fs.readFileSync
To truly create an API like fs.readFileSync() from scratch, you would have to use either option #3 (do it synchronously, but in another process and then communicate back the result asynchronously) or option #4 (access OS services at a lower level than is built into node.js via a native code plug-in).
Related
How do the NodeJS built in functions achieve their asynchronicity?
Am I able to write my own custom asynchronous functions that execute outside of the main thread? Or do I have to leverage the built in functions?
Just a side note, true asynchronous doesn't really mean anything. But we can assume you mean parallelism?.
Now depending on what your doing, you might find there is little to no benefit in using threads in node. Take for example: nodes file system, as long as you don't use the sync versions, it's going to automatically run multiple requests in parallel, because node is just going to pass these requests to worker threads.
It's the reason when people say Node is single threaded, it's actually incorrect, it's just the JS engine that is. You can even prove this by looking at the number of threads a nodeJs process takes using your process monitor of choice.
So then you might ask, so why do we have worker threads in node?. Well the V8 JS engine that node uses is pretty fast these days, so lets say you wanted to calculate PI to a million digits using JS, you could do this in the main thread without blocking. But it would be a shame not to use those extra CPU cores that modern PC's have and keep the main thread doing other things while PI is been calculated inside another thread.
So what about File IO in node, would this benefit been in a worker thread?.. Well this depends on what you do with the result of the file-io, if you was just reading and then writing blocks of data, then no there would be no benefit, but if say you was reading a file and then doing some heavy calculations on these files with Javascript (eg. some custom image compression etc), then again a worker thread would help.
So in a nutshell, worker threads are great when you need to use Javascript for some heavy calculations, using them for just simple IO may in fact slow things down, due to IPC overheads.
You don't mention in your question what your trying to run in parallel, so it's hard to say if doing so would be of benefit.
Javascript is mono-thread, if you want to create 'thread' you can use https://nodejs.org/api/worker_threads.html.
But you may have heard about async function and promises in javascript, async function return a promise by default and promise are NOT thread. You can create async function like this :
async function toto() {
return 0;
}
toto().then((d) => console.log(d));
console.log('hello');
Here you will display hello then 0
but remember that even the .then() will be executed after it's a promise so that not running in parallel, it will just be executed later.
I'm experimenting with laying out data in indexedDB object stores and using Promise.all to extract the data, build HTML, and add it to different sections of a tabbed display within a single-page web application.
If Promise.all is used to extract the data from different object stores or different key ranges of the same store and then build HTML fragments and insert them before resolving each individual promise, is the browser really performing these steps concurrently such that the process will complete quicker?
To be more specific, there is one promise for the extraction of data for each tab in the display. If the transaction completes successfully then the data is passed to a function that builds and inserts the HTML and, when it returns, the promise resolves. The promises for each tab are grouped in the Promise.all and, when that resolves, the program navigates to the first tab and displays it.
Is this really working on multiple things at the same time or is it just providing a timing method to show the first tab only after it is known that data and HTML for all tabs have been successfully gathered and built?
Is it accurate that this is quicker than chaining the promises to run one after the other with then statements but not faster than just kicking off each promise individually and placing a then on the first tab's promise to display it?
I know JS is a single-thread language but the documentation for some of these methods reads that "in a separate thread..."; so, I don't quite understand if the browser is performing things at the same time or not.
Thank you.
First off, let's just stipulate that there's no use of webWorker threads so we're just talking about regular Javascript in a web page.
In that case, there's only ONE thread of Javascript running at any given time and the whole system is event driven (more on that later).
So, when running actual Javascript code things can only run one at a time. But, as soon as you make a function call that has a native-code backed non-blocking, asynchronous implementation, then the Javascript calls out to the native code, the asynchronous operation is initiated and then it immediately returns control back to the Javascript interpreter so more Javascript can run. In this manner, you can start multiple asynchronous operations at once that will proceed independent of the Javascript interpreter.
How much independent they are and how parallel they are depends upon the implementation of those asynchronous operations. If they are all accessing the same database, there may or may not be some real parallelism between database operations available - it all depends upon the database implementation.
Now, let's assume you did start multiple database operations and you are tracking when they a re all done with Promise.all(). As each one finishes, it will insert an event in the JS event queue. When the JS interpreter has nothing else to do, it will pull the next event from the event queue and run the callback associated with the completion of that asynchronous operation. When using Promise.all(), that callback will (among other things), mark that single operation as done and store its result for later access. When, all the operations you are tracking with Promise.all() have completed in this manner, then the promise that Promise.all() returns will resolve and you can get access to the array of results.
Now to your specific questions...
If Promise.all is used to extract the data from different object stores or different key ranges of the same store and then build HTML fragments and insert them before resolving each individual promise, is the browser really performing these steps concurrently such that the process will complete quicker?
If each asynchronous operation is capable of efficient running in parallel with others (at the native code level), then "yes" the total time to completion will be faster if the operations are run in parallel, than if they are sequenced.
Is this really working on multiple things at the same time or is it just providing a timing method to show the first tab only after it is known that data and HTML for all tabs have been successfully gathered and built?
That depends upon the specific asynchronous operations and if multiple of them can be run in parallel. Usually, they can, but not always.
Is it accurate that this is quicker than chaining the promises to run one after the other with then statements but not faster than just kicking off each promise individually and placing a then on the first tab's promise to display it?
Usually, yes.
I know JS is a single-thread language but the documentation for some of these methods reads that "in a separate thread..."; so, I don't quite understand if the browser is performing things at the same time or not.
The browser runs your Javascript (the executing of actual Javascript instructions) as a single thread (ignoring webWorkers here) with no parallelism. But, when you call an asynchronous operation, that operation has an implementation in native code and that native code behind it is free to use threads or other asynchronous, non-blocking OS APIs/tools to do its job. That may allow parallelism between separate asynchronous calls, thus allowing multiple calls to proceed in parallel.
The IndexedDB interface contains many asynchronous interfaces which means it is event driven and must be running asynchronously with some native code behind it. You would have to test a given implementation of that database in a given browser to see how well it parallelizes multiple requests in-flight at the same time. Each browser implementation could have different characteristics.
It's a very general question, but I don't quite understand. When would I prefer one over the other? I don't seem to understand what situations might arise, which would clearly favour one over the other. Are there strong reasons to avoid x / use x?
When would I prefer one over the other?
In a server intended to scale and serve the needs of many users, you would only use synchronous I/O during server initialization. In fact, require() itself uses synchronous I/O. In all other parts of your server that handle incoming requests once the server is already up and running, you would only use asynchronous I/O.
There are other uses for node.js besides creating a server. For example, suppose you want to create a script that will parse through a giant file and look for certain words or phrases. And, this script is designed to run by itself to process one file and it has no persistent server functionality and it has no particular reason to do I/O from multiple sources at once. In that case, it's perfectly OK to use synchronous I/O. For example, I created a node.js script that helps me age backup files (removing backup files that meet some particular age criteria) and my computer automatically runs that script once a day. There was no reason to use asynchronous I/O for that type of use so I used synchronous I/O and it made the code simpler to write.
I don't seem to understand what situations might arise, which would clearly favour one over the other. Are there strong reasons to avoid x / use x?
Avoid ever using synchronous I/O in the request handlers of a server. Because of the single threaded nature of Javascript in node.js, using synchronous I/O blocks the node.js Javascript thread so it can only do one thing at a time (which is death for a multi-user server) whereas asynchronous I/O does not block the node.js Javascript thread (allowing it to potentially serve the needs of many users).
In non-multi-user situations (code that is only doing one thing for one user), synchronous I/O may be favored because writing the code is easier and there may be no advantages to using asynchronous I/O.
I thought of an electron application with nodejs, which is simply reading a file and did not understand what difference that would make really, if my software really just has to wait for that file to load anyways.
If this is a single user application and there's nothing else for your application to be doing while waiting for the file to be read into memory (no sockets to be responding to, no screen updates, no other requests to be working on, no other file operations to be running in parallel), then there is no advantage to using asynchronous I/O so synchronous I/O will be just fine and likely a bit simpler to code.
When would I prefer one over the other?
Use the non-Sync versions (the async ones) unless there's literally nothing else you need your program to do while the I/O is pending, in which case the Sync ones are fine; see below for details...
Are there strong reasons to avoid x / use x?
Yes. NodeJS runs your JavaScript code on a single thread. If you use the Sync version of an I/O function, that thread is blocked waiting on I/O and can't do anything else. If you use the async version, the I/O can continue in the background while the JavaScript thread gets on with other work; the I/O completion will be queued as a job for the JavaScript thread to come back to later.
If you're running a foreground Node app that doesn't need to do anything else while the I/O is pending, you're probably fine using Sync calls. But if you're using Node for processing multiple things at once (like web requests), best to use the async versions.
In a comment you added under the question you've said:
I thought of an electron application with nodejs, which is simply reading a file and did not understand what difference that would make really, if my software really just has to wait for that file to load anyways.
I have virtually no knowledge of Electron, but I note that it uses a "main" process to manage windows and then a "rendering" process per window (link). That being the case, using Sync functions will block the relevant process, which may affect application or window responsiveness. But I don't have any deep knowledge of Electron (more's the pity).
Until somewhat recently, using async functions meant using lots of callback-heavy code which was hard to compose:
// (Obviously this is just an example, you wouldn't actually read and write a file this way, you'd use streaming...)
fs.readFile("file1.txt", function(err, data) {
if (err) {
// Do something about the error...
} else {
fs.writeFile("file2.txt", data, function(err) {
if (err) {
// Do something about the error...
} else {
// All good
});
}
});
Then promises came along and if you used a promisified* version of the operation (shown here with pseudonyms like fs.promisifiedXYZ), it still involved callbacks, but they were more composable:
// (See earlier caveat, just an example)
fs.promisifiedReadFile("file1.txt")
.then(function(data) {
return fs.promisifiedWriteFile("file2.txt", data);
})
.then(function() {
// All good
})
.catch(function(err) {
// Do something about the error...
});
Now, in recent versions of Node, you can use the ES2017+ async/await syntax to write synchronous-looking code that is, in fact, asynchronous:
// (See earlier caveat, just an example)
(async () => {
try {
const data = await fs.promisifiedReadFile("file1.txt");
fs.promisifiedWriteFile("file2.txt", data);
// All good
} catch (err) {
// Do something about the error...
}
})();
Node's API predates promises and has its own conventions. There are various libraries out there to help you "promisify" a Node-style callback API so that it uses promises instead. One is promisify but there are others.
So I have an understanding of how Node.js works: it has a single listener thread that receives an event and then delegates it to a worker pool. The worker thread notifies the listener once it completes the work, and the listener then returns the response to the caller.
My question is this: if I stand up an HTTP server in Node.js and call sleep on one of my routed path events (such as "/test/sleep"), the whole system comes to a halt. Even the single listener thread. But my understanding was that this code is happening on the worker pool.
Now, by contrast, when I use Mongoose to talk to MongoDB, DB reads are an expensive I/O operation. Node seems to be able to delegate the work to a thread and receive the callback when it completes; the time taken to load from the DB does not seem to block the system.
How does Node.js decide to use a thread pool thread vs the listener thread? Why can't I write event code that sleeps and only blocks a thread pool thread?
Your understanding of how node works isn't correct... but it's a common misconception, because the reality of the situation is actually fairly complex, and typically boiled down to pithy little phrases like "node is single threaded" that over-simplify things.
For the moment, we'll ignore explicit multi-processing/multi-threading through cluster and webworker-threads, and just talk about typical non-threaded node.
Node runs in a single event loop. It's single threaded, and you only ever get that one thread. All of the javascript you write executes in this loop, and if a blocking operation happens in that code, then it will block the entire loop and nothing else will happen until it finishes. This is the typically single threaded nature of node that you hear so much about. But, it's not the whole picture.
Certain functions and modules, usually written in C/C++, support asynchronous I/O. When you call these functions and methods, they internally manage passing the call on to a worker thread. For instance, when you use the fs module to request a file, the fs module passes that call on to a worker thread, and that worker waits for its response, which it then presents back to the event loop that has been churning on without it in the meantime. All of this is abstracted away from you, the node developer, and some of it is abstracted away from the module developers through the use of libuv.
As pointed out by Denis Dollfus in the comments (from this answer to a similar question), the strategy used by libuv to achieve asynchronous I/O is not always a thread pool, specifically in the case of the http module a different strategy appears to be used at this time. For our purposes here it's mainly important to note how the asynchronous context is achieved (by using libuv) and that the thread pool maintained by libuv is one of multiple strategies offered by that library to achieve asynchronicity.
On a mostly related tangent, there is a much deeper analysis of how node achieves asynchronicity, and some related potential problems and how to deal with them, in this excellent article. Most of it expands on what I've written above, but additionally it points out:
Any external module that you include in your project that makes use of native C++ and libuv is likely to use the thread pool (think: database access)
libuv has a default thread pool size of 4, and uses a queue to manage access to the thread pool - the upshot is that if you have 5 long-running DB queries all going at the same time, one of them (and any other asynchronous action that relies on the thread pool) will be waiting for those queries to finish before they even get started
You can mitigate this by increasing the size of the thread pool through the UV_THREADPOOL_SIZE environment variable, so long as you do it before the thread pool is required and created: process.env.UV_THREADPOOL_SIZE = 10;
If you want traditional multi-processing or multi-threading in node, you can get it through the built in cluster module or various other modules such as the aforementioned webworker-threads, or you can fake it by implementing some way of chunking up your work and manually using setTimeout or setImmediate or process.nextTick to pause your work and continue it in a later loop to let other processes complete (but that's not recommended).
Please note, if you're writing long running/blocking code in javascript, you're probably making a mistake. Other languages will perform much more efficiently.
So I have an understanding of how Node.js works: it has a single listener thread that receives an event and then delegates it to a worker pool. The worker thread notifies the listener once it completes the work, and the listener then returns the response to the caller.
This is not really accurate. Node.js has only a single "worker" thread that does javascript execution. There are threads within node that handle IO processing, but to think of them as "workers" is a misconception. There are really just IO handling and a few other details of node's internal implementation, but as a programmer you cannot influence their behavior other than a few misc parameters such as MAX_LISTENERS.
My question is this: if I stand up an HTTP server in Node.js and call sleep on one of my routed path events (such as "/test/sleep"), the whole system comes to a halt. Even the single listener thread. But my understanding was that this code is happening on the worker pool.
There is no sleep mechanism in JavaScript. We could discuss this more concretely if you posted a code snippet of what you think "sleep" means. There's no such function to call to simulate something like time.sleep(30) in python, for example. There's setTimeout but that is fundamentally NOT sleep. setTimeout and setInterval explicitly release, not block, the event loop so other bits of code can execute on the main execution thread. The only thing you can do is busy loop the CPU with in-memory computation, which will indeed starve the main execution thread and render your program unresponsive.
How does Node.js decide to use a thread pool thread vs the listener thread? Why can't I write event code that sleeps and only blocks a thread pool thread?
Network IO is always asynchronous. End of story. Disk IO has both synchronous and asynchronous APIs, so there is no "decision". node.js will behave according to the API core functions you call sync vs normal async. For example: fs.readFile vs fs.readFileSync. For child processes, there are also separate child_process.exec and child_process.execSync APIs.
Rule of thumb is always use the asynchronous APIs. The valid reasons to use the sync APIs are for initialization code in a network service before it is listening for connections or in simple scripts that do not accept network requests for build tools and that kind of thing.
Thread pool how when and who used:
First off when we use/install Node on a computer, it starts a process among other processes which is called node process in the computer, and it keeps running until you kill it. And this running process is our so-called single thread.
So the mechanism of single thread it makes easy to block a node application but this is one of the unique features that Node.js brings to the table. So, again if you run your node application, it will run in just a single thread. No matter if you have 1 or million users accessing your application at the same time.
So let's understand exactly what happens in the single thread of nodejs when you start your node application. At first the program is initialized, then all the top-level code is executed, which means all the codes that are not inside any callback function (remember all codes inside all callback functions will be executed under event loop).
After that, all the modules code executed then register all the callback, finally, event loop started for your application.
So as we discuss before all the callback functions and codes inside those functions will execute under event loop. In the event loop, loads are distributed in different phases. Anyway, I'm not going to discuss about event loop here.
Well for the sack of better understanding of Thread pool I a requesting you to imagine that in the event loop, codes inside of one callback function execute after completing execution of codes inside another callback function, now if there are some tasks are actually too heavy. They would then block our nodejs single thread. And so, that's where the thread pool comes in, which is just like the event loop, is provided to Node.js by the libuv library.
So the thread pool is not a part of nodejs itself, it's provided by libuv to offload heavy duties to libuv, and libuv will execute those codes in its own threads and after execution libuv will return the results to the event in the event loop.
Thread pool gives us four additional threads, those are completely separate from the main single thread. And we can actually configure it up to 128 threads.
So all these threads together formed a thread pool. and the event loop can then automatically offload heavy tasks to the thread pool.
The fun part is all this happens automatically behind the scenes. It's not us developers who decide what goes to the thread pool and what doesn't.
There are many tasks goes to the thread pool, such as
-> All operations dealing with files
->Everyting is related to cryptography, like caching passwords.
->All compression stuff
->DNS lookups
This misunderstanding is merely the difference between pre-emptive multi-tasking and cooperative multitasking...
The sleep turns off the entire carnival because there is really one line to all the rides, and you closed the gate. Think of it as "a JS interpreter and some other things" and ignore the threads...for you, there is only one thread, ...
...so don't block it.
Will Node.js get blocked when processing large file uploads?
Since Node.js only has one thread, is that true that when doing large file uploads all other requests will get blocked?
If so, how should I handle file uploads in nodejs?
All the I/O operations is handled by Node.js is using multiple threads internally; it's the programming interface to that I/O functionality that's single threaded, event-based, and asynchronous.
So the big upload of your example is performed by a separate thread that's managed by Node.js, and when that thread completes its work, your callback is put onto the event loop queue.
When you do CPU intensive task it blocks. Let's say we have a task compute() which needs to run almost continuously, and does some CPU intensive calculations.
Answer to the main question "How should I handle file uploads in nodejs?" Check in your code (or the library) where you save file on the server, is it dependent on writefile() or writeFileSync()?If it is using writefile() then its asynchronous; But if it is writeFileSync() its is synchronous version.
Updates: In response to a comment:
"the answer "No, it won't block" is correct but explanation is
completely wrong. JS is in one thread AND I/O is in one (same)
thread. Event loop / asynchronous processing / callbacks make this possible. No multiple threads required. " - by andrey-sidorov
There is no async API for file operations so Node.js uses a thread pool for that. You can see it in the code of libuv. You can look at the source for fs.readFile in lib/fs.js, you’ll see binding.read. Whenever you see binding in Node’s core modules you’re looking at a portal into the land of C++. This binding is made available using NODE_SET_METHOD(target, "read", Read). If you know any C, you might think this is a macro – it was originally, but it’s now a function.
Going back to ASYNC_CALL in Read, one of the arguments is read: the syscall read. But wait, doesn't this function block?
Yes, but that’s not the end of the story. An Introduction to libuv denotes the following:
"The libuv filesystem operations are different from socket operations. Socket operations use the non-blocking operations provided by the operating system. Filesystem operations use blocking functions internally, but invoke these functions in a thread pool and notify watchers registered with the event loop when application interaction is required."
Summary:
Node API method writeFile() is asynchronous, but that doesn’t necessarily mean it’s non-blocking underneath. As the libuv book points out, socket (network) code is non-blocking, but filesystems are more complicated. Some things are event-based (kqueue), others use thread pool (as in this case).
Consider going through C code on which Node.js is developed, for more information:
Unix fs.c
Windows fs.c
That depends on the functions used for accomplishing that task. If you are using asynchronous functions, then Node.js will not block. But there are also synchronous functions, for example fs.readFileSync (FileSystem Doc), that will block execution.
Just take care and choose asynchronous functions. This way Node.js will keep running while slow tasks/waits are completed by external libraries. Once those tasks are completed, the Event Loop will take care of the result and execute your callbacks.
You can read more about the Event Loop here: Understanding the Node.js event loop
That is the exact reason node.js being asynchronous.
Most (all?) functions in node.js involving Input/Output operations (where bottleneck is some other device than CPU or RAM) the operation happens on a sepparate thread, letting your node.js server do some other code while waiting.