Managing multiple long-running tasks concurrently in JS (Node.js)

Managing multiple long-running tasks concurrently in JS (Node.js) - javascript

Golang developer here, trying to learn JS (Node.js).
I'm used to working with goroutines in Go, which for the sake of simplicity let's assume are just threads (actually they're not exactly threads, more like Green Threads, but bear with me!).
Imagine now that I want to create some kind of service that can run some endlessTask which, for example, could be a function that receives data from a websocket and keeps an internal state updated, which can be queried later on. Now, I want to be able to serve multiple users at the same time and each of them can also stop their specific ongoing task at some point. In Go, I could just spawn a goroutine for my endlessTask, store some kind of session in the request dispatcher to keep track to which user each task belongs.
How can I implement something like this in JS? I looked through Node.js API documentation and I found some interesting things:
Cluster: doesn't seem to be exactly what I'm looking for
Child processes: could work, but I'd be spawning 1 process per client/user and the overhead would be huge I think
Worker threads: that's more like it, but the documentation states that they "are useful for performing CPU-intensive JavaScript operations" and "Node.js built-in asynchronous I/O operations are more efficient than Workers can be"
I'm not sure how I could handle this scenario without multi-threading or multi-processing. Would the worker threads solution be viable in this case?
Any input or suggestion would be appreciated. Thanks!

Imagine now that I want to create some kind of service that can run some endlessTask which, for example, could be a function that receives data from a websocket and keeps an internal state updated
So, rather than threads, you need to be thinking in terms of events and event handlers since that's the core of the nodejs architecture, particularly for I/O. So, if you want to be able to read incoming webSocket data and update some internal state when it arrives, all you do is set up an event handler for the incoming webSocket data. That event handler will then get called any time there's data waiting to be read and the interpreter is back to the event loop.
You don't have to create any thread structure for that or any type of loop or anything like that. Just add the right event handler and let it call you when there's incoming data available.
Now, I want to be able to serve multiple users at the same time and each of them can also stop their specific ongoing task at some point.
Just add an event listener to each webSocket and your nodejs server will easily serve multiple users. When the user disconnects their webSocket, the listener automatically goes away with it. There's nothing else to do or cleanup in that regard unless you want to update the internal state, in which case you can also listen for the disconnect event.
In Go, I could just spawn a goroutine for my endlessTask, store some kind of session in the request dispatcher to keep track to which user each task belongs.
I don't know goroutines but there are lots of options for storing the user state. If it's just info that you need to be able to get to when you already have the webSocket and don't need it to persist beyond that, then you can just add the state directly to the webSocket object. That object will be available anytime you get a webSocket event so you can always have it there to update when there's incoming data. You can also put the state other places (a database, Map object that's indexed by socket or by username of by whatever you need to be able to look it up by) - it really depends what exactly the state is.
I'm not sure how I could handle this scenario without multi-threading or multi-processing. Would the worker threads solution be viable in this case?
What you have described doesn't sound like anything that would require clustering, child processes or worker threads unless something you're doing with the data is CPU intensive. Just using event listeners for incoming data on each webSocket will let nodejs' very efficient and asynchronous I/O handling kick into gear. This is one of the things it is best at.
Keep in mind that I/O in nodejs may be a little inside-out from one what you're used to. You don't create a blocking read loop waiting for incoming data on the webSocket. Instead, you just set up an event listener for incoming data and it will call you when incoming data is available.
The time you would involve clustering, child processes or Worker Threads are when you have more CPU processing in your Javascript to process the incoming data than a single core can handle. I would only go there if/when you've proven you have a scalability issue with the CPU usage in your nodejs server. Then, you'd want to pursue an archicture that adds just a few other processes or threads to share the load (not one per connection). If you have specific CPU heavy processes (custom encryption or compresssion are classic examples), then it you may help to create a few other processes or Worker Threads that just handle a work queue for the CPU-heavy work. Or if it's just increasing the overall CPU cycles available to process incoming data, then you would probably go to clustering and just let each incoming webSocket get assigned to a cluster and still use the same event handling logic previously described, but now you have the webSockets split across several processes so you have more CPU to throw at them.

Related

Socket.io Performance optimisation 15'000 users

I have a chat application with huge chatrooms (15'000 user connected to one single room).
Only few have the right to write, so theoretically there should not be a huge load.
I have noticed that there are performance issue: when only one message is send, the server CPU load spikes to 30%-50% and the message gets delivered slowly (maybe 1 second later or worse if you write multiple messages)
I have analysed the performance with clinic-flame. I see that this code is the problem:
socket.to("room1").emit(/* ... */); which will trigger send() in engine.io and clearBuffer in the ws library.
Does someone know if I am doing something wrong and how to optimize the performance?

You can load balance sokets.io servers.
I haven't do this yet, but you have informations about doing this here: https://socket.io/docs/v4/using-multiple-nodes
If you use node.js as http server : https://socket.io/docs/v4/using-multiple-nodes#using-nodejs-cluster
Alternatively, you can consider using redis for that. There is a package that allow to distribute trafic on a cluster of process or server:
https://github.com/socketio/socket.io-redis-adapter

You do not have to manage load balance if you have up to 15k to 20k Users, follow the below instructions.
In socket code do not use await async.
In socket code do not use any DB query.
Use a socket to take data from someone and provide those data to others, if you want to store those data do it using APIs before or after you send it to the socket.
It's all about how you use it. queries and await all this will slow your response and load, there is always a way to do things differently.
The above 3 suggestions boost my socket performance, I hope it does the same for you.

What are some good use cases for Server Sent Events

I discovered SSE (Server Sent Events) pretty late, but I can't seem to figure out some use cases for it, so that it would be more efficient than using setInterval() and ajax.
I guess, if we'd have to update the data multiple times per second then having one single connection created would produce less overhead. But, except this case, when would one really choose SSE?
I was thinking of this scenario:
A new user comment from the website is added in the database
Server periodically queries DB for changes. If it finds new comment, send notification to client with SSE
Also, this SSE question came into my mind after having to do a simple "live" website change (when someone posts a comment, notify everybody who is on the site). Is there really another way of doing this without periodically querying the database?

Nowadays web technologies are used to implmement all sort of applications, including those which need to fetch constant updates from the server.
As an example, imagine to have a graph in your web page which displays real time data. Your page must refresh the graph any time there is new data to display.
Before Server Sent Events the only way to obtain new data from the server was to perform a new request every time.
Polling
As you pointed out in the question, one way to look for updates is to use setInterval() and an ajax request. With this technique, our client will perform a request once every X seconds, no matter if there is new data or not. This technique is known as polling.
Events
Server Sent Events on the contrary are asynchronous. The server itself will notify to the client when there is new data available.
In the scenario of your example, you would implement SSE such in a way that the server sends an event immediately after adding the new comment, and not by polling the DB.
Comparison
Now the question may be when is it advisable to use polling vs SSE. Aside from compatibility issues (not all browsers support SSE, although there are some polyfills which essentially emulate SSE via polling), you should focus on the frequency and regularity of the updates.
If you are uncertain about the frequency of the updates (how often new data should be available), SSE may be the solution because they avoid all the extra requests that polling would perform.
However, it is wrong to say in general that SSE produce less overhead than polling. That is because SSE requires an open TCP connection to work. This essentially means that some resources on the server (e.g. a worker and a network socket) are allocated to one client until the connection is over. With polling instead, after the request is answered the connection may be reset.
Therefore, I would not recommend to use SSE if the average number of connected clients is high, because this could create some overhead on the server.
In general, I advice to use SSE only if your application requires real time updates. As real life example, I developed a data acquisition software in the past and had to provide a web interface for it. In this case, a lot of graphs were updated every time a new data point was collected. That was a good fit for SSE because the number of connected clients was low (essentially, only one), the user interface should update in real-time, and the server was not flooded with requests as it would be with polling.
Many applications do not require real time updates, and thus it is perfectly acceptable to display the updates with some delay. In this case, polling with a long interval may be viable.

is it conflicting for multiple users on one backend server websockets

I'm planning on building some backend logic on a server for personal use. Its connected to a websocket from another server and I've set code to handle data from that socket. I'm still fairly new to using websockets so the whole concept is still a little foreign to me.
If I allowed more users to use that backend and the websocket has specific logic running wouldn't it be conflicted by multiple users? Or would each user have their own instance of the script running?
Does it make any sense of what I'm trying to ask?

If I allowed more users to use that backend and the websocket has specific logic running wouldn't it be conflicted by multiple users? Or would each user have their own instance of the script running?
In node.js, there is only one copy of the script running (unless you use something like clustering to run a copy of the script for each core, which it does not sound like you are asking about). So, if you have multiple webSocket connections to the same server, they will all be running in the same server code with the same variables, etc... This is how node.js works. One running Javascript engine and one code base serves many connections.
node.js is an event-driven system so it will serve an incoming event from one webSocket, then return control back to the Javascript system and serve the next event in the event queue and so on. Whenever a request handler calls some asynchronous operation and waits for a response, that is an opportunity for another event to be pulled from the incoming event queue and another request handler can run. In this way, multiple requests handlers can be interleaved with all making progress toward completion, even though there is only one single thread of Javascript running.
What this architecture generally means is that you never want to put request-specific state in the global or module scope because those scopes are shared by all request handlers. Instead, the state should be in the request-specific scope or in a session that is bound to that particular user.
Is it conflicting for multiple users on one backend server websockets
No, it will not conflict if you write your server code properly. Yes, it will conflict if you write your server code wrongly.

Alternative to "Notification URL" to Handle Long Running API Process in Node

I am building an API that will take a long time to return data, up to 60 seconds while a conversion takes place. While running, I would like to keep the users informed of any errors and notify them which process in the conversion stage we are at.
This is pretty easy on the client since I can simply send a WebSocket event, but for a public API, that's not very practical.
I know I can request a notification URL and send updates to the given URL, but it seems cumbersome and potentially resource heavy. Is there another more efficient means to send progress notifications?
Ideally, the user consuming the api would be able to setup.
.on("error", function(err) {
//handle error
});
or something to that effect.

You're not really clear on who the consumers of your API are, what kinds of clients they're using, or what the workflow will look like. So there's a lot of different answers depending on what you're looking for and what resources you have available.
A non-exhaustive list:
REST endpoint polling
Understood that you aren't a fan, but this remains one of the best ways to do it for a wide range of clients, is one of only two (that I know of) ways to do it for purely browser-based clients. Performance wise, it's not awful if you setup your caching strategy appropriately and set throttle limits on your clients (which you should be doing anyway). I disagree that it's a PITA for clients to use consume, but that's opinion and you obviously feel differently. A way to mitigate that PITA is to offer an SDK that handles that mechanism for consumers.
Web Sockets
I get that you might be dealing with clients who aren't starting off in the web, but if a client can make a RESTful request, you could set the server to do the web socket upgrade if the client advertises interest in establishing same. I'm not a fan of this option as it feels more complex to me (more moving parts), but it's an option if you like web sockets and all/most of your clients will be web socket capable. Or you could just have the REST response be the URL to the web socket you're opening for that client.
Web Hooks
If your clients are likely to be other machines (esp. servers), then a web hook is a very good approach, especially if the event you want to raise can happen more than once and at unpredictable intervals. In this scheme, the client makes a REST request to you, part of the data they send you includes a URL that you will POST data to (in a format you specify in your API) when the event occurs. Obviously, they either have to leave that URL open to your POST or else you can agree upon some kind of credentialing that your server will respect.
TCP Socket
Similar to the Web Socket option, in that you'd probably have a REST request hit your endpoint, and then respond with the socket connection information/URI to a custom TCP socket. This is a bit nonstandard, but can be very useful and efficient in the right use cases. I haven't used it in a while so they may have changed it, but this is how Heroku's API used to handle streaming logs.
Pub/Sub or Message Queue or similar
Redis can do this, as can many others. In this scenario you're making a more generic solution where there might be more than one event channel clients can subscribe to, and so on. I dislike exposing Redis directly for security reasons, which means you'll still need to figure out how to handle the comms between Redis and the client (see above), but using it under the hood will at least buy you some of the conceptual logic of handling publishers and subscribers and so on; useful if you have more than one event as I said. This is a more heavyweight solution than the above, though, and will increase your sysadmin overhead by some amount (depending on your high availability needs, etc)

Moving node.js server javascript processing to the client

I'd like some opinions on the practical implications of moving processing that would traditionally be done on the server to be handled instead by the client in a node.js web app.
Example case study:
The user uploads a CSV file containing a years worth of their bank statement entries. We want to parse the file, categorise each entry and calculate cumulative values for each category so that we can store the newly categorised statement in a db and display spending analysis to the user.
The entries are categorised by matching strings in the descriptions. There are many categories and many entries and it takes a fair amount of time to process.
In our node.js server, we can happily free up the event loop whilst waiting for network responses and so on, but if there is any data crunching or similar processing, the server will be blocked from responding to requests, and this seems unavoidable.
Traditionally, the CSV file would be passed to the server, the server would process, save in db, and send back the output of the processing.
It seems to make sense in our single threaded node.js server that this processing is handled by the browser, and the output displayed and sent to server to be stored. Of course the client will have to wait while this is done, but their processing will not be preventing the server from responding to requests from other clients.
I'm interested to see if anyone has had experience build apps using this model.
So, the question is.. are there any issues in getting browsers rather than the server to handle, wherever possible, any processing that will block the event loop? Is this a good/sensible/viable approach to node.js application development?

I don't think trusting client processed data is a good idea.
Instead you should look into creating a work queue that a separate process listens on, separating the CPU intensive tasks from your node.js process handling HTTP requests.
My proposed data flow would be:
HTTP upload request
App server (save raw file somewhere the worker process can access)
Notification to 'csv' work queue
Worker processes uploaded csv file.

Although perfectly possible, simply shifting the processing to the client machine does not solve the basic problem.
Now the client's event loop is blocked, preventing the user from interacting with the browser. Browsers tend to detect this problem and stop execution of the page's script altogether. Something your users will certainly hate.
There is no way around either delegating or splitting up the work-load.
Using a second process (for example a 2nd node instance) for doing the number crunching server-side has the added benefit of allowing the operating system to use a 2nd CPU core. Ideally you run as many Node instances as you have CPU cores in the server and balance your work-load between them. Have a look at the diode module for some inspiration on how to implement multi-process communication in node.

Develop Reference

JavaScript is the programming language of the Web.