Synchronization in Javascript & Socket.io

Synchronization in Javascript & Socket.io - javascript

I have a Node.js server running the following code, which uses socket.io library :
var Common = require('./common');
var _ = require('lodash');
var store = (function() {
var userMapByUserId = {};
var tokenSet = {};
var generateToken = function(userId) {
if (_.has(userMapByUserId, userId)) {
var token = '';
do {
token = Common.randomGenerator();
} while (_.has(tokenSet, token));
if (userMapByUserId[userId].tokens === undefined) {
userMapByUserId[userId].tokens = {};
}
userMapByUserId[userId].tokens[token] = true;
}
};
var deleteToken = function(userId, tokenValue) {
if (_.has(userMapByUserId, userId)) {
if (userMapByUserId[userId].tokens !== undefined) {
userMapByUserId[userId].tokens = _.omit(userMapByUserId[userId].tokens, tokenValue);
}
if (_.has(tokenSet, tokenValue)) {
tokenSet = _.omit(tokenSet, tokenValue);
}
}
};
return {
generateToken: generateToken,
deleteToken: deleteToken
};
}());
module.exports = function(socket, io) {
socket.on('generateToken', function(ownerUser) {
store.generateToken(ownerUser.userId);
io.sockets.emit('userList', {
userMapByUserId: store.getUserList()
});
});
socket.on('deleteToken', function(token) {
store.deleteToken(token.userId, token.tokenValue);
io.sockets.emit('userList', {
userMapByUserId: store.getUserList()
});
});
};
So, basically we can have multiple clients sending requests to this server to add/remove tokens. Do I need to worry about synchronization and race conditions ?

I don't see any concurrency issues in the code you've shown.
The Javascript in node.js is single threaded and works via an event queue. One thread of execution runs until it completes and then the JS engine fetches the next event waiting to run and runs that thread of execution until it completes. As such, there is no pre-emptive concurrency in Javascript like you might have to worry about in a language or situation using threads.
There are still some places in node.js where you can get opportunities for concurrency issues. This can happen if you are using asynchronous operations in your code (like async IO to a socket or disk). In that case, your thread of execution runs to completion, but your async operation is still running and has not finished yet and has not called its callback yet. At this point, some other event can get processed and other code can run. So, if your async callback refers to state that could get changed by other event handlers, then you could have concurrency issues.
In the two socket event handlers you disclosed in your question, I don't see any asynchronous callbacks within the code running in the event handlers. If that's the case, then there are no opportunities for concurrency there. Even if there were concurrency opportunities, but the separate pieces of code weren't using/modifying the same underlying data, then you would still be OK.
To give you an example of something that can cause a concurrency issue, I had a node.js app that was running on a Raspberry Pi. Every 10 seconds it would read a couple of digital temperature sensors and would record the data in an in-memory data structure. Every few hours, it would write the data out to disk. The process of writing the data out to disk was all done with a serios of async disk IO operations. Each async disk IO operation has a callback and the code continued only within that callback (technically I was using promises to manage the callbacks, but its the same concept). This create a concurrency issue because if the timer that records new temperatures fires while I'm in the middle of one of these async disk operations, then the data structure could get changed in the middle of writing it to disk which could cause me problems.
It wasn't that the async operation would get interrupted in the middle of it's disk write (node.js is not pre-emptive in that way), but because the whole operation of writing to disk consisted of many separate async file writes, other events could get processed between the separate async file writes. If those events could muss with your data, then it could create a problem.
My solution was to set a flag when I started writing the data to disk (then clear the flag when done) and if I wanted to add something to the data structure while that flag was set, then the new data would go into a queue and would be processed later so the core data structure was not changed while I was writing it to disk. I was able to implement this entirely in a few methods that were used for modifying the data so that the calling code didn't even need to know anything different was going on.
Though this answer was written for Ajax in a browser, the event queue concept is the same in node.js so you may find it helpful. Note that node.js is called "event IO for V8 Javascript" for a reason (because it runs via an event queue).

Related

avoiding simultaneous execution of shell command with node.js & shelljs

Using nodejs 8.12 on Gnu/Linux CentOS 7. Using the built-in web server, require('https'), for a simple application.
I understand that nodejs is single threaded (single process) and there is no actual parallel execution of code. Based on my understanding, I think the http/https server will process one http request and run the handler through all synchronous statements and set up asynchronous statements to be executed later before it will return to process a subsequent request. However, with http/https libraries, you have an asynchronous code that is used to assemble the body of the request. So, we already have one callback which is executed when the body is ready ('end' event). This fact makes me think it might be possible to be in the middle of processing two or more requests simultaneously.
As part of handling the requests, I need to execute a string of shell commands and I use the shelljs.exec library to do that. It runs synchronously, waiting until complete before returning. So, example code would look like:
const shelljs_exec = require('shelljs.exec');
function process() {
// bunch of shell commands in string
var command_str = 'command1; command2; command3';
var exec_results = shelljs_exec(command_str);
console.log('just executed shelljs_exec command');
var proc_results = process_results(exec_results);
console.log(proc_results);
// and return the response...
}
So node.js runs the shelljs_exec() and waits for completion. While it's waiting, can another request be worked on, such that there is a risk, slight, of two or more shelljs.exec invocations running at the same time? Since that could be a problem, I need to ensure only one shelljs.exec statement can be in progress at a given time.
If that is not a correct understanding, then I was thinking I need to do something with mutex locks. Like this:
const shelljs_exec = require('shelljs.exec');
const locks = require('locks');
// Need this in global scope - so we are all dealing with the same one.
var mutex = locks.createMutex();
function await_lock(shell_commands) {
var commands = shell_commands;
return new Promise(getting_lock => {
mutex.lock(got_lock_and_execute);
});
function got_lock_and_execute() {
var exec_results = shelljs_exec(commands);
console.log('just executed shelljs_exec command');
mutex.unlock();
return exec_results;
}
}
async function process() {
// bunch of shell commands in string
var command_str = 'command1; command2; command3';
exec_results = await await_lock(command_str);
var proc_results = process_results(exec_results);
console.log(proc_results);
}

If shelljs_exec is synchronous, no need.
If it is not. If it takes a callback wrap it in a Promise constructor so that it can be awaited. I would suggest properly wrapping the mutex.lock in a promise that gets resolved when the lock is acquired. The try finally is needed to ensure that the mutex is unlocked if shelljs_exec throws an exception.
async function await_lock(shell_commands) {
await (new Promise(function(resolve, reject) {
mutex.lock(resolve);
}));
try {
let exec_results = await shelljs_exec(commands);
return exec_results;
} finally {
mutex.unlock();
}
}
Untested. But it looks like it should work.

Keep order in nodejs command line script

There is something I want to code in nodejs, but I don't have any idea of how to implement it. I've been reading and searching a lot, and still have not idea of what would be the correct way to do it.
The problem is the following:
Read lines from stdin
For each line, launch an http request
There must be a limit to simultaneous http
Write the line readed plus some data obtained from the http request to stdout
Lines must be written in order
You can not read "all" the file and then split lines: you must process one line at a time, remember it's stdin. You don't know when the input will end.
Does anybody have some clues of how to approach this problem? I do not have any idea of how to proceed.

You could do something like this:
const http = require('http');
const Promise = require('bluebird');
let arrayOfRequests = [];
process.stdin.on('data', (data) => {
//get data from the stdin
//then add the request to the array
arrayOfRequests.push(http.get({}))
})
process.stdin.on('end', () => {
Promise.all(arrayOfRequests)
// use Promise .all to bundle all of the reuqest
//then use the spread operator so you can use all of the reuqest In order
.spread( (request1,request2,request3) => {
// do stuff
})
})
FYI, the snippet wont work.
So what you are doing is using the process.stdin that is built into Node.js. Then you are bundling all of the requests. Whenever the user cancels out of the program, your requests will be made. Since the calls will be async, you have them in an array, then run Promsise.all and use the bluebird .spread operator to deconstruct the Promise.all and get the values.

So far, I've got this solution for the producer-consumer problem in nodejs, where the producer don't produce more data until there is space available in the queue.
This is queue's code, based on block-queue: https://gist.github.com/AlvaroMaceda/8a933a317fed3fd4691fd5eda3e60c5e
To use the blocking queue, you create it with 3 parameters:
Number of tasks running concurrently
"Push" function. It will be called with the queue as parameter when
more data is needed. The task will be added with an identifier.
"Task" function. It will be called with the identifier created by
"Push" function.
The queue will call "push" only when more data is needed. For example, if there are five tasks running and it was created with a maximum of 5 concurrent tasks, "push" won't be called until one of these tasks end.
This is an example of how to use it:
"use strict";
const queue = require('./block-queue');
const CONCURRENCY = 5;
const WORKS_TO_LAUNCH = 10;
const TIMEOUT = 200;
let i = 1;
let q = queue(CONCURRENCY, doSomethingAsync, putMoreData );
function putMoreData(queue) {
if (++i <= WORKS_TO_LAUNCH) {
console.log('Pushing to queue');
queue.push(i);
}
}
function doSomethingAsync(task, done) {
setTimeout(function() {
console.log('done ' + task);
done();
}, 1000 + task * TIMEOUT);
}
q.push(i);
I don't give this as solved because I don't know if there is a more simple approach and I want to work the complete solution, don't know if I'll find some issues when working with this and streams.

fs.readFile is very slow, am I making too many request?

node.js beginner here:
A node.js applications scrapes an array of links (linkArray) from a list of ~30 urls.
Each domain/url has a corresponding (name).json file that is used to check whether the scraped links are new or not.
All pages are fetched, links are scraped into arrays, and then passed to:
function checkLinks(linkArray, name){
console.log(name, "checkLinks");
fs.readFile('json/'+name+'.json', 'utf8', function readFileCallback(err, data){
if(err && err.errno != -4058) throw err;
if(err && err.errno == -4058){
console.log(name+'.json', " is NEW .json");
compareAndAdd(linkArray, {linkArray: []}.linkArray, name);
}
else{
//file EXISTS
compareAndAdd(linkArray, JSON.parse(data).linkArray, name);
}
});
}
compareAndAdd() reads:
function compareAndAdd(arrNew, arrOld, name){
console.log(name, "compareAndAdd()");
if(!arrOld) var arrOld = [];
if(!arrNew) var arrNew = [];
//compare and remove dups
function hasDup(value) {
for (var i = 0; i < arrOld.length; i++)
if(value.href == arrOld[i].href)
if(value.text.length <= arrOld[i].text.length) return false;
arrOld.push(value);
return true;
}
var rArr = arrNew.filter(hasDup);
//update existing array;
if(rArr.length > 0){
fs.writeFile('json/'+name+'.json', JSON.stringify({linkArray: arrOld}), function (err) {
if (err) return console.log(err);
console.log(" "+name+'.json UPDATED');
});
}
else console.log(" "+name, "no changes, nothing to update");
return;
}
checkLinks() is where the program hangs, it's unbelievably slow. I understand that fs.readFile is being hit multiple times a second, but imo less than 30 hits seems pretty trivial: assuming this is a function meant to be used to serve data to (potentially) millions of users. Am I expecting too much from fs.readFile, or (more likely) is there another component (like writeFile, or something else entirely) that's locking everything up.
supplemental:
using write/readFileSync creates a lot of problems: this program is inherently async because it begins with request to external websites with largely varied response times and read/write would frequently collide. the functions above insure that writing for a given file only happens after it's been read. (though it is very slow)
Also, this program does not exit on its own, and I do not know why.
edit
I've reworked the program to read first then write synchronously last and the process is down to ~12 seconds. Apparently fs.readFile was getting hung when called multiple times. I don't understand when/how to use asynchronous fs, if multiple calls hangs the function.

All async fs operations are executed inside the libuv thread pool, which has a default size of 4 (can be changed by setting the UV_THREADPOOL_SIZE environment variable to something different). If all threads in the thread pool are busy, any fs operations will be queued up.
I should also point out that fs is not the only module that uses the thread pool, dns.lookup() (the default hostname resolution method used internally by node), async zlib methods, crypto.randomBytes(), and a couple of other things IIRC also use the libuv thread pool. This is just something to keep in mind.

If you read many files (checkLinks) in a loop, firstly ALL the fs.readFile functions will be called. And only AFTER that callbacks will be processed (they processed only if main function stack is empty). This would lead to significant starting delay. But don't worry about that.
You point that a program never ends. So, make a counter, count calls to checkLinks, and decrease the counter after callback function is called. Inside the callback, check the counter against 0 and then do finalizing logic (I suspect this could be a response to the http request).
Actually, it doesn't matter whether you use async version or sync. They will work relatively the same time.

Can I miss web socket events if I reassign ws.onmessage?

I am receiving data from a rest api via the function callRestApi, below. However, I receive updates on the rest data received via a websocket and I want to ensure that I dont miss anything. Therefore I start buffering websocket events prior to calling the rest endpoint. Having received the rest response I dispatch the buffered events and start dispatching anything new that is subsequently received, updating my copy of the data. But have I implemented this correctly? In particular, is there a risk that I could miss an event in the function startDispatchingEvents where I assign ws.onmessage a new value. I am using a redux dispatcher.
export const startBufferingEvents = () => {
eventBuffer = [];
ws.onmessage = msg => {
eventBuffer.push(msg);
}
};
export const startDispatchingEvents = (dispatcher) => {
eventBuffer.forEach(evt => dispatcher(evt));
ws.onmessage = evt => dispatcher(evt);
};
startEventBuffering()
.then(()=>callRestApi())
.then(restResponse=>dispatch(action(restResponse)))
.then(()=>startDispatchingEvents((evt)=>eventDispatcher(dispatch, evt)))
};

It shouldn't be a problem. All the code that is executed in JavaScript (assuming you're running this in a browser) is single-threaded. That means you cannot run into race conditions or any other concurrency issues. When the callback that replaces the value of onmessage is executed, access to the variable by, say, a websocket message, will wait (a.k.a. go into the event queue) until the callback execution finishes.
This might not be particularly related to the question, but if you are interested in learning a bit more about the event handling system of JS and the internals of the engine, I found this talk quite good: https://youtu.be/8aGhZQkoFbQ

Node.js sync vs. async

I'm currently learning node.js and I see 2 examples for sync and async program (same one).
I do understand the concept of a callback, but i'm trying to understand the benefit for the second (async) example, as it seems that the two of them are doing the exact same thing even though this difference...
Can you please detail the reason why would the second example be better?
I'll be happy to get an ever wider explanation that would help me understand the concept..
Thank you!!
1st example:
var fs = require('fs');
function calculateByteSize() {
var totalBytes = 0,
i,
filenames,
stats;
filenames = fs.readdirSync(".");
for (i = 0; i < filenames.length; i ++) {
stats = fs.statSync("./" + filenames[i]);
totalBytes += stats.size;
}
console.log(totalBytes);
}
calculateByteSize();
2nd example:
var fs = require('fs');
var count = 0,
totalBytes = 0;
function calculateByteSize() {
fs.readdir(".", function (err, filenames) {
var i;
count = filenames.length;
for (i = 0; i < filenames.length; i++) {
fs.stat("./" + filenames[i], function (err, stats) {
totalBytes += stats.size;
count--;
if (count === 0) {
console.log(totalBytes);
}
});
}
});
}
calculateByteSize();

Your first example is all blocking I/O. In other words, you would need to wait until the readdir operation is complete before looping through each file. Then you would need to block (wait) for each individual sync stat operation to run before moving on to the next file. No code could run after calculateByteSize() call until all operations are completed.
The async (second) example on the otherhand is all non-blocking using the callback pattern. Here, the execution returns to just after the calculateByteSize() call as soon as fs.readdir is called (but before the callback is run). Once the readdir task is complete it performs a callback to your anonymous function. Here it loops through each of the files and again does non-blocking calls to fs.stat.
The second is more advantageous. If you can pretend that calls to readdir or stat can range from 250ms to 750ms to complete (this is probably not the case), you would be waiting for serial calls to your sync operations. However, the async operations would not cause you to wait between each call. In other words, looping over the readdir files, you would need to wait for each stat operation to complete if you were doing it synchronously. If you were to do it asynchronously, you would not have to wait to call each fs.stat call.

In your first example, the node.js process, which is single-threaded, is blocking for the entire duration of your readdirSync, and can't do anything else except wait for the result to be returned. In the second example, the process can handle other tasks and the event loop will return it to the continuation of the callback when the result is available. So you can handle a much much higher total throughput by using asynchronous code -- the time spent waiting for the readdir in the first example is probably thousands of times as long as the time actually spend executing your code, so you're wasting 99.9% or more of your CPU time.

In your example the benefit of async programming is indeed not much visible. But suppose that your program needs to do other things as well. Remember that your JavaScript code is running in a single thread, so when you chose the synchronous implementation the program can't do anything else but waiting for the IO operation to finish. When you use async programming, your program can do other important tasks while the IO operation runs in the background (outside the JavaScript thread)

Can you please detail the reason why would the second example be better? I'll be happy to get an ever wider explanation that would help me understand the concept..
It's all about concurrency for network servers (thus the name "node"). If this were in a build script the second, synchronous example would be "better" in that is is more straightforward. And given a single disk, there might not be much actual benefit to making it asynchronous.
However, in a network service, the first, synchronous version would block the entire process and defeat node's main design principle. Performance would be slow as number of concurrent clients increased. However the second, asynchronous example, would perform relatively well as while waiting for the relatively-slow filesystem to come back with results, it could handle all the relatively-fast CPU operations concurrently. The async version should basically be able to saturate your filesystem and however much your filesystem can deliver, node will be able to get it out to clients at that rate.

Lots of good answers here, but be sure to also read the docs:
The synchronous versions will block the entire process until they complete--halting all connections.
There is a good overview of sync vs async in the documentation: http://nodejs.org/api/fs.html#fs_file_system

Develop Reference

JavaScript is the programming language of the Web.