Perhaps the underlying issue is how the node-kafka module I am using has implemented things, but perhaps not, so here we go...
Using the node-kafa library, I am facing an issue with subscribing to consumer.on('message') events. The library is using the standard events module, so I think this question might be generic enough.
My actual code structure is large and complicated, so here is a pseudo-example of the basic layout to highlight my problem. (Note: This code snippet is untested so I might have errors here, but the syntax is not in question here anyway)
var messageCount = 0;
var queryCount = 0;
// Getting messages via some event Emitter
consumer.on('message', function(message) {
message++;
console.log('Message #' + message);
// Making a database call for each message
mysql.query('SELECT "test" AS testQuery', function(err, rows, fields) {
queryCount++;
console.log('Query #' + queryCount);
});
})
What I am seeing here is when I start my server, there are 100,000 or so backlogged messages that kafka will want to give me and it does so through the event emitter. So I start to get messages. To get and log all the messages takes about 15 seconds.
This is what I would expect to see for an output assuming the mysql query is reasonably fast:
Message #1
Message #2
Message #3
...
Message #500
Query #1
Message #501
Message #502
Query #2
... and so on in some intermingled fashion
I would expect this because my first mysql result should be ready very quickly and I would expect the result(s) to take their turn in the event loop to have the response processed. What I am actually getting is:
Message #1
Message #2
...
Message #100000
Query #1
Query #2
...
Query #100000
I am getting every single message before a mysql response is able to be processed. So my question is, why? Why am I not able to get a single database result until all the message events are complete?
Another note: I set a break point at .emit('message') in node-kafka and at mysql.query() in my code and I am hitting them turn-based. So it appears that all 100,000 emits are not stacking up up front before getting into my event subscriber. So there went my first hypothesis on the problem.
Ideas and knowledge would be very appreciated :)
The node-kafka driver uses quite a liberal buffer size (1M), which means that it will get as many messages from Kafka that will fit in the buffer. If the server is backlogged, and depending on the message size, this may mean (tens of) thousands of messages coming in with one request.
Because EventEmitter is synchronous (it doesn't use the Node event loop), this means that the driver will emit (tens of) thousands of events to its listeners, and since it's synchronous, it won't yield to the Node event loop until all messages have been delivered.
I don't think you can work around the flood of event deliveries, but I don't think that specifically the event delivery is problematic. The more likely problem is starting an asynchronous operation (in this case a MySQL query) for each event, which may flood the database with queries.
A possible workaround would be to use a queue instead of performing the queries directly from the event handlers. For instance, with async.queue you can limit the number of concurrent (asynchronous) tasks. The "worker" part of the queue would perform the MySQL query, and in the event handlers you'd merely push the message onto the queue.
Related
I am new to Node JS and am trying to understand the concurrent / asynchronous execution models of Node.
So far, I do understand that whenever an asynchronous task is encountered in Node, that task runs in the background ( e.g an asynchronous setTimeout function will start timing) and the control is then sent back to other tasks that are there on the call stack. Once the timer times out, the callback that was passed to the asynchronous task is pushed onto the callback queue and once the call stack is empty, that callback gets executed. I took the help of this visualization to understand the sequence of task execution. So far so good.
Q1. Now, I am not being able to wrap my head around the paradigm of event listeners and event emitters and would appreciate if someone could explain how even emitters and listeners fall into the picture of call stack, event loops and callback queues.
Q2. I have the following code that reads data from the serial port of a raspberry pi.
const SerialPort = require('serialport');
const port = new SerialPort('/dev/ttyUSB0',{baudRate: 9600}, (err) => {
if (err) {
console.log("Port Open Error: ", err);
}
} )
port.on('data', (data) => {
console.log(data.toString());
})
As can be seen from the example, to read data from the serial port, an 'event-listener' has been employed. From what I understand, whenever data comes to the port, a 'data' event is emitted which is 'responded to' or rather listened to by the listener, which just prints the data onto the console.
When I run the above program, it runs continuously, with no break, printing the data onto the console whenever a data arrives at the serial port. There are no continuously running while loops continuously scanning the serial port as would be expected in a synchronous program. So my question is, why is this program running continuously? It is obvious that the event emitter is running continuously, generating an event whenever data comes, and the event listener is also running continuously, printing the data whenever a 'data' event is emitted. But WHERE are these things actually running, that too, continuously? How are these things fitting into the whole picture of the call/execution stack, the event loop and the callback queue?
Thanks
Q1. Now, I am not being able to wrap my head around the paradigm of event listeners and event emitters and would appreciate if someone could explain how even emitters and listeners fall into the picture of call stack, event loops and callback queues.
Event emitters on their own have nothing to do with the event loop. Event listeners are called synchronously whenever someone emits an event. When some code calls someEmitter.emit(...), all listeners are called synchronously from the time the .emit() occurred one after another. This is just plain old function calls. You can look in the eventEmitter code yourself to see a for loop that calls all the listeners one after another associated with a given event.
Q2. I have the following code that reads data from the serial port of a raspberry pi.
The data event in your code is an asynchronous event. That means that it will be triggered one or more times at an unknown time in the future. Some lower level code will be registered for some sort of I/O event. If that code is native code, then it will insert a callback into the node.js event queue. When node.js is done running other code, it will grab the next event from the event queue. When it gets to the event associated with data being available on the serial port, it will call port.emit(...) and that will synchronously trigger each of the listeners for the data event to be called.
When I run the above program, it runs continuously, with no break, printing the data onto the console whenever a data arrives at the serial port. There are no continuously running while loops continuously scanning the serial port as would be expected in a synchronous program. So my question is, why is this program running continuously?
This is the event-driven nature of node.js in a nutshell. You register an interest in certain events. Lower level code sees that incoming data has arrived and triggers those events, thus calling your listeners.
This is how the Javascript interpreter manages the event loop. Run current piece of Javascript until it's done. Check to see if any more events in the event loop. If so, grab next event and run it. If not, wait until there is an event in the event queue and then run it.
It is obvious that the event emitter is running continuously, generating an event whenever data comes, and the event listener is also running continuously, printing the data whenever a 'data' event is emitted. But WHERE are these things actually running, that too, continuously?
The event emitter itself is not running continuously. It's just a notification scheme (essentially a publish/subscribe model) where one party can register an interest in certain events with .on() and another party can trigger certain events with .emit(). It allows very loose coupling through a generic interface. Nothing is running continuously in the emitter system. It's just a notification scheme. Someone triggers an event with .emit() and it looks in its data structures to see who has registered an interest in that event and calls them. It knows nothing about the event or the data itself or how it was triggered. The emitters job is just to deliver notifications to those who expressed an interest.
We've described so far how the Javascript side of things works. It runs the event loop as described above. At a lower level, there is serial port code that interfaces directly with the serial port and this is likely some native code. If the OS supports a native asynchronous interface for the serial port, then the native code would use that and tell the OS to call it when there's data waiting on the serial port. If there is not a native asynchronous interface for the serial port data in the OS, then there's probably a native thread in the native code that interfaces with the serial port that handles getting data from the port, either polling for it or using some other mechanism built into the hardware to tell you when data is available. The exact details of how that works would be built into the serial port module you're using.
How are these things fitting into the whole picture of the call/execution stack, the event loop and the callback queue?
The call/execution stack comes into play the moment an event in the Javascript event queue is found by the interpreter and it starts to execute it. Executing that event will always start with a Javascript callback. The interpreter will call that callback (putting a return address on the call/execution stack). That callback will run until it returns. When it returns, the call/execution stack will be empty. The interpreter will then check to see if there's another event waiting in the event queue. If so, it will run that one.
FYI, if you want to examine the code for the serial port module it appears you are using, it's all there on Github. It does appear to have a number of native code files. You can see a file called poller.cpp here and it appears to do cooperative polling using the node.js add-on programming interface offered by libuv. For example, it creates a uv_poll_t which is a poll handle described here. Here's an excerpt from that doc:
Poll handles are used to watch file descriptors for readability, writability and disconnection similar to the purpose of poll(2).
The purpose of poll handles is to enable integrating external libraries that rely on the event loop to signal it about the socket status changes, like c-ares or libssh2. Using uv_poll_t for any other purpose is not recommended; uv_tcp_t, uv_udp_t, etc. provide an implementation that is faster and more scalable than what can be achieved with uv_poll_t, especially on Windows.
It is possible that poll handles occasionally signal that a file descriptor is readable or writable even when it isn’t. The user should therefore always be prepared to handle EAGAIN or equivalent when it attempts to read from or write to the fd.
Mongo db operations are getting starved in a rabbit mq consumer .
rabbitConn.createChannel(function(err, channel) {
channel.consume(q.queue, async function(msg) {
// The consumer listens to messages on Queue A for suppose based on a binding key.
await Conversations.findOneAndUpdate(
{'_id': 'someID'},
{'$push': {'messages': {'body': 'message body'}}}, function(error, count) {
// Passing a call back so that the query is executed immediately as mentioned in the
// mongoose document http://mongoosejs.com/docs/api.html#model_Model.findOneAndUpdate
});
});
});
The problem is if there are a large number of messages being read the mongo operations are getting starved and executed when the queue has no more messages. So if there are 1000 messages in the queue.The 1000 messages are read first and then and then mongo operation is getting called.
Would running the workers in a different nodejs process work ?
Ans: Tried doing this decoupling the workers from the main thread, does not help.
I have also written a load balancer with 10 workers but that does not seem to help, is the event loop not prioritizing the mongo operations ?
Ans: Does not help either the 10 workers read from the queue and only execute the findOneAndUpdate once there is nothing more to read from the queue.
Any help would be appreciated.
Thank you
Based on the description of the problem, I think you have a case of no message queuing happening. This can happen when you have a bunch of messages sitting in the queue, then subscribe a consumer with auto-ack set to true and no prefetch count. This answer describes in a bit more detail what happens in this case.
If I had to guess, I'd say the javascript code is spending all of its allocated cycles downloading messages from the broker rather than processing them into Mongo. Adding a prefetch count, while simultaneously disabling auto-ack may solve your issue.
I have a web worker in which I'm trying to run as little asynchronous code as possible.
I would like to have a while loop in my web worker while still allowing messages to be processed. Is there a way to manually update the event system in the browser? Or at least update the web worker's messages?
There appears to be something like this in Node.js (process._tickDomainCallback()) but so far I haven't found anything for web.
Using a setTimeout is not an option. I would like either a solution or a definitive answer that this is simply not possible.
// worker.js
self.onmessage = function(e) {
console.log("Receive Message");
};
while (true) {
UpdateMessages(); // Receive and handle incoming messages
// Do other stuff
}
Not sure what your context is here ... I had an issue where the browser side Web Audio API event loop was getting interrupted at inopportune moments by WebSocket traffic coming in from my nodejs server so I added a WebWorker middle layer to free up the browser event loop from ever getting interruptions
The WebWorker handled all network traffic then populated a circular queue with data ... when the browser event loop deemed itself available it plucked data from this shared queue (using Transferable Object buffer) and so was never interrupted since it was the one initiating calls to the WebWorker
I feel your pain however this approach kept the event loop happy ... since a WebWorker is effectively on its own thread and from the browser side is doing async work, there is no need to minimize async code
I'm using the dead letter exchange feature on rabbitmq to perform scheduled rpc calls, but after the queue is dead lettered it dropped the replyto property that was in the original queue. Is there anyway to declare the replyto property in a way that it will be retained in the "dead queue"?
I'm doing this with amqplib in node.js by the way.
Unfortunately, RabbitMQ will only preserve the properties that are listed on the Dead Letter Exchange page:
queue - the name of the queue the message was in before it was dead-lettered,
reason - reason for DLX being used
time - the date and time the message was dead lettered as a 64-bit AMQP format timestamp,
exchange - the exchange the message was published to
routing-keys - the routing keys the message was published with,
count - how many times this message was dead-lettered in this queue for this reason, and
original-expiration - the original expiration property of the message.
There are 2 ways to solve the problem you're seeing, I think.
1) Put the reply-to in your own header or property field, and read it from there / replace it when it's not in the usual spot
2) Don't use the reply-to field. Instead, use a well-known exchange for the reply at a later point in time.
Using the reply-to field typically implies a request/response or RPC scenario. These scenarios usually need a response fairly quickly. If a response does not come quickly, the system can usually move forward without it - even if it's just a message to the user saying "X is not available right now".
You say you're using a DLX to do scheduled RPC calls... delayed messages is a common use case for a DLX - nothing wrong with that. But delaying the RPC response can run in to some significant challenges beyond what you're already seeing.
For example, what happens when your system has a hiccup and the code that made the original request is no longer there to listen for the response? The answer to this depends on whether or not you really need the response to be handled. If you do need it to be handled - the system will run in to serious trouble if it isn't - then RPC can be dangerous.
Instead of relying on RPC and implying a temporal need for a given response, it's often better to use two-way messaging via separate queues. I've written about this in both my managing long running processes post and in my RabbitMQ Patterns email course / ebook.
The gist of it is that you can avoid the need for a reply-to queue by having the original message publisher also be a subscriber with a queue for the responses.
An example from the long running process post:
var DrinkRequestSender = new Sender(/* ... details ... */);
var DrinkRequestReceiver = new Receiver(/* ... details ... */);
var DrinkStation = {
make: function(drink){
DrinkRequestReceiver.receive((response) => {
var drinkResponse = response.body;
this.trigger("drinkup", drinkResponse);
});
var drinkData = drink.toJSON();
DrinkRequestSender.send(drinkData);
}
};
In this example, the code is sending out a "request" and later receiving a "response" - but not using a standard RPC setup. It is using a dedicated queue for the response, with the code on the other end sending the reply back via an exchange that routes to that queue.
This allows you to better handle failure scenarios, very long running processes and more.
This style of 2-way messaging does add some additional challenges, though. For one, you'll have to build in the ability to reconstruct the object that made the original request.
You can find this detailed in the long running process post, and there's a bit more info in RMQ Patterns, as well (along with a lot of other patterns).
Hope that helps!
I have developed a javascript chat (php on the backend) using:
1) long-polling to get new messages for the receiver
2) sessionStorage to store the counter of messages
3) setInterval to read new messages and if sessionStorageCounter < setIntervalCounter then the last message is shown to receiver.
4) javascript to create,update and write the chat dialogues
The module is working fine, but when users have a speedy chat the receiver' front end gets two or three same messages, (neither the counter fails, nor the query provides double inserts).
The code seems to be correct (that's why I don't provide the code), so the interval delay might be the reason (on reducing interval delay, nothing changes).
Do you think that the above schema is a bad practice and which schema do you think would eliminate the errors?
My approach, if solving it myself (as opposed to using an existing library that already handles this) would be:
Have the server assign a unique ID (GUID) to each message as it arrives.
On the clients, store the ID of the most recently received message.
When polling for new messages, do so with the ID of the last message successfully received. Server then responds by finding that message in its own queue and replaying all of the subsequent messages.
To guard against 'dropped' messages, each message can also carry the ID of the immediately-previous message (allowing the client to do consistency-checking)
If repolling does cause duplicates to be delivered from server to client, the presence of unique IDs on each message makes eliminating them trivial. Think of the server-side message queue as an event stream, with each client tracking their last-read position. The client makes no guesses about the appropriate order of messages, how many there are, etc - because its state consists entirely of 'what have I seen', there are few opportunities to get out of sync.
Since it's real time chat, the setInterval interval is probably small enough to ask the server for new messages two or three times simultaneously. Make sure that the server handler is synchronized and it is ignoring duplicated queries from the same user.