Wrong message order with kafka-node - javascript

I am using kafka-node node.js library. I have problem with message order when consuming topic with 250k messages (which where loaded into Kafka in batches of 2000 messages) with fresh start (no offsets in zookeeper). Consumer often does not process messages from offset 0, rather it starts at 4000 or 8000, or so. Also it continuosly processes block of 1000 messages and jumps at later or sooner N*1000 offset. I have tried changed maxTickMessages to 800 and it process block od 800 messages, but it still jumped to N*1000 offset. I could not find missing 200 offsets in debug log. Chaning maxTickMessages or maxNumSegments to very large number did not help.
I was printing current message offset directly in Kafka binary protocol decoder, which should eliminate some of potential async effects. Please see Offset log and used code kafka-order-test.js. I think there is problem in Kafka binary protocol parsing, but I was not able to find problem in it.
Kafka itself should not be a problem as I dumped topic with kafkacat, which mantained correct offset and messages order. I also monitored node.js-Kafka network traffic with Wireshark, and messages were shown in correct order.

This problem was caused by asynchronous nested MessageSet decompression which resulted in out of order message consuming. Kafka returns messages in MessageSet which contains nested compressed MessageSets of 2000 messages (in my testing). Unfortunately decompression was asynchronous without any synchronisation, so messages were processed out of order in batches of max 2000 (depends on maxTickMessages). My fix applies synchronous decompression.

Related

Sending lots of small packets out of Node.js UDP doesn't send all of them

I have a project that I'm working on that needs to send a small 9 byte packet to 7000 different hosts outside the local network, after which it waits for their replies back on the same port and processes the responses.
The problem I'm having is Node.js dgram (udp4) doesn't seem to be sending all packets out. I'm not rate limiting in any way so there might be an issue there.
I'm looping, creating the packets, then firing them straight out using .send(). With Wireshark open I can see that out of the 7000 being "sent" only ~1300 of them appear to be hitting the wire and leaving.
The script itself is reporting all packets as sent with no errors, Wireshark shows a different result, and the hosts at the other end reflect what Wireshark says, they don't receive the packet. I'm using the following to send and verify, packet is a buffer.
udpServer.send(packet, 0, packet.length, port, address, function(error) {
if (!error) {
successes++;
console.log(successes + "/" + total);
} else {
console.log(error);
}
});
Any ideas on what I am doing wrong here, or what's been overlooked?
There are many junctures where your packet could be dropped:
dropped when sending to kernel. Are you on linux? Try using strace to find the system call (probably send, sendmsg, or sendto) return value. If the system call is returning an error, I'd expect that to be reported in "error" by Node, though.
dropped in the kernel tx queue. On linux you can check, eg, /sys/class/net/eth0/statistics/ and see if any drop counters are incrementing.
dropped in hardware tx queue. If using an Intel NIC you can run ethtool -S eth0 to see if any drop counters are incremented.
dropped in intermediate hardware (eg switches/routers). This is trickier to see, as it's vendor dependent and maybe invisible. You can eliminate this by hooking your machines up directly to each other.
dropped in hardware rx queue. On the remote end, check all same stats to see if an overflow is occurring there.

NodeJS: Do I need to end HTTP requests to save memory/CPU?

I've written a program in Node and Express, using Request to connect to an API and downloads a bunch of data (think 3,000 API requests) (all within the usage limits of the API, mind you).
When running this in a Docker container, I'm getting a lot of getaddrinfo ENOTFOUND errors, and I'm wondering if this is a resourcing issue. My requests are like so:
request.get(url, function(err, resp, body){
// do stuff with the body here,
// like create an object and handball to a worker function
});
For the first few hundred requests this always works fine, but then I get lots nad lots of either ENOTFOUND or timeout errors, and I think the issue might be the way my code is dealing with all these requests.
I've batched them in a queue with timeouts so the requests happen relatively slowly, it helps a little bit but doesn't solve the problem completely.
Do I need to destroy the body/response objects to free up memory or something?
I've encountered similar issues with an API I was using, and it ended up being what some here suggested - rate limits. Some APIs don't return readable errors on rate limits, as they provide certain amount of resources per client, and when you've used it all up they can't even send you a bad response.
This happened even though I've stayed within the published rate limits per day, but turned out they have an unwritten limit per minute (or more like - just unable to process so many requests).
I answered it though by mocking that API with my own code, placing it in the network so it will maximize the similarities, and as my mocked code didn't do anything, I never got any errors in the NodeJS server.
Then I put it some countdowns and timeouts when it was needed.
I suggest the same to you. Remember them having a per hour limit, doesn't mean they don't have a different per second/minute limit.

Anyway to prevent rabbitmq dead letter queue dropping properties?

I'm using the dead letter exchange feature on rabbitmq to perform scheduled rpc calls, but after the queue is dead lettered it dropped the replyto property that was in the original queue. Is there anyway to declare the replyto property in a way that it will be retained in the "dead queue"?
I'm doing this with amqplib in node.js by the way.
Unfortunately, RabbitMQ will only preserve the properties that are listed on the Dead Letter Exchange page:
queue - the name of the queue the message was in before it was dead-lettered,
reason - reason for DLX being used
time - the date and time the message was dead lettered as a 64-bit AMQP format timestamp,
exchange - the exchange the message was published to
routing-keys - the routing keys the message was published with,
count - how many times this message was dead-lettered in this queue for this reason, and
original-expiration - the original expiration property of the message.
There are 2 ways to solve the problem you're seeing, I think.
1) Put the reply-to in your own header or property field, and read it from there / replace it when it's not in the usual spot
2) Don't use the reply-to field. Instead, use a well-known exchange for the reply at a later point in time.
Using the reply-to field typically implies a request/response or RPC scenario. These scenarios usually need a response fairly quickly. If a response does not come quickly, the system can usually move forward without it - even if it's just a message to the user saying "X is not available right now".
You say you're using a DLX to do scheduled RPC calls... delayed messages is a common use case for a DLX - nothing wrong with that. But delaying the RPC response can run in to some significant challenges beyond what you're already seeing.
For example, what happens when your system has a hiccup and the code that made the original request is no longer there to listen for the response? The answer to this depends on whether or not you really need the response to be handled. If you do need it to be handled - the system will run in to serious trouble if it isn't - then RPC can be dangerous.
Instead of relying on RPC and implying a temporal need for a given response, it's often better to use two-way messaging via separate queues. I've written about this in both my managing long running processes post and in my RabbitMQ Patterns email course / ebook.
The gist of it is that you can avoid the need for a reply-to queue by having the original message publisher also be a subscriber with a queue for the responses.
An example from the long running process post:
var DrinkRequestSender = new Sender(/* ... details ... */);
var DrinkRequestReceiver = new Receiver(/* ... details ... */);
var DrinkStation = {
make: function(drink){
DrinkRequestReceiver.receive((response) => {
var drinkResponse = response.body;
this.trigger("drinkup", drinkResponse);
});
var drinkData = drink.toJSON();
DrinkRequestSender.send(drinkData);
}
};
In this example, the code is sending out a "request" and later receiving a "response" - but not using a standard RPC setup. It is using a dedicated queue for the response, with the code on the other end sending the reply back via an exchange that routes to that queue.
This allows you to better handle failure scenarios, very long running processes and more.
This style of 2-way messaging does add some additional challenges, though. For one, you'll have to build in the ability to reconstruct the object that made the original request.
You can find this detailed in the long running process post, and there's a bit more info in RMQ Patterns, as well (along with a lot of other patterns).
Hope that helps!

how to limit the amount of data being sent by the client through websocket?

I am using the ws module and I'd like to limit the amount of data being sent by the client over websocket to 1Mb. This will prevent a malicious user from sending huge amounts of data (in terms of GB) causing the server to run out of memory, which would cause denial of service errors for every normal user.
For example, example Express allows to specify the max size of a post request body like so:
bodyParser.json({limit:'1Mb'})
How I do something similar with the ws module?I tried
var ws = require('ws').Server
var wsserver = new ws({port:8080, limit:'1Mb'})
But this of course doesn't work. I want the transmission of data to be interrupted (after 1Mb is exceeded) and the websocket connection to be closed. How can I do that?
There must be a way to limit the frames of data coming from the client...
That ability does not (currently) exist in that library.
Poking around their source code, it appears that the place to start would be processPacket() method in https://github.com/websockets/ws/blob/master/lib/Receiver.js .
Once you have the packet header available, you can see the size of the message being sent. If it's above a certain threshold, there should be a way to close the connection before all of the bytes are even hitting your network.
Of course, the nice thing to do would be to fork their repository, issue a feature request, add in a configuration option that defaults to not taking any action if it's not set (don't break backwards compatibility), and submit a pull request.
If they like it, they'll merge. If not, you'll still be able to merge their future versions into your own repo and stay up to date without having to re-do your work each time they submit a new release.

Acknowledging multiple messages with Rabbit.js for RabbitMQ

I'm using RabbitMQ as a buffer for a large number of messages that need to be saved to a database. The messages come in, then on the other end, a script pulls a number of them and writes them, as a batch, to the database. This app is written in NodeJS (using Rabbit.js).
While I haven't found a way to wait, and consume a set number of messages at once (say, 100), I am using a worker queue to receive messages in Node, and then write them after a time period / max number of messages has been reached.
If the app dies, however, or otherwise fails, I need the messages to be re-released onto the queue.
Therefore I can use the ack() function in Rabbit.js on the queue, but that merely acknowledges the most recent message, instead of letting me select a number of messages to acknowledge, and I'm reluctant to call ack() 100 times just to get to the right number of acknowledgements.
Is there a way to acknowledge receipt of X messages using Rabbit.js (or some other Node.js library that will work with RabbitMQ)?

Categories

Resources