Meteor Shutdown/Restart Interrupting Running Server-Side Functions - javascript

I am building out a data analysis tool in Meteor and am running into issues with the Meteor server shutdown and restart. Server-side, I am pinging several external APIs on a setInterval, trimming the responses down to just new data that I haven't already captured, running a batch of calculations on that new data, and finally storing the computed results in Mongo. For every chunk of new data that I receive from the external APIs, there are about 15 different functions/computations that I need to run, and each of the 15 outputs are being stored in Mongo inside separate documents. Client-side, users can subscribe to any one out of the 15 documents, enabling them to view the data in the manner in which they please.
new data is captured from the API as {A} and {A} is stored in Mongo
|
begin chain
|
function1 -> transforms {A} into {B} and stores {B} in Mongo
|
function2 -> transforms {A} into {C} and stores {C} in Mongo
|
...
|
function15 -> transforms {A} into {P} and stores {P} in Mongo
|
end chain
The problem, is that when I shut down Meteor, or deploy new code to the server (which will automatically restart the server), the loop that iterates through these functions is being interrupted. Say, functions 1-7 ran successfully, and then Meteor restarted, causing functions 8-15 to never run (or even worse, for function 8 to be interrupted while 9-15 never ran). This causes my documents to no longer be in sync with {A} data that was stored before the loop began.
How can this risk be mitigated? Is it possible, to tell Meteor to gracefully shutdown / wait until this process has completed? Thanks!

I've never heard of a graceful shutdown, and even if there were one, I'm not sure I'd trust it...
What I'd do is assign a complete flag to {A} that is triggered when function15 is finished. If the functions are async (or just expensive, but I doubt that's your case...), then create a record in {B}:{P} when {A} is created with the _id of {A} and a complete flag.
Regardless, run the functions on a query of batches where complete is false.

There is currently no official way of a graceful shutdown. So you would have to come up with some other way of making sure your data isn't stored in a inconsistent state.
The easiest way that comes up in my mind would be to disable meteor automatic restarts with meteor --once.
Then add a shutdown mode to your application.
In shutdown mode your application should not pick up new tasks only finish what it is currently working on. This would be easy to do if you use meteor-synced-cron which has a stop method which doesn't kill currently running functions.
Make sure that a task finishing never leaves a document in a inconsistent state.
So you can have multiple tasks working on a document just make sure that when task1 finishes the document will always be in a state to be picked up by task2.

Related

Handle Multiple Concurent Requests for Express Sever on Same Endpoint API

this question might be duplicated but I am still not getting the answer. I am fairly new to node.js so I might need some help. Many have said that node.js is perfectly free to run incoming requests asynchronously, but the code below shows that if multiple requests hit the same endpoint, say /test3, the callback function will:
Print "test3"
Call setTimeout() to prevent blocking of event loop
Wait for 5 seconds and send a response of "test3" to the client
My question here is if client 1 and client 2 call /test3 endpoint at the same time, and the assumption here is that client 1 hits the endpoint first, client 2 has to wait for client 1 to finish first before entering the event loop.
Can anybody here tells me if it is possible for multiple clients to call a single endpoint and run concurrently, not sequentially, but something like 1 thread per connection kind of analogy.
Of course, if I were to call other endpoint /test1 or /test2 while the code is still executing on /test3, I would still get a response straight from /test2, which is "test2" immediately.
app.get("/test1", (req, res) => {
console.log("test1");
setTimeout(() => res.send("test1"), 5000);
});
app.get("/test2", async (req, res, next) => {
console.log("test2");
res.send("test2");
});
app.get("/test3", (req, res) => {
console.log("test3");
setTimeout(() => res.send("test3"), 5000);
});
For those who have visited, it has got nothing to do with blocking of event loop.
I have found something interesting. The answer to the question can be found here.
When I was using chrome, the requests keep getting blocked after the first request. However, with safari, I was able to hit the endpoint concurrently. For more details look at the following link below.
GET requests from Chrome browser are blocking the API to receive further requests in NODEJS
Run your application in cluster. Lookup Pm2
This question needs more details to be answer and is clearly an opinion-based question. just because it is an strawman argument I will answer it.
first of all we need to define run concurrently, it is ambiguous if we assume the literal meaning in stric theory nothing RUNS CONCURRENTLY
CPUs can only carry out one instruction at a time.
The speed at which the CPU can carry out instructions is called the clock speed. This is controlled by a clock. With every tick of the clock, the CPU fetches and executes one instruction. The clock speed is measured in cycles per second, and 1c/s is known as 1 hertz. This means that a CPU with a clock speed of 2 gigahertz (GHz) can carry out two thousand million (or two billion for those in the US) for the rest of us/world 2000 million cycles per second.
cpu running multiple task "concurrently"
yes you're right now-days computers even cell phones comes with multi core which means the number of tasks running at the same time will depend upon the number of cores, but If you ask any expert such as this Associate Staff Engineer AKA me will tell you that is very very rarely you'll find a server with more than one core. why would you spend 500 USD for a multi core server if you can spawn a hold bunch of ...nano or whatever option available in the free trial... with kubernetes.
Another thing. why would you handle/configurate node to be incharge of the routing let apache and/or nginx to worry about that.
as you mentioned there is one thing call event loop which is a fancy way of naming a Queue Data Structure FIFO
so in other words. no, NO nodejs as well as any other programming language out there will run
but definitly it depends on your infrastructure.

How to stop nodejs from killing a child process on out of memory?

I am working in Node.Js and with large files. I am invoking a child process of python script to work on these large files. Now problem is, on large files the memory runs out and node js abort the process by killing it. I know about --max-old-space-size but I don't just want to limit the size. I want node.js to keep working on the process if it runs out of memory. F
For example, Node.js performs a process with 2GB ram allowed in 1 minute. I limit the memory size to 1GB ram, it should run the process in 2 minutes but not give an error. It should use a queue or something else.
EDIT:
This is the command I am executing from child process using spawn.
let extExecCommand = "cat file.json | python sample-script.py > output-file"
let extensionScript = childProcess.spawn(extExecCommand, {shell: true});
When the size of file.json is very large, the node aborts the process and kills the child process. I don't want that to happen. I want node js to keep executing the process and maintain some type of data structure.
I want node.js to keep working on the process if it runs out of memory?*
Guess it's not possible to avoid the Heap-Out-of-Memory error.
You could use two things.
First, delegate a separate process to PM2 with -max-memory-restart option and run it via ecosystem file, like: pm2 start test.yaml
apps:
- script: ./index.js
name: 'PYTHON'
instances: 1
out_file: "/dev/null"
autorestart: true
max_memory_restart: '100M'
log_date_format: 'DD/MM HH:mm:ss.SSS'
The second option is using various queue job managers, like Bree, Bull, Agenda and put there your resourceful tasks, split jobs in parts (if it's possible) and then manage it. Node.js does it via worker_thread.
But I guess, something is wrong with your project architecture, and should not store such amount of data in-memory. You could access large files or data via streams, from the database or other storage.

Running tasks in parallel with reliability in cloud functions

I'm streaming and processing tweets in Firebase Cloud Functions using the Twitter API.
In my stream, I am tracking various keywords and users of Twitter, hence the influx of tweets is very high and a new tweet is delivered even before I have processed the previous tweet, which leads to lapses as the new tweet sometimes does not get processed.
This is how my stream looks:
...
const stream = twitter.stream('statuses/filter', {track: [various, keywords, ..., ...], follow: [userId1, userId2, userId3, userId3, ..., ...]});
stream.on('tweet', (tweet) => {
processTweet(tweet); //This takes time because there are multiple network requests involved and also sometimes recursively running functions depending on the tweets properties.
})
...
processTweet(tweet) essentially is compiling threads from twitter, which takes time depending upon the length of the thread. Sometimes a few seconds also. I have optimised processTweet(tweet) as much as possible to compile the threads reliably.
I want to run processTweet(tweet) parallelly and queue the tweets that are coming in at the time of processing so that it runs reliably as the twitter docs specify.
Ensure that your client is reading the stream fast enough. Typically you should not do any real processing work as you read the stream. Read the stream and hand the activity to another thread/process/data store to do your processing asynchronously.
Help would be very much appreciated.
This twitter streaming API will not work with Cloud Functions.
Cloud Functions code can only be invoked in response to incoming events, and the code may only run for up to 9 minutes max (default 60 seconds). After that, the function code is forced to shut down. With Cloud Functions, there is no way to continually process some stream of data coming from an API.
In order to use this API, you will need to use some other compute product that allows you to run code indefinitely on a dedicated server instance, such as App Engine or Compute Engine.

Node.js API to spawn off a call to another API

I created a Node.js API.
When this API gets called I return to the caller fairly quickly. Which is good.
But now I also want API to call or launch an different API or function or something that will go off and run on it's own. Kind of like calling a child process with child.unref(). In fact, I would use child.spawn() but I don't see how to have spawn() call another API. Maybe that alone would be my answer?
Of this other process, I don't care if it crashes or finishes without error.
So it doesn't need to be attached to anything. But if it does remain attached to the Node.js console then icing on the cake.
I'm still thinking about how to identify & what to do if the spawn somehow gets caught up in running a really long time. But ready to cross that part of this yet.
Your thoughts on what I might be able to do?
I guess I could child.spawn('node', [somescript])
What do you think?
I would have to explore if my cloud host will permit this too.
You need to specify exactly what the other spawned thing is supposed to do. If it is calling an HTTP API, with Node.js you should not launch a new process to do that. Node is built to run HTTP requests asynchronously.
The normal pattern, if you really need some stuff to happen in a different process, is to use something like a message queue, the cluster module, or other messaging/queue between processes that the worker will monitor, and the worker is usually set up to handle some particular task or set of tasks this way. It is pretty unusual to be spawning another process after receiving an HTTP request since launching new processes is pretty heavy-weight and can use up all of your server resources if you aren't careful, and due to Node's async capabilities usually isn't necessary especially for things mainly involving IO.
This is from a test API I built some time ago. Note I'm even passing a value into the script as a parameter.
router.put('/test', function (req, res, next) {
var u = req.body.u;
var cp = require('child_process');
var c = cp.spawn('node', ['yourtest.js', '"' + u + '"'], { detach: true });
c.unref();
res.sendStatus(200);
});
The yourtest.js script can be just about anything you want it to be. But I thought I would have enjoy learning more if I thought to first treat the script as a node.js console desktop app. FIRST get your yourtest.js script to run without error by manually running/testing it from your console's command line node yourstest.js yourparamtervalue THEN integrate it in to the child.spawn()
var u = process.argv[2];
console.log('f2u', u);
function f1() {
console.log('f1-hello');
}
function f2() {
console.log('f2-hello');
}
setTimeout(f2, 3000); // wait 3 second before execution f2(). I do this just for troubleshooting. You can watch node.exe open and then close in TaskManager if node.exe is running long enough.
f1();

Mongo script runs quickly locally but slow if I run it against a remote instance?

I have a mongo script that I am using to perform some data cleanup after a database migration.
When I run this script locally it finishes in about 5 minutes. When I run the script from my local machine against a remote instance it takes forever (I usually kill it after about two hours). These databases are essentially identical. Indexes are all the same, maybe a few records in one place that isn't in the other.
I am executing the script like so:
Locally-
mongo localDatabase script.js
Against remote instance-
mongo removeServer/remoteDatabase -u user -p password script.js
I had assumed that since I was passing the script to the remote instance it would be executed entirely on the remote machine with no data having to be transported back and forth between the remote machine and my local machine (and hence there would be little difference in performance).
Is this assumption correct? Any idea why I am seeing the huge performance difference between local/remote? Suggestions on how to fix?
Yes you can use Bulk operations, all operations in MongoDB are designed around a single collection, but there is nothing wrong with looping one collection and inserting or updating another collection.
In fact in the MongoDB 2.6 shell it is the best way to do it, and the actual collection methods themselves try to use the "Bulk" methods under the hood, even though they actually only do single updates/inserts per operation. That is why you would see the different response in the shell.
Note that your server needs to be a MongoDB 2.6 or greater instance as well, which is is why the collection methods in the shell do some detection in case you are connecting to an older server.
But basically your process is:
var bulk = db.targetcollection.initializeOrderedBulkOP();
var counter = 0;
db.sourcecollection.find().forEach(function(doc) {
bulk.find({ "_id": doc._id }).updateOne(
// update operations here
);
counter++;
if ( counter % 1000 == 0 ) {
bulk.execute();
bulk = db.targetcollection.initializeOrderedBulkOP();
}
});
if ( counter % 1000 != 0 )
bulk.execute();
The Bulk API itself will keep all operations you send to it "queued" up until an execute is called which sends operations to the server. The API itself will just keep whatever operations to "queue" until this is called but only actually send in batches of 1000 entries at at time. A little extra care is taken here to manually limit this with a modulo in order to avoid using up additional memory.
You can tune that amount to your needs, but remember there is indeed a hard limit of 16MB as this basically translates to a a BSON document as a request.
See the full manual page for all options including upserts, multi-updates, insert and remove. Or even un-ordered operations where the order or failure on individual error is not important.
Also note that the write result in the latter case would return the error items in a list if any, as well as the response containing things such as lists of upserts where those are applied.
Combined with keeping your shell instance as close as possible to the server, the reduced "back and forth" traffic will speed things up. As I said, the shell is using these anyway, so you might as well leverage these to your advantage.

Categories

Resources