mongoDB insert and process.nextTick - javascript

I have a list of 50k entries that I am entering into my db.
var tickets = [new Ticket(), new Ticket(), ...]; // 50k of them
tickets.forEach(function (t, ind){
console.log(ind+1 + '/' + tickets.length);
Ticket.findOneAndUpdate({id: t.id}, t, {upsert: true}, function (err, doc){
if (err){
console.log(err);
} else {
console.log('inserted');
}
});
});
Instead of the expected interleaving of
1 / 50000
inserted
2 / 50000
inserted
I am getting all of the indices followed by all of the inserted confirmations
1 / 50000
2 / 50000
...
50000 / 50000
inserted
inserted
...
inserted
I think something is happening with process.nextTick. There is a significant slow down after a few thousand records.
Does anyone know how to get the efficient interleaving?

Instead of the expected interleaving
That would be the expected behavior only for synchronous I/O.
Remember that these operations are all asynchronous, which is a key idea of node.js. What the code does is this:
for each item in the list,
'start a function' // <-- this will immediately look at the next item
output a number (happens immediately)
do some long-running operation over the network with connection pooling
and batching. When done,
call a callback that says 'inserted'
Now the code will launch a ton of those functions that, in turn, send requests to the database. All that will happen long before the first request has even reached the database. It is likely that the OS doesn't even bother to actually send the first TCP packets before you're at, say ticket 5 or 10 or so.
To answer the question from your comment: No, the requests will be sent out relatively soon (that is up to the OS), but the results won't reach your single-threaded javascript code before your loop hasn't finished queuing up the 50k entries. This is because the forEach is your currently running piece of code, and all events that come in while it's running will be processed only after it's finished - you'll observe the same if you use setTimeout(function() { console.log("inserted... not") }, 0) instead of the actual DB call, because setTimeout is also an async event.
To make your code fully async, your data source should be some kind of (async) iterator that provides data, instead of a huge array of items.

You are running into the wonders of node's asynchronicity. It's sending the upsert requests off into the ether, then continuing on to the next record without waiting for a response. Does it matter, as it's just an informational message that is not in sync with the upsert. You might want to use the Async library to flip through your array, if you need to make sure they are done in order.

Related

Bulk Upsert Javascript stored procedure always exceeds execution cap of 5 seconds and results in a timeout

I'm currently running a script in python SDK which programmatically bulk upserts 1.5 million documents into a collection in azure cosmos db. I've been using the bulk import sproc from the samples provided in the github repo: https://github.com/Azure/azure-cosmosdb-js-server/tree/master/samples/stored-procedures, the only change being that I've swapped collection.createDocument with collection.upsertDocument. I'll include my sproc in full below.
The stored procedure does run successfully - it upserts documents consistently and relatively quickly. Although this will be the case only up until around 30% progress when this error will be thrown:
CosmosHttpResponseError: (RequestTimeout) Message: {"Errors":["The requested operation exceeded maximum alloted time. Learn more: https://aka.ms/cosmosdb-tsg-service-request-timeout"]}
ActivityId: 9f2357c6-918c-4b67-ba20-569034bfde6f, Request URI: /apps/4a997bdb-7123-485a-9808-f952db2b7e52/services/a7c137c6-96b8-4b53-a20c-b9577981b353/partitions/305a8287-11d1-43f8-be1f-983bd4c4a63d/replicas/132488328092882514p/, RequestStats:
RequestStartTime: 2020-11-03T23:43:59.9158203Z, RequestEndTime: 2020-11-03T23:44:05.3858559Z, Number of regions attempted:1
ResponseTime: 2020-11-03T23:44:05.3858559Z, StoreResult: StorePhysicalAddress: rntbd://cdb-ms-prod-centralus1-fd22.documents.azure.com:14354/apps/4a997bdb-7123-485a-9808-f952db2b7e52/services/a7c137c6-96b8-4b53-a20c-b9577981b353/partitions/305a8287-11d1-43f8-be1f-983bd4c4a63d/replicas/132488328092882514p/, LSN: -1, GlobalCommittedLsn: -1, PartitionKeyRangeId: , IsValid: False, StatusCode: 408, SubStatusCode: 0, RequestCharge: 0, ItemLSN: -1, SessionToken: , UsingLocalLSN: False, TransportException: null, ResourceType: StoredProcedure, OperationType: ExecuteJavaScript, SDK: Microsoft.Azure.Documents.Common/2.11.0
Is there a way to add some retry logic or to extend the timeout period for bulk upserts? I believe the section of code in the sproc below if (!isAccepted) getContext().getResponse().setBody(count); is supposed to help with this scenario but it doesn't seem to work in my case.
Bulk upsert stored procedure in Javascript:
function bulkUpsert(docs) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
// The count of imported docs, also used as current doc index.
var count = 0;
// Validate input.
if (!docs) throw new Error("The array is undefined or null.");
var docsLength = docs.length;
if (docsLength == 0) {
getContext().getResponse().setBody(0);
return;
}
// Call the CRUD API to create a document.
tryCreate(docs[count], callback);
// Note that there are 2 exit conditions:
// 1) The upsertDocument request was not accepted.
// In this case the callback will not be called, we just call setBody and we are done.
// 2) The callback was called docs.length times.
// In this case all documents were created and we don't need to call tryCreate anymore. Just call setBody and we are done.
function tryCreate(doc, callback) {
var isAccepted = collection.upsertDocument(collectionLink, doc, callback);
// If the request was accepted, callback will be called.
// Otherwise report current count back to the client,
// which will call the script again with remaining set of docs.
// This condition will happen when this stored procedure has been running too long
// and is about to get cancelled by the server. This will allow the calling client
// to resume this batch from the point we got to before isAccepted was set to false
if (!isAccepted) {
getContext().getResponse().setBody(count);
}
}
// This is called when collection.upsertDocument is done and the document has been persisted.
function callback(err, doc, options) {
if (err) throw err;
// One more document has been inserted, increment the count.
count++;
if (count >= docsLength) {
// If we have created all documents, we are done. Just set the response.
getContext().getResponse().setBody(count);
} else {
// Create next document.
tryCreate(docs[count], callback);
}
}
}
I think that the problem may lie in the stored procedure rather than the python script, if this isn't the case though I can provide my python script. Any help on this would be massively appreciated, it's been a head scratcher for me for days now!
Extra Info:
Throughput = 10,000, partition upsert size ~ 1.9MB consistently.
If anyone else has this problem, the workaround I've used is to increase the throughput to 100,000 instead of 10,000 temporarily whilst the bulk upsert operation is underway. The error doesn't occur if you use that bulk upsert stored procedure in conjunction with a sufficiently high throughput. I think the timeout was happening frequently once the bulk upsert operation had upserted around 30% of the 1.5 million records, likely because the throughput wasn't divided sufficiently between partitions and it was causing a bottleneck. I may have to again assign a greater throughput to my container once it is used in practice or maybe I'll be able to reduce it to save costs. Either way the code to do this is quite simple with just the method below:
new_throughput = 10000; container.replace_throughput(new_throughput)
Stored procedures have a bounded execution time of 5 seconds. However you can write your stored procedure to handle bounded execution by checking a boolean return value and then use the count of items inserted in each invocation of the stored procedure to track and resume progress across batches. There is an example here.

JavaScript Why is some code getting executed before the rest?

I've mostly learned coding with OOPs like Java.
I have a personal project where I want to import a bunch of plaintext into a mongodb. I thought I'd try to expand my horizons and do this with using node.js powered JavaScript.
I got the code working fine but I'm trying to figure out why it is executing the way it is.
The output from the console is:
1. done reading file
2. closing db
3. record inserted (n times)
var fs = require('fs'),
readline = require('readline'),
instream = fs.createReadStream(config.file),
outstream = new (require('stream'))(),
rl = readline.createInterface(instream, outstream);
rl.on('line', function (line) {
var split = line.split(" ");
_user = "#" + split[0];
_text = "'" + split[1] + "'";
_addedBy = config._addedBy;
_dateAdded = new Date().toISOString();
quoteObj = { user : _user , text : _text , addedby : _addedBy, dateadded : _dateAdded};
db.collection("quotes").insertOne(quoteObj, function(err, res) {
if (err) throw err;
console.log("record inserted.");
});
});
rl.on('close', function (line) {
console.log('done reading file.');
console.log('closing db.')
db.close();
});
(full code is here: https://github.com/HansHovanitz/Import-Stuff/blob/master/importStuff.js)
When I run it I get the message 'done reading file' and 'closing db' and then all of the 'record inserted' messages. Why is that happening? Is it because of the delay in inserting a record in the db? The fact that I see 'closing db' first makes me think that the db would be getting closed and then how are the records being inserted still?
Just curious to know why the program is executing in this order for my own peace of mind. Thanks for any insight!
In short, it's because of asynchronous nature of I/O operations in the used functions - which is quite common for Node.js.
Here's what happens. First, the script reads all the lines of the file, and for each line initiates db.insertOne() operation, supplying a callback for each of them. Note that the callback will be called when the corresponding operation is finished, not in the middle of this process.
Eventually the script reaches the end of the input file, logs two messages, then invokes db.close() line. Note that even though 'insert' callbacks (that log 'inserted' message) are not called yet, the database interface has already received all the 'insert' commands.
Now the tricky part: whether or not DB interface succeeds to store all the DB records (in other words, whether or not it'll wait until all the insert operations are completed before closing the connection) is up both to DB interface and its speed. If write op is fast enough (faster than reading the file line), you'll probably end up with all the records been inserted; if not, you can miss some of them. That's why it's a safest bet to close the connection to database not in the file close (when the reading is complete), but in insert callbacks (when the writing is complete):
let linesCount = 0;
let eofReached = false;
rl.on('line', function (line) {
++linesCount;
// parsing skipped for brevity
db.collection("quotes").insertOne(quoteObj, function(err, res) {
--linesCount;
if (linesCount === 0 && eofReached) {
db.close();
console.log('database close');
}
// the rest skipped
});
});
rl.on('close', function() {
console.log('reading complete');
eofReached = true;
});
This question describes the similar problem - and several different approaches to solve it.
Welcome to the world of asynchronicity. Inserting into the DB happens asynchronously. This means that the rest of your (synchronous) code will execute completely before this task is complete. Consider the simplest asynchronous JS function setTimeout. It takes two arguments, a function and a time (in ms) after which to execute the function. In the example below "hello!" will log before "set timeout executed" is logged, even though the time is set to 0. Crazy right? That's because setTimeout is asynchronous.
This is one of the fundamental concepts of JS and it's going to come up all the time, so watch out!
setTimeout(() => {
console.log("set timeout executed")
}, 0)
console.log("hello!")
When you call db.collection("quotes").insertOne you're actually creating an asynchronous request to the database, a good way to determine if a code will be asynchronous or not is if one (or more) of its parameters is a callback.
So the order you're running it is actually expected:
You instantiate rl
You bind your event handlers to rl
Your stream starts processing & calling your 'line' handler
Your 'line' handler opens asynchronous requests
Your stream ends and rl closes
...
4.5. Your asynchronous requests return and execute their callbacks
I labelled the callback execution as 4.5 because technically your requests can return at anytime after step 4.
I hope this is a useful explanation, most modern javascript relies heavily on asynchronous events and it can be a little tricky to figure out how to work with them!
You're on the right track. The key is that the database calls are asychronous. As the file is being read, it starts a bunch of async calls to the database. Since they are asynchronous, the program doesn't wait for them to complete at the time they are called. The file then closes. As the async calls complete, your callbacks runs and the console.logs execute.
Your code reads lines and immediately after that makes a call to the db - both asynchronous processes. When the last line is read the last request to the db is made and it takes some time for this request to be processed and the callback of the insertOne to be executed. Meanwhile the r1 has done it's job and triggers the close event.

Delay function execution (API call) to execute n times in time period

I'm trying to write back end functionality that is handling requests to particular API, but this API has some restrictive quotas, especially for requests/sec. I want to create API abstraction layer that is able of delaying function execution if there are too many requests/s, so it works like this:
New request arrives (to put it simple - library method is invoked)
Check if this request could be executed right now, according to given limit (requests/s)
If it can't be executed, delay its execution till next available moment
If at this time a new request arrives, delay its execution further or put it on some execution queue
I don't have any constraints in terms of waiting queue length. Requests are function calls with node.js callbacks as the last param for responding with data.
I thought of adding delay to each request, which would be equal to the smallest possible slot between requests (expressed as minimal miliseconds/request), but it can be a bit inefficient (always delaying functions before sending response).
Do you know any library or simple solution that could provide me with such functionality?
Save the last request's timestamp.
Whenever you have a new incoming request, check if a minimum interval elapsed since then, if not, put the function in a queue then schedule a job (unless one was already scheduled):
setTimeout(
processItemFromQueue,
(lastTime + minInterval - new Date()).getTime()
)
processItemFromQueue takes a job from the front of the queue (shift) then reschedules itself unless the queue is empty.
The definite answer for this problem (and the best one) came from the API documentation itself. We use it for a couple of months and it perfectly solved my problem.
In such cases, instead of writing some complicated queue code, the best way is to leverage JS possibility of handling asynchronous code and either write simple backoff by yourself or use one of many great libraries to use so.
So, if you stumble upon any API limits (e.g. quota, 5xx etc.), you should use backoff to recursively run the query again, but with increasing delay (more about backoff could be found here: https://en.wikipedia.org/wiki/Exponential_backoff). And, if finally, after given amount of times you fail again - gracefully return error about unavailability of the API.
Example use below (taken from https://www.npmjs.com/package/backoff):
var call = backoff.call(get, 'https://someaddress', function(err, res) {
console.log('Num retries: ' + call.getNumRetries());
if (err) {
// Put your error handling code here.
// Called ONLY IF backoff fails to help
console.log('Error: ' + err.message);
} else {
// Put your success code here
console.log('Status: ' + res.statusCode);
}
});
/*
* When to retry. Here - 503 error code returned from the API
*/
call.retryIf(function(err) { return err.status == 503; });
/*
* This lib offers two strategies - Exponential and Fibonacci.
* I'd suggest using the first one in most of the cases
*/
call.setStrategy(new backoff.ExponentialStrategy());
/*
* Info how many times backoff should try to post request
* before failing permanently
*/
call.failAfter(10);
// Triggers backoff to execute given function
call.start();
There are many backoff libraries for NodeJS, leveraging either Promise-style, callback-style or even event-style backoff handling (example above being second of the mentioned ones). They're really easy to use if you understand backoff algorithm itself. And as the backoff parameters could be stored in config, if backoff is failing too often, they could be adjusted to achieve better results.

Node js for loop timer

I'm trying to test which Database, which can implemented in Node.js has the fastest time in certain tasks.
I'm working with a few DBs here, and every time I try to time a for loop, the timer quickly ends, but the task wasn't actually finished. Here is the code:
console.time("Postgre_write");
for(var i = 0; i < limit; i++){
var worker = {
id: "worker" + i.toString(),
ps: Math.random(),
com_id: Math.random(),
status: (i % 3).toString()
}
client.query("INSERT INTO test(worker, password, com_id, status) values($1, $2, $3, $4)", [worker.id, worker.ps, worker.com_id, worker.status]);
}
console.timeEnd("Postgre_write");
when I tried like a few hundred I thought the results were true. But when I test higher numbers like 100,000, the console outputs 500ms, but when I check the PGadmin app, the inserts were still processing.
Obviously I got something wrong here, but I don't know what. I rely heavily on the timer data and I need to find a way to time these operations correctly. Help?
Node.js is asynchronous. This means that your code will not necessarily execute in the order that you have written it.
Take a look at this article http://www.i-programmer.info/programming/theory/6040-what-is-asynchronous-programming.html which offers a pretty decent explanation of asynchronous programming.
In your code, your query is sent to the database and begins to execute. Since Node is asynchronous, it does not wait for that line to finish executing, but instead goes to the next line (which stops your timer).
When your query is finished running, a callback will occur and notify Node that the query has finished. If you specify a callback function it will be run at that point. Think of event-driven programming, which the query's completion being an event.
To solve your problem, stop the timer in the callback function of the query.

Database Backed Work Queue

My situation ...
I have a set of workers that are scheduled to run periodically, each at different intervals, and would like to find a good implementation to manage their execution.
Example: Let's say I have a worker that goes to the store and buys me milk once a week. I would like to store this job and it's configuration in a mysql table. But, it seems like a really bad idea to poll the table (every second?) and see which jobs are ready to be put into the execution pipeline.
All of my workers are written in javascript, so I'm using node.js for execution and beanstalkd as a pipeline.
If new jobs (ie. scheduling a worker to run at a given time) are being created asynchronously and I need to store the job result and configuration persistently, how do I avoid polling a table?
Thanks!
I agree that it seems inelegant, but given the way that computers work something *somewhere* is going to have to do polling of some kind in order to figure out which jobs to execute when. So, let's go over some of your options:
Poll the database table. This isn't a bad idea at all - it's probably the simplest option if you're storing the jobs in MySQL anyway. A rate of one query per second is nothing - give it a try and you'll notice that your system doesn't even feel it.
Some ideas to help you scale this to possibly hundreds of queries per second, or just keep system resource requirements down:
Create a second table, 'job_pending', where you put the jobs that need to be executed within the next X seconds/minutes/hours.
Run queries on your big table of all jobs only once in a longer while, then populate the small table which you query every shorter while.
Remove jobs that were executed from the small table in order to keep it small.
Use an index on your 'execute_time' (or whatever you call it) column.
If you have to scale even further, keep the main jobs table in the database, and use the second, smaller table I suggest, just put that table in RAM: either as a memory table in the DB engine, or in a Queue of some kind in your program. Query the queue at extremely short intervals if you have too - it'll take some extreme use cases to cause any performance issues here.
The main issue with this option is that you'll have to keep track of jobs that were in memory but didn't execute, e.g. due to a system crash - more coding for you...
Create a thread for each of a bunch of jobs (say, all jobs that need to execute in the next minute), and call thread.sleep(millis_until_execution_time) (or whatever, I'm not that familiar with node.js).
This option has the same problem as no. 2 - where you have to keep track job execution for crash recovery. It's also the most wasteful imo - every sleeping job thread still takes system resources.
There may be additional options of course - I hope that others answer with more ideas.
Just realize that polling the DB every second isn't a bad idea at all. It's the most straightforward way imo (remember KISS), and at this rate you shouldn't have performance issues so avoid premature optimizations.
Why not have a Job object in node.js that's saved to the database.
var Job = {
id: long,
task: String,
configuration: JSON,
dueDate: Date,
finished: bit
};
I would suggest you only store the id in RAM and leave all the other Job data in the database. When your timeout function finally runs it only needs to know the .id to get the other data.
var job = createJob(...); // create from async data somewhere.
job.save(); // save the job.
var id = job.id // only store the id in RAM
// ask the job to be run in the future.
setTimeout(Date.now - job.dueDate, function() {
// load the job when you want to run it
db.load(id, function(job) {
// run it.
run(job);
// mark as finished
job.finished = true;
// save your finished = true state
job.save();
});
});
// remove job from RAM now.
job = null;
If the server ever crashes all you have to is query all jobs that have [finished=false], load them into RAM and start the setTimeouts again.
If anything goes wrong you should be able to restart cleanly like such:
db.find("job", { finished: false }, function(jobs) {
each(jobs, function(job) {
var id = job.id;
setTimeout(Date.now - job.dueDate, function() {
// load the job when you want to run it
db.load(id, function(job) {
// run it.
run(job);
// mark as finished
job.finished = true;
// save your finished = true state
job.save();
});
});
job = null;
});
});

Categories

Resources