How to save 1 million records to mongodb asyncronously? - javascript

I want to save 1 million records to mongodb using javascript like this:
for (var i = 0; i<10000000; i++) {
model = buildModel(i);
db.save(model, function(err, done) {
console.log('cool');
});
}
I tried it, it saved ~160 records, then hang for 2 minutes, then exited. Why?

It blew up because you are not waiting for an asynchronous call to complete before moving on to the next iteration. What this means is that you are building a "stack" of unresolved operations until this causes a problem. What is the name of this site again? Get the picture?
So this is not the best way to proceed with "Bulk" insertions. Fortunately the underlying MongoDB driver has already thought about this, aside from the callback issue mentioned earlier. There is in fact a "Bulk API" available to make this a whole lot better. And assuming you already pulled the native driver as the db object. But I prefer just using the .collection accessor from the model, and the "async" module to make everything clear:
var bulk = Model.collection.initializeOrderedBulkOp();
var counter = 0;
async.whilst(
// Iterator condition
function() { return count < 1000000 },
// Do this in the iterator
function(callback) {
counter++;
var model = buildModel(counter);
bulk.insert(model);
if ( counter % 1000 == 0 ) {
bulk.execute(function(err,result) {
bulk = Model.collection.initializeOrderedBulkOp();
callback(err);
});
} else {
callback();
}
},
// When all is done
function(err) {
if ( counter % 1000 != 0 )
bulk.execute(function(err,result) {
console.log( "inserted some more" );
});
console.log( "I'm finished now" ;
}
);
The difference there is using both "asynchronous" callback methods on completion rather that just building up a stack, but also employing the "Bulk Operations API" in order to mitigate the asynchronous write calls by submitting everything in batch update statements of 1000 entries.
This does not not only not "build up a stack" of function execution like your own example code, but also performs efficient "wire" transactions by not sending everything all in individual statements, but rather breaking up into manageable "batches" for server commitment.

You should probably use something like Async's eachLimit:
// Create a array of numbers 0-999999
var models = new Array(1000000);
for (var i = models.length; i >= 0; i--)
models[i] = i;
// Iterate over the array performing a MongoDB save operation for each item
// while never performing more than 20 parallel saves at the same time
async.eachLimit(models, 20, function iterator(model, next){
// Build a model and save it to the DB, call next when finished
db.save(buildModel(model), next);
}, function done(err, results){
if (err) { // When an error has occurred while trying to save any model to the DB
console.error(err);
} else { // When all 1,000,000 models have been saved to the DB
console.log('Successfully saved ' + results.length + ' models to MongoDB.');
}
});

Related

How to manage nearly 5000 records in a for loop in node js server?

I created a server using NodeJS. There's a database in MySQL and nearly 5000 users in it. I have to read the mysql database and update and make a log in MongoDB database. I implemented a code for this.
https://gist.github.com/chanakaDe/aa9d6a511070c3c78ba3ebc018306ad8
Here's the problem. in this code, in line 50, I added this value.
userArray[i].ID]
This is a user ID from for loop and I need to update mysql table with that ID. All those codes in the for loop block. But I am getting this error.
TypeError: Cannot read property 'ID' of undefined
So that I assigned those values to variables at the top. See line 38 and 39.
var selectedUserID = userArray[i].ID;
var selectedUserTelephone = userArray[i].telephone;
When I'm using like this, there's no error. But user ID is not updating. Recent 2 values has same user ID.
What is the solution for this ?
This is a general JavaScript issue related to the concepts of scope and hoisting of variables during asynchronous operations.
var a = 0;
function doThingWithA () {
console.log(a)
}
for (var i=0; i<1000; i++) {
a++;
setTimeout(function () {
doThingWithA();
}, 10);
}
In this example "a" will always log with a value of 1000. The reason for this is that the setTimeout (mimics the slow db operation) takes time and during that time (before the log happens) "a" is increased to 1000 since the for loop does not wait for setTimeout to complete.
The best solution is to use the "async" module.
pool.getConnection(function (err, connection) {
connection.query(query, function (err, users) {
async.eachSeries(users, function (user, next) {
async.parallel([
function updateUserStatus (cb) { /* your current code */ },
function updateUserAccount (cb) { /* current code for this */ }
], next);
}, function (err) { console.log('finished for all users!') })
});
});
You could also use promises. This is a typical async issue in node.js. From reading your code it appears you think each operation runs in series, whereas in node each input/ouput (e.g db call) is triggered, but your code continues to run as shown in my for loop example above.
There is a cool library called ES6-promise-pool
https://www.npmjs.com/package/es6-promise-pool
So it has concurrency option
Like:
var count = 0
var promiseProducer = function () {
if (count < 5) {
count++
return delayValue(count, 1000)
} else {
return null
}
}
var pool = new PromisePool(promiseProducer, 3)
pool.start()
.then(function () {
console.log('Complete')
})

Correct way to insert many records into Mongodb with Node.js

I was wondering what is the correct way to do bulk inserts into Mongodb (although could be any other database) with Node.js
I have written the following code as an example, although I believe it is floored as db.close() may be run before all the asynchronous collection.insert calls have completed.
MongoClient.connect('mongodb://127.0.0.1:27017/test', function (err, db) {
var i, collection;
if (err) {
throw err;
}
collection = db.collection('entries');
for (i = 0; i < entries.length; i++) {
collection.insert(entries[i].entry);
}
db.close();
});
If your MongoDB server is 2.6 or newer, it would be better to take advantage of using a write commands Bulk API that allow for the execution of bulk insert operations which are simply abstractions on top of the server to make it easy to build bulk operations and thus get perfomance gains with your update over large collections.
Sending the bulk insert operations in batches results in less traffic to the server and thus performs efficient wire transactions by not sending everything all in individual statements, but rather breaking up into manageable chunks for server commitment. There is also less time waiting for the response in the callback with this approach.
These bulk operations come mainly in two flavours:
Ordered bulk operations. These operations execute all the operation in order and error out on the first write error.
Unordered bulk operations. These operations execute all the operations in parallel and aggregates up all the errors. Unordered bulk operations do not guarantee order of execution.
Note, for older servers than 2.6 the API will downconvert the operations. However it's not possible to downconvert 100% so there might be some edge cases where it cannot correctly report the right numbers.
In your case, you could implement the Bulk API insert operation in batches of 1000 like this:
For MongoDB 3.2+ using bulkWrite
var MongoClient = require('mongodb').MongoClient;
var url = 'mongodb://localhost:27017/test';
var entries = [ ... ] // a huge array containing the entry objects
var createNewEntries = function(db, entries, callback) {
// Get the collection and bulk api artefacts
var collection = db.collection('entries'),
bulkUpdateOps = [];
entries.forEach(function(doc) {
bulkUpdateOps.push({ "insertOne": { "document": doc } });
if (bulkUpdateOps.length === 1000) {
collection.bulkWrite(bulkUpdateOps).then(function(r) {
// do something with result
});
bulkUpdateOps = [];
}
})
if (bulkUpdateOps.length > 0) {
collection.bulkWrite(bulkUpdateOps).then(function(r) {
// do something with result
});
}
};
For MongoDB <3.2
var MongoClient = require('mongodb').MongoClient;
var url = 'mongodb://localhost:27017/test';
var entries = [ ... ] // a huge array containing the entry objects
var createNewEntries = function(db, entries, callback) {
// Get the collection and bulk api artefacts
var collection = db.collection('entries'),
bulk = collection.initializeOrderedBulkOp(), // Initialize the Ordered Batch
counter = 0;
// Execute the forEach method, triggers for each entry in the array
entries.forEach(function(obj) {
bulk.insert(obj);
counter++;
if (counter % 1000 == 0 ) {
// Execute the operation
bulk.execute(function(err, result) {
// re-initialise batch operation
bulk = collection.initializeOrderedBulkOp();
callback();
});
}
});
if (counter % 1000 != 0 ){
bulk.execute(function(err, result) {
// do something with result
callback();
});
}
};
Call the createNewEntries() function.
MongoClient.connect(url, function(err, db) {
createNewEntries(db, entries, function() {
db.close();
});
});
You can use insertMany. It accepts an array of objects. Check the API.
New in version 3.2.
The db.collection.bulkWrite() method provides the ability to perform bulk insert, update, and remove operations. MongoDB also supports bulk insert through the db.collection.insertMany().
In bulkWrite it is supporting only insertOne, updateOne, updateMany, replaceOne, deleteOne, deleteMany
In your case to insert data using single line of code, it can use insertMany option.
MongoClient.connect('mongodb://127.0.0.1:27017/test', function (err, db) {
var i, collection;
if (err) {
throw err;
}
collection = db.collection('entries');
collection.insertMany(entries)
db.close();
});
var MongoClient = require('mongodb').MongoClient;
var url = 'mongodb://localhost:27017/test';
var data1={
name:'Data1',
work:'student',
No:4355453,
Date_of_birth:new Date(1996,10,17)
};
var data2={
name:'Data2',
work:'student',
No:4355453,
Date_of_birth:new Date(1996,10,17)
};
MongoClient.connect(url, function(err, db) {
if(err!=null){
return console.log(err.message)
}
//insertOne
db.collection("App").insertOne(data1,function (err,data) {
if(err!=null){
return console.log(err);
}
console.log(data.ops[0]);
});
//insertMany
var Data=[data1,data2];
db.collection("App").insertMany(Data,forceServerObjectId=true,function (err,data) {
if(err!=null){
return console.log(err);
}
console.log(data.ops);
});
db.close();
});

How can I handle large objects and avoid an out of memory error?

I'm trying to handle a large array, but the object is sometimes so large that it breaks the server with:
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
I can reproduce this locally by doing node --max-old-space-size=100 and then running my process. Without setting this option it works fine.
Here's the section of the code where it's breaking
function setManagers(json){
return Q.all(json.map(function(obj){
return getUserByUsername(obj.username)
.then(function(user){
console.log('got here'); //never hits
...
function getUserByUsername(username){
console.log('getting user'); //hits
return UserModel.findOneQ({username:username})
.then(function(user){
console.log('search finished'); //never hits
if(user){
console.log('got user');
return Q(user);
}else{
console.log('no user');
return Q.reject({message:'user ' + username + ' could not be found locally'});
}
});
}
I think that this problem is due to it being a huge array, which has a length of over 5000
If I were to remove the option to limit my memory, or use a smaller data set, it'd work.
Is there any easy way to avoid this error?
The problem is that you're doing too many database calls simultaneously.
I would change your logic to send to the database a sorted string with all the usernames joined by commas, and I would get all the users needed at once. Then, I would loop through those users and do whatever you're doing with the users you're receiving from the database.
Another solution, that will not be obtrusive to your current logic, is to perform the database operations one by one, using, for example, a promise-while:
function pwhile(condition, body) {
var done = Q.defer();
function loop() {
if (!condition())
return done.resolve();
Q.when(body(), loop, done.reject);
}
Q.nextTick(loop);
return done.promise;
}
function setManagers(json) {
var i = 0;
return pwhile(function() { i < json.length; }, function() {
var p = getUserByUsername(json[i].username);
i++;
return p;
});
}

How to duplicate a mongodb collection with the node.js driver?

Is there anyway to duplicate an collection through the nodejs mongodb driver?
i.e. collection.copyTo("duplicate_collection");
You can eval copyTo() server-side though it will block the entire mongod process and won't create indexes on the new collection.
var copyTo = "function() { db['source'].copyTo('target') };"
db.eval(copyTo, [], function(err, result) {
console.log(err);
});
Also note the field type warning.
"When using db.collection.copyTo() check field types to ensure that the operation does not remove type information from documents during the translation from BSON to JSON. Consider using cloneCollection() to maintain type fidelity."
Try to avoid .eval() if this is something you want to do regularly on a production system. It's fast, but there are problems.
A better approach would be to use The "Bulk" operations API, and with a little help from the "async" library:
db.collection("target",function(err,target) {
var batch = target.initializeOrderedBulkOp();
counter = 0;
var cursor = db.collection("source").find();
var current = null;
async.whilst(
function() {
cursor.nextObject(function(err,doc) {
if (err) throw err;
// .nextObject() returns null when the cursor is depleted
if ( doc != null ) {
current = doc;
return true;
} else {
return false;
}
})
},
function(callback) {
batch.insert(current);
counter++;
if ( counter % 1000 == 0 ) {
batch.execute(function(err,result) {
if (err) throw err;
var batch = target.initializeOrderedBulkOp();
callback();
});
}
},
function(err) {
if (err) throw err;
if ( counter % 1000 != 0 )
batch.execute(function(err,result) {
if (err) throw err;
// job done
});
}
);
});
It's fast, not as fast as .eval() but does not block either the application or server.
Batch operations will generally take as many operations as you throw at them, but using a modulo as a limiter allows a little more control and essentially avoids loading an unreasonable amount of documents in memory at a time. Keep in mind that whatever the the case the batch size that is sent cannot exceed more that 16MB between executions.
Another option to duplicate a collection would be to use aggregate method on a collection and the $out parameter. Here is an example inside of an async function:
const client = await MongoClient.connect("mongodb://alt_dev:aaaaa:27018/meteor");
const db = client.db('meteor');
const planPrice = await db.collection('plan_price');
const planPriceCopy = await planPrice.aggregate([{$match: {}}, {$out: planPriceUpdateCollection}]);
await planPriceCopy.toArray();
This will create a copy of the original collection with all of its content.

node.js nested callback, get final results array

I am doing a for loop to find the result from mongodb, and concat the array. But I am not getting the final results array when the loop is finished. I am new to node.js, and I think it's not working like objective-c callback.
app.get('/users/self/feed', function(req, res){
var query = Bill.find({user: req.param('userId')});
query.sort('-createdAt');
query.exec(function(err, bill){
if (bill) {
var arr = bill;
Following.findOne({user: req.param('userId')}, function(err, follow){
if (follow) {
var follows = follow.following; //this is a array of user ids
for (var i = 0; i < follows.length; i++) {
var followId = follows[i];
Bill.find({user: followId}, function(err, result){
arr = arr.concat(result);
// res.send(200, arr);// this is working.
});
}
} else {
res.send(400, err);
}
});
res.send(200, arr); //if put here, i am not getting the final results
} else {
res.send(400, err);
}
})
});
While I'm not entirely familiar with MongoDB, a quick reading of their documentation shows that they provide an asynchronous Node.js interface.
That said, both the findOne and find operations start, but don't necessarily complete by the time you reach res.send(200, arr) meaning arr will still be empty.
Instead, you should send your response back once all asynchronous calls complete meaning you could do something like:
var billsToFind = follows.length;
for (var i = 0; i < follows.length; i++) {
var followId = follows[i];
Bill.find({user: followId}, function(err, result){
arr = arr.concat(result);
billsToFind -= 1;
if(billsToFind === 0){
res.send(200, arr);
}
});
}
The approach uses a counter for all of the inner async calls (I'm ignoring the findOne because we're currently inside its callback anyway). As each Bill.find call completes it decrements the counter and once it reaches 0 it means that all callbacks have fired (this works since Bill.find is called for every item in the array follows) and it sends back the response with the full array.
That's true. Your codes inside for will be executed in parallel at the same time (and with the same value of i I think). If you added console.log inside and after your for loop you will found the outside one will be printed before inside one.
You can wrap the code that inside your for into array of functions and execute them by using async module (https://www.npmjs.org/package/async) in parallel or series, and retrieve the final result from async.parallel or async.series's last parameter.

Categories

Resources