Mongo DB reorder objects on insert - javascript

In my node js app Im using Mongo DB , and I have an Issue when Inserting something in database. When I add new record to collection objects beign reordered . Does anyone knows why is this happening ? Im doing insert like this
collection.insert(object, {safe:true}, function(err, result) {
if (err) {
res.send({'error':'An error has occurred'});
} else {
// Error
}
});
Actually , any operation on collection change the order of objects , does anyone knows why is this happening ?

MongoDB documents have padding space that is used for updates. If you make small changes to a document like adding/updating a small field there is a good chance that the size of the updated document will increase but will still fit into allocated space because it can use that padding. If the updated document does not fit in that space Mongo will move it to a new place on disk. Thus your documents might move a lot in the beginning until Mongo learns how much padding you would usually need to prevent such moves. You can also set higher padding to avoid documents being moved in the first place.
In either case you can't really rely on the insertion order to get sorted list of documents. If you want guaranteed order you need to sort. In your case you can sort by _id because it's a monotonically increasing counter which contains date and time details:
// in the order of insertion unless you got `_id` value externally
// (in your code, not auto assigned by Mongo) and doc with such ID
// was inserted much later.
// Sharding might also introduce tiny ordering mismatches
db.collection.find().sort( { '_id': 1 } );
// most recently inserted items first
db.collection.find().sort( { '_id': -1 } );
If you use capped collection the order of inserts will be preserved always, i.e. Mongo will never move documents. In that case you can use natural order sorting:
db.collection.find().sort( { $natural: 1 } )
which is equivalent to sorting by _id as shown above.
Do not use natural order sorting with non-capped collections (regular collections) because it will not be reliable in presence of updates.

Related

Why I can't get multiple data from MongoDB? Where I can insert on the same collection [duplicate]

We are troubled by eventually occurring cursor not found exceptions for some Morphia Queries asList and I've found a hint on SO, that this might be quite memory consumptive.
Now I'd like to know a bit more about the background: can sombody explain (in English), what a Cursor (in MongoDB) actually is? Why can it kept open or be not found?
The documentation defines a cursor as:
A pointer to the result set of a query. Clients can iterate through a cursor to retrieve results. By default, cursors timeout after 10 minutes of inactivity
But this is not very telling. Maybe it could be helpful to define a batch for query results, because the documentation also states:
The MongoDB server returns the query results in batches. Batch size will not exceed the maximum BSON document size. For most queries, the first batch returns 101 documents or just enough documents to exceed 1 megabyte. Subsequent batch size is 4 megabytes. [...] For queries that include a sort operation without an index, the server must load all the documents in memory to perform the sort before returning any results.
Note: in our queries in question we don't use sort statements at all, but also no limit and offset.
Here's a comparison between toArray() and cursors after a find() in the Node.js MongoDB driver. Common code:
var MongoClient = require('mongodb').MongoClient,
assert = require('assert');
MongoClient.connect('mongodb://localhost:27017/crunchbase', function (err, db) {
assert.equal(err, null);
console.log('Successfully connected to MongoDB.');
const query = { category_code: "biotech" };
// toArray() vs. cursor code goes here
});
Here's the toArray() code that goes in the section above.
db.collection('companies').find(query).toArray(function (err, docs) {
assert.equal(err, null);
assert.notEqual(docs.length, 0);
docs.forEach(doc => {
console.log(`${doc.name} is a ${doc.category_code} company.`);
});
db.close();
});
Per the documentation,
The caller is responsible for making sure that there
is enough memory to store the results.
Here's the cursor-based approach, using the cursor.forEach() method:
const cursor = db.collection('companies').find(query);
cursor.forEach(
function (doc) {
console.log(`${doc.name} is a ${doc.category_code} company.`);
},
function (err) {
assert.equal(err, null);
return db.close();
}
);
});
With the forEach() approach, instead of fetching all data in memory, we're streaming the data to our application. find() creates a cursor immediately because it doesn't actually make a request to the database until we try to use some of the documents it will provide. The point of cursor is to describe our query. The second parameter to cursor.forEach shows what to do when an error occurs.
In the initial version of the above code, it was toArray() which forced the database call. It meant we needed ALL the documents and wanted them to be in an array.
Note that MongoDB returns data in batches. The image below shows requests from cursors (from application) to MongoDB:
forEach scales better than toArray because we can process documents as they come in until we reach the end. Contrast it with toArray - where we wait for ALL the documents to be retrieved and the entire array is built. This means we're not getting any advantage from the fact that the driver and the database system are working together to batch results to your application. Batching is meant to provide efficiency in terms of memory overhead and the execution time. Take advantage of it in your application, if you can.
I am by no mean a mongodb expert but I just want to add some observations from working in a medium sized mongo system for the last year. Also thanks to #xameeramir for the excellent walkthough about how cursors work in general.
The causes of a "cursor lost" exception may be several. One that I have noticed is explained in this answer.
The cursor lives server side. It is not distributed over a replica set but exists on the instance that is primary at the time of creation. This means that if another instance takes over as primary the cursor will be lost to the client. If the old primary is still up and around it may still be there but for no use. I guess it is garbaged collected away after a while. So if your mongo replica set is unstable or you have a shaky network in front of it you are out of luck when doing any long running queries.
If the full content of what the cursor wants to return does not fit in memory on the server the query may be very slow. RAM on your servers needs to be larger than the largest query you run.
All this can partly be avoided by designing better. For a use case with large long running queries you may be better of with several smaller database collections instead of a big one.
The collection's find method returns a cursor - this points to the set of documents (called as result set) that are matched to the query filter. The result set is the actual documents that are returned by the query, but this is on the database server.
To the client program, for example the mongo shell, you get a cursor. You can think the cursor is like an API or a program to work with the result set. The cursor has many methods which can be run to perform some actions on the result set. Some of the methods affect the result set data and some provide the status or info about the result set.
As the cursor maintains information about the result set, some information can change as you use the result set data by applying other cursor methods. You use these methods and information to suit your application, i.e., how and what you want to do with the queried data.
Working on the result set using the cursor and some of its commonly used methods and features from mongo shell:
The count() method returns the count of the number of documents in the result set, initially - as the result of the query. It is always constant at any point in the life of the cursor. This is information. This information remains same even after the cursor is closed or exhausted.
As you read documents from the result set, the result set gets exhausted. Once completely exhausted you cannot read any more. The hasNext() tells if there are any documents available to be read - returns a boolean true or false. The next() returns a document if available (you first check with hasNext, and then do a next). These two methods are commonly used to iterate over the result set data. Another iteration method is the forEach().
The data is retrieved from the server in batches - which has a default size. With the first batch you read the documents and when all it's documents are read, the following next() method retrieves the next batch of documents, etc., until all documents are read from the result set. This batch size can be configured and you can also get its status.
If you apply the toArray() method on the cursor, then all the remaining documents in the result set are loaded into the memory of your client computer and are available as a JavaScript array. And, the result set data is exhausted. The following hasNext method will return false, and the next will throw an error (once you exhaust the cursor, you cannot read data from it). This method loads all the result set data into your client's memory (the array). This can be memory consuming in case of large result sets.
The itcount() returns the count of remaining documents in the result set and exhausts the cursor.
There are cursor methods like isClosed(), isExhausted(), size() which give status information about the cursor and its underlying result set as you work with your data.
Those are the basic features of cursor and result set. There are many cursor methods, and you can try and see how they work and get a better understanding.
Reference:
mongo shell's cursor
methods
Cursor behavior with Aggregate
method
(the collection's aggregate method also returns a cursor)
Example usage in mongo shell:
Assume the test collection has 200 documents (run the commands in the same sequence).
var cur = db.test.find( { } ).limit(25) creates a result set with 25
documents only.
But, cur.count() will show 200, which is the actual count of
documents by the query's filter.
hasNext() will return true.
next() will return a document.
itcount() will return 24 (and exhausts the cursor).
itcount() again will return 0.
cur.count() will still show 200.
This error also comes when you have a large set of data and are doing batch processing on that data and each batch takes more time, totalling that time be exceeded the default cursor live time.
Then you need to change that default time to tell mongo that will not expire this cursor until processing is done.
Do check No TimeOut Documentation
A cursor is an object returned by calling db.collection.find() and which enables iterating through documents (NoSQL equivalent of a SQL "row") of a MongoDB collection (NoSQL equivalent of "table").
In case your cluster is stable and no members where down or changing state, the most posible reason for not finding the cursor is this:
Default idle cursor timeout is 10min , but in the versions >= 3.6 the cursor is also associated with session which is having default session timeout 30min , so even you set the cursor to not expire with the option noCursorTimeout() you are still limited by the session timeout of 30min. To avoid your cursor to be killed by the session timeout you will need to perioducally check in your code and execute sessionRefresh command:
db.adminCommand({"refreshSessions" : [sessionId]})
to extend the session with another 30min so your cursor to not be killed if you do something with the data before fetching the next batch...
check the docs here for detail how to do it:
https://docs.mongodb.com/manual/reference/method/cursor.noCursorTimeout/

storing data as object vs array in MongoDb for write performance

Should I store objects in an Array or inside an Object with top importance given Write Speed?
I'm trying to decide whether data should be stored as an array of objects, or using nested objects inside a mongodb document.
In this particular case, I'm keeping track of a set of continually updating files that I add and update and the file name acts as a key and the number of lines processed within the file.
the document looks something like this
{
t_id:1220,
some-other-info: {}, // there's other info here not updated frequently
files: {
log1-txt: {filename:"log1.txt",numlines:233,filesize:19928},
log2-txt: {filename:"log2.txt",numlines:2,filesize:843}
}
}
or this
{
t_id:1220,
some-other-info: {},
files:[
{filename:"log1.txt",numlines:233,filesize:19928},
{filename:"log2.txt",numlines:2,filesize:843}
]
}
I am making an assumption that handling a document, especially when it comes to updates, it is easier to deal with objects, because the location of the object can be determined by the name; unlike an array, where I have to look through each object's value until I find the match.
Because the object key will have periods, I will need to convert (or drop) the periods to create a valid key (fi.le.log to filelog or fi-le-log).
I'm not worried about the files' possible duplicate names emerging (such as fi.le.log and fi-le.log) so I would prefer to use Objects, because the number of files is relatively small, but the updates are frequent.
Or would it be better to handle this data in a separate collection for best write performance...
{
"_id": ObjectId('56d9f1202d777d9806000003'),"t_id": "1220","filename": "log1.txt","filesize": 1843,"numlines": 554
},
{
"_id": ObjectId('56d9f1392d777d9806000004'),"t_id": "1220","filename": "log2.txt","filesize": 5231,"numlines": 3027
}
From what I understand you are talking about write speed, without any read consideration. So we have to think about how you will insert/update your document.
We have to compare (assuming you know the _id you are replacing, replace {key} by the key name, in your example log1-txt or log2-txt):
db.Col.update({ _id: '' }, { $set: { 'files.{key}': object }})
vs
db.Col.update({ _id: '', 'files.filename': '{key}'}, { $set: { 'files.$': object }})
The second one means that MongoDB have to browse the array, find the matching index and update it. The first one means MongoDB just update the specified field.
The worst:
The second command will not work if the matching filename is not present in the array! So you have to execute it, check if nMatched is 0, and create it if it is so. That's really bad write speed (see here MongoDB: upsert sub-document).
If you will never/almost never use read queries / aggregation framework on this collection: go for the first one, that will be faster. If you want to aggregate, unwind, do some analytics on the files you parsed to have statistics about file size and line numbers, you may consider using the second one, you will avoid some headache.
Pure write speed will be better with the first solution.

mongodb find in order of last record to first

My project inserts a large number of historical data documents into a history collection, since they are never modified the order is correct (as no updating goes on) but backwards for retrieving.
I'm using something like this to retrieve the data in pages of 20 records.
var page = 2;
hStats_collection.find({name:name},{_id:0, name:0}).limit(20).skip(page*20).toArray( function(err, doc) {
});
After reading my eyes dry, $sort is no good, and it appears the best way to reverse the order that the above code retrieves is to add an index via a time stamp element in the document, of which I already have a useful item called started (seconds since epoc) and need to create an index for it.
http://docs.mongodb.org/manual/tutorial/create-an-index/
The docs say I can do something like:
hStats_collection.createIndex( { started: -1 } )
How do I change my .find() code above to find name and reference the started index so the results come back newest to oldest (as oposed to the natural find() oldest to newest).
hStats_collection.find({name:name},{_id:0, name:0}).sort({ started:-1 }).limit(20).

Range query for MongoDB pagination

I want to implement pagination on top of a MongoDB. For my range query, I thought about using ObjectIDs:
db.tweets.find({ _id: { $lt: maxID } }, { limit: 50 })
However, according to the docs, the structure of the ObjectID means that "ObjectId values do not represent a strict insertion order":
The relationship between the order of ObjectId values and generation time is not strict within a single second. If multiple systems, or multiple processes or threads on a single system generate values, within a single second; ObjectId values do not represent a strict insertion order. Clock skew between clients can also result in non-strict ordering even for values, because client drivers generate ObjectId values, not the mongod process.
I then thought about querying with a timestamp:
db.tweets.find({ created: { $lt: maxDate } }, { limit: 50 })
However, there is no guarantee the date will be unique — it's quite likely that two documents could be created within the same second. This means documents could be missed when paging.
Is there any sort of ranged query that would provide me with more stability?
It is perfectly fine to use ObjectId() though your syntax for pagination is wrong. You want:
db.tweets.find().limit(50).sort({"_id":-1});
This says you want tweets sorted by _id value in descending order and you want the most recent 50. Your problem is the fact that pagination is tricky when the current result set is changing - so rather than using skip for the next page, you want to make note of the smallest _id in the result set (the 50th most recent _id value and then get the next page with:
db.tweets.find( {_id : { "$lt" : <50th _id> } } ).limit(50).sort({"_id":-1});
This will give you the next "most recent" tweets, without new incoming tweets messing up your pagination back through time.
There is absolutely no need to worry about whether _id value is strictly corresponding to insertion order - it will be 99.999% close enough, and no one actually cares on the sub-second level which tweet came first - you might even notice Twitter frequently displays tweets out of order, it's just not that critical.
If it is critical, then you would have to use the same technique but with "tweet date" where that date would have to be a timestamp, rather than just a date.
Wouldn't a tweet "actual" timestamp (i.e. time tweeted and the criteria you want it sorted by) be different from a tweet "insertion" timestamp (i.e. time added to local collection). This depends on your application, of course, but it's a likely scenario that tweet inserts could be batched or otherwise end up being inserted in the "wrong" order. So, unless you work at Twitter (and have access to collections inserted in correct order), you wouldn't be able to rely just on $natural or ObjectID for sorting logic.
Mongo docs suggest skip and limit for paging:
db.tweets.find({created: {$lt: maxID}).
sort({created: -1, username: 1}).
skip(50).limit(50); //second page
There is, however, a performance concern when using skip:
The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return result. As offset increases, cursor.skip() will become slower and more CPU intensive.
This happens because skip does not fit into the MapReduce model and is not an operation that would scale well, you have to wait for a sorted collection to become available before it can be "sliced". Now limit(n) sounds like an equally poor method as it applies a similar constraint "from the other end"; however with sorting applied, the engine is able to somewhat optimize the process by only keeping in memory n elements per shard as it traverses the collection.
An alternative is to use range based paging. After retrieving the first page of tweets, you know what the created value is for the last tweet, so all you have to do is substitute the original maxID with this new value:
db.tweets.find({created: {$lt: lastTweetOnCurrentPageCreated}).
sort({created: -1, username: 1}).
limit(50); //next page
Performing a find condition like this can be easily parallellized. But how to deal with pages other than the next one? You don't know the begin date for pages number 5, 10, 20, or even the previous page! #SergioTulentsev suggests creative chaining of methods but I would advocate pre-calculating first-last ranges of the aggregate field in a separate pages collection; these could be re-calculated on update. Furthermore, if you're not happy with DateTime (note the performance remarks) or are concerned about duplicate values, you should consider compound indexes on timestamp + account tie (since a user can't tweet twice at the same time), or even an artificial aggregate of the two:
db.pages.
find({pagenum: 3})
> {pagenum:3; begin:"01-01-2014#BillGates"; end:"03-01-2014#big_ben_clock"}
db.tweets.
find({_sortdate: {$lt: "03-01-2014#big_ben_clock", $gt: "01-01-2014#BillGates"}).
sort({_sortdate: -1}).
limit(50) //third page
Using an aggregate field for sorting will work "on the fold" (although perhaps there are more kosher ways to deal with the condition). This could be set up as a unique index with values corrected at insert time, with a single tweet document looking like
{
_id: ...,
created: ..., //to be used in markup
user: ..., //also to be used in markup
_sortdate: "01-01-2014#BillGates" //sorting only, use date AND time
}
The following approach wil work even if there are multiple documents inserted/updated at same millisecond even if from multiple clients (which generates ObjectId). For simiplicity, In following queries I am projecting _id, lastModifiedDate.
First page, fetch the result Sorted by modifiedTime (Descending), ObjectId (Ascending) for fist page.
db.product.find({},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)
Note down the ObjectId and lastModifiedDate of the last record fetched in this page. (loid, lmd)
For sencod page, include query condition to search if (lastModifiedDate = lmd AND oid > loid ) OR (lastModifiedDate < loid)
db.productfind({$or:[{"lastModifiedDate":{$lt:lmd}},{"_id":1,"lastModifiedDate":1},{$and:[{"lastModifiedDate":lmd},{"_id":{$gt:loid}}]}]},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)
repeat same for subsequent pages.
ObjectIds should be good enough for pagination if you limit your queries to the previous second (or don't care about the subsecond possibility of weirdness). If that is not good enough for your needs then you will need to implement an ID generation system that works like an auto-increment.
Update:
To query the previous second of ObjectIds you will need to construct an ObjectID manually.
See the specification of ObjectId http://docs.mongodb.org/manual/reference/object-id/
Try using this expression to do it from a mongos.
{ _id :
{
$lt : ObjectId(Math.floor((new Date).getTime()/1000 - 1).toString(16)+"ffffffffffffffff")
}
}
The 'f''s at the end are to max out the possible random bits that are not associated with a timestamp since you are doing a less than query.
I recommend during the actual ObjectId creation on your application server rather than on the mongos since this type of calculation can slow you down if you have many users.
I have build a pagination using mongodb _id this way.
// import ObjectId from mongodb
let sortOrder = -1;
let query = []
if (prev) {
sortOrder = 1
query.push({title: 'findTitle', _id:{$gt: ObjectId('_idValue')}})
}
if (next) {
sortOrder = -1
query.push({title: 'findTitle', _id:{$lt: ObjectId('_idValue')}})
}
db.collection.find(query).limit(10).sort({_id: sortOrder})

Javascript function taking too long to complete?

Below is a snipet of code that I am having trouble with. The purpose is to check duplicate entries in the database and return "h" with a boolean if true or false. For testing purposes I am returning a true boolean for "h" but by the time the alert(duplicate_count); line gets executed the duplicate_count is still 0. Even though the alert for a +1 gets executed.
To me it seems like the function updateUserFields is taking longer to execute so it's taking longer to finish before getting to the alert.
Any ideas or suggestions? Thanks!
var duplicate_count = 0
for (var i = 0; i < skill_id.length; i++) {
function updateUserFields(h) {
if(h) {
duplicate_count++;
alert("count +1");
} else {
alert("none found");
}
}
var g = new cfc_mentoring_find_mentor();
g.setCallbackHandler(updateUserFields);
g.is_relationship_duplicate(resource_id, mentee_id, section_id[i], skill_id[i], active_ind,table);
};
alert(duplicate_count);
There is no reason whatsoever to use client-side JavaScript/jQuery to remove duplicates from your database. Security concerns aside (and there are a lot of those), there is a much easier way to make sure the entries in your database are unique: use SQL.
SQL is capable of expressing the requirement that there be no duplicates in a table column, and the database engine will enforce that for you, never letting you insert a duplicate entry in the first place. The syntax varies very slightly by database engine, but whenever you create the table you can specify that a column must be unique.
Let's use SQLite as our example database engine. The relevant part of your problem is right now probably expressed with tables something like this:
CREATE TABLE Person(
id INTEGER PRIMARY KEY ASC,
-- Other fields here
);
CREATE TABLE MentorRelationship(
id INTEGER PRIMARY KEY ASC,
mentorID INTEGER,
menteeID INTEGER,
FOREIGN KEY (mentorID) REFERENCES Person(id),
FOREIGN KEY (menteeID) REFERENCES Person(id)
);
However, you can make enforce uniqueness i.e. require that any (mentorID, menteeID) pair is unique, by changing the pair (mentorID, menteeID) to be the primary key. This works because you are only allowed one copy of each primary key. Then, the MentorRelationship table becomes
CREATE TABLE MentorRelationship(
mentorID INTEGER,
menteeID INTEGER,
PRIMARY KEY (mentorID, menteeID),
FOREIGN KEY (mentorID) REFERENCES Person(id),
FOREIGN KEY (menteeID) REFERENCES Person(id)
);
EDIT: As per the comment, alerting the user to duplicates but not actually removing them
This is still much better with SQL than with JavaScript. When you do this in JavaScript, you read one database row at a time, send it over the network, wait for it to come to your page, process it, throw it away, and then request the next one. With SQL, all the hard work is done by the database engine, and you don't lose time by transferring unnecessary data over the network. Using the first set of table definitions above, you could write
SELECT mentorID, menteeID
FROM MentorRelationship
GROUP BY mentorID, menteeID
HAVING COUNT(*) > 1;
which will return all the (mentorID, menteeID) pairs that occur more than once.
Once you have a query like this working on the server (and are also pulling out all the information you want to show to the user, which is presumably more than just a pair of IDs), you need to send this over the network to the user's web browser. Essentially, on the server side you map a URL to return this information in some convenient form (JSON, XML, etc.), and on the client side you read this information by contacting that URL with an AJAX call (see jQuery's website for some code examples), and then display that information to the user. No need to write in JavaScript what a database engine will execute orders of magnitude faster.
EDIT 2: As per the second comment, checking whether an item is already in the database
Almost everything I said in the first edit applies, except for two changes: the schema and the query. The schema should become the second of the two schemas I posted, since you don't want the database engine to allow duplicates. Also, the query should be simply
SELECT COUNT(*) > 0
FROM MentorRelationship
WHERE mentorID = #mentorID AND menteeID = #menteeID;
where #mentorID and #menteeID are the items that the user selected, and are inserted into the query by a query builder library and not by string concatenation. Then, the server will get a true value if the item is already in the database, and a false value otherwise. The server can send that back to the client via AJAX as before, and the client (that's your JavaScript page) can alert the user if the item is already in the database.

Categories

Resources