I'm looking in ways to improve the upsert performance of my mongoDB application. In my test program I have a 'user' collection which has an 'id' (type - Number) and a 'name' (type - string) property. There is an unique index on the 'id'.
The Problem:
When performing a bulk write (ordered: false) - It seems that updateOne or replaceOne with upsert enabled is about 6 to 8 times slower than 'insertOne'.
My Index:
await getDb().collection('user').createIndex({
id: 1
}, {
unique: true,
name: "id_index"
});
Example replaceOne (Take 8.8 seconds) for 100,000 users:
operations.push({
replaceOne: {
filter: {id: 1},
replacement: {id: 1, name: "user 1"},
upsert: true
}
})
Example updateOne (Take 8.4 seconds) 100,000 users:
operations.push({
updateOne: {
filter: {id: 1},
update: {$set:{name: "user 1"}},
upsert: true
}
})
Example insertOne (Take 1.3 seconds) 100,000 users:
operations.push({
insertOne: {id: 1, name: "user 1"}
})
NOTE - each time I preformed these tests, the collection was emptied, and index was recreated.
Is that to be expected?
Is there anything else I can do to improve upsert performance? I have modified writeConcern on bulkWrite with little to no impact.
I was 'following' this question to see what might come of it. Seeing no answers over the week, I'll offer my own.
Without further information, I think the evidence that you provided yourself is reasonably strong evidence to suggest that the answer is 'yes, this is expected'. While we don't know details such as how many updates versus inserts were performed by your test or what version of the database you are using, there doesn't seem to be anything blatantly wrong with the comparison. The upsert documentation suggest that the database first checks for the existence of documents that would be updated by the command before performing the insert. This is further suggested by the following text a little bit lower on the same page (emphasis added):
If all update() operations finish the query phase before any client successfully inserts data, and there is no unique index on the name field, each update() operation may result in an insert, creating multiple documents with name: Andy.
Based on all of this, I think it is perfectly reasonable to surmise that the update portion of an upsert operation has a noticeable overhead on the operation that is absent for direct insert operations.
Given this information, I think it raises a few additional questions which are important to consider:
What was your goal in knowing this information? Just to make sure you had configured things optimally, or were you not currently achieving some performance targets?
Depending on a variety of factors, perhaps an alternative approach here would be just attempt the insert (or update) and deal with the exceptions separately afterwards?
Perhaps out of curiosity, what's the purpose of having a separate unique index on id when there is already one present for the _id field? Certainly each new index (unique or not) adds some overhead, so perhaps it would be best to just repurpose the required _id field and index to use your particular needs?
How would I go about filtering a set of records based on their child records.
Let's say I have a collection Item that has a field to another collection Bag called bagId. I'd like to find all Items where a field on Bags matches some clause.
I.e. db.Items.find( { "where bag.type:'Paper' " }) . How would I go about doing this in MongoDB. I understand I'd have to join on Bags and then link where Item.bagId == Bag._id
I used Studio3T to convert a SQL GROUP BY to a Mongo aggregate. I'm just wondering if there's any defacto way to do this.
Should I perform a data migration to simply include Bag.type on every Item document (don't want to get into the habit of continuously making schema changes everytime I want to sort/filter Items by Bag fields).
Use something like https://github.com/meteorhacks/meteor-aggregate (No luck with that syntax yet)
Grapher https://github.com/cult-of-coders/grapher I played around with this briefly and while it's cool I'm not sure if it'll actually solve my problem. I can use it to add Bag.type to every Item returned, but I don't see how that could help me filter every item by Bag.type.
Is this just one of the tradeoffs of using a NoSQL dbms? What option above is recommended or are there any other ideas?
Thanks
You could use the $in functionality of MongoDB. It would look something like this:
const bagsIds = Bags.find({type: 'paper'}, {fields: {"_id": 1}}).map(function(bag) { return bag._id; });
const items = Items.find( { bagId: { $in: bagsIds } } ).fetch();
It would take some testing if the reactivity of this solution is still how you expect it to work and if this would still be suitable for larger collections instead of going for your first solution and performing the migration.
I’ve found multiple posts and guides praising the ability to do joins in Mongoose and MongoDB using the populate() method.
This makes me confused. If you want to do joins, shouldn’t you use an SQL database? Shouldn’t joins in MongoDB be a last resort?
Each object that is using populate() is required to do a second query to fetch that data. So if you fetch 100 items in a query, you need to do another 100 queries to fetch that data. It sounds like storing it as nested schemes is a way better idea where possible.
An I wrong? Is populate() actually a great method that make sense? Or am I right that it’s a last resort option that you can use in cases that should be avoided?
populate() doesn't send a find request for every child document per parent document.
it sends a single find with all child ObjectIds (of all parents!) in the filter.
example (mongoose.set('debug', true) console output):
Mongoose: parent.find({}, { fields: {} }) // was called with populate()
Mongoose: child.find({ _id: { '$in': [ ObjectId(A), ObjectId(B), ...] }})
and then probably "joins" parents to children in node.
so essentially, only 1 RTT was added. to avoid this as much as possible, I've denormalized some of my schemas for common use cases.
I would like to order a collection based on several keys. E.g.:
db.collection.find().sort( { age: -1, score: 1 } );
This is all fine and dandy. However, I would like to guarantee that age comes first, and score comes next.
In Javascript, the order in which an object keys are listed is not guaranteed.
The explanation of sort in the official MongoDb documentation is not exactly clear.
So... is the way I showed actually reliable in terms of order in which the arguments are taken?
Merc.
The syntax you use only works if your JavaScript implementation preserves the order of keys in objects. Most do, most of the time, but it's up to you to be sure.
For the Nodejs official driver, there is an alternate form that works with an Array:
db.collection.find().sort([[age, -1], [score, 1]]);
Drivers for other languages might have something similar.
The aggregation documentation actually provides a little bit more information on the subject, in which it states:
db.users.aggregate(
{ $sort : { age : -1, posts: 1 } }
);
This operation sorts the documents in the users collection, in
descending order according by the age field and then in ascending
order according to the value in the posts field
Which would indicate that the .sort() method functions in the same way.
Here's an example of what I need in sql:
SELECT name FROM employ WHERE name LIKE %bro%
How do I create view like that in CouchDB?
The simple answer is that CouchDB views aren't ideal for this.
The more complicated answer is that this type of query tends to be very inefficient in typical SQL engines too, and so if you grant that there will be tradeoffs with any solution then CouchDB actually has the benefit of letting you choose your tradeoff.
1. The SQL Ways
When you do SELECT ... WHERE name LIKE %bro%, all the SQL engines I'm familiar with must do what's called a "full table scan". This means the server reads every row in the relevant table, and brute force scans the field to see if it matches.
You can do this in CouchDB 2.x with a Mango query using the $regex operator. The query would look something like this for the basic case:
{"selector":{
"name": {
"$regex": "bro"
}
}}
There do not appear to be any options exposed for case-sensitivity, etc. but you could extend it to match only at the beginning/end or more complicated patterns. If you can also restrict your query via some other (indexable) field operator, that would likely help performance. As the documentation warns:
Regular expressions do not work with indexes, so they should not be used to filter large data sets. […]
You can do a full scan in CouchDB 1.x too, using a temporary view:
POST /some_database/_temp_view
{"map": "function (doc) { if (doc.name && doc.name.indexOf('bro') !== -1) emit(null); }"}
This will look through every single document in the database and give you a list of matching documents. You can tweak the map function to also match on a document type, or to emit with a certain key for ordering — emit(doc.timestamp) — or some data value useful to your purpose — emit(null, doc.name).
2. The "tons of disk space available" way
Depending on your source data size you could create an index that emits every possible "interior string" as its permanent (on-disk) view key. That is to say for a name like "Dobros" you would emit("dobros"); emit("obros"); emit("bros"); emit("ros"); emit("os"); emit("s");. Then for a term like '%bro%' you could query your view with startkey="bro"&endkey="bro\uFFFF" to get all occurrences of the lookup term. Your index will be approximately the size of your text content squared, but if you need to do an arbitrary "find in string" faster than the full DB scan above and have the space this might work. You'd be better served by a data structure designed for substring searching though.
Which brings us too...
3. The Full Text Search way
You could use a CouchDB plugin (couchdb-lucene now via Dreyfus/Clouseau for 2.x, ElasticSearch, SQLite's FTS) to generate an auxiliary text-oriented index into your documents.
Note that most full text search indexes don't naturally support arbitrary wildcard prefixes either, likely for similar reasons of space efficiency as we saw above. Usually full text search doesn't imply "brute force binary search", but "word search". YMMV though, take a look around at the options available in your full text engine.
If you don't really need to find "bro" anywhere in a field, you can implement basic "find a word starting with X" search with regular CouchDB views by just splitting on various locale-specific word separators and omitting these "words" as your view keys. This will be more efficient than above, scaling proportionally to the amount of data indexed.
Unfortunately, doing searches using LIKE %...% aren't really how CouchDB Views work, but you can accomplish a great deal of search capability by installing couchdb-lucene, it's a fulltext search engine that creates indexes on your database that you can do more sophisticated searches with.
The typical way to "search" a database for a given key, without any 3rd party tools, is to create a view that emits the value you are looking for as the key. In your example:
function (doc) {
emit(doc.name, doc);
}
This outputs a list of all the names in your database.
Now, you would "search" based on the first letters of your key. For example, if you are searching for names that start with "bro".
/db/_design/test/_view/names?startkey="bro"&endkey="brp"
Notice I took the last letter of the search parameter, and "incremented" the last letter in it. Again, if you want to perform searches, rather than aggregating statistics, you should use a fulltext search engine like lucene. (see above)
You can use regular expressions. As per this table you can write something like this to return any id that contains "SMS".
{
"selector": {
"_id": {
"$regex": "sms"
}
}
}
Basic regex you can use on that includes
"sms$" roughly to LIKE "%sms"
"^sms" roughly to LIKE "sms%"
You can read more on regular expressions here
i found a simple view code for my problem...
{"getavailableproduct": {
"map": "function(doc) { var prefix = doc['productid'].match(/[A-Za-z0-9]+/g); if(prefix) for(var pre in prefix) { emit(prefix[pre],null); } }"
}
}
from this view code if i split a key sentence into a key word...
and i can call
?key="[search_keyword]"
but i need more complex code because if i run this code i can only find word wich i type (ex: eat, food, etc)...
but if i want to type not a complete word (ex: ea from eat, or foo from food) that code does not work..
I know it is an old question, but: What about using a "list" function? You can have all your normal views, andthen add a "list" function to the design document to process the view's results:
{
"_id": "_design/...",
"views": {
"employees": "..."
},
"lists": {
"by_name": "..."
}
}
And the function attached to "by_name" function, should be something like:
function (head, req) {
provides('json', function() {
var filtered = [];
while (row = getRow()) {
// We can retrive all row information from the view
var key = row.key;
var value = row.value;
var doc = req.query.include_docs ? row.doc : {};
if (value.name.indexOf(req.query.name) == 0) {
if (req.query.include_docs) {
filtered.push({ key: key, value: value, doc: doc});
} else {
filtered.push({ key: key, value: value});
}
}
}
return toJSON({ total_rows: filtered.length, rows: filtered });
});
}
You can, of course, use regular expressions too. It's not a perfect solution, but it works to me.
You could emit your documents like normal. emit(doc.name, null); I would throw a toLowerCase() on that name to remove case sensitivity.
and then query the view with a slew of keys to see if something "like" the query shows up.
keys = differentVersions("bro"); // returns ["bro", "br", "bo", "ro", "cro", "dro", ..., "zro"]
$.couch("db").view("employeesByName", { keys: keys, success: dealWithIt } )
Some considerations
Obviously that array can get really big really fast depending on what differentVersions returns. You might hit a post data limit at some point or conceivably get slow lookups.
The results are only as good as differentVersions is at giving you guesses for what the person meant to spell. Obviously this function can be as simple or complex as you like. In this example I tried two strategies, a) removed a letter and pushed that, and b) replaced the letter at position n with all other letters. So if someone had been looking for "bro" but typed in "gro" or "bri" or even "bgro", differentVersions would have permuted that to "bro" at some point.
While not ideal, it's still pretty fast since a look up in Couch's b-trees is fast.
why cann't we just use indexOf() in view?