mongodb - can't understand why/how to use map-reduce - javascript

I'm trying to use map-reduce to understand when this can be helpful.
So I have a collection named "actions" with 100k docs like this:
{
"profile_id":1111,
"action_id":2222
}
Now I'm trying to do map-reduce examples. I'm trying to get a list of "all users and total actions each one has". Is this possible? My code:
db.fbooklikes.mapReduce(
function(){
emit(this.profile_id, this.action_id);
},
function(keyProfile, valueAction){
return Array.sum(valueAction);
},
{
out:"example"
}
)
.. This is not working. The result is:
"counts" : {
"input" : 100000,
"emit" : 100000,
"reduce" : 1146,
"output" : 13
},
"ok" : 1,
"_o" : {
"result" : "map_reduce_example",
"timeMillis" : 2539,
"counts" : {
"input" : 100000,
"emit" : 100000,
"reduce" : 1146,
"output" : 13
},
"ok" : 1
},
What I'm trying to do is something possible with map-reduce?

Well yes you can use it, but the more refined response is that there are likely better tools for doing what you want.
MapReduce is handy for some tasks, but usually best suited when something else does not apply. The inclusion of mapReduce in MongoDB pre-dates the introduction of the aggregation framework, which is generally what you should be using when you can:
db.fbooklikes.aggregate([
{ "$group": {
"_id": "$profile_id",
"count": { "$sum": 1 }
}}
])
Which will simply return the counts for the all documents in the collection grouped by each value of "profile_id".
MapReduce requires JavaScript evaluation and therefore runs much slower than the native code functions implemented by the aggregation framework. Sometimes you have to use it, but in simple cases it is best not to, and there are some quirks that you need to understand:
db.fbooklikes.mapReduce(
function(){
emit(this.profile_id, 1);
},
function(key,values){
return Array.sum(values);
},
{
out: { "inline": 1 }
}
)
The biggest thing people miss with mapReduce is the fact that the reducer is almost never called just once per emitted key. In fact it will process output in "chunks", thus "reducing" down part of that output and placing it back to be "reduced" again against other output until there is only a single value for that key.
For this reason it is important to emit the same type of data from the reduce function as is sent from the "map" function. It's a sticky point that can lead to weird results when you don't understand that part of the function. It is in fact the underlying way that mapReduce can deal with large values of results for a single key value and reduce them.
But generally speaking, you should be using the aggregation framework where possible, and where a problem requires some special calculations that would not be possible there, or otherwise has some complex document traversal where you need to inspect with JavaScript, then that is where you use mapReduce.

You don't want to sum the action ids, you want to count them. So you want something like the following
var map = function () {
emit(this.profile_id, { action_ids : [this.action_id], count : 1 });
}
var reduce = function(profile_id, values) {
var value = { action_ids: [], count: 0 };
for (var i = 0; i < values.length; i++) {
value.count += values[i].count;
value.action_ids.push.apply(value.action_ids, values[i].action_ids);
}
return value;
}
db.fbooklikes.mapReduce(map, reduce, { out:"example" });
This will give you an array of action ids and a count for each profile id. The count could be obtained by accessing the length of the action_ids array, but I thought I would keep it separate to make the example clearer.

Related

Search through a big collection of objects

i have a really big collection of objects that i want to search through.
The array have > 60.000 items and the search performance can be really slow from time to time.
One object in that array looks like this:
{
"title": "title"
"company": "abc company"
"rating": 13 // internal rating based on comments and interaction
...
}
I want to search for the title and the company info and order that by the rating of the items.
This is what my search currently look like:
onSearchInput(searchTerm) {
(<any>window).clearTimeout(this.searchInputTimeout);
this.searchInputTimeout = window.setTimeout(() => {
this.searchForFood(searchTerm);
}, 500);
}
searchForFood(searchTerm) {
if (searchTerm.length > 1) {
this.searchResults = [];
this.foodList.map(item => {
searchTerm.split(' ').map(searchTermPart => {
if (item.title.toLowerCase().includes(searchTermPart.toLowerCase())
|| item.company.toLowerCase().includes(searchTermPart.toLowerCase())) {
this.searchResults.push(item);
}
});
});
this.searchResults = this.searchResults.sort(function(a, b) {
return a.rating - b.rating;
}).reverse();
} else {
this.searchResults = [];
}
}
Question: Is there any way to improve the search logic and performance wise?
A bunch of hints:
It's a bit excessive to put searching through 60,000 items on the front-end. Any way you can perform part of the search on the back-end? If you really must do it on the front-end considering searching in chunks of e.g. 10,000 and then using a setImmediate() to perform the next part of the search so the user's browser won't completely freeze during processing time.
Do the splitting and lowercasing of the search term outside of the loop.
map() like you're using it is weird as you don't use the return value. Better to use forEach(). Better still, is use filter() to get the items that match.
When iterating over the search terms, use some() (as pointed out in the comments) as it's an opportunity to early return.
sort() mutates the original array so you don't need to re-assign it.
sort() with reverse() is usually a smell. Instead, swap the sides of your condition to be b - a.
At this scale, it may make sense to do performance tests with includes(), indexOf(), roll-your-own-for-loop, match() (can almost guarantee it will be slower though)
Alex's suggestions are good. My only suggestion would be, if you could afford to pre-process the data during idle time (preferably don't hold up first render or interaction) you could process the data into a modified prefix trie. That would let you search for the items in O(k) time where k is the length of the search term (right now you are searching in O(kn) time because you look at every item and then do an includes which takes k time (it's actually a little worse because of the toLowerCase's but I don't want to get into the weeds of it).
If you aren't familiar with what a trie is, hopefully the code below gives you the idea or you can search for information with your search engine of choice. It's basically a mapping of characters in a string in nested hash maps.
Here's some sample code of how you might construct the trie:
function makeTries(data){
let companyTrie = {};
let titleTrie = {};
data.forEach(item => {
addToTrie(companyTrie, item.company, item, 0);
addToTrie(titleTrie, item.title, item, 0);
});
return {
companyTrie,
titleTrie
}
}
function addToTrie(trie, str, item, i){
trie.data = trie.data || [];
trie.data.push(item);
if(i >= str.length)
return;
if(! trie[str[i]]){
trie[str[i]] = {};
}
addToTrie(trie[str[i]], str, item, ++i);
}
function searchTrie(trie, term){
if(trie == undefined)
return [];
if(term == "")
return trie.data;
return searchTrie(trie[term[0]], term.substring(1));
}
var testData = [
{
company: "abc",
title: "def",
rank: 5
},{
company: "abd",
title: "deg",
rank: 5
},{
company: "afg",
title: "efg",
rank: 5
},{
company: "afgh",
title: "efh",
rank: 5
},
];
const tries = makeTries(testData);
console.log(searchTrie(tries.companyTrie, "afg"));

Use Lodash to find the indexOf a JSON array inside of an [] array

I have an array that looks something like this.
Users : {
0 : { BidderBadge: "somestuff", Bidders: 6, }
1 : { BidderBadge: "somemorestuff", Bidders: 7,}
}
I want to search the array using lodash to find a value inside of each of the user objects.
Specifically, I want to use values from another similar array of objects to find the value.
var bidArray = [];
_.each(this.vue.AllUsers, function(user) {
_.each(this.vue.Bids, function(bid) {
if(user.BidderBadge == bid.Badge) {
bidArray.push(user);
}
});
});
This is what I have and it works, but I want to do it using only one loop instead of two. I want to use something like _.indexOf. Is that possible?
If you want to avoid nesting, you just have to modify Azamantes' solution a bit
var bidders = this.vue.Bids.reduce(function(acc, bid) {
return acc[bid.BidderBadge] = true;
}, {});
var bidArray = this.vue.AllBidders.filter(function(bidder) {
return !!bidders[bidder.Badge];
});
It is difficult to give an accurate answer with an example that doesn't coincide with the input that your provide.
Anyway, supposing your data structures were more or less like this ones, you could solve the problem with lodash _.intersectionWith.
Intersect both arrays using a comparator that checks the correct object properties. Also, take into account that users must go first in the intersection due to the fact that you're interested in its values.
function comparator(user, bid) {
return user.BidderBadge === bid.Badge;
}
console.log(_.intersectionWith(users, bids, comparator));
Here's the fiddle.

How to update Field by multiply Field value Mongodb

I try to update field with value in euro but something's going wrong.
db.orders.update({
"_id": ObjectId("56892f6065380a21019dc810")
}, {
$set: {
"wartoscEUR": {
$multiply: ["wartoscPLN", 4]
}
}
})
I got an error:
The dollar ($) prefixed field '$multiply' in 'wartoscEUR.$multiply' is not valid for storage.
WartoscPLN and WartoscEUR are number fields, and i'd like to calculate wartoscEUR by multiplying wartoscPLN by 4.
Sorry, maybe this is really easy but I'm just getting starting in nosql.
The $multiply-operator is only used in aggregation. The operator you are looking for is $mul and it has a different syntax:
db.orders.update(
{"_id" : ObjectId("56892f6065380a21019dc810")},
{ $mul: { wartoscPLN: 4 }
);
It is not necessary to combine it with $set, as it implies that semantic implicitly.
The $multiply operator can only be used in the aggregation framework, not in an update operation. You could use the aggregation framework in your case to create a new result set that has the new field wartoscEUR created with the $project pipeline.
From the result set you can then loop through it (using the forEach() method of the cursor returned from the .aggregate() method), update your collection with the new value but you cannot access a value of a document directly in an update statement hence the Bulk API update operation come in handy here:
The Bulk update will at least allow many operations to be sent in a single request with a singular response.
The above operation can be depicted with this implementation:
var bulk = db.orders.initializeOrderedBulkOp(),
counter = 0;
db.orders.aggregate([
{
"$project": {
"wartoscEUR": { "$multiply": ["$wartoscPLN", 4] }
}
}
]).forEach(function (doc) {
bulk.find({ "_id": doc._id }).updateOne({
"$set": { "wartoscEUR": doc.wartoscEUR }
});
counter++;
if (counter % 1000 == 0) {
bulk.execute();
bulk = db.orders.initializeOrderedBulkOp();
}
});
if (counter % 1000 != 0) {
bulk.execute();
}

How can I fake a multi-item $pop in MongoDB?

Quick question from someone new to Mongo.
I have a collection of documents that (simplified) look like this:
{"_id":<objectID>, "name":"fakeName", "seeds":[1231,2341,0842,1341,3451, ...]}
What I really need is a $pop that pops 2 or 3 items off my list of seeds, but $pop currently only works for one item, so I'm trying to look for another way to accomplish the same thing.
The first thing I looked at was doing $push/$each/$slice with an empty "each", like:
update: { $push: { order: { $each: [ ], $slice: ?}}}
The problem here is that I don't know exactly how long I want my new slice to be (I want it to be "current size - number of seeds I popped"). If the $slice modifier worked like the $slice projection, this would be easy, I could just do $slice: [ #of seeds, ], but it doesn't so that doesn't work.
The next thing I looked at was getting the side of the array and using that as an input to $slice, like:
update: { $push: { seeds: { $each: [ ], $slice: {$subtract: [{$size:"$seeds"}, <number of seeds to pop>]}}}}
But Mongo tells me "value for $slice must be numeric value and not an Object", so apparently the result of $subtract is an Object not a number.
Then I tried to see if I could "remove" items from the array based on an empty query with a $limit, but apparently limit gets applied later in the pipeline, so I couldn't manage to make that work.
Any other suggestions, or am I out of luck and need to go back to the drawing board?
Thanks so much for help/input.
MongoDB does not presently have any method of referencing the existing values of fields in a singular update statement. The only exceptions are operators such as $inc an $mul which can act on the present value and alter it according to a set rule.
This is in part due to the compatibility of the "phrasing" of operations to act over multiple documents, whether that is the case or not. But what you are asking for is some "variable" operation that allows the "length" of an array to be tested and used as in "input parameter" to another method. This is not supported.
So the best you can do is read the document content and test the length of the array in code, then perform the $slice update as you first surmized, or alternately you could use the aggregation framework to work out the "possible lengths" of arrays ( assuming a lot of duplication ) and then work on "multi" updates for those documents that match the conditions, of course assuming that you want to do this over more than a single document.
First form:
var bulk = db.collection.initializeOrderedBulkOp();
var count = 0;
db.collection.find().forEach(function(doc) {
if ( doc.order.length > 2 ) {
bulk.find({ "_id": doc._id })
.updateOne({
"$push": {
"order": { "$each": [], "$slice": doc.order.length - 2 }
}
});
count ++;
}
if ( (count % 1000) == 0 && ( count > 1 ) ) {
bulk.execute();
bulk = db.collection.initializeOrderedBulkOp();
}
});
if ( count % 1000 != 0 )
bulk = db.collection.initializeOrderedBulkOp();
Second form:
var bulk = db.collection.initializeOrderedBulkOp();
var count = 0;
db.collection.aggregate([
{ "$group": { "_id": { "$size": "$order" } }},
{ "$match": { "$_id": { "$gt": 2 } }}
]).forEach(function(doc) {
bulk.find({ "order": { "$size": doc._id } })
.update(
"$push": {
"order": { "$each": [], "$slice": doc._id - 2 }
}
});
count ++;
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.collection.initializeOrderedBulkOp();
}
});
if ( count % 1000 != 0 )
bulk = db.collection.initializeOrderedBulkOp();
Noting that in both cases there is some logic to consider the length of the arrays in order not to "empty" them, or produce an undesired $slice operation.
Another possibly alternative is to use the projection form of $slice in the query to get the last n elements and then $pull the matching elements from the array. Of course the identifier used for such an operation would have to be "unique", but it is a valid case where uniqueness is assured.
So whatever your case, you cannot do this in a singular update statement without having some prior knowledge of the current state of the document to be modified. The different listings though give you ways to approach "emulating" this, albeit not in a single statement.

MongoDB $inc a NaN value

I have to deal with inconsistent documents in MongoDB collection where some field might be numeric or might have NaN value. I need to update it with $inc. But looks like if it have NaN value $inc have no effect. What options available for atomic document update?
Well this seems to lead to two logical conclusions. The first being that if there are NaN values present in a field then how to identify them? Consider the following sample, let's call the collection "nantest"
{ "_id" : ObjectId("54055993b145d1c015a1ad41"), "n" : NaN }
{ "_id" : ObjectId("540559e8b145d1c015a1ad42"), "n" : Infinity }
{ "_id" : ObjectId("54055b59b145d1c015a1ad43"), "n" : 1 }
{ "_id" : ObjectId("54055ea1b145d1c015a1ad44"), "n" : -Infinity }
So both NaN and Infinity or -Infinity are representative of "non-numbers" that have somehow emerged in your data. The best way to find these documents where that field is set that way is to use the $where operator for a JavaScript evaluated query condition. Not efficient but it what you have got:
db.nantest.find({
"$where": "return isNaN(this.n) || Math.abs(this.n) == Infinity"
})
So that gives a way of finding the data that is the problem. From here you could jump through hoops and decide that where this was encountered you would just reset it to 0 before incrementing, essentially issuing two update statements where the first one would not match a document to update if the value was correct:
db.nantest.update(
{ "$where": "return isNaN(this.n) || Math.abs(this.n) == Infinity" },
{ "$set": { "n": 0 } }
);
db.nantest.update(
{ },
{ "$inc": { "n": 1 } }
);
But really when you look at that, why would you want to patch your code to cater for this when you can just patch the data. So the logical thing to finally conclude is just update all the Nan and possibly Infinity values to a standard reset number in one statement:
db.nantest.update(
{ "$where": "return isNaN(this.n) || Math.abs(this.n) == Infinity" },
{ "$set": { "n": 0 } },
{ "multi": true }
);
Run one statement and then you don't have to change your code and simply process increments as you should normally expect.
If your trouble is knowing which fields have the Nan values present in order to invoke updates to fix them, then consider something along the lines of this mapReduce process to inspect the fields:
db.nantest.mapReduce(
function () {
var doc = this;
delete doc._id;
Object.keys( doc ).forEach(function(key) {
if ( isNaN( doc[key] ) || Math.abs(doc[key]) == Infinity )
emit( key, 1 );
});
},
function (key,values) {
return Array.sum( values );
},
{ "out": { "inline": 1 } }
)
For which you might need to add some complexity to for more nested documents, but this tells you which fields can possibly contain the errant values so you can construct update statements to fix them.
It would seem that rather than bending your code to suit this you "should be" doing:
Find the source that is causing the numbers to appear and fix that.
Identify the field or fields that contain these values
Process one off update statement to fix the data all at once.
Minimal messing with your code and it both fixes the "source" of the problem and the "result" of that data corruption that was introduced.

Categories

Resources