MongoDB mapReduce method unexpected results - javascript

I have 100 documents in my mongoDB, assuming each of them are possible duplicate with other document(s) in different conditions, such as firstName & lastName, email and mobile phone.
I am trying to mapReduce these 100 documents to have the key-value pairs, like grouping.
Everything works fine until I have the 101st duplicate records in the DB.
The output of the mapReduce result for the other documents which are duplicate with the 101st records are corrupted.
For example:
I am working on firstName & lastName now.
When the DB contains 100 documents, I can have the result containing
{
_id: {
firstName: "foo",
lastName: "bar,
},
value: {
count: 20
duplicate: [{
id: ObjectId("/*an object id*/"),
fullName: "foo bar",
DOB: ISODate("2000-01-01T00:00:00.000Z")
},{
id: ObjectId("/*another object id*/"),
fullName: "foo bar",
DOB: ISODate("2000-01-02T00:00:00.000Z")
},...]
},
}
It is what exactly I want, but...
when the DB contains more than 100 possible duplicated documents, the result became like this,
Let's say the 101st documents is
{
firstName: "foo",
lastName: "bar",
email: "foo#bar.com",
mobile: "019894793"
}
containing 101 documents:
{
_id: {
firstName: "foo",
lastName: "bar,
},
value: {
count: 21
duplicate: [{
id: undefined,
fullName: undefined,
DOB: undefined
},{
id: ObjectId("/*another object id*/"),
fullName: "foo bar",
DOB: ISODate("2000-01-02T00:00:00.000Z")
}]
},
}
containing 102 documents:
{
_id: {
firstName: "foo",
lastName: "bar,
},
value: {
count: 22
duplicate: [{
id: undefined,
fullName: undefined,
DOB: undefined
},{
id: undefined,
fullName: undefined,
DOB: undefined
}]
},
}
I found another topic on stackoverflow having the similar issue like me, but the answer does not work for me
MapReduce results seem limited to 100?
Any ideas?
Edit:
Original source code:
var map = function () {
var value = {
count: 1,
userId: this._id
};
emit({lastName: this.lastName, firstName: this.firstName}, value);
};
var reduce = function (key, values) {
var reducedObj = {
count: 0,
userIds: []
};
values.forEach(function (value) {
reducedObj.count += value.count;
reducedObj.userIds.push(value.userId);
});
return reducedObj;
};
Source code now:
var map = function () {
var value = {
count: 1,
users: [this]
};
emit({lastName: this.lastName, firstName: this.firstName}, value);
};
var reduce = function (key, values) {
var reducedObj = {
count: 0,
users: []
};
values.forEach(function (value) {
reducedObj.count += value.count;
reducedObj.users = reducedObj.users.concat(values.users); // or using the forEach method
// value.users.forEach(function (user) {
// reducedObj.users.push(user);
// });
});
return reducedObj;
};
I don't understand why it would fail as I was also pushing a value (userId) to reducedObj.userIds.
Are there some problems about the value that I emitted in map function?

Explaining the problem
This is a common mapReduce trap, but clearly part of the problem you have here is that the questions you are finding don't have answers that explain this clearly or even properly. So an answer is justified here.
The point in the documentation that is often missed or at least misunderstood is here in the documentation:
MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.
And adding to that just a little later down the page:
the type of the return object must be identical to the type of the value emitted by the map function.
What this means in the context of your question is that at a certain point there are "too many" duplicate key values being passed in for a reduce stage to act on this in one single pass as it will be able to do for a lower number of documents. By design the reduce method is called multiple times, often taking the "output" from data that is already reduced as part of it's "input" for yet another pass.
This is how mapReduce is designed to handle very large datasets, by processing everything in "chunks" until it finally "reduces" down to a singular grouped result per key. This is why the next statement is important is that what comes out of both emit and the reduce output needs to be structured exactly the same in order for the reduce code to handle it correctly.
Solving the problem
You correct this by fixing up how you are both emitting the data in the map and how you also return and process in the reduce function:
db.collection.mapReduce(
function() {
emit(
{ "firstName": this.firstName, "lastName": this.lastName },
{ "count": 1, "duplicate": [this] } // Note [this]
)
},
function(key,values) {
var reduced = { "count": 0, "duplicate": [] };
values.forEach(function(value) {
reduced.count += value.count;
value.duplicate.forEach(function(duplicate) {
reduced.duplicate.push(duplicate);
});
});
return reduced;
},
{
"out": { "inline": 1 },
}
)
The key points can be seen in both the content to emit and the first line of the reduce function. Essentially these present a structure that is the same. In the case of the emit it does not matter that the array being produced only has a singular element, but you send it that way anyhow. Side by side:
{ "count": 1, "duplicate": [this] } // Note [this]
// Same as
var reduced = { "count": 0, "duplicate": [] };
That also means that the remainder of the reduce function will always assume that the "duplicate" content is in fact an array, because that is how it came as original input and is also how it will be returned:
values.forEach(function(value) {
reduced.count += value.count;
value.duplicate.forEach(function(duplicate) {
reduced.duplicate.push(duplicate);
});
});
return reduced;
Alternate Solution
The other reason for an answer is that considering the output you are expecting, this would in fact be much better suited to the aggregation framework. It's going to do this a lot faster than mapReduce can, and is even far more simple to code up:
db.collection.aggregate([
{ "$group": {
"_id": { "firstName": "$firstName", "lastName": "$lastName" },
"duplicate": { "$push": "$$ROOT" },
"count": { "$sum": 1 }
}},
{ "$match": { "count": { "$gt": 1 } }}
])
That's all it is. You can write out to a collection by adding an $out stage to this where required. But basically either mapReduce or aggregate, you are still placing the same 16MB restriction on the document size by adding your "duplicate" items into an array.
Also note that you can simply do something that mapReduce cannot here, and just "omit" any items that are not in fact a "duplicate" from the results. The mapReduce method cannot do this without first producing output to a collection and then "filtering" the results in a separate query.
That core documentation itself quotes:
NOTE
For most aggregation operations, the Aggregation Pipeline provides better performance and more coherent interface. However, map-reduce operations provide some flexibility that is not presently available in the aggregation pipeline.
So it's really a case of weighing up which is better suited to the problem at hand.

Related

dynamical key & mongodb unclarities

let time = "12:00";
"10:00": {
name: "john",
status: "registered"
},
"11:00": {
name: "jane",
status: "pending"
},
"12:00": {
name: "joe",
status: "denied"
}
how to find() the data in mongodb by its dynamically changing key variable (time in this case)?
also how to access the name key afterwards like in plain javascript (someVar[time].name) as with the results of find().toArray I can't really do that ... and storedJson[2].name after storedJson = JSON.parse(JSON.stringify(result)) storing the result of dbo.collection(someCollection).find({}, { projection: { _id: 0} }).toArray ( (err, result) seems a bit off and not how this should be done ... (also curious if it unwise saving data that way - and would therefore be open for an suggested alternative)
EDIT:
time: [time],
content: {
name: "john",
status: "pending"
}
this also isn't really a case since as I'm struggling grasping the .toArray which forces me to do someVar[0].content.name instead of someVar[time].content.name

Javascript - Access nested elements of JSON object

So I have a series of 4 JSON objects with nested data inside each of them. Each of these objects are stored in an array called classes. Here is an example of how one of the class objects is formatted:
let class_A = {
professor: "Joey Smith",
numberStudents: 25,
courseCode: "COMS 2360",
seating: {
"FirstRow": {
0: {
firstName: "Sarah",
collegeMajor: "English",
},
1: {
firstName: "Bob",
collegeMajor: "Computer Engineering",
},
2: {
firstName: "Dylan",
collegeMajor: "Mathematics",
}
},
"SecondRow": {
3: {
firstName: "Molly",
collegeMajor: "Music"
}
}
}
};
I'm struggling to figure out how to access the very last fields within each class object (firstName and collegeMajor). The furthest I was able to get was the indexes beneath each row number.
let classes = [class_A, class_B, class_C, class_D];
let classesAvailable = document.getElementById('classes');
let class = classes[classesAvailable.value];
for(rowNum in class.seating){
for(index in class.seating[rowNum]){
console.log(index);
//console.log(class.seating[rowNum[index]].firstName);
}
}
So in this example, console.log(index) prints out:
0
1
2
3
but I'm unable to print the first name and college major of each student in each row. I was trying to follow a similar logic and do console.log(class.seating[rowNum[index]].firstName) but I get the error:
Cannot read properties of undefined (reading 'firstName')
Was wondering if anyone knows what's wrong with my logic here?
console.log(class.seating[rowNum][index])

How to change the location of an object key value pair in JavaScript

I've seen similar questions to this one but in different languages and I am struggling to create a JavaScript equivalent.
I am receiving an object and through a function I want to change the location of one (or more) of the properties. For example,
With the original object of
{
individual: [
{
dob: '2017-01-01',
isAuthorized: true,
},
],
business: [
{
taxId: '123',
},
],
product: {
code: '123',
},
}
I would like to change the location of isAuthorized to be in the first object inside of the business array instead of individual.
Like so
{
individual: [
{
dob: '2017-01-01',
},
],
business: [
{
taxId: '123',
isAuthorized: true,
},
],
product: {
code: '123',
},
}
So far I was trying to create an object that would contain the key name and location to change it to, e.g.
{
isAuthorized: obj.business[0]
}
And then loop over the original object as well as the object with the location values and then set the location of that key value pair.
Basically, in this function I want to see that if the original object contains a certain value (in this case isAuthorized) that it will take that key value pair and move it to the desired location.
What you want can easily be achieved by using loadsh, here's a working snippet of how to restructure based on defined structure map. Extended this example to match what you want.
The example is doing a deep clone, if you are fine modifying the original object then skip that step to avoid the overhead.
// input data
const data = {
individual: [
{
dob: '2017-01-01',
isAuthorized: true,
},
],
business: [
{
taxId: '123',
},
],
product: {
code: '123',
},
};
// the structure change map
const keyMap = {
'individual[0].isAuthorized': 'business[0].isAuthorized'
};
function parseData(data,keyMap) {
const newData = _.cloneDeep(data);
for( let [source,dest] of Object.entries(keyMap) ) {
_.set(newData,dest,_.get(newData,source));
_.unset(newData,source);
}
return newData;
}
console.log(parseData(data, keyMap));
<script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.15/lodash.min.js"></script>
Note: loadsh's set consider any numeric value as an array index so if you are using a numeric object key then use loadash.setWith. I recommend reading examples in doc for a better understanding.
https://lodash.com/docs/4.17.15#set

Can I index an object in a mongo query?

With a compound index like the following
db.data.ensureIndex({ userId: 1, myObject: 1 })
Will the following find use this index?
db.data.find({
userId: 1,
myObject: {
a:'test',
b:'test2'
}
})
I know in Javascript that objects don't preserve their order so is this the same in Mongo?
So what will happen if I have documents with objects in different orders like this.
{
_id: ObjectId,
userId:1,
myObject: {
b: 'test',
a: 'test2'
}
},
{
_id: ObjectId,
userId:1,
myObject: {
b: 'test',
a: 'test2'
}
},
{
_id: ObjectId,
userId:2,
myObject: {
b: 'test',
a: 'test2'
}
}
Does the order of the properties matter when indexing an object like above?
EDIT:
In the documentation, http://docs.mongodb.org/manual/core/index-single/ it says "When performing equality matches on subdocuments, field order matters and the subdocuments must match exactly."
And in their example, the following would work
db.factories.find( { metro: { city: "New York", state: "NY" } } )
but if the metro fields were the other way around it wont work
db.factories.find( { metro: { state: "NY", city: "New York" } } )
So how do you preserve the property order in Javascript/Node when the language itself doesn't support it?
Actually, JavasScript or JSON notation objects do preserve their order. While some top level properties might shuffle around if MongoDB has to move them as a document grows, generally speaking this "sub-level" document should not change from the order in which it was originally inserted.
That said some languages try to order their "Hash", "Map", "Dict" keys by default, and therefore it becomes important to find and "ordered" or "indexed" form when dealing with these, but that is another issue.
To your question of an "index" for MongoDB affecting this order, no it does not. In fact the index you have added, though it does not error as such, does not really have any value as an Object or {} essentially does not have a value for an index to act on.
What you can do is this:
db.data.ensureIndex({ "userId": 1, "myObject.a": 1 })
Which keeps an ascending index on the "a" property of your object. That is useful if this is regularly viewed in queries.
You can also have problems with positions if you attempt to query like this:
db.data.find({ "Object": { a: "test2", b: "test" } })
Where the actual stored keys are not in order. But this is typically fixed by the "dot notation" form:
db.data.find({ "Object.a": "test2", "Object.b": "test" })
Which will not care about the actual order of keys.
But the index you defined does not keep the "keys" in insertion order, that is entirely language specific
The order matters for indexes on sub-documents. To quote the documentation:
This query returns the above document. When performing equality
matches on subdocuments, field order matters and the subdocuments must
match exactly. For example, the following query does not match the
above document
Consider this document:
{
_id: ObjectId(...),
metro: {
city: "New York",
state: "NY"
},
name: "Giant Factory"
}
and this index:
db.factories.ensureIndex( { metro: 1 } )
this query matches:
db.factories.find( { metro: { city: "New York", state: "NY" } } )
and this one does not:
db.factories.find( { metro: { state: "NY", city: "New York" } } )
See http://docs.mongodb.org/manual/core/index-single/#indexes-on-subdocuments
and http://docs.mongodb.org/manual/reference/method/db.collection.find/#query-subdocuments

Get document's placement in collection based on sort order

I'm new to MongoDB (+Mongoose). I have a collection of highscores with documents that looks like this:
{id: 123, user: 'User14', score: 101}
{id: 231, user: 'User10', score: 400}
{id: 412, user: 'User90', score: 244}
{id: 111, user: 'User12', score: 310}
{id: 221, user: 'User88', score: 900}
{id: 521, user: 'User13', score: 103}
+ thousands more...
now I'm getting the top 5 players like so:
highscores
.find()
.sort({'score': -1})
.limit(5)
.exec(function(err, users) { ...code... });
which is great, but I would also like to make a query like "What placement does user12 have on the highscore list?"
Is that possible to achieve with a query somehow?
If you don't have to get the placement in real time, Neil Lunn's answer is perfect. But if your app has always data to insert in this collection, for the new data, you can't get the placement for it.
Here is another solution:
Firstly, you add index on the field score in this collection. Then use query db.highscores.count({score:{$gt: user's score}). It will count the documents whose score it greater than the target. This number is the placement.
It is possible to do this with mapReduce, but it does require that you have an index on the sorted field, so first, if you already have not done:
db.highscores.ensureIndex({ "score": -1 })
Then you can do this:
db.highscores.mapReduce(
function() {
emit( null, this.user );
},
function(key,values) {
return values.indexOf("User12") + 1;
},
{
"sort": { "score": -1 },
"out": { "inline": 1 }
}
)
Or vary that to the information you need to return other than simply the "ranking" position. But since that is basically putting everything into a large array that has already been sorted by score then it probably will not be the best performance for any reasonable size of data.
A better solution would be to maintain a separate "rankings" collection, which you can again update periodically with mapReduce, even though it does not do any reducing:
db.highscores.mapReduce(
function() {
ranking++;
emit( ranking, this );
},
function() {},
{
"sort": { "score": -1 },
"scope": { "ranking": 0 },
"out": {
"replace": "rankings"
}
}
)
Then you can query this collection in order to get your results:
db.rankings.find({ "value.user": "User12 })
So that would contain the ranking as "emitted" in the _id field of the "rankings" collection.

Categories

Resources