{
"_id" : ObjectId("59786a62a96166007d7e364dsadasfafsdfsdgdfgfd"),
"someotherdata" : {
"place1" : "lwekjfrhweriufesdfwergfwr",
"place2" : "sgfertgryrctshyctrhysdthc ",
"place3" : "sdfsdgfrdgfvk",
"place4" : "asdfkjaseeeeeeeeeeeeeeeeefjnhwklegvds."
}
}
I have thousands of these in my collection. I need to look through all the someotherdata and do the following
Check to see if it is present (in some records i have place1 and not place4)
Find the longest record (in terms of string length)
The output must look something like this (showing the count of characters for the longest)
{
place1: 123,
place2: 12,
place3: 17
place4: 445
}
I'am using Mongodb 3.2.9 so don't have access to the new aggregate functions. But I do have the Mongodb shell
EDIT: To be clear I want the longest throughout the whole collection. So there might be 1000 documents but only one result with the longest length for each field throughout the whole collection.
Use .mapReduce() for this to reduce down to the largest values for each key:
db.collection.mapReduce(
function() {
emit(null,
Object.keys(this.someotherdata).map(k => ({ [k]: this.someotherdata[k].length }))
.reduce((acc,curr) => Object.assign(acc,curr),{})
);
},
function(key,values) {
var result = {};
values.forEach(value => {
Object.keys(value).forEach(k => {
if (!result.hasOwnProperty(k))
result[k] = 0;
if ( value[k] > result[k] )
result[k] = value[k];
});
});
return result;
},
{
"out": { "inline": 1 },
"query": { "someotherdata": { "$exists": true } }
}
)
Which basically emits the "length" of each key present in the sub-document path for each document, and then in "reduction", only the largest "length" for each key is actually returned.
Note that in mapReduce you need to put out the same structure you put in, since the way it deals with a large number of documents is by "reducing" in gradual batches. Which is why we emit in numeric form, just like the "reduce" function does.
Gives this output on your document shown in the question. Of course it's the "max" on all documents in the collection when you have more.
{
"_id" : null,
"value" : {
"place1" : 25.0,
"place2" : 26.0,
"place3" : 13.0,
"place4" : 38.0
}
}
For the interested, the context of the question is in fact that features of MongoDB 3.4 were not available to them. But to do the same thing using .aggregate() where the features are available:
db.collection.aggregate([
{ "$match": { "someotherdata": { "$exists": true } } },
{ "$project": {
"_id": 0,
"someotherdata": {
"$map": {
"input": { "$objectToArray": "$someotherdata" },
"as": "s",
"in": { "k": "$$s.k", "v": { "$strLenCP": "$$s.v" } }
}
}
}},
{ "$unwind": "$someotherdata" },
{ "$group": {
"_id": "$someotherdata.k",
"v": { "$max": "$someotherdata.v" }
}},
{ "$sort": { "_id": 1 } },
{ "$group": {
"_id": null,
"data": {
"$push": { "k": "$_id", "v": "$v" }
}
}},
{ "$replaceRoot": {
"newRoot": {
"$arrayToObject": "$data"
}
}}
])
With the same output:
{
"place1" : 25,
"place2" : 26,
"place3" : 13,
"place4" : 38
}
Use cursor.forEach to iterate through the collection.
Keep track of the longest placen values (starting from -1, updating when greater found). Print out values with print() or printjson()
Related
I am having some problems in altering the schema I am using for a time series database I have constructed using Mongo DB. Currently, I have records like the one shown below:
{
"_id" : 20,
"name" : "Bob,
"location" : "London",
"01/01/1993" : {
"height" : "110cm",
"weight" : "60kg",
},
"02/01/1993" : {
"height" : "112cm",
"weight" : "61kg",
}
}
I wish to use the aggregation framework to create several records for each "person", one for each "time-value" subdocument in the original record:
{
"_id" : 20,
"name" : "Bob,
"date" : "01/01/1993"
"location" : "London",
"height" : "110cm",
"weight" : "60kg",
},
{
"_id" : 20,
"name" : "Bob,
"date" : "02/01/1993"
"location" : "London",
"height" : "112cm",
"weight" : "61kg",
}
The new scheme should be much more efficient when adding a large number of time series values to each record and I shouldn't run into a max document size error!
Any help on how to do this using the Mongo DB aggregation pipeline would be greatly appreciated!
Whilst there are functions in modern releases of the Aggregation Framework that can allow you to do this sort of thing, mileage may vary to whether it is actually the best solution for this.
In essence you can create an array of entries comprised of the document keys "which do not include" the other top level keys which would then be included in the document. That array can then be processed with $unwind and the whole result reshaped into new documents:
db.getCollection('input').aggregate([
{ "$project": {
"name": 1,
"location": 1,
"data": {
"$filter": {
"input": { "$objectToArray": "$$ROOT" },
"as": "d",
"cond": {
"$not": { "$in": [ "$$d.k", ["_id","name","location"] ] }
}
}
}
}},
{ "$unwind": "$data" },
{ "$replaceRoot": {
"newRoot": {
"$arrayToObject": {
"$concatArrays": [
[{ "k": "id", "v": "$_id" },
{ "k": "name", "v": "$name" },
{ "k": "location", "v": "$location" },
{ "k": "date", "v": "$data.k" }],
{ "$objectToArray": "$data.v" }
]
}
}
}},
{ "$out": "output" }
])
or alternately do all the reshaping in the initial $project within the array elements produced:
db.getCollection('input').aggregate([
{ "$project": {
"_id": 0,
"data": {
"$map": {
"input": {
"$filter": {
"input": { "$objectToArray": "$$ROOT" },
"as": "d",
"cond": {
"$not": { "$in": [ "$$d.k", ["_id", "name", "location"] ] }
}
}
},
"as": "d",
"in": {
"$arrayToObject": {
"$concatArrays": [
{ "$filter": {
"input": { "$objectToArray": "$$ROOT" },
"as": "r",
"cond": { "$in": [ "$$r.k", ["_id", "name", "location"] ] }
}},
[{ "k": "date", "v": "$$d.k" }],
{ "$objectToArray": "$$d.v" }
]
}
}
}
}
}},
{ "$unwind": "$data" },
{ "$replaceRoot": { "newRoot": "$data" } },
{ "$out": "output" }
])
So you use $objectToArray and $filter in order to make an array from the keys which actually contain the data points for each date.
After $unwind we basically apply $arrayToObject on a set of named keys in the "array format" in order to construct the newRoot for $replaceRoot and then write to the new collection, as one new document for each data key using $out.
That may only get you part of the way though, as you really should change the "date"data to a BSON Date. It takes a lot less storage space, and is easier to query as well.
var updates = [];
db.getCollection('output').find().forEach( d => {
updates.push({
"updateOne": {
"filter": { "_id": d._id },
"update": {
"$set": {
"date": new Date(
Date.UTC.apply(null,
d.date.split('/')
.reverse().map((e,i) => (i == 1) ? parseInt(e)-1: parseInt(e) )
)
)
}
}
}
});
if ( updates.length >= 500 ) {
db.getCollection('output').bulkWrite(updates);
updates = [];
}
})
if ( updates.length != 0 ) {
db.getCollection('output').bulkWrite(updates);
updates = [];
}
Of course, if your MongoDB server lacks those aggregation features then you are better off just writing the output to a new collection by iterating the loop in the first place:
var output = [];
db.getCollection('input').find().forEach( d => {
output = [
...output,
...Object.keys(d)
.filter(k => ['_id','name','location'].indexOf(k) === -1)
.map(k => Object.assign(
{
id: d._id,
name: d.name,
location: d.location,
date: new Date(
Date.UTC.apply(null,
k.split('/')
.reverse().map((e,i) => (i == 1) ? parseInt(e)-1: parseInt(e) )
)
)
},
d[k]
))
];
if ( output.length >= 500 ) {
db.getCollection('output').insertMany(output);
output = [];
}
})
if ( output.length != 0 ) {
db.getCollection('output').insertMany(output);
output = [];
}
In either of those cases we want to apply Date.UTC to the reversed string elements from the existing "string" based date and get a value than can be cast into a BSON Date.
The aggregation framework itself does not allow casting of types so the only solution for that part ( and it is a necessary part ) is to actually loop and update, but using the forms at least makes it efficient to loop and update.
Either case gives you the same end output:
/* 1 */
{
"_id" : ObjectId("599275b1e38f41729f1d64fe"),
"id" : 20.0,
"name" : "Bob",
"location" : "London",
"date" : ISODate("1993-01-01T00:00:00.000Z"),
"height" : "110cm",
"weight" : "60kg"
}
/* 2 */
{
"_id" : ObjectId("599275b1e38f41729f1d64ff"),
"id" : 20.0,
"name" : "Bob",
"location" : "London",
"date" : ISODate("1993-01-02T00:00:00.000Z"),
"height" : "112cm",
"weight" : "61kg"
}
An example document from a collection:
{ "teamAlpha": { }, "teamBeta": { }, "leader_name": "leader" }
For such document, I would like to remove all fields that starts with "team". So the expected result is
{leader_name: "leader"}
I am currently using a function:
db.teamList.find().forEach(
function(document) {
for(var k in document) {
if (k.startsWith('team')) {
delete document[k];
}
}
db.teamList.save(document);
}
);
I am wondering if there is a better approach for this problem.
It would be "better" to instead determine all the possible keys beforehand and then issue a single "multi" update to remove all the keys. Depending on the available MongoDB version there would be different approaches.
MongoDB 3.4: $objectToArray
let fields = db.teamList.aggregate([
{ "$project": {
"_id": 0,
"fields": {
"$map": {
"input": {
"$filter": {
"input": { "$objectToArray": "$$ROOT" },
"as": "d",
"cond": { "$eq": [{ "$substrCP": [ "$$d.k", 0, 4 ] }, "team" ] }
}
},
"as": "f",
"in": "$$f.k"
}
}
}},
{ "$unwind": "$fields" },
{ "$group": { "_id": "$fields" } }
])
.map( d => ({ [d._id]: "" }))
.reduce((acc,curr) => Object.assign(acc,curr),{})
db.teamList.updateMany({},{ "$unset": fields });
The .aggregate() statement turns the fields in the document into an array via $objectToArray and then applies $filter to only return those where the first four letters of the "key" matched the string "team". This is then processed with $unwind and $group to make a "unique list" of the matching fields.
The subsequent instructions merely process that list returned in the cursor into a single object like:
{
"teamBeta" : "",
"teamAlpha" : ""
}
Which is then passed to $unset to remove those fields from all documents.
Earlier Versions: mapReduce
var fields = db.teamList.mapReduce(
function() {
Object.keys(this).filter( k => /^team/.test(k) )
.forEach( k => emit(k,1) );
},
function() {},
{ "out": { "inline": 1 } }
)
.results.map( d => ({ [d._id]: "" }))
.reduce((acc,curr) => Object.assign(acc,curr),{})
db.teamList.update({},{ "$unset": fields },{ "multi": true });
Same basic thing, where the only difference demonstrated is where .updateMany() does not exist as a method we simply call .update() using the "multi" parameter to apply to all matched documents. Which is all the new API call actually does.
Beyond those Options
It certainly is not wise to iterate all documents simply to remove fields, and therefore either of the above would be the "preferred" approach. The only possible failing is that constructing the "distinct list" of keys actually exceeds the 16MB BSON limit. That is pretty extreme, but depending on the actual data it is possible.
Therefore there are essentially "two extensions" that naturally apply to the techniques:
Use the "cursor" with .aggregate()
var fields = [];
db.teamList.aggregate([
{ "$project": {
"_id": 0,
"fields": {
"$map": {
"input": {
"$filter": {
"input": { "$objectToArray": "$$ROOT" },
"as": "d",
"cond": { "$eq": [{ "$substrCP": [ "$$d.k", 0, 4 ] }, "team" ] }
}
},
"as": "f",
"in": "$$f.k"
}
}
}},
{ "$unwind": "$fields" },
{ "$group": { "_id": "$fields" } }
]).forEach( d => {
fields.push(d._id);
if ( fields.length >= 2000 ) {
db.teamList.updateMany({},
{ "$unset":
fields.reduce((acc,curr) => Object.assign(acc,{ [curr]: "" }),{})
}
);
}
});
if ( fields.length > 0 ) {
db.teamList.updateMany({},
{ "$unset":
fields.reduce((acc,curr) => Object.assign(acc,{ [curr]: "" }),{})
}
);
}
Where this would essentially "batch" the number of fields as processed on the "cursor" into lots of 2000, which "should" stay well under the 16MB BSON limit as a request.
Use a temporary collection with mapReduce()
db.teamList.mapReduce(
function() {
Object.keys(this).filter( k => /^team/.test(k) )
.forEach( k => emit(k,1) );
},
function() {},
{ "out": { "replace": "tempoutput" } }
);
db.tempoutput.find({},{ "_id": 1 }).forEach(d => {
fields.push(d._id);
if ( fields.length >= 2000 ) {
db.teamList.update({},
{ "$unset":
fields.reduce((acc,curr) => Object.assign(acc,{ [curr]: "" }),{})
},
{ "multi": true }
);
}
});
if ( fields.length > 0 ) {
db.teamList.update({},
{ "$unset":
fields.reduce((acc,curr) => Object.assign(acc,{ [curr]: "" }),{})
},
{ "multi": true }
);
}
Where it is again essentially the same process, except as mapReduce cannot output to a "cursor", you need to output to a temporary collection consisting of only the "distinct field names" and then iterate the cursor from that collection in order to process in the same "batch" manner.
Just as the similar initial approaches, these are much more performant options than iterating the whole collection and making adjustments to each document individually. It generally should not be necessary since the likelihood of any "distinct list" actually causing a single update request to exceed 16MB would indeed be extreme. But this would again be the "preferred" way to handle such an extreme case.
General
Of course if you simply know all the field names and do not need to work them out by examining the collection, then simply write the statement with the known names:
db.teamList.update({},{ "$unset": { "teamBeta": "", "teamAlpha": "" } },{ "multi": true })
Which is perfectly valid because all the other statements are doing is working out what those names should be for you.
The structure of the objects stored in mongodb is the following:
obj = {_id: "55c898787c2ab821e23e4661", ingredients: [{name: "ingredient1", value: "70.2"}, {name: "ingredient2", value: "34"}, {name: "ingredient3", value: "15.2"}, ...]}
What I would like to do is retrieve all documents, which value of specific ingredient is greater than arbitrary number.
To be more specific, suppose we want to retrieve all the documents which contain ingredient with name "ingredient1" and its value is greater than 50.
Trying the following I couldn't retrieve desired results:
var collection = db.get('docs');
var queryTest = collection.find({$where: 'this.ingredients.name == "ingredient1" && parseFloat(this.ingredients.value) > 50'}, function(e, docs) {
console.log(docs);
});
Does anyone know what is the correct query to condition upon specific array element names and values?
Thanks!
You really don't need the JavaScript evaluation of $where here, just use basic query operators with an $elemMatch query for the array. While true that the "value" elements here are in fact strings, this is not really the point ( as I explain at the end of this ). The main point is to get it right the first time:
collection.find(
{
"ingredients": {
"$elemMatch": {
"name": "ingredient1",
"value": { "$gt": 50 }
}
}
},
{ "ingredients.$": 1 }
)
The $ in the second part is the postional operator, which projects only the matched element of the array from the query conditions.
This is also considerably faster than the JavaScript evaluation, in both that the evaluation code does not need to be compiled and uses native coded operators, as well as that an "index" can be used on the "name" and even "value" elements of the array to aid in filtering the matches.
If you expect more than one match in the array, then the .aggregate() command is the best option. With modern MongoDB versions this is quite simple:
collection.aggregate([
{ "$match": {
"ingredients": {
"$elemMatch": {
"name": "ingredient1",
"value": { "$gt": 50 }
}
}
}},
{ "$redact": {
"$cond": {
"if": {
"$and": [
{ "$eq": [ { "$ifNull": [ "$name", "ingredient1" ] }, "ingredient1" ] },
{ "$gt": [ { "$ifNull": [ "$value", 60 ] }, 50 ] }
]
},
"then": "$$DESCEND",
"else": "$$PRUNE"
}
}}
])
And even simplier in forthcoming releases which introduce the $filter operator:
collection.aggregate([
{ "$match": {
"ingredients": {
"$elemMatch": {
"name": "ingredient1",
"value": { "$gt": 50 }
}
}
}},
{ "$project": {
"ingredients": {
"$filter": {
"input": "$ingredients",
"as": "ingredient",
"cond": {
"$and": [
{ "$eq": [ "$$ingredient.name", "ingredient1" ] },
{ "$gt": [ "$$ingredient.value", 50 ] }
]
}
}
}
}}
])
Where in both cases you are effectively "filtering" the array elements that do not match the conditions after the initial document match.
Also, since your "values" are actually "strings" right now, you reaally should change this to be numeric. Here is a basic process:
var bulk = collection.initializeOrderedBulkOp(),
count = 0;
collection.find().forEach(function(doc) {
doc.ingredients.forEach(function(ingredient,idx) {
var update = { "$set": {} };
update["$set"]["ingredients." + idx + ".value"] = parseFloat(ingredients.value);
bulk.find({ "_id": doc._id }).updateOne(update);
count++;
if ( count % 1000 != 0 ) {
bulk.execute();
bulk = collection.initializeOrderedBulkOp();
}
})
]);
if ( count % 1000 != 0 )
bulk.execute();
And that will fix the data so the query forms here work.
This is much better than processing with JavaScript $where which needs to evaluate every document in the collection without the benefit of an index to filter. Where the correct form is:
collection.find(function() {
return this.ingredients.some(function(ingredient) {
return (
( ingredient.name === "ingredient1" ) &&
( parseFloat(ingredient.value) > 50 )
);
});
})
And that can also not "project" the matched value(s) in the results as the other forms can.
Try using $elemMatch:
var queryTest = collection.find(
{ ingredients: { $elemMatch: { name: "ingredient1", value: { $gte: 50 } } } }
);
How can i convert following SQL query to mongoDB using mapReduce
SELECT mobile, SUM( amount ),count(mobile) as noOfTimesRecharges
FROM recharge
WHERE recharge_date between '2015-02-26' AND '2015-03-27'
GROUP BY mobile
having noOfTimesRecharges > 0 and noOfTimesRecharges < 5
I have tried
db.users.mapReduce(
function(){
emit(this.mobile,this.amount);
},
function(k,v){
return Array.sum(v)
},
{
query:{
recharge_date:{$gte:ISODate("2014-06-17"),$lte:ISODate("2014-06-20")}
},
out:"one_month_data"
}).find();
which gives me result but not the count.
So you probably really want the aggregation framework in this case. It runs in native code operations and is much faster than what can be achieved from the JavaScript evaluation of mapReduce.
db.users.aggregate([
{ "$match": {
"recharge_date": {
"$gte": ISODate("2014-06-17"),
"$lte": ISODate("2014-06-20")
}
}},
{ "$group": {
"_id": "$mobile",
"amount": { "$sum": "$amount" },
"count": { "$sum": 1 }
}},
{ "$match": {
"count": { "$gt": 1, "lt": 5 }
}}
{ "$out": "newCollection" }
],
{ "allowDiskUse": true }
)
It's a lot more efficient and very simple to code.
Also check out the SQL to agregation mapping chart for common examples.
If you really do need mapReduce ( and you likely do not ) then the correct approach is:
db.users.mapReduce(
function() {
emit( this.mobile, { "amount": this.amount, "count": 1 } );
},
function(key,values) {
var doc = { "amount": 0, "count": 0 };
values.forEach(function(value) {
doc.amount += value.amount;
doc.count += value.count;
};
return doc;
},
{
"out": { "replace": "newCollection" },
"query": {
"recharge_date": {
"$gte": ISODate("2014-06-17"),
"$lte": ISODate("2014-06-20")
}
}
}
)
But you don't get the same case on "limiting" results as you can with the aggregation pipeline without additional processing on the collection of results.
My data looks like this:
{
"foo_list": [
{
"id": "98aa4987-d812-4aba-ac20-92d1079f87b2",
"name": "Foo 1",
"slug": "foo-1"
},
{
"id": "98aa4987-d812-4aba-ac20-92d1079f87b2",
"name": "Foo 1",
"slug": "foo-1"
},
{
"id": "157569ec-abab-4bfb-b732-55e9c8f4a57d",
"name": "Foo 3",
"slug": "foo-3"
}
]
}
Where foo_list is a field in a model called Bar. Notice that the first and second objects in the array are complete duplicates.
Aside from the obvious solution of switching to PostgresSQL, what MongoDB query can I run to remove duplicate entries from foo_list?
Similar answers that do not quite cut it:
https://stackoverflow.com/a/16907596/432
https://stackoverflow.com/a/18804460/432
These questions answer the question if the array had bare strings in it. However in my situation the array is filled with objects.
I hope it is clear that I am not interested querying the database; I want the duplicates to be gone from the database forever.
Purely from an aggregation framework point of view there are a few approaches to this.
You can either just apply $setUnion in modern releases:
db.collection.aggregate([
{ "$project": {
"foo_list": { "$setUnion": [ "$foo_list", "$foo_list" ] }
}}
])
Or more traditionally with $unwind and $addToSet:
db.collection.aggregate([
{ "$unwind": "$foo_list" },
{ "$group": {
"_id": "$_id",
"foo_list": { "$addToSet": "$foo_list" }
}}
])
Or if you were just interested in the duplicates only then by general grouping:
db.collection.aggregate([
{ "$unwind": "$foo_list" },
{ "$group": {
"_id": {
"_id": "$_id",
"foo_list": "$foo_list"
},
"count": { "$sum": 1 }
}},
{ "$match": { "count": { "$ne": 1 } } },
{ "$group": {
"_id": "$_id._id",
"foo_list": { "$push": "$_id.foo_list" }
}}
])
The last form could be useful to you if you actually want to "remove" the duplicates from your data with another update statement as it identifies the elements which are duplicates.
So in that last form the returned result from your sample data identifies the duplicate:
{
"_id" : ObjectId("53f5f7314ffa9b02cf01c076"),
"foo_list" : [
{
"id" : "98aa4987-d812-4aba-ac20-92d1079f87b2",
"name" : "Foo 1",
"slug" : "foo-1"
}
]
}
Where results are returned from your collection per document that contains duplicate entries in the array and which entries are duplicated. This is the information you need to update, and you loop the results as you need to specify the update information from the results in order to remove duplicates.
This is actually done with two update statements per document, as a simple $pull operation would remove "both" items, which is not what you want:
var cursor = db.collection.aggregate([
{ "$unwind": "$foo_list" },
{ "$group": {
"_id": {
"_id": "$_id",
"foo_list": "$foo_list"
},
"count": { "$sum": 1 }
}},
{ "$match": { "count": { "$ne": 1 } } },
{ "$group": {
"_id": "$_id._id",
"foo_list": { "$push": "$_id.foo_list" }
}}
])
var batch = db.collection.initializeOrderedBulkOp();
var count = 0;
cursor.forEach(function(doc) {
doc.foo_list.forEach(function(dup) {
batch.find({ "_id": doc._id, "foo_list": { "$elemMatch": dup } }).updateOne({
"$unset": { "foo_list.$": "" }
});
batch.find({ "_id": doc._id }).updateOne({
"$pull": { "foo_list": null }
});
});
count++;
if ( count % 500 == 0 ) {
batch.execute();
batch = db.collection.initializeOrderedBulkOp();
}
});
if ( count % 500 != 0 ) {
batch.execute();
}
That's the modern MongoDB 2.6 and above way to do it, with a cursor result from aggregation and Bulk operations for updates. But the principles remain the same:
Identify the duplicates in documents
Loop the results to issue the updates to the affected documents
Use $unset with the positional $ operator to set the "first" matched array element to null
Use $pull to remove the null entry from the array
So after processing the above operations your sample now looks like this:
{
"_id" : ObjectId("53f5f7314ffa9b02cf01c076"),
"foo_list" : [
{
"id" : "98aa4987-d812-4aba-ac20-92d1079f87b2",
"name" : "Foo 1",
"slug" : "foo-1"
},
{
"id" : "157569ec-abab-4bfb-b732-55e9c8f4a57d",
"name" : "Foo 3",
"slug" : "foo-3"
}
]
}
The duplicate is removed with the "duplicated" item still intact. That is how you process to identify and remove the duplicate data from your collection.