I have a dataset of records stored in mongodb and i have been trying to extract a complex set of data from the records.
Sample records are as follows :-
{
bookId : '135wfkjdbv',
type : 'a',
store : 'crossword',
shelf : 'A1'
}
{
bookId : '13erjfn',
type : 'b',
store : 'crossword',
shelf : 'A2'
}
I have been trying to extract data such that for each bookId, i get a count (of records) for each shelf per store name that holds the book identified by bookId where the type of the book is 'a'.
I understand that the aggregation query allows a pipeline that allows grouping, matching etc, but I have not been able to reach a solution.
The desired output is of the form :-
{
bookId : '135wfkjdbv',
stores : [
{
name : 'crossword'
shelves : [
{
name : 'A1',
count : 12
},
]
},
{
name : 'granth'
shelves : [
{
name : 'C2',
count : 12
},
{
name : 'C4',
count : 12
},
]
}
]
}
The process isn't really that difficult when you look at at. The aggregation "pipeline" is exactly that, where each "stage" feeds a result into the next for processing. Just like unix "pipe":
ps -ef | grep mongo | tee out.txt
So it's just adding stages, and in fact three $group stages where the first does the basic aggregation and the remaining two simply "roll up" the arrays required in the output.
db.collection.aggregate([
{ "$group": {
"_id": {
"bookId": "$bookId",
"store": "$store",
"shelf": "$shelf"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": {
"bookId": "$_id.bookId",
"store": "$_id.store"
},
"shelves": {
"$push": {
"name": "$_id.shelf",
"count": "$count"
}
}
}},
{ "$group": {
"_id": "$_id.bookId",
"stores": {
"$push": {
"name": "$_id.store",
"shelves": "$shelves"
}
}
}}
])
You could possibly $project at the end to change the _id to bookId, but you should already know that is what it is and get used to treating _id as a primary key. There is a cost to such operations, so it is a habit you should not get into and learn doing things correctly from the start.
So all that really happens here is all the fields that would make up the grouping detail are made the primary key of $group with the other field being produced as count, to count the shelves within that grouping. Think the SQL equivalent:
GROUP BY bookId, store, shelf
All each other stage does is transpose each grouping level into array entries, first by shelf within the store and then the store within the bookId. Each time the fields in the primary grouping key are reduced down by the content going into the produced array.
When you start thinking in terms of "pipeline" processing, then it becomes clear. As you construct one form, then take that output and move it to the next form and so on. This is basically how you fold the results within two arrays.
Related
I am working on a MERN project. I have created a collection in MongoDB having different types of document. Is it an accepted practice to have different structure documents in a single collection? Secondly i need to fetch only a single document from the collection using the key name. My documents are
[{
"_id": {
"$oid": "6333f72822dc0acc4bea17bd"
},
"designation": [
{
"name": "Chairman",
"level": 17
},
{
"name": "Director",
"level": 13
},
{
"name": "Secretary ",
"level": 13
},
{
"name": "Account Officer",
"level": 9
},
{
"name": "Data Entry Operator-GR B",
"level": 5
}
]
},
{
"_id": {
"$oid": "6334313b22dc0acc4bea17c2"
},
"storeRole": ["manager", "approver", "accepter", "firstsignatory"]
},
{
"_id": {
"$oid": "63369d2083a7cc2e818990dd"
},
"designationSuffix": ["I","II", "III"]
}]
How do I get any of the three documents if I only know the key name i.e(designation, storeRole, designationSuffix). I dont want to use ID value.
Welcome to SO.
First, yes it is an accepted practice and indeed, a powerful feature of MongoDB to have different shapes of data in a single collection.
There are two important things to remember when querying for data:
Matching on fields that don't even exist in a document is OK; the document will simply be skipped. This permits you, for example, to query for storeRole and ignore the other documents with designation, etc. -- unless of course you wish to look for those too using an $or expression.
Matching (using $match) for elements in an array will return the whole array, not just the elements that match.
To illustrate this point, let's expand your input data slightly:
{"designation": [
{"name": "Chairman","level": 17},
{"name": "Director", "level": 13}
]
},
{"designation": [
{"name": "Secretary","level": 13}
]
},
We will use dot notation to reach into the structures in the designation array to find those docs where at least one of the name fields is Chairman:
db.foo.aggregate([
{$match: {"designation.name": "Chairman"}}
]);
{
"_id" : 0,
"designation" : [
{
"name" : "Chairman",
"level" : 17
},
{
"name" : "Director",
"level" : 13
}
]
}
The query eliminated the document with name = Secretary as expected but properly returned the whole document (and the whole array) where name = Chairman. Very often the goal is to fetch only the matching items in the array; this is accomplished with the $filter operator:
db.foo.aggregate([
{$match: {"designation.name": "Chairman"}},
{$project: {
// Assigning the output of $filter to the same name as input:
designation: {$filter: {
input: "$designation",
as: "zz",
cond: {$eq: ['$$zz.name','Chairman']}
}}
}}
]);
{
"_id" : 0,
"designation" : [
{
"name" : "Chairman",
"level" : 17
}
]
}
An alternative approach which is useful when query conditions yield null or empty arrays instead of eliminating the document altogether is to $filter first, then match only on results where the array has a length > 1. We must use the $ifNull function to protect $size from being passed a null by turning it into an empty (but not null) array:
db.foo.aggregate([
{$project: {
// Assigning the output of $filter to the same name as input:
designation: {$filter: {
input: "$designation",
as: "zz",
cond: {$eq: ['$$zz.name','Chairman']}
}}
}},
{$match: {$expr: {$gt:[{$size: {$ifNull:["$designation",[] ]}}, 0]}} }
]);
Try commenting out the $match to see what $filter returns when a document has the target array field but no matches vs. when the document does not have the field.
I have a log collection in MongoDB that has a structure that looks like this:
{
url : "http://example.com",
query : "name=blah,;another_param=bleh",
count : 5
}
where the "query" field is the query parameters in the requested url.
I want to compute a total of count grouped by the query parameter "name". For example, for this collection:
[{
url : "http://example.com",
query : "name=blah,;another_param=bleh",
count : 3
},
{
url : "http://example.com",
query : "name=blah,;another_param=xyz",
count : 4
},
{
url : "http://example.com",
query : "name=another_name,;another_param=bleh",
count : 3
}]
I need this output:
[{
key : "blah",
count : 7
},
{
key : "another_name",
count : 3
}]
It doesnt look like I can do this string manipulation using the aggregation framework. I can do this via map-reduce, but can a map-reduce operation be part of the aggregation pipeline?
The aggregation framework does not have the string manipulation operators necessary to dissect the string content and break this up into the key/value pairs you need for this operation. The only string manipulation currently available is $substr, which is not going to help unless you are dealing with fixed length data.
So the only server side way to do this at present is with mapReduce since you can just the JavaScript functions available to do the right manipulation. Something like this:
For the mapper:
function() {
var obj = {};
this.query.split(/,;/).forEach(function(item) {
var temp = item.split(/=/);
obj[temp[0]] = temp[1];
});
if (obj.hasOwnProperty('name')
emit(obj.name,this.count);
}
And the reducer:
function(key,values) {
return Array.sum( values );
}
Which is the basic structure of the JavaScript functions required to split out the "name" parameters and use them as the "keys" for aggregation, or general counting of the "key" occurrences.
So the aggregation framework cannot execute any JavaScript itself, as it just runs native code operators over the data.
It would be a good idea though to look at changing how your data is stored, so that the elements are broken down into a an "object" representation rather than a string when the documents are inserted to MongoDB. This allows native query forms that don't rely on JavaScript execution to manipulate the data:
[{
"url": "http://example.com",
"query": {
"name": "blah",
"another_param": "bleh"
},
"count": 3
},
{
"url": "http://example.com",
"query": {
"name": "blah",
"another_param": "xyz"
},
"count": 4
},
{
"url": "http://example.com",
"query": {
"name": "another_name",
"another_param": "bleh"
},
"count": 3
}]
This makes a $group pipeline stage quite simple as the data is now organized in a form that can be natively processed:
{ "$match": { "query.name": { "$exists": true } },
{ "$group": {
"_id": "$query.name",
"count": { "$sum": "$count" }
}}
So use mapReduce for now, but ultimately consider changing your recording of the data to split the "tokens" from the query string and represent this as structured data, optionally keeping the original string in another field.
The aggregation framework will process this much faster than mapReduce can, so this would be the better ongoing option.
Consider this example collection:
{
"_id:"0,
"firstname":"Tom",
"children" : {
"childA":{
"toys":{
'toy 1':'batman',
'toy 2':'car',
'toy 3':'train',
}
"movies": {
'movie 1': "Ironman"
'movie 2': "Deathwish"
}
},
"childB":{
"toys":{
'toy 1':'doll',
'toy 2':'bike',
'toy 3':'xbox',
}
"movies": {
'movie 1': "Frozen"
'movie 2': "Barbie"
}
}
}
}
Now I would like to retrieve ONLY the movies from a particular document.
I have tried something like this:
movies = users.find_one({'_id': 0}, {'_id': 0, 'children.ChildA.movies': 1})
However, I get the whole field structure from 'children' down to 'movies' and it's content. How do I just do a query and retrieve only the content of 'movies'?
To be specific I want to end up with this:
{
'movie 1': "Frozen"
'movie 2': "Barbie"
}
The problem here is your current data structure is not really great for querying. This is mostly because you are using "keys" to actually represent "data points", and while it might initially seem to be a logical idea it is actually a very bad practice.
So rather than do something like assign "childA" and "childB" as keys of an object or "sub-document", you are better off assigning these are "values" to a generic key name in a structure like this:
{
"_id:"0,
"firstname":"Tom",
"children" : [
{
"name": "childA",
"toys": [
"batman",
"car",
"train"
],
"movies": [
"Ironman"
"Deathwish"
]
},
{
"name": "childB",
"toys": [
"doll",
"bike",
"xbox",
],
"movies": [
"Frozen",
"Barbie"
]
}
]
}
Not the best as there are nested arrays, which can be a potential problem but there are workarounds to this as well ( but later ), but the main point here is this is a lot better than defining the data in "keys". And the main problem with "keys" that are not consistently named is that MongoDB does not generally allow any way to "wildcard" these names, so you are stuck with naming and "absolute path" in order to access elements as in:
children -> childA -> toys
children -> childB -> toys
And that in a nutshell is bad, and compared to this:
"children.toys"
From the sample prepared above, then I would say that is a whole lot better approach to organizing your data.
Even so, just getting back something such as a "unique list of movies" is out of scope for standard .find() type queries in MongoDB. This actually requires something more of "document manipulation" and is well supported in the aggregation framework for MongoDB. This has extensive capabilities for manipulation that is not present in the query methods, and as a per document response with the above structure then you can do this:
db.collection.aggregate([
# De-normalize the array content first
{ "$unwind": "$children" },
# De-normalize the content from the inner array as well
{ "$unwind": "$children.movies" },
# Group back, well optionally, but just the "movies" per document
{ "$group": {
"_id": "$_id",
"movies": { "$addToSet": "$children.movies" }
}}
])
So now the "list" response in the document only contains the "unique" movies, which corresponds more to what you are asking. Alternately you could just $push instead and make a "non-unique" list. But stupidly that is actually the same as this:
db.collection.find({},{ "_id": False, "children.movies": True })
As a "collection wide" concept, then you could simplify this a lot by simply using the .distinct() method. Which basically forms a list of "distinct" keys based on the input you provide. This playes with arrays really well:
db.collection.distinct("children.toys")
And that is essentially a collection wide analysis of all the "distinct" occurrences for each"toys" value in the collection, and returned as a simple "array".
But as for you existing structure, it deserves a solution to explain, but you really must understand that the explanation is horrible. The problem here is that the "native" and optimized methods available to general queries and aggregation methods are not available at all and the only option available is JavaScript based processing. Which even though a little better through "v8" engine integration, is still really a complete slouch when compared side by side with native code methods.
So from the "original" form that you have, ( JavaScript form, functions have to be so easy to translate") :
db.collection.mapReduce(
// Mapper
function() {
var id this._id;
children = this.children;
Object.keys(children).forEach(function(child) {
Object.keys(child).forEach(function(childKey) {
Object.keys(childKey).forEach(function(toy) {
emit(
id, { "toys": [children[childkey]["toys"][toy]] }
);
});
});
});
},
// Reducer
function(key,values) {
var output = { "toys": [] };
values.forEach(function(value) {
value.toys.forEach(function(toy) {
if ( ouput.toys.indexOf( toy ) == -1 )
output.toys.push( toy );
});
});
},
{
"out": { "inline": 1 }
}
)
So JavaScript evaluation is the "horrible" approach as this is much slower in execution, and you see the "traversing" code that needs to be implemented. Bad news for performance, so don't do it. Change the structure instead.
As a final part, you could model this differently to avoid the "nested array" concept. And understand that the only real problem with a "nested array" is that "updating" a nested element is really impossible without reading in the whole document and modifying it.
So $push and $pull methods work fine. But using a "positional" $ operator just does not work as the "outer" array index is always the "first" matched element. So if this really was a problem for you then you could do something like this, for example:
{
"_id:"0,
"firstname":"Tom",
"childtoys" : [
{
"name": "childA",
"toy": "batman"
}.
{
"name": "childA",
"toy": "car"
},
{
"name": "childA",
"toy": "train"
},
{
"name": "childB",
"toy": "doll"
},
{
"name": "childB",
"toy": "bike"
},
{
"name": "childB",
"toy": "xbox"
}
],
"childMovies": [
{
"name": "childA"
"movie": "Ironman"
},
{
"name": "childA",
"movie": "Deathwish"
},
{
"name": "childB",
"movie": "Frozen"
},
{
"name": "childB",
"movie": "Barbie"
}
]
}
That would be one way to avoid the problem with nested updates if you did indeed need to "update" items on a regular basis rather than just $push and $pull items to the "toys" and "movies" arrays.
But the overall message here is to design your data around the access patterns you actually use. MongoDB does generally not like things with a "strict path" in the terms of being able to query or otherwise flexibly issue updates.
Projections in MongoDB make use of '1' and '0' , not 'True'/'False'.
Moreover ensure that the fields are specified in the right cases(uppercase/lowercase)
The query should be as below:
db.users.findOne({'_id': 0}, {'_id': 0, 'children.childA.movies': 1})
Which will result in :
{
"children" : {
"childA" : {
"movies" : {
"movie 1" : "Ironman",
"movie 2" : "Deathwish"
}
}
}
}
I have documents like this:
{
"_id" : ObjectId("53bcedc39c837bba3e1bf1c2"),
id : "abc1",
someArray: [ 1 , 10 , 11]
}
{
"_id" : ObjectId("53bcedc39c837bba3e1bf1c4"),
id : "abc1",
someArray: [ 1 , 10]
}
... other similar documents with different Ids
I would like to go through the whole collection and delete the document where someArray is smallest, grouped by id. So in this example, I group by abc1 (and I get 2 documents) and then the 2nd document would be the one to delete because it has least count in someArray.
There isn't a $count accumulator so I don't see how I can use $group.
Additionally, there will be 1000s of Ids that have duplicates like this, so if there is such a thing as a bulk check/delete that would be good (possibly a stupid question, sorry, Mongo is all new to me!)
Removing "duplicates" is a process here and there is no simple way to both "identify" the dupliciates and "remove" them as a single statement. Another particular here is that query forms cannot "typically" determine the size of an array, and certainly cannot sort by that where it is not already present in the document.
All cases basically come down to
Identifying the list of documents that are "duplicates", and then ideally fingering the particular document you want to delete, or more to the point the document you "don't" want to delete from the possible duplicates.
Processing that list to actually perform the deletes.
With that in mind you hopefully have a modern MongoDB of 2.6 version or greater where you can obtain a cursor from the aggregate method. You also want the Bulk Operations API available in these versions for optimal speed:
var bulk = db.collection.initializeOrderedBulkOp();
var counter = 0;
db.collection.aggregate([
{ "$project": {
"id": 1,
"size": { "$size": "$someArray" }
}},
{ "$sort": { "id": 1, "size": -1 } },
{ "$group": {
"_id": "$id",
"docId": { "$first": "$_id" }
}}
]).forEach(function(doc) {
bulk.find({ "id": doc._id, "_id": { "$ne": doc.docId }).remove();
counter++;
// Send to server once every 1000 statements only
if ( counter % 1000 == 0 ) {
bulk.execute();
bulk = db.collection.initializeOrderedBulkOp(); // need to reset
}
});
// Clean up results that did not round to 1000
if ( counter % 1000 != 0 )
bulk.execute();
You can still do much the same thing with older versions of MongoDB, but the result from .aggregate() must be under 16MB which is the BSON limit. That still should be a lot, but with older versions you could also output to a collection with mapReduce.
But for the general aggregation response, you get an array of results and you also don't have the other convienience methods for finding the size of the array. So a little more work:
var result = db.collection.aggregate([
{ "$unwind": "$someArray" },
{ "$group": {
"_id": "$id",
"id": { "$first": "$id" },
"size": { "$sum": 1 }
}},
{ "$sort": { "id": 1, "size": -1 } },
{ "$group": {
"_id": "$id",
"docId": { "$first": "$_id" }
}}
]);
result.result.forEach(function(doc) {
db.collection.remove({ "id": doc._id, "_id": { "$ne": doc.docId } });
});
So no cursor for large results and no bulk operations so every single "remove" needs to be sent to the server individually.
So in MongoDB there are no "sub-queries" or even when there is more than "two duplicates" a way to single out the document you don't want to remove from the other duplicates. But this is the general way to do it.
Just as a note, if the "size" of arrays is something important to you for a purpose such as "sorting", then your best apporach is to maintain that "size" as another property of your document so it makes those operations easier without needing to "calculate" that as is done here.
I know this question has been asked a lot of times but I'm kinda new to mongo and mongoose as well and I couldn't figure it out !
My problem:
I have a which looks like this:
var rankingSchema = new Schema({
userId : { type : Schema.Types.ObjectId, ref:'User' },
pontos : {type: Number, default:0},
placarExato : {type: Number, default:0},
golVencedor : {type: Number, default:0},
golPerdedor : {type: Number, default:0},
diferencaVencPerd : {type: Number, default:0},
empateNaoExato : {type: Number, default:0},
timeVencedor : {type: Number, default:0},
resumo : [{
partida : { type : Schema.Types.ObjectId, ref:'Partida' },
palpite : [Number],
quesito : String
}]
});
Which would return a document like this:
{
"_id" : ObjectId("539d0756f0ccd69ac5dd61fa"),
"diferencaVencPerd" : 0,
"empateNaoExato" : 0,
"golPerdedor" : 0,
"golVencedor" : 1,
"placarExato" : 2,
"pontos" : 78,
"resumo" : [
{
"partida" : ObjectId("5387d991d69197902ae27586"),
"_id" : ObjectId("539d07eb06b1e60000c19c18"),
"palpite" : [
2,
0
]
},
{
"partida" : ObjectId("5387da7b27f54fb425502918"),
"quesito" : "golsVencedor",
"_id" : ObjectId("539d07eb06b1e60000c19c1a"),
"palpite" : [
3,
0
]
},
{
"partida" : ObjectId("5387dc012752ff402a0a7882"),
"quesito" : "timeVencedor",
"_id" : ObjectId("539d07eb06b1e60000c19c1c"),
"palpite" : [
2,
1
]
},
{
"partida" : ObjectId("5387dc112752ff402a0a7883"),
"_id" : ObjectId("539d07eb06b1e60000c19c1e"),
"palpite" : [
1,
1
]
},
{
"partida" : ObjectId("53880ea52752ff402a0a7886"),
"quesito" : "placarExato",
"_id" : ObjectId("539d07eb06b1e60000c19c20"),
"palpite" : [
1,
2
]
},
{
"partida" : ObjectId("53880eae2752ff402a0a7887"),
"quesito" : "placarExato",
"_id" : ObjectId("539d0aa82fb219000054c84f"),
"palpite" : [
2,
1
]
}
],
"timeVencedor" : 1,
"userId" : ObjectId("539b2f2930de100000d7356c")
}
My question is, first: How can I filter the resumo nested document by quesito ? Is it possible to paginate this result, since this array is going to increase. And last question, is this a nice approach to this case ?
Thank you guys !
As noted, your schema implies that you actually have embedded data even though you are storing an external reference. So it is not clear if you are doing both embedding and referencing or simply embedding by itself.
The big caveat here is the difference between matching a "document" and actually filtering the contents of an array. Since you seem to be talking about "paging" your array results, the large focus here is on doing that, but still making mention of the warnings.
Multiple "filtered" matches in an array requires the aggregation framework. You can generally "project" the single match of an array element, but this is needed where you expect more than one:
Ranking.aggregate(
[
// This match finds "documents" that "contain" the match
{ "$match": { "resumo.quesito": "value" } },
// Unwind de-normalizes arrays as documents
{ "$unwind": "$resumo" },
// This match actually filters those document matches
{ "$match": { "resumo.quesito": "value" } },
// Skip and limit for paging, which really only makes sense on single
// document matches
{ "$skip": 0 },
{ "$limit": 2 },
// Return as an array in the original document if you really want
{ "$group": {
"_id": "$_id",
"otherField": { "$first": "$otherField" },
"resumo": { "$push": "$resumo" }
}}
],
function(err,results) {
}
)
Or the MongoDB 2.6 way by "filtering" inside a $project using the $map operator. But still you need to $unwind in order to "page" array positions, but there is possibly less processing as the array is "filtered" first:
Ranking.aggregate(
[
// This match finds "documents" that "contain" the match
{ "$match": { "resumo.quesito": "value" } },
// Filter with $map
{ "$project": {
"otherField": 1,
"resumo": {
"$setDifference": [
{
"$map": {
"input": "$resumo",
"as": "el",
"in": { "$eq": ["$$el.questio", "value" ] }
}
},
[false]
]
}
}},
// Unwind de-normalizes arrays as documents
{ "$unwind": "$resumo" },
// Skip and limit for paging, which really only makes sense on single
// document matches
{ "$skip": 0 },
{ "$limit": 2 },
// Return as an array in the original document if you really want
{ "$group": {
"_id": "$_id",
"otherField": { "$first": "$otherField" },
"resumo": { "$push": "$resumo" }
}}
],
function(err,results) {
}
)
The inner usage of $skip and $limit here really only makes sense when you are processing a single document and just "filtering" and "paging" the array. It is possible to do this with multiple documents, but is very involved as there is no way to just "slice" the array. Which brings us to the next point.
Really with embedded arrays, for paging that does not require any filtering you just use the $slice operator, which was designed for this purpose:
Ranking.find({},{ "resumo": { "$slice": [0,2] } },function(err,docs) {
});
Your alternate though is to simply reference the documents in the external collection and then pass the arguments to mongoose .populate() to filter and "page" the results. The change in the schema itself would just be:
"resumo": [{ "type": "Schema.Types.ObjectId", "ref": "Partida" }]
With the external referenced collection now holding the object detail rather than embedding directly in the array. The use of .populate() with filtering and paging is:
Ranking.find().populate({
"path": "resumo",
"match": { "questio": "value" },
"options": { "skip": 0, "limit": 2 }
}).exec(function(err,docs) {
docs = docs.filter(function(doc) {
return docs.comments.length;
});
});
Of course the possible problem there is that you can no longer actually query for the documents that contain the "embedded" information as it is now in another collection. This results in pulling in all documents, though possibly by some other query condition, but then manually testing them to see if they were "populated" by the filtered query that was sent to retrieve those items.
So it really does depend on what you are doing and what your approach is. If you regularly intend to "search" on inner arrays then embedding will generally suit you better. Also if you really only interesting in "paging" then the $slice operator works well for this purpose with embedded documents. But beware growing embedded arrays too large.
Using a referenced schema with mongoose helps with some size concerns, and there is methodology in place to assist with "paging" results and filtering them as well. The drawback is that you can no longer query "inside" those elements from the parent itself. So parent selection by the inner elements is not well suited here. Also keep in mind that while not all of the data is embedded, there is still the reference to the _id value of the external document. So you can still end up with large arrays, which may not be desirable.
For anything large, consider that you will likely be doing the work yourself, and working backwards from the "child" items to then match the parent(s).
I am not sure that you can filter sub-document directly with mongoose. However you can get the parent document with Model.find({'resumo.quesito': 'THEVALUE'}) (you should also and an index on it)
Then when you have the parent you can get the child by comparing the quesito
Additionnal doc can be found here: http://mongoosejs.com/docs/subdocs.html