I have a log collection in MongoDB that has a structure that looks like this:
{
url : "http://example.com",
query : "name=blah,;another_param=bleh",
count : 5
}
where the "query" field is the query parameters in the requested url.
I want to compute a total of count grouped by the query parameter "name". For example, for this collection:
[{
url : "http://example.com",
query : "name=blah,;another_param=bleh",
count : 3
},
{
url : "http://example.com",
query : "name=blah,;another_param=xyz",
count : 4
},
{
url : "http://example.com",
query : "name=another_name,;another_param=bleh",
count : 3
}]
I need this output:
[{
key : "blah",
count : 7
},
{
key : "another_name",
count : 3
}]
It doesnt look like I can do this string manipulation using the aggregation framework. I can do this via map-reduce, but can a map-reduce operation be part of the aggregation pipeline?
The aggregation framework does not have the string manipulation operators necessary to dissect the string content and break this up into the key/value pairs you need for this operation. The only string manipulation currently available is $substr, which is not going to help unless you are dealing with fixed length data.
So the only server side way to do this at present is with mapReduce since you can just the JavaScript functions available to do the right manipulation. Something like this:
For the mapper:
function() {
var obj = {};
this.query.split(/,;/).forEach(function(item) {
var temp = item.split(/=/);
obj[temp[0]] = temp[1];
});
if (obj.hasOwnProperty('name')
emit(obj.name,this.count);
}
And the reducer:
function(key,values) {
return Array.sum( values );
}
Which is the basic structure of the JavaScript functions required to split out the "name" parameters and use them as the "keys" for aggregation, or general counting of the "key" occurrences.
So the aggregation framework cannot execute any JavaScript itself, as it just runs native code operators over the data.
It would be a good idea though to look at changing how your data is stored, so that the elements are broken down into a an "object" representation rather than a string when the documents are inserted to MongoDB. This allows native query forms that don't rely on JavaScript execution to manipulate the data:
[{
"url": "http://example.com",
"query": {
"name": "blah",
"another_param": "bleh"
},
"count": 3
},
{
"url": "http://example.com",
"query": {
"name": "blah",
"another_param": "xyz"
},
"count": 4
},
{
"url": "http://example.com",
"query": {
"name": "another_name",
"another_param": "bleh"
},
"count": 3
}]
This makes a $group pipeline stage quite simple as the data is now organized in a form that can be natively processed:
{ "$match": { "query.name": { "$exists": true } },
{ "$group": {
"_id": "$query.name",
"count": { "$sum": "$count" }
}}
So use mapReduce for now, but ultimately consider changing your recording of the data to split the "tokens" from the query string and represent this as structured data, optionally keeping the original string in another field.
The aggregation framework will process this much faster than mapReduce can, so this would be the better ongoing option.
Related
Say I have the following two collections, sites and webpages. I'm trying to understand how to create an aggregation that'll allow me to combine values of a document from the sites collection and use that to lookup a value from the webpages collection. In addition, I need to prepend the combined values with a string.
// sites collection
[
{ "_id" : 3, "host" : "www.example-foo.com", "path": "/bar", "hasVisited": false },
]
// webpages collection
[
{ "_id" : 5, "url" : "https://www.example-foo.com/bar" },
{ "_id" : 8, "url" : "https://www.fizz.com/buzz" },
]
Without an aggregation I would do something like the following.
const site = await db.sites.findOne({ hasVisited: { $eq: false } });
const pages = await db.webpages.find({
url: `https://${site.host}${site.path}`, // <--- how to construct this in a lookup aggregation? string + value + value
});
// pages = [{ "_id" : 5, "url" : "https://www.example-foo.com/bar" }]
This is like translation of your code with the 2 find queries in 1 using $lookup
Query
first findOne is the $match and the $limit 1
$set url is to make the string concat
second find is to do the $lookup (with the 1 site from above stages)
*if you want to do it for more than 1 sites remove the limit, and project more fields, to know where this pages belong to(which site)
Test code here
db.sites.aggregate([
{
"$match": {
"hasVisited": {
"$eq": false
}
}
},
{
"$limit": 1
},
{
"$set": {
"url": {
"$concat": [
"https://",
"$host",
"$path"
]
}
}
},
{
"$lookup": {
"from": "webpages",
"localField": "url",
"foreignField": "url",
"as": "pages"
}
},
{
"$project": {
"_id": 0,
"pages": 1
}
}
])
I have a dataset of records stored in mongodb and i have been trying to extract a complex set of data from the records.
Sample records are as follows :-
{
bookId : '135wfkjdbv',
type : 'a',
store : 'crossword',
shelf : 'A1'
}
{
bookId : '13erjfn',
type : 'b',
store : 'crossword',
shelf : 'A2'
}
I have been trying to extract data such that for each bookId, i get a count (of records) for each shelf per store name that holds the book identified by bookId where the type of the book is 'a'.
I understand that the aggregation query allows a pipeline that allows grouping, matching etc, but I have not been able to reach a solution.
The desired output is of the form :-
{
bookId : '135wfkjdbv',
stores : [
{
name : 'crossword'
shelves : [
{
name : 'A1',
count : 12
},
]
},
{
name : 'granth'
shelves : [
{
name : 'C2',
count : 12
},
{
name : 'C4',
count : 12
},
]
}
]
}
The process isn't really that difficult when you look at at. The aggregation "pipeline" is exactly that, where each "stage" feeds a result into the next for processing. Just like unix "pipe":
ps -ef | grep mongo | tee out.txt
So it's just adding stages, and in fact three $group stages where the first does the basic aggregation and the remaining two simply "roll up" the arrays required in the output.
db.collection.aggregate([
{ "$group": {
"_id": {
"bookId": "$bookId",
"store": "$store",
"shelf": "$shelf"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": {
"bookId": "$_id.bookId",
"store": "$_id.store"
},
"shelves": {
"$push": {
"name": "$_id.shelf",
"count": "$count"
}
}
}},
{ "$group": {
"_id": "$_id.bookId",
"stores": {
"$push": {
"name": "$_id.store",
"shelves": "$shelves"
}
}
}}
])
You could possibly $project at the end to change the _id to bookId, but you should already know that is what it is and get used to treating _id as a primary key. There is a cost to such operations, so it is a habit you should not get into and learn doing things correctly from the start.
So all that really happens here is all the fields that would make up the grouping detail are made the primary key of $group with the other field being produced as count, to count the shelves within that grouping. Think the SQL equivalent:
GROUP BY bookId, store, shelf
All each other stage does is transpose each grouping level into array entries, first by shelf within the store and then the store within the bookId. Each time the fields in the primary grouping key are reduced down by the content going into the produced array.
When you start thinking in terms of "pipeline" processing, then it becomes clear. As you construct one form, then take that output and move it to the next form and so on. This is basically how you fold the results within two arrays.
Consider this example collection:
{
"_id:"0,
"firstname":"Tom",
"children" : {
"childA":{
"toys":{
'toy 1':'batman',
'toy 2':'car',
'toy 3':'train',
}
"movies": {
'movie 1': "Ironman"
'movie 2': "Deathwish"
}
},
"childB":{
"toys":{
'toy 1':'doll',
'toy 2':'bike',
'toy 3':'xbox',
}
"movies": {
'movie 1': "Frozen"
'movie 2': "Barbie"
}
}
}
}
Now I would like to retrieve ONLY the movies from a particular document.
I have tried something like this:
movies = users.find_one({'_id': 0}, {'_id': 0, 'children.ChildA.movies': 1})
However, I get the whole field structure from 'children' down to 'movies' and it's content. How do I just do a query and retrieve only the content of 'movies'?
To be specific I want to end up with this:
{
'movie 1': "Frozen"
'movie 2': "Barbie"
}
The problem here is your current data structure is not really great for querying. This is mostly because you are using "keys" to actually represent "data points", and while it might initially seem to be a logical idea it is actually a very bad practice.
So rather than do something like assign "childA" and "childB" as keys of an object or "sub-document", you are better off assigning these are "values" to a generic key name in a structure like this:
{
"_id:"0,
"firstname":"Tom",
"children" : [
{
"name": "childA",
"toys": [
"batman",
"car",
"train"
],
"movies": [
"Ironman"
"Deathwish"
]
},
{
"name": "childB",
"toys": [
"doll",
"bike",
"xbox",
],
"movies": [
"Frozen",
"Barbie"
]
}
]
}
Not the best as there are nested arrays, which can be a potential problem but there are workarounds to this as well ( but later ), but the main point here is this is a lot better than defining the data in "keys". And the main problem with "keys" that are not consistently named is that MongoDB does not generally allow any way to "wildcard" these names, so you are stuck with naming and "absolute path" in order to access elements as in:
children -> childA -> toys
children -> childB -> toys
And that in a nutshell is bad, and compared to this:
"children.toys"
From the sample prepared above, then I would say that is a whole lot better approach to organizing your data.
Even so, just getting back something such as a "unique list of movies" is out of scope for standard .find() type queries in MongoDB. This actually requires something more of "document manipulation" and is well supported in the aggregation framework for MongoDB. This has extensive capabilities for manipulation that is not present in the query methods, and as a per document response with the above structure then you can do this:
db.collection.aggregate([
# De-normalize the array content first
{ "$unwind": "$children" },
# De-normalize the content from the inner array as well
{ "$unwind": "$children.movies" },
# Group back, well optionally, but just the "movies" per document
{ "$group": {
"_id": "$_id",
"movies": { "$addToSet": "$children.movies" }
}}
])
So now the "list" response in the document only contains the "unique" movies, which corresponds more to what you are asking. Alternately you could just $push instead and make a "non-unique" list. But stupidly that is actually the same as this:
db.collection.find({},{ "_id": False, "children.movies": True })
As a "collection wide" concept, then you could simplify this a lot by simply using the .distinct() method. Which basically forms a list of "distinct" keys based on the input you provide. This playes with arrays really well:
db.collection.distinct("children.toys")
And that is essentially a collection wide analysis of all the "distinct" occurrences for each"toys" value in the collection, and returned as a simple "array".
But as for you existing structure, it deserves a solution to explain, but you really must understand that the explanation is horrible. The problem here is that the "native" and optimized methods available to general queries and aggregation methods are not available at all and the only option available is JavaScript based processing. Which even though a little better through "v8" engine integration, is still really a complete slouch when compared side by side with native code methods.
So from the "original" form that you have, ( JavaScript form, functions have to be so easy to translate") :
db.collection.mapReduce(
// Mapper
function() {
var id this._id;
children = this.children;
Object.keys(children).forEach(function(child) {
Object.keys(child).forEach(function(childKey) {
Object.keys(childKey).forEach(function(toy) {
emit(
id, { "toys": [children[childkey]["toys"][toy]] }
);
});
});
});
},
// Reducer
function(key,values) {
var output = { "toys": [] };
values.forEach(function(value) {
value.toys.forEach(function(toy) {
if ( ouput.toys.indexOf( toy ) == -1 )
output.toys.push( toy );
});
});
},
{
"out": { "inline": 1 }
}
)
So JavaScript evaluation is the "horrible" approach as this is much slower in execution, and you see the "traversing" code that needs to be implemented. Bad news for performance, so don't do it. Change the structure instead.
As a final part, you could model this differently to avoid the "nested array" concept. And understand that the only real problem with a "nested array" is that "updating" a nested element is really impossible without reading in the whole document and modifying it.
So $push and $pull methods work fine. But using a "positional" $ operator just does not work as the "outer" array index is always the "first" matched element. So if this really was a problem for you then you could do something like this, for example:
{
"_id:"0,
"firstname":"Tom",
"childtoys" : [
{
"name": "childA",
"toy": "batman"
}.
{
"name": "childA",
"toy": "car"
},
{
"name": "childA",
"toy": "train"
},
{
"name": "childB",
"toy": "doll"
},
{
"name": "childB",
"toy": "bike"
},
{
"name": "childB",
"toy": "xbox"
}
],
"childMovies": [
{
"name": "childA"
"movie": "Ironman"
},
{
"name": "childA",
"movie": "Deathwish"
},
{
"name": "childB",
"movie": "Frozen"
},
{
"name": "childB",
"movie": "Barbie"
}
]
}
That would be one way to avoid the problem with nested updates if you did indeed need to "update" items on a regular basis rather than just $push and $pull items to the "toys" and "movies" arrays.
But the overall message here is to design your data around the access patterns you actually use. MongoDB does generally not like things with a "strict path" in the terms of being able to query or otherwise flexibly issue updates.
Projections in MongoDB make use of '1' and '0' , not 'True'/'False'.
Moreover ensure that the fields are specified in the right cases(uppercase/lowercase)
The query should be as below:
db.users.findOne({'_id': 0}, {'_id': 0, 'children.childA.movies': 1})
Which will result in :
{
"children" : {
"childA" : {
"movies" : {
"movie 1" : "Ironman",
"movie 2" : "Deathwish"
}
}
}
}
I know this question has been asked a lot of times but I'm kinda new to mongo and mongoose as well and I couldn't figure it out !
My problem:
I have a which looks like this:
var rankingSchema = new Schema({
userId : { type : Schema.Types.ObjectId, ref:'User' },
pontos : {type: Number, default:0},
placarExato : {type: Number, default:0},
golVencedor : {type: Number, default:0},
golPerdedor : {type: Number, default:0},
diferencaVencPerd : {type: Number, default:0},
empateNaoExato : {type: Number, default:0},
timeVencedor : {type: Number, default:0},
resumo : [{
partida : { type : Schema.Types.ObjectId, ref:'Partida' },
palpite : [Number],
quesito : String
}]
});
Which would return a document like this:
{
"_id" : ObjectId("539d0756f0ccd69ac5dd61fa"),
"diferencaVencPerd" : 0,
"empateNaoExato" : 0,
"golPerdedor" : 0,
"golVencedor" : 1,
"placarExato" : 2,
"pontos" : 78,
"resumo" : [
{
"partida" : ObjectId("5387d991d69197902ae27586"),
"_id" : ObjectId("539d07eb06b1e60000c19c18"),
"palpite" : [
2,
0
]
},
{
"partida" : ObjectId("5387da7b27f54fb425502918"),
"quesito" : "golsVencedor",
"_id" : ObjectId("539d07eb06b1e60000c19c1a"),
"palpite" : [
3,
0
]
},
{
"partida" : ObjectId("5387dc012752ff402a0a7882"),
"quesito" : "timeVencedor",
"_id" : ObjectId("539d07eb06b1e60000c19c1c"),
"palpite" : [
2,
1
]
},
{
"partida" : ObjectId("5387dc112752ff402a0a7883"),
"_id" : ObjectId("539d07eb06b1e60000c19c1e"),
"palpite" : [
1,
1
]
},
{
"partida" : ObjectId("53880ea52752ff402a0a7886"),
"quesito" : "placarExato",
"_id" : ObjectId("539d07eb06b1e60000c19c20"),
"palpite" : [
1,
2
]
},
{
"partida" : ObjectId("53880eae2752ff402a0a7887"),
"quesito" : "placarExato",
"_id" : ObjectId("539d0aa82fb219000054c84f"),
"palpite" : [
2,
1
]
}
],
"timeVencedor" : 1,
"userId" : ObjectId("539b2f2930de100000d7356c")
}
My question is, first: How can I filter the resumo nested document by quesito ? Is it possible to paginate this result, since this array is going to increase. And last question, is this a nice approach to this case ?
Thank you guys !
As noted, your schema implies that you actually have embedded data even though you are storing an external reference. So it is not clear if you are doing both embedding and referencing or simply embedding by itself.
The big caveat here is the difference between matching a "document" and actually filtering the contents of an array. Since you seem to be talking about "paging" your array results, the large focus here is on doing that, but still making mention of the warnings.
Multiple "filtered" matches in an array requires the aggregation framework. You can generally "project" the single match of an array element, but this is needed where you expect more than one:
Ranking.aggregate(
[
// This match finds "documents" that "contain" the match
{ "$match": { "resumo.quesito": "value" } },
// Unwind de-normalizes arrays as documents
{ "$unwind": "$resumo" },
// This match actually filters those document matches
{ "$match": { "resumo.quesito": "value" } },
// Skip and limit for paging, which really only makes sense on single
// document matches
{ "$skip": 0 },
{ "$limit": 2 },
// Return as an array in the original document if you really want
{ "$group": {
"_id": "$_id",
"otherField": { "$first": "$otherField" },
"resumo": { "$push": "$resumo" }
}}
],
function(err,results) {
}
)
Or the MongoDB 2.6 way by "filtering" inside a $project using the $map operator. But still you need to $unwind in order to "page" array positions, but there is possibly less processing as the array is "filtered" first:
Ranking.aggregate(
[
// This match finds "documents" that "contain" the match
{ "$match": { "resumo.quesito": "value" } },
// Filter with $map
{ "$project": {
"otherField": 1,
"resumo": {
"$setDifference": [
{
"$map": {
"input": "$resumo",
"as": "el",
"in": { "$eq": ["$$el.questio", "value" ] }
}
},
[false]
]
}
}},
// Unwind de-normalizes arrays as documents
{ "$unwind": "$resumo" },
// Skip and limit for paging, which really only makes sense on single
// document matches
{ "$skip": 0 },
{ "$limit": 2 },
// Return as an array in the original document if you really want
{ "$group": {
"_id": "$_id",
"otherField": { "$first": "$otherField" },
"resumo": { "$push": "$resumo" }
}}
],
function(err,results) {
}
)
The inner usage of $skip and $limit here really only makes sense when you are processing a single document and just "filtering" and "paging" the array. It is possible to do this with multiple documents, but is very involved as there is no way to just "slice" the array. Which brings us to the next point.
Really with embedded arrays, for paging that does not require any filtering you just use the $slice operator, which was designed for this purpose:
Ranking.find({},{ "resumo": { "$slice": [0,2] } },function(err,docs) {
});
Your alternate though is to simply reference the documents in the external collection and then pass the arguments to mongoose .populate() to filter and "page" the results. The change in the schema itself would just be:
"resumo": [{ "type": "Schema.Types.ObjectId", "ref": "Partida" }]
With the external referenced collection now holding the object detail rather than embedding directly in the array. The use of .populate() with filtering and paging is:
Ranking.find().populate({
"path": "resumo",
"match": { "questio": "value" },
"options": { "skip": 0, "limit": 2 }
}).exec(function(err,docs) {
docs = docs.filter(function(doc) {
return docs.comments.length;
});
});
Of course the possible problem there is that you can no longer actually query for the documents that contain the "embedded" information as it is now in another collection. This results in pulling in all documents, though possibly by some other query condition, but then manually testing them to see if they were "populated" by the filtered query that was sent to retrieve those items.
So it really does depend on what you are doing and what your approach is. If you regularly intend to "search" on inner arrays then embedding will generally suit you better. Also if you really only interesting in "paging" then the $slice operator works well for this purpose with embedded documents. But beware growing embedded arrays too large.
Using a referenced schema with mongoose helps with some size concerns, and there is methodology in place to assist with "paging" results and filtering them as well. The drawback is that you can no longer query "inside" those elements from the parent itself. So parent selection by the inner elements is not well suited here. Also keep in mind that while not all of the data is embedded, there is still the reference to the _id value of the external document. So you can still end up with large arrays, which may not be desirable.
For anything large, consider that you will likely be doing the work yourself, and working backwards from the "child" items to then match the parent(s).
I am not sure that you can filter sub-document directly with mongoose. However you can get the parent document with Model.find({'resumo.quesito': 'THEVALUE'}) (you should also and an index on it)
Then when you have the parent you can get the child by comparing the quesito
Additionnal doc can be found here: http://mongoosejs.com/docs/subdocs.html
I have a huge data set (in millions) in the following format :
{
"userid" : "codejammer",
"data" : [
{"type" : "number", "value" : "23748"},
{"type" : "message","value" : "one"}
]
}
I want to get count of message with value one for userid - codejammer
The following is the mapreduce function I am using :
Map :
var map = function(){
emit(this.data[0].value,1);
}
Reduce
var reduce = function(key,values){
return Array.sum(values);
}
Options
var options = {
"query":{"userid" : "codejammer",
"data.type" : "message"},
"out" : "aggregrated"
}
The mapreduce function executes successfully with the following output:
{
"_id" : 23748,
"value" : 1
}
But, I am expecting the following output :
{
"_id" : one,
"value" : 1
}
The query filter in options, is sending the entire array to map function even though I specifically ask for data.type : "message"
Is there any way to use projection operator in query filter to get only the required item in array ?
Thank you very much for your help.
You actually would be better off doing this with aggregate. There is no need for mapReduce in this case and the aggregation framework runs as native code and will be much faster than running through the JavaScript interpreter:
db.collection.aggregate([
// Still makes sense to match the documents to reduce the set
{ "$match": {
"userid": "codejammer",
"data": { "$elemMatch": {
"type": "message", "value": "one"
}}
}},
// Unwind to de-normalize the array content
{ "$unwind": "$data" },
// Filter the content of the array
{ "$match": {
"data.type": "message",
"data.value": "one"
}},
// Count all the matching entries
{ "$group": {
"_id": null,
"count": { "$sum": 1 }
}}
])
Of course if you actually did only ever have one "message" inside your "data" array this becomes very simple:
db.collection.aggregate([
// Match the documents you want
{ "$match": {
"userid": "codejammer",
"data": { "$elemMatch": {
"type": "message", "value": "one"
}}
}},
// Simply count the documents
{ "$group": {
"_id": null,
"count": { "$sum": 1 }
}}
])
But of course that is actually no different to this:
db.collection.find({
"userid": "codejammer",
"data": { "$elemMatch": {
"type": "message", "value": "one"
}}
}).count()
So while there is a way to do this with mapReduce, the other ways shown are much better. Especially in the newly released 2.6 version and upwards. In the newer versions the aggregation pipeline can make use of disk storage to handle very large collections.
But to get the count using mapReduce you were basically going about it the wrong way. The projection will not work as an input, so you need to take the element out of the results. I'm still going to consider that there could possibly be more than one matching value in your array even if that was not the case:
db.collection.mapReduce(
function() {
var userid = this.userid;
this.data.forEach(function(doc) {
if ( doc == condition )
emit( userid, 1 );
});
},
function(key,values) {
return values.length;
},
{
"query": {
"userid": "codejammer",
"data": { "$elemMatch": {
"type": "message", "value": "one"
}}
},
"scope": {
"condition": {"type" : "message", "value" : "one"}
},
"out": { "inline": 1 }
}
)
So in much the same way this "emits" a value for the common key when a document matching your criteria is found inside the data array. So you cannot project just the matching element, you get all of them and you filter in this way.
Since you are only expecting one result there is no point in actually outputting to a collection, so just send it out as one.
But basically, use the aggregation method if you have to do this.