Mongodb aggregation vs client side processing - javascript

I have a blogs collection which has almost the following schema:
{
title: { name: "My First Blog Post",
postDate: "01-28-11" },
content: "Here is my super long post ...",
comments: [ { text: "This post sucks!"
, name: "seanhess"
, created: 01-28-14}
, { text: "I know! I wish it were longer"
, name: "bob"
, postDate: 01-28-11}
]
}
I mainly want to run three queries:
Give me all the comments made by only bob
Find all the comments made at the same day the post is written which is comments.postDate = title.postDate.
Find all the comments made by bob on the same day the post is written
My questions are as following:
These three are going to be really frequent queries, so is it a good idea to use aggregation framework?
For the third query, I can simply make a query like db.blogs.find({"comments.name":"bob"}, {comments.name:1, comments.postDate:1, title.postDate:1}) and then do a client side post processing to loop through the returned results. Is it a good idea? I'd like to note that it is possible that this might return several thousand documents back.
I will be happy if you can propose some ways to make the third query.

It probably is best practice here to "break-up" your multiple questions in to several questions, if not only for that maybe the answer on one question would have led you to understand the other.
I am also not very keen on answering anything where there is no example shown of what yo have tried to do. But with that said and "shooting myself in the foot", the questions are reasonable from a design approach so I will answer.
Point 1 : Comments by "bob"
Standard $unwind and filter the results. Use $match first so you don't process unneeded documents.
db.collection.aggregate([
// Match to "narrow down" the documents.
{ "$match": { "comments.name": "bob" }},
// Unwind the array
{ "$unwind": "$comments" },
// Match and "filter" just the "bob" comments
{ "$match": { "comments.name": "bob" }},
// Possibly wind back the array
{ "$group": {
"_id": "$_id",
"title": { "$first": "$title" },
"content": { "$first": "$content" },
"comments": { "$push": "$comments" }
}}
])
Point 2: All comments on the same day
db.collection.aggregate([
// Try and match posts within a date or range
// { "$match": { "title.postDate": Date( /* something */ ) }},
// Unwind the array
{ "$unwind": "$comments" },
// Aha! Project out the same day. Not the time-stamp.
{ "$project": {
"title": 1,
"content": 1,
"comments": 1,
"same": { "$eq": [
{
"year" : { "$year": "$title.postDate" },
"month" : { "$month": "$title.postDate" },
"day": { "$dayOfMonth": "$title.postDate" }
},
{
"year" : { "$year": "$comments.postDate" },
"month" : { "$month": "$comments.postDate" },
"day": { "$dayOfMonth": "$comments.postDate" }
}
]}
}},
// Match the things on the "same
{ "$match": { "same": true } },
// Possibly wind back the array
{ "$group": {
"_id": "$_id",
"title": { "$first": "$title" },
"content": { "$first": "$content" },
"comments": { "$push": "$comments" }
}}
])
Point 3: "bob" on the same date
db.collection.aggregate([
// Try and match posts within a date or range
// { "$match": { "title.postDate": Date( /* something */ ) }},
// Unwind the array
{ "$unwind": "$comments" },
// Aha! Project out the same day. Not the time-stamp.
{ "$project": {
"title": 1,
"content": 1,
"comments": 1,
"same": { "$eq": [
{
"year" : { "$year": "$title.postDate" },
"month" : { "$month": "$title.postDate" },
"day": { "$dayOfMonth": "$title.postDate" }
},
{
"year" : { "$year": "$comments.postDate" },
"month" : { "$month": "$comments.postDate" },
"day": { "$dayOfMonth": "$comments.postDate" }
}
]}
}},
// Match the things on the "same" field
{ "$match": { "same": true, "comments.name": "bob" } },
// Possibly wind back the array
{ "$group": {
"_id": "$_id",
"title": { "$first": "$title" },
"content": { "$first": "$content" },
"comments": { "$push": "$comments" }
}}
])
Results
Honestly, and especially if you are using some indexing to feed to the initial $match stages of these operations, then it should be very clear that this will "run rings" around trying to iterate this in code.
At the very least this reduces the returned records "over the wire", so there is less network traffic. And of course there is less (or nothing) to post process once the query results have been received.
As a general convention, database server hardware tends to be an order of magnitude higher rated in performance than "application server" hardware. So again the general condition is that anything executed on the server will run faster.
Is aggregation the right thing: "Yes". and by a long long way. You even get a cursor very soon.
How can you do the queries you want: Shown to be pretty simple. And in real world code we never "hard code" this, we build it dynamically. So adding conditions and attributes should be as simple as all you normal data manipulation code.
So I would not normally answer this style of question. But say thank-you! Please ?

Related

$addFields not accepting arrays in mongodb

I am stuck in a problem where I have a field which is sometimes string and sometimes the output of that field is in array so how can i tackle that in $addField query
I am sharing my mongo query code
db.ledger_scheme_logs.aggregate([
{
$match:{
"type":{ $in: ["add","edit"]},
}
},
{
"$addFields": {
"trail_beginning": {
$substr: [ "$metadata.schemes._trail", 0, 36 ]
}
}
},
{
$group: {
"_id": {
"trail_beginning":"$trail_beginning"
},
"count": { $sum: 1 },
"items": { $push: "$$ROOT" },
}
},
{
"$sort": {
count: -1
}
}
])
In this query the "$metadata.schemes._trail" here schemes is in array in some array of objects and because of that I am getting mongo error -> "message" : "can't convert from BSON type array to String" so how can I solve this type of problem any help with example would be appreciated.
Thanks in advance!
The bigger and trickier question here is about what behavior you would like the system to have rather than how to actually make the database do it. There's a closely related topic around (consistent) schema design that naturally follows.
To directly answer your question, you can use the $cond operator to conditionally calculate the new trail_beginning field based on the data type of the source document currently being processed. An example would be something like:
{
"$addFields": {
"trail_beginning": {
"$cond": {
"if": {
$eq: [
{
$type: "$metadata.schemes"
},
"array"
]
},
"then": {
"$map": {
"input": "$metadata.schemes._trail",
"in": {
$substr: [
"$$this",
0,
3
]
}
}
},
"else": {
$substr: [
"$metadata.schemes._trail",
0,
3
]
}
}
}
}
}
Using two sample documents with different schemas yields the following as demonstrated in this playground example:
[
{
"_id": 1,
"metadata": {
"schemes": {
"_trail": "ABCDEFG"
}
},
"trail_beginning": "ABC"
},
{
"_id": 2,
"metadata": {
"schemes": [
{
"_trail": "HIJKLMN"
},
{
"_trail": "OPQRSTU"
}
]
},
"trail_beginning": [
"HIJ",
"OPQ"
]
}
]
Taking a glance at the rest of your pipeline though, I suspect (but can't say for sure) that this isn't actually what you want to do. This is because the subsequent $group will use the entire array of values to do the grouping, but I'm (again) guessing that you want to group based on individual values.
If my assumptions are correct, then logically what you really want to do is $unwind the array first before you do the substring transformation. This will correct the subsequent grouping logic and, as a side effect, it will also eliminate your problem of having different possible input types during the $addFields stage. Your full pipeline would look something like this:
db.ledger_scheme_logs.aggregate([
{
$match:{
"type":{ $in: ["add","edit"]},
}
},
{
$unwind: "$metadata.schemes"
},
{
"$addFields": {
"trail_beginning": {
$substr: [ "$metadata.schemes._trail", 0, 36 ]
}
}
},
{
$group: {
"_id": {
"trail_beginning":"$trail_beginning"
},
"count": { $sum: 1 },
"items": { $push: "$$ROOT" },
}
},
{
"$sort": {
count: -1
}
}
])
Playground demonstration (using a shorter substring) here.
This works because $unwind will treat non-array field paths as a single element array. However, having a discrepancy in the schema may frequently result in you having to put in special conditional logic to account for the difference in various places in the application. Consider simplifying development by making the schema consistent (converting the non-arrays to arrays with single values).

mongoose find from nested array of objects

hey I am quite new to mongoose and can't get my head around search.
models
User->resumes[]->employments[]
UserSchema
{
resumes: [ResumeSchema],
...
}
ResumeSchema
{
employments: [EmploymentSchema],
...
}
EmploymentSchema
{
jobTitle: {
type: String,
required: [true, "Job title is required."]
},
...
}
Background
User has to enter job title and needs suggestions from the existing data of the already present resumes and their employment's job title
I have tried the following code.
let q = req.query.q; // Software
User.find({ "resumes.employments.jobTitle": new RegExp(req.query.q, 'ig') }, {
"resumes.employments.$": 1
}, (err, docs) => {
res.json(docs);
})
Output
[
{
_id: '...',
resumes:[
{
employments: [
{
jobTitle: 'Software Developer',
...
},
...
]
},
...
]
},
...
]
Expected OutPut
["Software Developer", "Software Engineer", "Software Manager"]
Problem
1:) The Data returned is too much as I only need jobTitle
2:) All employments are being returned whereas the query matched one of them
3:) Is there any better way to do it ? via index or via $search ? I did not find much of information in mongoose documentation to create search index (and I also don't really know how to create a compound index to make it work)
I know there might be a lot of answers but none of them helped or I was not able to make them work ... I am really new to mongodb I have been working with relational databases via SQL or through ORM so my mongodb concepts and knowledge is limited.
So please let me know if there is a better solution to do it. or something to make the current one working.
You can use one of the aggregation query below to get this result:
[
{
"jobTitle": [
"Software Engineer",
"Software Manager",
"Software Developer"
]
}
]
Query is:
First using $unwind twice to deconstructs the arrays and get the values.
Then $match to filter by values you want using $regex.
Then $group to get all values together (using _id: null and $addToSet to no add duplicates).
And finally $project to shown only the field you want.
User.aggregate({
"$unwind": "$resumes"
},
{
"$unwind": "$resumes.employments"
},
{
"$match": {
"resumes.employments.jobTitle": {
"$regex": "software",
"$options": "i"
}
}
},
{
"$group": {
"_id": null,
"jobTitle": {
"$addToSet": "$resumes.employments.jobTitle"
}
}
},
{
"$project": {
"_id": 0
}
})
Example here
Also another option is using $filter into $project stage:
Is similar as before but using $filter instead of $unwind twice.
User.aggregate({
"$unwind": "$resumes"
},
{
"$project": {
"jobs": {
"$filter": {
"input": "$resumes.employments",
"as": "e",
"cond": {
"$regexMatch": {
"input": "$$e.jobTitle",
"regex": "Software",
"options": "i"
}
}
}
}
}
},
{
"$unwind": "$jobs"
},
{
"$group": {
"_id": null,
"jobTitle": {
"$addToSet": "$jobs.jobTitle"
}
}
},
{
"$project": {
"_id": 0
}
})
Example here

How to determine which query line has no match in mongodb [duplicate]

I have got my data in following format..
{
"_id" : ObjectId("534fd4662d22a05415000000"),
"product_id" : "50862224",
"ean" : "8808992479390",
"brand" : "LG",
"model" : "37LH3000",
"features" : [{
{
"key" : "Screen Format",
"value" : "16:9",
}, {
"key" : "DVD Player / Recorder",
"value" : "No",
},
"key" : "Weight in kg",
"value" : "12.6",
}
... so on
]
}
I need to compare features of one product with others and divide the result into separate categories ( 100% match, 50-99 % match) based on % of feature matches..
My initial thought was to prepare a dynamic query with or condition for each feature and do the percentage thing in php but then that means mongodb will return me even those product which only have 1 feature matching. And I I think nearly all products of a category might have some feature in common, so I fear I might be working on lot of products in php.
I have two questions basically.
is there any alternate ways?
And is the data structure I am using is good enough to support the functionality I am looking for, Or should I consider changing it
Well your solution really should be MongoDB specific otherwise you will end up doing your calculations and possible matching on the client side, and that is not going to be good for performance.
So of course what you really want is a way for that to have that processing on the server side:
db.products.aggregate([
// Match the documents that meet your conditions
{ "$match": {
"$or": [
{
"features": {
"$elemMatch": {
"key": "Screen Format",
"value": "16:9"
}
}
},
{
"features": {
"$elemMatch": {
"key" : "Weight in kg",
"value" : { "$gt": "5", "$lt": "8" }
}
}
},
]
}},
// Keep the document and a copy of the features array
{ "$project": {
"_id": {
"_id": "$_id",
"product_id": "$product_id",
"ean": "$ean",
"brand": "$brand",
"model": "$model",
"features": "$features"
},
"features": 1
}},
// Unwind the array
{ "$unwind": "$features" },
// Find the actual elements that match the conditions
{ "$match": {
"$or": [
{
"features.key": "Screen Format",
"features.value": "16:9"
},
{
"features.key" : "Weight in kg",
"features.value" : { "$gt": "5", "$lt": "8" }
},
]
}},
// Count those matched elements
{ "$group": {
"_id": "$_id",
"count": { "$sum": 1 }
}},
// Restore the document and divide the mated elements by the
// number of elements in the "or" condition
{ "$project": {
"_id": "$_id._id",
"product_id": "$_id.product_id",
"ean": "$_id.ean",
"brand": "$_id.brand",
"model": "$_id.model",
"features": "$_id.features",
"matched": { "$divide": [ "$count", 2 ] }
}},
// Sort by the matched percentage
{ "$sort": { "matched": -1 } }
])
So as you know the "length" of the $or condition being applied, then you simply need to find out how many of the elements in the "features" array match those conditions. So that is what the second $match in the pipeline is all about.
Once you have that count, you simply divide by the number of conditions what were passed in as your $or. The beauty here is that now you can do something useful with this like sort by that relevance and then even "page" the results server side.
Of course if you want some additional "categorization" of this, all you would need to do is add another $project stage to the end of the pipeline:
{ "$project": {
"product_id": 1
"ean": 1
"brand": 1
"model": 1,
"features": 1,
"matched": 1,
"category": { "$cond": [
{ "$eq": [ "$matched", 1 ] },
"100",
{ "$cond": [
{ "$gte": [ "$matched", .7 ] },
"70-99",
{ "$cond": [
"$gte": [ "$matched", .4 ] },
"40-69",
"under 40"
]}
]}
]}
}}
Or as something similar. But the $cond operator can help you here.
The architecture should be fine as you have it as you can have a compound index on the "key" and "value" for the entries in your features array and this should scale well for queries.
Of course if you actually need something more than that, such as faceted searching and results, you can look at solutions like Solr or elastic search. But the full implementation of that would be a bit lengthy for here.
I'm assuming that you'd like to compare the rest of the collection to a given product, which is a textbook example of aggregation:
lookingat = db.products.findOne({product_id:'50862224'})
matches = db.products.aggregate([
{ $unwind: '$features' },
{ $match: { features: { $in: lookingat.features }}},
{ $group: { _id: '$product_id', matchedfeatures: { $sum:1 }}},
{ $sort: { matchedfeatures: -1 }},
{ $limit: 5 },
{ $project: { _id:0, product_id: '$_id',
pctmatch: { $multiply: [ '$matchedfeatures',
100/lookingat.features.length ]}
}}
])
Walking through this briefly from the perspective of a product in the collection that has 6 features, and comparing it to the target product ('lookingat') which has 4 features, 3 of which match:
$unwind turns 1 document with 6 features into 6 otherwise-identical documents with 1 feature each
$match looks for that feature in the target's feature array (be aware that two documents are "equal" only if they have the same field names and values, in the same order), discards the 3 that don't match, and passes along the 3 that do
$group consumes those 3 matching documents and produces a new one that tells you there were 3 documents that matched that product_id
$sort and $limit give you the most relevant results and leave behind all those 1-feature matches you were concerned about
$project lets you rename the _id from the $group step back to product_id and also math the number of matching features into a percentage (we avoided a $divide operation by recognizing that 2 of the 3 terms in our calculation are constants and can be divided in JS)

Group by Multiple Fields In Meteor

What I am trying to achieve is, within a given date range, I want to group Users by First time and then by userId.
I tried below query to group by Multiple Fields,
ReactiveAggregate(this, Questionaire,
[
{
"$match": {
"time": {$gte: fromDate, $lte: toDate},
"userId": {'$regex' : regex}
}
},
{
$group : {
"_id": {
"userId": "$userId",
"date": { $dateToString: { format: "%Y-%m-%d", date: "$time" } }
},
"total": { "$sum": 1 }
}
}
], { clientCollection: "Questionaire" }
);
But When I execute it on server side, it shows me below error,
Exception from sub Questionaire id kndfrx9EuZ5EejKmE
Error: Meteor does not currently support objects other than ObjectID as ids
The message actually says it all, since the "compound" _id value that you are generating via the $group is not actually supported in the clientCollection output which will be published.
The simple solution of course is to not use the resulting _id value from $group as the "final" _id value in the generated output. So just as the example on the project README demonstrates, simply add a $project that removes the _id and renames the present "compound grouping key" as a different property name:
ReactiveAggregate(this, Questionaire,
[
{
"$match": {
"time": {$gte: fromDate, $lte: toDate},
"userId": {'$regex' : regex}
}
},
{
$group : {
"_id": {
"userId": "$userId",
"date": { $dateToString: { format: "%Y-%m-%d", date: "$time" } }
},
"total": { "$sum": 1 }
}
},
// Add the reshaping to the end of the pipeline
{
"$project": {
"_id": 0, // remove the _id, this will be automatically filled
"userDate": "$_id", // the renamed compound key
"total": 1
}
}
], { clientCollection: "Questionaire" }
);
The field order will be different because MongoDB keeps the existing fields ( i.e "total" in this example ) and then adds any new fields to the document. You can cou[nter that by using different field names in the $groupand $project stages rather than the 1 inclusive syntax.
Without such a plugin, this sort of reshaping is something that has been regularly done in the past, by again renaming the output _id and supplying a new _id value compatible with what meteor client collections expect to be present in this property.
On closer inspection of how the code is implemented, it is probably best to actually supply an _id value in the results because the plugin actually makes no effort to create an _id value.
So simply extracting one of the existing document _id values in the grouping should be sufficient. So I would add a $max to do this, and then replace the _id in the $project:
ReactiveAggregate(this, Questionaire,
[
{
"$match": {
"time": {$gte: fromDate, $lte: toDate},
"userId": {'$regex' : regex}
}
},
{
$group : {
"_id": {
"userId": "$userId",
"date": { $dateToString: { format: "%Y-%m-%d", date: "$time" } }
},
"maxId": { "$max": "$_id" },
"total": { "$sum": 1 }
}
},
// Add the reshaping to the end of the pipeline
{
"$project": {
"_id": "$maxId", // replaced _id
"userDate": "$_id", // the renamed compound key
"total": 1
}
}
], { clientCollection: "Questionaire" }
);
This could be easily patched in the plugin by replacing the lines
if (!sub._ids[doc._id]) {
sub.added(options.clientCollection, doc._id, doc);
} else {
sub.changed(options.clientCollection, doc._id, doc);
}
With using Random.id() when the document(s) output from the pipeline did not already have an _id value present:
if (!sub._ids[doc._id]) {
sub.added(options.clientCollection, doc._id || Random.id(), doc);
} else {
sub.changed(options.clientCollection, doc._id || Random.id(), doc);
}
But that might be a note to the author to consider updating the package.

Return limited number of records of a certain type, but unlimited number of other records?

I have a query where I need to return 10 of "Type A" records, while returning all other records. How can I accomplish this?
Update: Admittedly, I could do this with two queries, but I wanted to avoid that, if possible, thinking it would be less overhead, and possibly more performant. My query already is an aggregation query that takes both kinds of records into account, I just need to limit the number of the one type of record in the results.
Update: the following is an example query that highlights the problem:
db.books.aggregate([
{$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
{$match: {"type": "Fiction"}},
{$project: {
'title': 1,
'author': 1,
'type': 1,
'typeSortOrder':
{$add: [
{$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
{$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
{$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
]},
}},
{$sort: {'typeSortOrder'}},
{$limit: 10}
])
db.books.aggregate([
{$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
{$match: {"type": "Horror"}},
{$project: {
'title': 1,
'author': 1,
'type': 1,
'typeSortOrder':
{$add: [
{$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
{$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
{$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
]},
}},
{$sort: {'typeSortOrder'}},
{$limit: 10}
])
db.books.aggregate([
{$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
{$match: {"type": "Science"}},
{$project: {
'title': 1,
'author': 1,
'type': 1,
'typeSortOrder':
{$add: [
{$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
{$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
{$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
]},
}},
{$sort: {'typeSortOrder'}},
{$limit: 10}
])
I would like to have all these records returned in one query, but limit the type to at most 10 of any category.
I realize that the typeSortOrder doesn't need to be conditional when the queries are broken out like this, I had it there for when the queries were one query, originally (which is where I would like to get back to).
I don't think this is presently (2.6) possible to do with one aggregation pipeline. It's difficult to give a precise argument as to why not, but basically the aggregation pipeline performs transformations of streams of documents, one document at a time. There's no awareness within the pipeline of the state of the stream itself, which is what you'd need to determine that you've hit the limit for A's, B's, etc and need to drop further documents of the same type. $group does bring multiple documents together and allows their field values in aggregate to affect the resulting group document ($sum, $avg, etc.). Maybe this makes some sense, but it's necessarily not rigorous because there are simple operations you could add to make it possible to limit based on the types, e.g., adding a $push x accumulator to $group that only pushes the value if the array being pushed to has fewer than x elements.
Even if I did have a way to do it, I'd recommend just doing two aggregations. Keep it simple.
Problem
The results here are not impossible but are also possibly impractical. The general notes have been made that you cannot "slice" an array or otherwise "limit" the amount of results pushed onto one. And the method for doing this per "type" is essentially to use arrays.
The "impractical" part is usually about the number of results, where too large a result set is going to blow up the BSON document limit when "grouping". But, I'm going to consider this with some other recommendations on your "geo search" along with the ultimate goal to return 10 results of each "type" at most.
Principle
To first consider and understand the problem, let's look at a simplified "set" of data and the pipeline code necessary to return the "top 2 results" from each type:
{ "title": "Title 1", "author": "Author 1", "type": "Fiction", "distance": 1 },
{ "title": "Title 2", "author": "Author 2", "type": "Fiction", "distance": 2 },
{ "title": "Title 3", "author": "Author 3", "type": "Fiction", "distance": 3 },
{ "title": "Title 4", "author": "Author 4", "type": "Science", "distance": 1 },
{ "title": "Title 5", "author": "Author 5", "type": "Science", "distance": 2 },
{ "title": "Title 6", "author": "Author 6", "type": "Science", "distance": 3 },
{ "title": "Title 7", "author": "Author 7", "type": "Horror", "distance": 1 }
That's a simplified view of the data and somewhat representative of the state of documents after an initial query. Now comes the trick of how to use the aggregation pipeline to get the "nearest" two results for each "type":
db.books.aggregate([
{ "$sort": { "type": 1, "distance": 1 } },
{ "$group": {
"_id": "$type",
"1": {
"$first": {
"_id": "$_id",
"title": "$title",
"author": "$author",
"distance": "$distance"
}
},
"books": {
"$push": {
"_id": "$_id",
"title": "$title",
"author": "$author",
"distance": "$distance"
}
}
}},
{ "$project": {
"1": 1,
"books": {
"$cond": [
{ "$eq": [ { "$size": "$books" }, 1 ] },
{ "$literal": [false] },
"$books"
]
}
}},
{ "$unwind": "$books" },
{ "$project": {
"1": 1,
"books": 1,
"seen": { "$eq": [ "$1", "$books" ] }
}},
{ "$sort": { "_id": 1, "seen": 1 } },
{ "$group": {
"_id": "$_id",
"1": { "$first": "$1" },
"2": { "$first": "$books" },
"books": {
"$push": {
"$cond": [ { "$not": "$seen" }, "$books", false ]
}
}
}},
{ "$project": {
"1": 1,
"2": 2,
"pos": { "$literal": [1,2] }
}},
{ "$unwind": "$pos" },
{ "$group": {
"_id": "$_id",
"books": {
"$push": {
"$cond": [
{ "$eq": [ "$pos", 1 ] },
"$1",
{ "$cond": [
{ "$eq": [ "$pos", 2 ] },
"$2",
false
]}
]
}
}
}},
{ "$unwind": "$books" },
{ "$match": { "books": { "$ne": false } } },
{ "$project": {
"_id": "$books._id",
"title": "$books.title",
"author": "$books.author",
"type": "$_id",
"distance": "$books.distance",
"sortOrder": {
"$add": [
{ "$cond": [ { "$eq": [ "$_id", "Fiction" ] }, 1, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Science" ] }, 0, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Horror" ] }, 3, 0 ] }
]
}
}},
{ "$sort": { "sortOrder": 1 } }
])
Of course that is just two results, but it outlines the process for getting n results, which naturally is done in generated pipeline code. Before moving onto the code the process deserves a walk through.
After any query, the first thing to do here is $sort the results, and this you want to basically do by both the "grouping key" which is the "type" and by the "distance" so that the "nearest" items are on top.
The reason for this is shown in the $group stages that will repeat. What is done is essentially "popping the $first result off of each grouping stack. So other documents are not lost, they are placed in an array using $push.
Just to be safe, the next stage is really only required after the "first step", but could optionally be added for similar filtering in the repetition. The main check here is that the resulting "array" is larger than just one item. Where it is not, the contents are replaced with a single value of false. The reason for which is about to become evident.
After this "first step" the real repetition cycle beings, where that array is then "de-normalized" with $unwind and then a $project made in order to "match" the document that has been last "seen".
As only one of the documents will match this condition the results are again "sorted" in order to float the "unseen" documents to the top, while of course maintaining the grouping order. The next thing is similar to the first $group step, but where any kept positions are maintained and the "first unseen" document is "popped off the stack" again.
The document that was "seen" is then pushed back to the array not as itself but as a value of false. This is not going to match the kept value and this is generally the way to handle this without being "destructive" to the array contents where you don't want the operations to fail should there not be enough matches to cover the n results required.
Cleaning up when complete, the next "projection" adds an array to the final documents now grouped by "type" representing each position in the n results required. When this array is unwound, the documents can again be grouped back together, but now all in a single array
that possibly contains several false values but is n elements long.
Finally unwind the array again, use $match to filter out the false values, and project to the required document form.
Practicality
The problem as stated earlier is with the number of results being filtered as there is a real limit on the number of results that can be pushed into an array. That is mostly the BSON limit, but you also don't really want 1000's of items even if that is still under the limit.
The trick here is keeping the initial "match" small enough that the "slicing operations" becomes practical. There are some things with the $geoNear pipeline process that can make this a possibility.
The obvious is limit. By default this is 100 but you clearly want to have something in the range of:
(the number of categories you can possibly match) X ( required matches )
But if this is essentially a number not in the 1000's then there is already some help here.
The others are maxDistance and minDistance, where essentially you put upper and lower bounds on how "far out" to search. The max bound is the general limiter while the min bound is useful when "paging", which is the next helper.
When "upwardly paging", you can use the query argument in order to exclude the _id values of documents "already seen" using the $nin query. In much the same way, the minDistance can be populated with the "last seen" largest distance, or at least the smallest largest distance by "type". This allows some concept of filtering out things that have already been "seen" and getting another page.
Really a topic in itself, but those are the general things to look for in reducing that initial match in order to make the process practical.
Implementing
The general problem of returning "10 results at most, per type" is clearly going to want some code in order to generate the pipeline stages. No-one wants to type that out, and practically speaking you will probably want to change that number at some point.
So now to the code that can generate the monster pipeline. All code in JavaScript, but easy to translate in principles:
var coords = [-118.09771, 33.89244];
var key = "$type";
var val = {
"_id": "$_id",
"title": "$title",
"author": "$author",
"distance": "$distance"
};
var maxLen = 10;
var stack = [];
var pipe = [];
var fproj = { "$project": { "pos": { "$literal": [] } } };
pipe.push({ "$geoNear": {
"near": coords,
"distanceField": "distance",
"spherical": true
}});
pipe.push({ "$sort": {
"type": 1, "distance": 1
}});
for ( var x = 1; x <= maxLen; x++ ) {
fproj["$project"][""+x] = 1;
fproj["$project"]["pos"]["$literal"].push( x );
var rec = {
"$cond": [ { "$eq": [ "$pos", x ] }, "$"+x ]
};
if ( stack.length == 0 ) {
rec["$cond"].push( false );
} else {
lval = stack.pop();
rec["$cond"].push( lval );
}
stack.push( rec );
if ( x == 1) {
pipe.push({ "$group": {
"_id": key,
"1": { "$first": val },
"books": { "$push": val }
}});
pipe.push({ "$project": {
"1": 1,
"books": {
"$cond": [
{ "$eq": [ { "$size": "$books" }, 1 ] },
{ "$literal": [false] },
"$books"
]
}
}});
} else {
pipe.push({ "$unwind": "$books" });
var proj = {
"$project": {
"books": 1
}
};
proj["$project"]["seen"] = { "$eq": [ "$"+(x-1), "$books" ] };
var grp = {
"$group": {
"_id": "$_id",
"books": {
"$push": {
"$cond": [ { "$not": "$seen" }, "$books", false ]
}
}
}
};
for ( n=x; n >= 1; n-- ) {
if ( n != x )
proj["$project"][""+n] = 1;
grp["$group"][""+n] = ( n == x ) ? { "$first": "$books" } : { "$first": "$"+n };
}
pipe.push( proj );
pipe.push({ "$sort": { "_id": 1, "seen": 1 } });
pipe.push(grp);
}
}
pipe.push(fproj);
pipe.push({ "$unwind": "$pos" });
pipe.push({
"$group": {
"_id": "$_id",
"msgs": { "$push": stack[0] }
}
});
pipe.push({ "$unwind": "$books" });
pipe.push({ "$match": { "books": { "$ne": false } }});
pipe.push({
"$project": {
"_id": "$books._id",
"title": "$books.title",
"author": "$books.author",
"type": "$_id",
"distance": "$books",
"sortOrder": {
"$add": [
{ "$cond": [ { "$eq": [ "$_id", "Fiction" ] }, 1, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Science" ] }, 0, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Horror" ] }, 3, 0 ] },
]
}
}
});
pipe.push({ "$sort": { "sortOrder": 1, "distance": 1 } });
Alternate
Of course the end result here and the general problem with all above is that you really only want the "top 10" of each "type" to return. The aggregation pipeline will do it, but at the cost of keeping more than 10 and then "popping off the stack" until 10 is reached.
An alternate approach is to "brute force" this with mapReduce and "globally scoped" variables. Not as nice since the results all in arrays, but it may be a practical approach:
db.collection.mapReduce(
function () {
if ( !stash.hasOwnProperty(this.type) ) {
stash[this.type] = [];
}
if ( stash[this.type.length < maxLen ) {
stash[this.type].push({
"title": this.title,
"author": this.author,
"type": this.type,
"distance": this.distance
});
emit( this.type, 1 );
}
},
function(key,values) {
return 1; // really just want to keep the keys
},
{
"query": {
"location": {
"$nearSphere": [-118.09771, 33.89244]
}
},
"scope": { "stash": {}, "maxLen": 10 },
"finalize": function(key,value) {
return { "msgs": stash[key] };
},
"out": { "inline": 1 }
}
)
This is a real cheat which just uses the "global scope" to keep a single object whose keys are the grouping keys. The results are pushed onto an array in that global object until the maximum length is reached. Results are already sorted by nearest, so the mapper just gives up doing anything with the current document after the 10 are reached per key.
The reducer wont be called since only 1 document per key is emitted. The finalize then just "pulls" the value from the global and returns it in the result.
Simple, but of course you don't have all the $geoNear options if you really need them, and this form has the hard limit of 100 document as the output from the initial query.
This is a classic case for subquery/join which is not supported by MongoDB. All joins and subquery-like operations need to be implemented in the application logic. So multiple queries is your best bet. Performance of the multiple query approach should be good if you have an index on type.
Alternatively you can write a single aggregation query minus the type-matching and limit clauses and then process the stream in your application logic to limit documents per type.
This approach will be low on performance for large result sets because documents may be returned in random order. Your limiting logic will then need to traverse to the entire result set.
i guess you can use cursor.limit() on a cursor to specify the maximum number of documents the cursor will return. limit() is analogous to the LIMIT statement in a SQL database.
You must apply limit() to the cursor before retrieving any documents from the database.
The limit function in the cursors can be used for limiting the number of records in the find.
I guess this example should help:
var myCursor = db.bios.find( );
db.bios.find().limit( 5 )

Categories

Resources