How to determine which query line has no match in mongodb [duplicate] - javascript

I have got my data in following format..
{
"_id" : ObjectId("534fd4662d22a05415000000"),
"product_id" : "50862224",
"ean" : "8808992479390",
"brand" : "LG",
"model" : "37LH3000",
"features" : [{
{
"key" : "Screen Format",
"value" : "16:9",
}, {
"key" : "DVD Player / Recorder",
"value" : "No",
},
"key" : "Weight in kg",
"value" : "12.6",
}
... so on
]
}
I need to compare features of one product with others and divide the result into separate categories ( 100% match, 50-99 % match) based on % of feature matches..
My initial thought was to prepare a dynamic query with or condition for each feature and do the percentage thing in php but then that means mongodb will return me even those product which only have 1 feature matching. And I I think nearly all products of a category might have some feature in common, so I fear I might be working on lot of products in php.
I have two questions basically.
is there any alternate ways?
And is the data structure I am using is good enough to support the functionality I am looking for, Or should I consider changing it

Well your solution really should be MongoDB specific otherwise you will end up doing your calculations and possible matching on the client side, and that is not going to be good for performance.
So of course what you really want is a way for that to have that processing on the server side:
db.products.aggregate([
// Match the documents that meet your conditions
{ "$match": {
"$or": [
{
"features": {
"$elemMatch": {
"key": "Screen Format",
"value": "16:9"
}
}
},
{
"features": {
"$elemMatch": {
"key" : "Weight in kg",
"value" : { "$gt": "5", "$lt": "8" }
}
}
},
]
}},
// Keep the document and a copy of the features array
{ "$project": {
"_id": {
"_id": "$_id",
"product_id": "$product_id",
"ean": "$ean",
"brand": "$brand",
"model": "$model",
"features": "$features"
},
"features": 1
}},
// Unwind the array
{ "$unwind": "$features" },
// Find the actual elements that match the conditions
{ "$match": {
"$or": [
{
"features.key": "Screen Format",
"features.value": "16:9"
},
{
"features.key" : "Weight in kg",
"features.value" : { "$gt": "5", "$lt": "8" }
},
]
}},
// Count those matched elements
{ "$group": {
"_id": "$_id",
"count": { "$sum": 1 }
}},
// Restore the document and divide the mated elements by the
// number of elements in the "or" condition
{ "$project": {
"_id": "$_id._id",
"product_id": "$_id.product_id",
"ean": "$_id.ean",
"brand": "$_id.brand",
"model": "$_id.model",
"features": "$_id.features",
"matched": { "$divide": [ "$count", 2 ] }
}},
// Sort by the matched percentage
{ "$sort": { "matched": -1 } }
])
So as you know the "length" of the $or condition being applied, then you simply need to find out how many of the elements in the "features" array match those conditions. So that is what the second $match in the pipeline is all about.
Once you have that count, you simply divide by the number of conditions what were passed in as your $or. The beauty here is that now you can do something useful with this like sort by that relevance and then even "page" the results server side.
Of course if you want some additional "categorization" of this, all you would need to do is add another $project stage to the end of the pipeline:
{ "$project": {
"product_id": 1
"ean": 1
"brand": 1
"model": 1,
"features": 1,
"matched": 1,
"category": { "$cond": [
{ "$eq": [ "$matched", 1 ] },
"100",
{ "$cond": [
{ "$gte": [ "$matched", .7 ] },
"70-99",
{ "$cond": [
"$gte": [ "$matched", .4 ] },
"40-69",
"under 40"
]}
]}
]}
}}
Or as something similar. But the $cond operator can help you here.
The architecture should be fine as you have it as you can have a compound index on the "key" and "value" for the entries in your features array and this should scale well for queries.
Of course if you actually need something more than that, such as faceted searching and results, you can look at solutions like Solr or elastic search. But the full implementation of that would be a bit lengthy for here.

I'm assuming that you'd like to compare the rest of the collection to a given product, which is a textbook example of aggregation:
lookingat = db.products.findOne({product_id:'50862224'})
matches = db.products.aggregate([
{ $unwind: '$features' },
{ $match: { features: { $in: lookingat.features }}},
{ $group: { _id: '$product_id', matchedfeatures: { $sum:1 }}},
{ $sort: { matchedfeatures: -1 }},
{ $limit: 5 },
{ $project: { _id:0, product_id: '$_id',
pctmatch: { $multiply: [ '$matchedfeatures',
100/lookingat.features.length ]}
}}
])
Walking through this briefly from the perspective of a product in the collection that has 6 features, and comparing it to the target product ('lookingat') which has 4 features, 3 of which match:
$unwind turns 1 document with 6 features into 6 otherwise-identical documents with 1 feature each
$match looks for that feature in the target's feature array (be aware that two documents are "equal" only if they have the same field names and values, in the same order), discards the 3 that don't match, and passes along the 3 that do
$group consumes those 3 matching documents and produces a new one that tells you there were 3 documents that matched that product_id
$sort and $limit give you the most relevant results and leave behind all those 1-feature matches you were concerned about
$project lets you rename the _id from the $group step back to product_id and also math the number of matching features into a percentage (we avoided a $divide operation by recognizing that 2 of the 3 terms in our calculation are constants and can be divided in JS)

Related

Query for only a single document based on key name in MongoDB

I am working on a MERN project. I have created a collection in MongoDB having different types of document. Is it an accepted practice to have different structure documents in a single collection? Secondly i need to fetch only a single document from the collection using the key name. My documents are
[{
"_id": {
"$oid": "6333f72822dc0acc4bea17bd"
},
"designation": [
{
"name": "Chairman",
"level": 17
},
{
"name": "Director",
"level": 13
},
{
"name": "Secretary ",
"level": 13
},
{
"name": "Account Officer",
"level": 9
},
{
"name": "Data Entry Operator-GR B",
"level": 5
}
]
},
{
"_id": {
"$oid": "6334313b22dc0acc4bea17c2"
},
"storeRole": ["manager", "approver", "accepter", "firstsignatory"]
},
{
"_id": {
"$oid": "63369d2083a7cc2e818990dd"
},
"designationSuffix": ["I","II", "III"]
}]
How do I get any of the three documents if I only know the key name i.e(designation, storeRole, designationSuffix). I dont want to use ID value.
Welcome to SO.
First, yes it is an accepted practice and indeed, a powerful feature of MongoDB to have different shapes of data in a single collection.
There are two important things to remember when querying for data:
Matching on fields that don't even exist in a document is OK; the document will simply be skipped. This permits you, for example, to query for storeRole and ignore the other documents with designation, etc. -- unless of course you wish to look for those too using an $or expression.
Matching (using $match) for elements in an array will return the whole array, not just the elements that match.
To illustrate this point, let's expand your input data slightly:
{"designation": [
{"name": "Chairman","level": 17},
{"name": "Director", "level": 13}
]
},
{"designation": [
{"name": "Secretary","level": 13}
]
},
We will use dot notation to reach into the structures in the designation array to find those docs where at least one of the name fields is Chairman:
db.foo.aggregate([
{$match: {"designation.name": "Chairman"}}
]);
{
"_id" : 0,
"designation" : [
{
"name" : "Chairman",
"level" : 17
},
{
"name" : "Director",
"level" : 13
}
]
}
The query eliminated the document with name = Secretary as expected but properly returned the whole document (and the whole array) where name = Chairman. Very often the goal is to fetch only the matching items in the array; this is accomplished with the $filter operator:
db.foo.aggregate([
{$match: {"designation.name": "Chairman"}},
{$project: {
// Assigning the output of $filter to the same name as input:
designation: {$filter: {
input: "$designation",
as: "zz",
cond: {$eq: ['$$zz.name','Chairman']}
}}
}}
]);
{
"_id" : 0,
"designation" : [
{
"name" : "Chairman",
"level" : 17
}
]
}
An alternative approach which is useful when query conditions yield null or empty arrays instead of eliminating the document altogether is to $filter first, then match only on results where the array has a length > 1. We must use the $ifNull function to protect $size from being passed a null by turning it into an empty (but not null) array:
db.foo.aggregate([
{$project: {
// Assigning the output of $filter to the same name as input:
designation: {$filter: {
input: "$designation",
as: "zz",
cond: {$eq: ['$$zz.name','Chairman']}
}}
}},
{$match: {$expr: {$gt:[{$size: {$ifNull:["$designation",[] ]}}, 0]}} }
]);
Try commenting out the $match to see what $filter returns when a document has the target array field but no matches vs. when the document does not have the field.

mongo aggregate based on conditions to filter the document for versioning

I am working on versioning, We have documents based on UUIDs andjobUuids, andjobUuids are the documents associated with the currently working user. I have some aggregate queries on these collections which I need to update based on the job UUIDs,
The results fetched by the aggregate query should be such that,
if the current usersjobUuid document does not exist then the master document with jobUuid: "default" will be returned(The document without any jobUuid),
if job uuid exists then only the document is returned.
I have a$match used to get these documents based on certain conditions, from those documents I need to filter out the documents based on the above conditions, and an example is shown below,
The data looks like this:
[
{
"uuid": "5cdb5a10-4f9b-4886-98c1-31d9889dd943",
"name": "adam",
"jobUuid": "default",
},
{
"uuid": "5cdb5a10-4f9b-4886-98c1-31d9889dd943",
"jobUuid": "d275781f-ed7f-4ce4-8f7e-a82e0e9c8f12",
"name": "adam"
},
{
"uuid": "b745baff-312b-4d53-9438-ae28358539dc",
"name": "eve",
"jobUuid": "default",
},
{
"uuid": "b745baff-312b-4d53-9438-ae28358539dc",
"jobUuid": "d275781f-ed7f-4ce4-8f7e-a82e0e9c8f12",
"name": "eve"
},
{
"uuid": "26cba689-7eb6-4a9e-a04e-24ede0309e50",
"name": "john",
"jobUuid": "default",
}
]
Results for "jobUuid": "d275781f-ed7f-4ce4-8f7e-a82e0e9c8f12" should be:
[
{
"uuid": "5cdb5a10-4f9b-4886-98c1-31d9889dd943",
"jobUuid": "d275781f-ed7f-4ce4-8f7e-a82e0e9c8f12",
"name": "adam"
},
{
"uuid": "b745baff-312b-4d53-9438-ae28358539dc",
"jobUuid": "d275781f-ed7f-4ce4-8f7e-a82e0e9c8f12",
"name": "eve"
},
{
"uuid": "26cba689-7eb6-4a9e-a04e-24ede0309e50",
"name": "john",
"jobUuid": "default",
}
]
Based on the conditions mentioned above, is it possible to filter the document within the aggregate query to extract the document of a specific job uuid?
Edit 1: I got the following solution, which is working fine, I want a better solution, eliminating all those nested stages.
Edit 2: Updated the data with actual UUIDs and I just included only the name as another field, we do have n number of fields which are not relevant to include here but needed at the end (mentioning this for those who want to use the projection over all the fields).
Update based on comment:
but the UUIDs are alphanumeric strings, as shown above, does it have
an effect on these sorting, and since we are not using conditions to
get the results, I am worried it will cause issues.
You could use additional field to match the sort order to be the same order as values in the in expression. Make sure you provide the values with default as the last value.
[
{"$match":{"jobUuid":{"$in":["d275781f-ed7f-4ce4-8f7e-a82e0e9c8f12","default"]}}},
{"$addFields":{ "order":{"$indexOfArray":[["d275781f-ed7f-4ce4-8f7e-a82e0e9c8f12","default"], "$jobUuid"]}}},
{"$sort":{"uuid":1, "order":1}},
{
"$group": {
"_id": "$uuid",
"doc":{"$first":"$$ROOT"}
}
},
{"$project":{"doc.order":0}},
{"$replaceRoot":{"newRoot":"$doc"}}
]
example here - https://mongoplayground.net/p/wXiE9i18qxf
Original
You could use below query. The query will pick the non default document if it exists for uuid or else pick the default as the only document.
[
{"$match":{"jobUuid":{"$in":[1,"default"]}}},
{"$sort":{"uuid":1, "jobUuid":1}},
{
"$group": {
"_id": "$uuid",
"doc":{"$first":"$$ROOT"}
}
},
{"$replaceRoot":{"newRoot":"$doc"}}
]
example here - https://mongoplayground.net/p/KrL-1s8WCpw
Here is what I would do:
match stage with $in rather than an $or (for readability)
group stage with _id on $uuid, just as you did, but instead of pushing all the data into an array, be more selective. _id is already storing $uuid, so no reason to capture it again. name must always be the same for each $uuid, so take only the first instance. Based on the match, there are only two possibilities for jobUuid, but this will assume it will be either "default" or something else, and that there can be more than one occurrence of the non-"default" jobUuid. Using "$addToSet" instead of pushing to an array in case there are multiple occurrences of the same jobUuid for a user, also, before adding to the set, use a conditional to only add non-"default" jobUuids, using $$REMOVE to avoid inserting a null when the jobUuid is "default".
Finally, "$project" to clean things up. If element 0 of the jobUuids array does not exist (is null), there is no other possibility for this user than for the jobUuid to be "default", so use "$ifNull" to test and set "default" as appropriate. There could be more than 1 jobUuid here, depending if that is allowed in your db/application, up to you to decide how to handle that (take the highest, take the lowest, etc).
Tested at: https://mongoplayground.net/p/e76cVJf0F3o
[{
"$match": {
"jobUuid": {
"$in": [
"1",
"default"
]
}
}
},
{
"$group": {
"_id": "$uuid",
"name": {
"$first": "$name"
},
"jobUuids": {
"$addToSet": {
"$cond": {
"if": {
"$ne": [
"$jobUuid",
"default"
]
},
"then": "$jobUuid",
"else": "$$REMOVE"
}
}
}
}
},
{
"$project": {
"_id": 0,
"uuid": "$_id",
"name": 1,
"jobUuid": {
"$ifNull": [{
"$arrayElemAt": [
"$jobUuids",
0
]
},
"default"
]
}
}
}]
I was able to solve this problem with the following aggregate query,
We are first extracting the results matching only the jobUuid provided by the user or the "default" in the match section.
Then the results are grouped based on the uuid, using a group stage and we are counting the results as well.
Using the conditions in replaceRoot first we are checking the length of the grouped document,
If the grouped document length is greater than or equal to 2, we are
filtering the document that matches the provided jobUuid.
If it's less or equal to the 1, then we are checking if it's matching the default jobUuid and returning it.
The Query is below:
[
{
$match: {
$or: [{ jobUuid:1 },{ jobUuid: 'default'}]
}
},
{
$group: {
_id: '$uuid',
count: {
$sum: 1
},
docs: {
$push: '$$ROOT'
}
}
},
{
$replaceRoot: {
newRoot: {
$cond: {
if: {
$gte: [
'$count',
2
]
},
then: {
$arrayElemAt: [
{
$filter: {
input: '$docs',
as: 'item',
cond: {
$ne: [
'$$item.jobUuid',
'default'
]
}
}
},
0
]
},
else: {
$arrayElemAt: [
{
$filter: {
input: '$docs',
as: 'item',
cond: {
$eq: [
'$$item.jobUuid',
'default'
]
}
}
},
0
]
}
}
}
}
}
]

Ordering by count of filtered subdocument array elements

I currently have a MongoDB collection that looks like so:
{
{
"_id": ObjectId,
"user_id": Number,
"updates": [
{
"_id": ObjectId,
"mode": Number,
"score": Number
},
{
"_id": ObjectId,
"mode": Number,
"score": Number
},
{
"_id": ObjectId,
"mode": Number,
"score": Number
}
]
}
}
I am looking to find a way to find the users with the largest number of updates per mode. For instance, if I specify mode 0, I want it to load the users in order of greatest number of updates with mode: 0.
Is this possible in MongoDB? It does not need to be a fast algorithm, as it will be cached for quite a while, and it will run asynchronously.
The fastest way would be to store a count for each "mode" within the document as another field, then you could just sort on that:
var update = {
"$push": { "updates": updateDoc },
};
var countDoc = {};
countDoc["counts." + updateDoc.mode] = 1;
update["$inc"] = countDoc;
Model.update(
{ "_id": id },
update,
function(err,numAffected) {
}
);
Which would use $inc to increment a "counts" field for each "mode" value as a key for each "mode" pushed to the "updates" array. All the calculation happens on update, so it's fast and so is the query that can be applied with a sort on that value:
Model.find({ "updates.mode": 0 }).sort({ "counts.0": -1 }).exec(function(err,users) {
});
If you don't want to or cannot store such a field then the other option is to calculate at query time with .aggregate():
Model.aggregate(
[
{ "$match": { "updates.mode": 0 } },
{ "$project": {
"user_id": 1,
"updates": 1,
"count": {
"$size": {
"$setDifference": [
{ "$map": {
"input": "$updates",
"as": "el",
"in": {
"$cond": [
{ "$eq": [ "$$el.mode", 0 ] },
"$$el",
false
]
}
}},
[false]
]
}
}
}},
{ "$sort": { "count": -1 } }
],
function(err,results) {
}
);
Which isn't bad since the filtering of the array and getting the $size is fairly effecient, but it's not as fast as just using a stored value.
The $map operator allows inline processing of the array elements which are tested by $cond to see if it returns a match or false. Then $setDifference removes any false values. A much better way to filter array content than using $unwind, which can slow things down significantly and should not be used unless your intent to to aggregate array content across documents.
But the better approach is to store the value for the count instead, since this does not require runtime calculation and can even use an index
I think this is a duplicate of this question:
Mongo find query for longest arrays inside object
The accepted answer seem to be doing exactly what you ask for.
db.collection.aggregate( [
{ $unwind : "$l" },
{ $group : { _id : "$_id", len : { $sum : 1 } } },
{ $sort : { len : -1 } },
{ $limit : 25 }
] )
just replace "$l" with "$updates".
[edit:] and you probably do not want the result limited to 25, so you should also get rid of the { $limit : 25 }

Return limited number of records of a certain type, but unlimited number of other records?

I have a query where I need to return 10 of "Type A" records, while returning all other records. How can I accomplish this?
Update: Admittedly, I could do this with two queries, but I wanted to avoid that, if possible, thinking it would be less overhead, and possibly more performant. My query already is an aggregation query that takes both kinds of records into account, I just need to limit the number of the one type of record in the results.
Update: the following is an example query that highlights the problem:
db.books.aggregate([
{$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
{$match: {"type": "Fiction"}},
{$project: {
'title': 1,
'author': 1,
'type': 1,
'typeSortOrder':
{$add: [
{$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
{$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
{$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
]},
}},
{$sort: {'typeSortOrder'}},
{$limit: 10}
])
db.books.aggregate([
{$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
{$match: {"type": "Horror"}},
{$project: {
'title': 1,
'author': 1,
'type': 1,
'typeSortOrder':
{$add: [
{$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
{$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
{$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
]},
}},
{$sort: {'typeSortOrder'}},
{$limit: 10}
])
db.books.aggregate([
{$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
{$match: {"type": "Science"}},
{$project: {
'title': 1,
'author': 1,
'type': 1,
'typeSortOrder':
{$add: [
{$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
{$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
{$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
]},
}},
{$sort: {'typeSortOrder'}},
{$limit: 10}
])
I would like to have all these records returned in one query, but limit the type to at most 10 of any category.
I realize that the typeSortOrder doesn't need to be conditional when the queries are broken out like this, I had it there for when the queries were one query, originally (which is where I would like to get back to).
I don't think this is presently (2.6) possible to do with one aggregation pipeline. It's difficult to give a precise argument as to why not, but basically the aggregation pipeline performs transformations of streams of documents, one document at a time. There's no awareness within the pipeline of the state of the stream itself, which is what you'd need to determine that you've hit the limit for A's, B's, etc and need to drop further documents of the same type. $group does bring multiple documents together and allows their field values in aggregate to affect the resulting group document ($sum, $avg, etc.). Maybe this makes some sense, but it's necessarily not rigorous because there are simple operations you could add to make it possible to limit based on the types, e.g., adding a $push x accumulator to $group that only pushes the value if the array being pushed to has fewer than x elements.
Even if I did have a way to do it, I'd recommend just doing two aggregations. Keep it simple.
Problem
The results here are not impossible but are also possibly impractical. The general notes have been made that you cannot "slice" an array or otherwise "limit" the amount of results pushed onto one. And the method for doing this per "type" is essentially to use arrays.
The "impractical" part is usually about the number of results, where too large a result set is going to blow up the BSON document limit when "grouping". But, I'm going to consider this with some other recommendations on your "geo search" along with the ultimate goal to return 10 results of each "type" at most.
Principle
To first consider and understand the problem, let's look at a simplified "set" of data and the pipeline code necessary to return the "top 2 results" from each type:
{ "title": "Title 1", "author": "Author 1", "type": "Fiction", "distance": 1 },
{ "title": "Title 2", "author": "Author 2", "type": "Fiction", "distance": 2 },
{ "title": "Title 3", "author": "Author 3", "type": "Fiction", "distance": 3 },
{ "title": "Title 4", "author": "Author 4", "type": "Science", "distance": 1 },
{ "title": "Title 5", "author": "Author 5", "type": "Science", "distance": 2 },
{ "title": "Title 6", "author": "Author 6", "type": "Science", "distance": 3 },
{ "title": "Title 7", "author": "Author 7", "type": "Horror", "distance": 1 }
That's a simplified view of the data and somewhat representative of the state of documents after an initial query. Now comes the trick of how to use the aggregation pipeline to get the "nearest" two results for each "type":
db.books.aggregate([
{ "$sort": { "type": 1, "distance": 1 } },
{ "$group": {
"_id": "$type",
"1": {
"$first": {
"_id": "$_id",
"title": "$title",
"author": "$author",
"distance": "$distance"
}
},
"books": {
"$push": {
"_id": "$_id",
"title": "$title",
"author": "$author",
"distance": "$distance"
}
}
}},
{ "$project": {
"1": 1,
"books": {
"$cond": [
{ "$eq": [ { "$size": "$books" }, 1 ] },
{ "$literal": [false] },
"$books"
]
}
}},
{ "$unwind": "$books" },
{ "$project": {
"1": 1,
"books": 1,
"seen": { "$eq": [ "$1", "$books" ] }
}},
{ "$sort": { "_id": 1, "seen": 1 } },
{ "$group": {
"_id": "$_id",
"1": { "$first": "$1" },
"2": { "$first": "$books" },
"books": {
"$push": {
"$cond": [ { "$not": "$seen" }, "$books", false ]
}
}
}},
{ "$project": {
"1": 1,
"2": 2,
"pos": { "$literal": [1,2] }
}},
{ "$unwind": "$pos" },
{ "$group": {
"_id": "$_id",
"books": {
"$push": {
"$cond": [
{ "$eq": [ "$pos", 1 ] },
"$1",
{ "$cond": [
{ "$eq": [ "$pos", 2 ] },
"$2",
false
]}
]
}
}
}},
{ "$unwind": "$books" },
{ "$match": { "books": { "$ne": false } } },
{ "$project": {
"_id": "$books._id",
"title": "$books.title",
"author": "$books.author",
"type": "$_id",
"distance": "$books.distance",
"sortOrder": {
"$add": [
{ "$cond": [ { "$eq": [ "$_id", "Fiction" ] }, 1, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Science" ] }, 0, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Horror" ] }, 3, 0 ] }
]
}
}},
{ "$sort": { "sortOrder": 1 } }
])
Of course that is just two results, but it outlines the process for getting n results, which naturally is done in generated pipeline code. Before moving onto the code the process deserves a walk through.
After any query, the first thing to do here is $sort the results, and this you want to basically do by both the "grouping key" which is the "type" and by the "distance" so that the "nearest" items are on top.
The reason for this is shown in the $group stages that will repeat. What is done is essentially "popping the $first result off of each grouping stack. So other documents are not lost, they are placed in an array using $push.
Just to be safe, the next stage is really only required after the "first step", but could optionally be added for similar filtering in the repetition. The main check here is that the resulting "array" is larger than just one item. Where it is not, the contents are replaced with a single value of false. The reason for which is about to become evident.
After this "first step" the real repetition cycle beings, where that array is then "de-normalized" with $unwind and then a $project made in order to "match" the document that has been last "seen".
As only one of the documents will match this condition the results are again "sorted" in order to float the "unseen" documents to the top, while of course maintaining the grouping order. The next thing is similar to the first $group step, but where any kept positions are maintained and the "first unseen" document is "popped off the stack" again.
The document that was "seen" is then pushed back to the array not as itself but as a value of false. This is not going to match the kept value and this is generally the way to handle this without being "destructive" to the array contents where you don't want the operations to fail should there not be enough matches to cover the n results required.
Cleaning up when complete, the next "projection" adds an array to the final documents now grouped by "type" representing each position in the n results required. When this array is unwound, the documents can again be grouped back together, but now all in a single array
that possibly contains several false values but is n elements long.
Finally unwind the array again, use $match to filter out the false values, and project to the required document form.
Practicality
The problem as stated earlier is with the number of results being filtered as there is a real limit on the number of results that can be pushed into an array. That is mostly the BSON limit, but you also don't really want 1000's of items even if that is still under the limit.
The trick here is keeping the initial "match" small enough that the "slicing operations" becomes practical. There are some things with the $geoNear pipeline process that can make this a possibility.
The obvious is limit. By default this is 100 but you clearly want to have something in the range of:
(the number of categories you can possibly match) X ( required matches )
But if this is essentially a number not in the 1000's then there is already some help here.
The others are maxDistance and minDistance, where essentially you put upper and lower bounds on how "far out" to search. The max bound is the general limiter while the min bound is useful when "paging", which is the next helper.
When "upwardly paging", you can use the query argument in order to exclude the _id values of documents "already seen" using the $nin query. In much the same way, the minDistance can be populated with the "last seen" largest distance, or at least the smallest largest distance by "type". This allows some concept of filtering out things that have already been "seen" and getting another page.
Really a topic in itself, but those are the general things to look for in reducing that initial match in order to make the process practical.
Implementing
The general problem of returning "10 results at most, per type" is clearly going to want some code in order to generate the pipeline stages. No-one wants to type that out, and practically speaking you will probably want to change that number at some point.
So now to the code that can generate the monster pipeline. All code in JavaScript, but easy to translate in principles:
var coords = [-118.09771, 33.89244];
var key = "$type";
var val = {
"_id": "$_id",
"title": "$title",
"author": "$author",
"distance": "$distance"
};
var maxLen = 10;
var stack = [];
var pipe = [];
var fproj = { "$project": { "pos": { "$literal": [] } } };
pipe.push({ "$geoNear": {
"near": coords,
"distanceField": "distance",
"spherical": true
}});
pipe.push({ "$sort": {
"type": 1, "distance": 1
}});
for ( var x = 1; x <= maxLen; x++ ) {
fproj["$project"][""+x] = 1;
fproj["$project"]["pos"]["$literal"].push( x );
var rec = {
"$cond": [ { "$eq": [ "$pos", x ] }, "$"+x ]
};
if ( stack.length == 0 ) {
rec["$cond"].push( false );
} else {
lval = stack.pop();
rec["$cond"].push( lval );
}
stack.push( rec );
if ( x == 1) {
pipe.push({ "$group": {
"_id": key,
"1": { "$first": val },
"books": { "$push": val }
}});
pipe.push({ "$project": {
"1": 1,
"books": {
"$cond": [
{ "$eq": [ { "$size": "$books" }, 1 ] },
{ "$literal": [false] },
"$books"
]
}
}});
} else {
pipe.push({ "$unwind": "$books" });
var proj = {
"$project": {
"books": 1
}
};
proj["$project"]["seen"] = { "$eq": [ "$"+(x-1), "$books" ] };
var grp = {
"$group": {
"_id": "$_id",
"books": {
"$push": {
"$cond": [ { "$not": "$seen" }, "$books", false ]
}
}
}
};
for ( n=x; n >= 1; n-- ) {
if ( n != x )
proj["$project"][""+n] = 1;
grp["$group"][""+n] = ( n == x ) ? { "$first": "$books" } : { "$first": "$"+n };
}
pipe.push( proj );
pipe.push({ "$sort": { "_id": 1, "seen": 1 } });
pipe.push(grp);
}
}
pipe.push(fproj);
pipe.push({ "$unwind": "$pos" });
pipe.push({
"$group": {
"_id": "$_id",
"msgs": { "$push": stack[0] }
}
});
pipe.push({ "$unwind": "$books" });
pipe.push({ "$match": { "books": { "$ne": false } }});
pipe.push({
"$project": {
"_id": "$books._id",
"title": "$books.title",
"author": "$books.author",
"type": "$_id",
"distance": "$books",
"sortOrder": {
"$add": [
{ "$cond": [ { "$eq": [ "$_id", "Fiction" ] }, 1, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Science" ] }, 0, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Horror" ] }, 3, 0 ] },
]
}
}
});
pipe.push({ "$sort": { "sortOrder": 1, "distance": 1 } });
Alternate
Of course the end result here and the general problem with all above is that you really only want the "top 10" of each "type" to return. The aggregation pipeline will do it, but at the cost of keeping more than 10 and then "popping off the stack" until 10 is reached.
An alternate approach is to "brute force" this with mapReduce and "globally scoped" variables. Not as nice since the results all in arrays, but it may be a practical approach:
db.collection.mapReduce(
function () {
if ( !stash.hasOwnProperty(this.type) ) {
stash[this.type] = [];
}
if ( stash[this.type.length < maxLen ) {
stash[this.type].push({
"title": this.title,
"author": this.author,
"type": this.type,
"distance": this.distance
});
emit( this.type, 1 );
}
},
function(key,values) {
return 1; // really just want to keep the keys
},
{
"query": {
"location": {
"$nearSphere": [-118.09771, 33.89244]
}
},
"scope": { "stash": {}, "maxLen": 10 },
"finalize": function(key,value) {
return { "msgs": stash[key] };
},
"out": { "inline": 1 }
}
)
This is a real cheat which just uses the "global scope" to keep a single object whose keys are the grouping keys. The results are pushed onto an array in that global object until the maximum length is reached. Results are already sorted by nearest, so the mapper just gives up doing anything with the current document after the 10 are reached per key.
The reducer wont be called since only 1 document per key is emitted. The finalize then just "pulls" the value from the global and returns it in the result.
Simple, but of course you don't have all the $geoNear options if you really need them, and this form has the hard limit of 100 document as the output from the initial query.
This is a classic case for subquery/join which is not supported by MongoDB. All joins and subquery-like operations need to be implemented in the application logic. So multiple queries is your best bet. Performance of the multiple query approach should be good if you have an index on type.
Alternatively you can write a single aggregation query minus the type-matching and limit clauses and then process the stream in your application logic to limit documents per type.
This approach will be low on performance for large result sets because documents may be returned in random order. Your limiting logic will then need to traverse to the entire result set.
i guess you can use cursor.limit() on a cursor to specify the maximum number of documents the cursor will return. limit() is analogous to the LIMIT statement in a SQL database.
You must apply limit() to the cursor before retrieving any documents from the database.
The limit function in the cursors can be used for limiting the number of records in the find.
I guess this example should help:
var myCursor = db.bios.find( );
db.bios.find().limit( 5 )

Mongodb aggregation vs client side processing

I have a blogs collection which has almost the following schema:
{
title: { name: "My First Blog Post",
postDate: "01-28-11" },
content: "Here is my super long post ...",
comments: [ { text: "This post sucks!"
, name: "seanhess"
, created: 01-28-14}
, { text: "I know! I wish it were longer"
, name: "bob"
, postDate: 01-28-11}
]
}
I mainly want to run three queries:
Give me all the comments made by only bob
Find all the comments made at the same day the post is written which is comments.postDate = title.postDate.
Find all the comments made by bob on the same day the post is written
My questions are as following:
These three are going to be really frequent queries, so is it a good idea to use aggregation framework?
For the third query, I can simply make a query like db.blogs.find({"comments.name":"bob"}, {comments.name:1, comments.postDate:1, title.postDate:1}) and then do a client side post processing to loop through the returned results. Is it a good idea? I'd like to note that it is possible that this might return several thousand documents back.
I will be happy if you can propose some ways to make the third query.
It probably is best practice here to "break-up" your multiple questions in to several questions, if not only for that maybe the answer on one question would have led you to understand the other.
I am also not very keen on answering anything where there is no example shown of what yo have tried to do. But with that said and "shooting myself in the foot", the questions are reasonable from a design approach so I will answer.
Point 1 : Comments by "bob"
Standard $unwind and filter the results. Use $match first so you don't process unneeded documents.
db.collection.aggregate([
// Match to "narrow down" the documents.
{ "$match": { "comments.name": "bob" }},
// Unwind the array
{ "$unwind": "$comments" },
// Match and "filter" just the "bob" comments
{ "$match": { "comments.name": "bob" }},
// Possibly wind back the array
{ "$group": {
"_id": "$_id",
"title": { "$first": "$title" },
"content": { "$first": "$content" },
"comments": { "$push": "$comments" }
}}
])
Point 2: All comments on the same day
db.collection.aggregate([
// Try and match posts within a date or range
// { "$match": { "title.postDate": Date( /* something */ ) }},
// Unwind the array
{ "$unwind": "$comments" },
// Aha! Project out the same day. Not the time-stamp.
{ "$project": {
"title": 1,
"content": 1,
"comments": 1,
"same": { "$eq": [
{
"year" : { "$year": "$title.postDate" },
"month" : { "$month": "$title.postDate" },
"day": { "$dayOfMonth": "$title.postDate" }
},
{
"year" : { "$year": "$comments.postDate" },
"month" : { "$month": "$comments.postDate" },
"day": { "$dayOfMonth": "$comments.postDate" }
}
]}
}},
// Match the things on the "same
{ "$match": { "same": true } },
// Possibly wind back the array
{ "$group": {
"_id": "$_id",
"title": { "$first": "$title" },
"content": { "$first": "$content" },
"comments": { "$push": "$comments" }
}}
])
Point 3: "bob" on the same date
db.collection.aggregate([
// Try and match posts within a date or range
// { "$match": { "title.postDate": Date( /* something */ ) }},
// Unwind the array
{ "$unwind": "$comments" },
// Aha! Project out the same day. Not the time-stamp.
{ "$project": {
"title": 1,
"content": 1,
"comments": 1,
"same": { "$eq": [
{
"year" : { "$year": "$title.postDate" },
"month" : { "$month": "$title.postDate" },
"day": { "$dayOfMonth": "$title.postDate" }
},
{
"year" : { "$year": "$comments.postDate" },
"month" : { "$month": "$comments.postDate" },
"day": { "$dayOfMonth": "$comments.postDate" }
}
]}
}},
// Match the things on the "same" field
{ "$match": { "same": true, "comments.name": "bob" } },
// Possibly wind back the array
{ "$group": {
"_id": "$_id",
"title": { "$first": "$title" },
"content": { "$first": "$content" },
"comments": { "$push": "$comments" }
}}
])
Results
Honestly, and especially if you are using some indexing to feed to the initial $match stages of these operations, then it should be very clear that this will "run rings" around trying to iterate this in code.
At the very least this reduces the returned records "over the wire", so there is less network traffic. And of course there is less (or nothing) to post process once the query results have been received.
As a general convention, database server hardware tends to be an order of magnitude higher rated in performance than "application server" hardware. So again the general condition is that anything executed on the server will run faster.
Is aggregation the right thing: "Yes". and by a long long way. You even get a cursor very soon.
How can you do the queries you want: Shown to be pretty simple. And in real world code we never "hard code" this, we build it dynamically. So adding conditions and attributes should be as simple as all you normal data manipulation code.
So I would not normally answer this style of question. But say thank-you! Please ?

Categories

Resources