I need to find all datasets in my mongoDB with an expired date value. Expired means, that the last array element timestamp is older then the current timestamp plus a defined interval (which is defined by a category)
Every dataset has a field like this
{
"field" : [
{
"category" : 1,
"date" : ISODate("2019-03-01T12:00:00.464Z")
},
{
"category" : 1,
"date" : ISODate("2019-03-01T14:52:50.464Z")
}
]
}
The category defines a time interval. For example 'category 1' stands for 90 minutes, 'category 2' for 120 minutes.
Now I need to get every dataset with a date value which is expired, which means the last array element has a value which is older then 90 minutes before the current timestamp.
Something like
Content.find({ 'field.$.date': { $gt: new Date() } })
But with that attempt I've two problems:
How do I query for the last array element?
How to implement the category time interval in the query?
Let's break down the problem into parts.
Query the "last" ( most recent ) array element
Part 1: Logical and Fast
A quick perusal of MongoDB query operators related to arrays should tell you that you can in fact always query an array element based on the index position. This is very simple to do for the "first" array element since that position is always 0:
{ "field.0.date": { "$lt": new Date("2019-03-01T10:30:00.464Z") } }
Logically the "last" position would be -1, but you cannot actually use that value in notation of this form with MongoDB as it would be considered invalid.
However what you can do here instead is add new items to the array in a way so that rather than appending to the end of the array, you actually prepend to the beginning of the array. This means your array content is essentially "reversed" and it's then easy to access as shown above. This is what the $position modifier to $push does for you:
collection.updateOne(
{ "_id": documentId },
{
"$push": {
"field": {
"$each": [{ "category": 1, "date": new Date("2019-03-02") }],
"$position": 0
}
}
}
)
So that means newly added items go to the beginning rather than the end. That may be practical but it does mean you would need to re-order all your existing array items.
In the case where the "date" is static and basically never changes once you write the array item ( i.e you never update the date for a matched array item ) then you can actually re-order sorting on that "date" property in a single update statement, using the $sort modifier:
collection.updateMany(
{},
{ "$push": { "field": { "$each": [], "$sort": { "date": -1 } } } }
)
Whilst it might feel "odd" to use $push when you are not actually adding anything to the array, this is where the $sort modifier lives. The empty array "$each": [] argument essentially means "add nothing" yet the $sort applies to all current members of the the array.
This could optionally be done much like the earlier example with $position, in which the $sort would be applied on every write. However as long as the "date" applies to the "timestamp when added" ( as I suspect it does ) then it's probably more efficient to use the "$position": 0 approach instead of sorting every time something changes. Depends on your actual implementation and how you otherwise work with the data.
Part 2: Brute force, and slow
If however for whatever reason you really don't believe that being able to "reverse" the content of the array is a practical solution, then the only other available thing is to effectively "calculate" the "last" array element by projecting this value from a supported operator.
The only practical way to do that is typically with the Aggregation Framework and specifically the $arrayElemAt operator:
collection.aggregate([
{ "$addFields": {
"lastDate": { "$arrayElemAt": [ "$field.date", -1 ] }
}}
])
Basically that is just going to look at the supplied array content ( in this case just the "date" property values for each element ) and then extract the value at the given index position. This operator allows the -1 index notation, meaning the "last" element in the array.
Clearly this is not ideal as the extraction is decoupled from the actual expression needed to query or filter the values. That's in the next part, but you need to realize this just iterated through your whole collection before we can even look at comparing the values to see which you want to keep.
Sample Date by "Category"
Part 1: Fast query logic
Following on from the above the next criteria is based on the "category" field value, with the next main issues being
90 minutes adjust for value 1
120 minutes adjust for value 2
By the same logic just learned you should conclude that "calculating" as you process data is "bad news" for performance. So the trick to apply here is basically including the logic in the query expression to use different supplied "date" values depending on what the "category" value being matched in the document is.
The most simple application of this is with an $or expression:
var currentDateTime = new Date();
var ninetyMinsBefore = new Date(currentDateTime.valueOf() - (1000 * 60 * 90));
var oneTwentyMinsBefore = new Date(currentDateTime.valueOf() - (1000 * 60 * 120));
collection.find({
"$or": [
{
"field.0.category": 1,
"field.0.date": { "$lt": ninetyMinsBefore }
},
{
"field.0.category": 2,
"field.0.date": { "$lt": oneTwentyMinsBefore }
}
]
})
Note here that instead of calculating the "date" which is stored adjusted by the variable interval and seeing how that compares to the current date you instead calculate the differences from the current date and then conditionally apply that depending on the value of "category".
This is the fast and efficient way since you were able to re-order the array items as described above and then we can apply the conditions to see if that "first" element met them.
Part 2: Slower forced calculation
collection.aggregate([
{ "$addFields": {
"lastDate": {
"$arrayElemAt": [ "$field.date", -1 ]
},
"lastCategory": {
"$arrayElemAt": [ "$field.category", -1 ]
}
}},
{ "$match": {
"$or": [
{ "lastCategory": 1, "lastDate": { "$lt": ninetyMinsBefore } },
{ "lastCategory": 2, "lastDate": { "$lt": oneTwentyMinsBefore } }
]
}}
])
Same basic premise as even though you already needed to project values from the "last" array elements there's no real need to adjust the stored "date" values with math, which would just be complicating things further.
The original $addFields projection is the main cost, so the main disservice here is the $match on the bottom.
You could optionally use $expr with modern MongoDB releases, but it's basically the same thing:
collection.find({
"$expr": {
"$or": [
{
"$and": [
{ "$eq": [ { "$arrayElemAt": [ "$field.category", -1 ] }, 1 ] },
{ "$lt": [ { "$arrayElemAt": [ "$field.date", -1 ] }, ninetyMinsBefore ] }
]
},
{
"$and": [
{ "$eq": [ { "$arrayElemAt": [ "$field.category", -1 ] }, 2 ] },
{ "$lt": [ { "$arrayElemAt": [ "$field.date", -1 ] }, oneTwentyMinsBefore ] }
]
}
]
}
})
Worth noting the special "aggregation" forms of $or and $and since everything within $expr is an aggregation expression that needs to resolve to a Boolean value of true/false.
Either way it's all just the same problem as the initial "query only" examples are natively processed and can indeed use an index to speed up matching and results. None of these "aggregation expressions" can do that, and thus run considerably slower.
NOTE: If you are storing "date" with the purpose of meaning "expired" as the ones you want to select then it is "less than" the current date ( minus the interval ) rather than "greater than" as you presented in your question.
This means the current time, then subtract the interval ( instead of adding to the stored time ) would be the "greater" value in the selection, and therefore things "expired" before that time.
N.B Normally when you query for array elements with documents matching multiple properties you would use the $elemMatch operator in order for those multiple conditions to apply to that specific array element.
The only reason that does not apply here is because of the use of the numeric index value for the 0 position explicitly on each property. This means that rather than over the entire array ( like "field.date" ) this is specifically applying to only the 0 position.
Related
I have a dataset of records stored in mongodb and i have been trying to extract a complex set of data from the records.
Sample records are as follows :-
{
bookId : '135wfkjdbv',
type : 'a',
store : 'crossword',
shelf : 'A1'
}
{
bookId : '13erjfn',
type : 'b',
store : 'crossword',
shelf : 'A2'
}
I have been trying to extract data such that for each bookId, i get a count (of records) for each shelf per store name that holds the book identified by bookId where the type of the book is 'a'.
I understand that the aggregation query allows a pipeline that allows grouping, matching etc, but I have not been able to reach a solution.
The desired output is of the form :-
{
bookId : '135wfkjdbv',
stores : [
{
name : 'crossword'
shelves : [
{
name : 'A1',
count : 12
},
]
},
{
name : 'granth'
shelves : [
{
name : 'C2',
count : 12
},
{
name : 'C4',
count : 12
},
]
}
]
}
The process isn't really that difficult when you look at at. The aggregation "pipeline" is exactly that, where each "stage" feeds a result into the next for processing. Just like unix "pipe":
ps -ef | grep mongo | tee out.txt
So it's just adding stages, and in fact three $group stages where the first does the basic aggregation and the remaining two simply "roll up" the arrays required in the output.
db.collection.aggregate([
{ "$group": {
"_id": {
"bookId": "$bookId",
"store": "$store",
"shelf": "$shelf"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": {
"bookId": "$_id.bookId",
"store": "$_id.store"
},
"shelves": {
"$push": {
"name": "$_id.shelf",
"count": "$count"
}
}
}},
{ "$group": {
"_id": "$_id.bookId",
"stores": {
"$push": {
"name": "$_id.store",
"shelves": "$shelves"
}
}
}}
])
You could possibly $project at the end to change the _id to bookId, but you should already know that is what it is and get used to treating _id as a primary key. There is a cost to such operations, so it is a habit you should not get into and learn doing things correctly from the start.
So all that really happens here is all the fields that would make up the grouping detail are made the primary key of $group with the other field being produced as count, to count the shelves within that grouping. Think the SQL equivalent:
GROUP BY bookId, store, shelf
All each other stage does is transpose each grouping level into array entries, first by shelf within the store and then the store within the bookId. Each time the fields in the primary grouping key are reduced down by the content going into the produced array.
When you start thinking in terms of "pipeline" processing, then it becomes clear. As you construct one form, then take that output and move it to the next form and so on. This is basically how you fold the results within two arrays.
I have a log collection in MongoDB that has a structure that looks like this:
{
url : "http://example.com",
query : "name=blah,;another_param=bleh",
count : 5
}
where the "query" field is the query parameters in the requested url.
I want to compute a total of count grouped by the query parameter "name". For example, for this collection:
[{
url : "http://example.com",
query : "name=blah,;another_param=bleh",
count : 3
},
{
url : "http://example.com",
query : "name=blah,;another_param=xyz",
count : 4
},
{
url : "http://example.com",
query : "name=another_name,;another_param=bleh",
count : 3
}]
I need this output:
[{
key : "blah",
count : 7
},
{
key : "another_name",
count : 3
}]
It doesnt look like I can do this string manipulation using the aggregation framework. I can do this via map-reduce, but can a map-reduce operation be part of the aggregation pipeline?
The aggregation framework does not have the string manipulation operators necessary to dissect the string content and break this up into the key/value pairs you need for this operation. The only string manipulation currently available is $substr, which is not going to help unless you are dealing with fixed length data.
So the only server side way to do this at present is with mapReduce since you can just the JavaScript functions available to do the right manipulation. Something like this:
For the mapper:
function() {
var obj = {};
this.query.split(/,;/).forEach(function(item) {
var temp = item.split(/=/);
obj[temp[0]] = temp[1];
});
if (obj.hasOwnProperty('name')
emit(obj.name,this.count);
}
And the reducer:
function(key,values) {
return Array.sum( values );
}
Which is the basic structure of the JavaScript functions required to split out the "name" parameters and use them as the "keys" for aggregation, or general counting of the "key" occurrences.
So the aggregation framework cannot execute any JavaScript itself, as it just runs native code operators over the data.
It would be a good idea though to look at changing how your data is stored, so that the elements are broken down into a an "object" representation rather than a string when the documents are inserted to MongoDB. This allows native query forms that don't rely on JavaScript execution to manipulate the data:
[{
"url": "http://example.com",
"query": {
"name": "blah",
"another_param": "bleh"
},
"count": 3
},
{
"url": "http://example.com",
"query": {
"name": "blah",
"another_param": "xyz"
},
"count": 4
},
{
"url": "http://example.com",
"query": {
"name": "another_name",
"another_param": "bleh"
},
"count": 3
}]
This makes a $group pipeline stage quite simple as the data is now organized in a form that can be natively processed:
{ "$match": { "query.name": { "$exists": true } },
{ "$group": {
"_id": "$query.name",
"count": { "$sum": "$count" }
}}
So use mapReduce for now, but ultimately consider changing your recording of the data to split the "tokens" from the query string and represent this as structured data, optionally keeping the original string in another field.
The aggregation framework will process this much faster than mapReduce can, so this would be the better ongoing option.
I have documents like this:
{
"_id" : ObjectId("53bcedc39c837bba3e1bf1c2"),
id : "abc1",
someArray: [ 1 , 10 , 11]
}
{
"_id" : ObjectId("53bcedc39c837bba3e1bf1c4"),
id : "abc1",
someArray: [ 1 , 10]
}
... other similar documents with different Ids
I would like to go through the whole collection and delete the document where someArray is smallest, grouped by id. So in this example, I group by abc1 (and I get 2 documents) and then the 2nd document would be the one to delete because it has least count in someArray.
There isn't a $count accumulator so I don't see how I can use $group.
Additionally, there will be 1000s of Ids that have duplicates like this, so if there is such a thing as a bulk check/delete that would be good (possibly a stupid question, sorry, Mongo is all new to me!)
Removing "duplicates" is a process here and there is no simple way to both "identify" the dupliciates and "remove" them as a single statement. Another particular here is that query forms cannot "typically" determine the size of an array, and certainly cannot sort by that where it is not already present in the document.
All cases basically come down to
Identifying the list of documents that are "duplicates", and then ideally fingering the particular document you want to delete, or more to the point the document you "don't" want to delete from the possible duplicates.
Processing that list to actually perform the deletes.
With that in mind you hopefully have a modern MongoDB of 2.6 version or greater where you can obtain a cursor from the aggregate method. You also want the Bulk Operations API available in these versions for optimal speed:
var bulk = db.collection.initializeOrderedBulkOp();
var counter = 0;
db.collection.aggregate([
{ "$project": {
"id": 1,
"size": { "$size": "$someArray" }
}},
{ "$sort": { "id": 1, "size": -1 } },
{ "$group": {
"_id": "$id",
"docId": { "$first": "$_id" }
}}
]).forEach(function(doc) {
bulk.find({ "id": doc._id, "_id": { "$ne": doc.docId }).remove();
counter++;
// Send to server once every 1000 statements only
if ( counter % 1000 == 0 ) {
bulk.execute();
bulk = db.collection.initializeOrderedBulkOp(); // need to reset
}
});
// Clean up results that did not round to 1000
if ( counter % 1000 != 0 )
bulk.execute();
You can still do much the same thing with older versions of MongoDB, but the result from .aggregate() must be under 16MB which is the BSON limit. That still should be a lot, but with older versions you could also output to a collection with mapReduce.
But for the general aggregation response, you get an array of results and you also don't have the other convienience methods for finding the size of the array. So a little more work:
var result = db.collection.aggregate([
{ "$unwind": "$someArray" },
{ "$group": {
"_id": "$id",
"id": { "$first": "$id" },
"size": { "$sum": 1 }
}},
{ "$sort": { "id": 1, "size": -1 } },
{ "$group": {
"_id": "$id",
"docId": { "$first": "$_id" }
}}
]);
result.result.forEach(function(doc) {
db.collection.remove({ "id": doc._id, "_id": { "$ne": doc.docId } });
});
So no cursor for large results and no bulk operations so every single "remove" needs to be sent to the server individually.
So in MongoDB there are no "sub-queries" or even when there is more than "two duplicates" a way to single out the document you don't want to remove from the other duplicates. But this is the general way to do it.
Just as a note, if the "size" of arrays is something important to you for a purpose such as "sorting", then your best apporach is to maintain that "size" as another property of your document so it makes those operations easier without needing to "calculate" that as is done here.
I know this question has been asked a lot of times but I'm kinda new to mongo and mongoose as well and I couldn't figure it out !
My problem:
I have a which looks like this:
var rankingSchema = new Schema({
userId : { type : Schema.Types.ObjectId, ref:'User' },
pontos : {type: Number, default:0},
placarExato : {type: Number, default:0},
golVencedor : {type: Number, default:0},
golPerdedor : {type: Number, default:0},
diferencaVencPerd : {type: Number, default:0},
empateNaoExato : {type: Number, default:0},
timeVencedor : {type: Number, default:0},
resumo : [{
partida : { type : Schema.Types.ObjectId, ref:'Partida' },
palpite : [Number],
quesito : String
}]
});
Which would return a document like this:
{
"_id" : ObjectId("539d0756f0ccd69ac5dd61fa"),
"diferencaVencPerd" : 0,
"empateNaoExato" : 0,
"golPerdedor" : 0,
"golVencedor" : 1,
"placarExato" : 2,
"pontos" : 78,
"resumo" : [
{
"partida" : ObjectId("5387d991d69197902ae27586"),
"_id" : ObjectId("539d07eb06b1e60000c19c18"),
"palpite" : [
2,
0
]
},
{
"partida" : ObjectId("5387da7b27f54fb425502918"),
"quesito" : "golsVencedor",
"_id" : ObjectId("539d07eb06b1e60000c19c1a"),
"palpite" : [
3,
0
]
},
{
"partida" : ObjectId("5387dc012752ff402a0a7882"),
"quesito" : "timeVencedor",
"_id" : ObjectId("539d07eb06b1e60000c19c1c"),
"palpite" : [
2,
1
]
},
{
"partida" : ObjectId("5387dc112752ff402a0a7883"),
"_id" : ObjectId("539d07eb06b1e60000c19c1e"),
"palpite" : [
1,
1
]
},
{
"partida" : ObjectId("53880ea52752ff402a0a7886"),
"quesito" : "placarExato",
"_id" : ObjectId("539d07eb06b1e60000c19c20"),
"palpite" : [
1,
2
]
},
{
"partida" : ObjectId("53880eae2752ff402a0a7887"),
"quesito" : "placarExato",
"_id" : ObjectId("539d0aa82fb219000054c84f"),
"palpite" : [
2,
1
]
}
],
"timeVencedor" : 1,
"userId" : ObjectId("539b2f2930de100000d7356c")
}
My question is, first: How can I filter the resumo nested document by quesito ? Is it possible to paginate this result, since this array is going to increase. And last question, is this a nice approach to this case ?
Thank you guys !
As noted, your schema implies that you actually have embedded data even though you are storing an external reference. So it is not clear if you are doing both embedding and referencing or simply embedding by itself.
The big caveat here is the difference between matching a "document" and actually filtering the contents of an array. Since you seem to be talking about "paging" your array results, the large focus here is on doing that, but still making mention of the warnings.
Multiple "filtered" matches in an array requires the aggregation framework. You can generally "project" the single match of an array element, but this is needed where you expect more than one:
Ranking.aggregate(
[
// This match finds "documents" that "contain" the match
{ "$match": { "resumo.quesito": "value" } },
// Unwind de-normalizes arrays as documents
{ "$unwind": "$resumo" },
// This match actually filters those document matches
{ "$match": { "resumo.quesito": "value" } },
// Skip and limit for paging, which really only makes sense on single
// document matches
{ "$skip": 0 },
{ "$limit": 2 },
// Return as an array in the original document if you really want
{ "$group": {
"_id": "$_id",
"otherField": { "$first": "$otherField" },
"resumo": { "$push": "$resumo" }
}}
],
function(err,results) {
}
)
Or the MongoDB 2.6 way by "filtering" inside a $project using the $map operator. But still you need to $unwind in order to "page" array positions, but there is possibly less processing as the array is "filtered" first:
Ranking.aggregate(
[
// This match finds "documents" that "contain" the match
{ "$match": { "resumo.quesito": "value" } },
// Filter with $map
{ "$project": {
"otherField": 1,
"resumo": {
"$setDifference": [
{
"$map": {
"input": "$resumo",
"as": "el",
"in": { "$eq": ["$$el.questio", "value" ] }
}
},
[false]
]
}
}},
// Unwind de-normalizes arrays as documents
{ "$unwind": "$resumo" },
// Skip and limit for paging, which really only makes sense on single
// document matches
{ "$skip": 0 },
{ "$limit": 2 },
// Return as an array in the original document if you really want
{ "$group": {
"_id": "$_id",
"otherField": { "$first": "$otherField" },
"resumo": { "$push": "$resumo" }
}}
],
function(err,results) {
}
)
The inner usage of $skip and $limit here really only makes sense when you are processing a single document and just "filtering" and "paging" the array. It is possible to do this with multiple documents, but is very involved as there is no way to just "slice" the array. Which brings us to the next point.
Really with embedded arrays, for paging that does not require any filtering you just use the $slice operator, which was designed for this purpose:
Ranking.find({},{ "resumo": { "$slice": [0,2] } },function(err,docs) {
});
Your alternate though is to simply reference the documents in the external collection and then pass the arguments to mongoose .populate() to filter and "page" the results. The change in the schema itself would just be:
"resumo": [{ "type": "Schema.Types.ObjectId", "ref": "Partida" }]
With the external referenced collection now holding the object detail rather than embedding directly in the array. The use of .populate() with filtering and paging is:
Ranking.find().populate({
"path": "resumo",
"match": { "questio": "value" },
"options": { "skip": 0, "limit": 2 }
}).exec(function(err,docs) {
docs = docs.filter(function(doc) {
return docs.comments.length;
});
});
Of course the possible problem there is that you can no longer actually query for the documents that contain the "embedded" information as it is now in another collection. This results in pulling in all documents, though possibly by some other query condition, but then manually testing them to see if they were "populated" by the filtered query that was sent to retrieve those items.
So it really does depend on what you are doing and what your approach is. If you regularly intend to "search" on inner arrays then embedding will generally suit you better. Also if you really only interesting in "paging" then the $slice operator works well for this purpose with embedded documents. But beware growing embedded arrays too large.
Using a referenced schema with mongoose helps with some size concerns, and there is methodology in place to assist with "paging" results and filtering them as well. The drawback is that you can no longer query "inside" those elements from the parent itself. So parent selection by the inner elements is not well suited here. Also keep in mind that while not all of the data is embedded, there is still the reference to the _id value of the external document. So you can still end up with large arrays, which may not be desirable.
For anything large, consider that you will likely be doing the work yourself, and working backwards from the "child" items to then match the parent(s).
I am not sure that you can filter sub-document directly with mongoose. However you can get the parent document with Model.find({'resumo.quesito': 'THEVALUE'}) (you should also and an index on it)
Then when you have the parent you can get the child by comparing the quesito
Additionnal doc can be found here: http://mongoosejs.com/docs/subdocs.html
I am wondering if there is a way to get a MongoDB document age in hours? Here's what I have so far, but obviously I'm using a date object, it is not calculating the hours, and it's not giving the age, just the date it was created, so it is not giving the desired result. In fact, the $divide pipeline does not even allow for date objects. Any help is appreciated. Just as an FYI, the $views variable is a NumberInt32 type and the $created_at variable is a timestamp, like: 2014-05-20T00:01:08.629Z.
db.projects.aggregate({
$project: {
"result": {
$divide: ['$views', '$created_at']
}
}
)
If you're wondering, this code is to help sort popular posts, but of course, it's only part of it. Thanks for any help!
Presuming that $views and $created_at are fields in your document containing a number of views and the created timestamp as follows:
{
"_id" : ObjectId("537abe5e8da9877dbb0ef604"),
"views" : 5,
"created_at" : ISODate("2014-05-20T00:00:00Z")
}
Then just a little date math getting the difference from the current time should do:
db.projects.aggregate([
{ "$project": {
"score": { "$divide": [
{ "$divide": [
{ "$subtract": [
new Date(),
"$created_at"
]},
100*60*60
]},
"$views"
]},
"created_at": 1,
"views": 1
}}
])
So you are basically getting the difference in hours from the current time as a date object and the created_at value. Dividing that by a standard number for hours in a day, then dividing by your views in order to get your "score" result for sorting.
When you do math operations with two date objects then the result is returned as a number. So further operations with just numbers will work from then on.