My data looks like this:
{
"foo_list": [
{
"id": "98aa4987-d812-4aba-ac20-92d1079f87b2",
"name": "Foo 1",
"slug": "foo-1"
},
{
"id": "98aa4987-d812-4aba-ac20-92d1079f87b2",
"name": "Foo 1",
"slug": "foo-1"
},
{
"id": "157569ec-abab-4bfb-b732-55e9c8f4a57d",
"name": "Foo 3",
"slug": "foo-3"
}
]
}
Where foo_list is a field in a model called Bar. Notice that the first and second objects in the array are complete duplicates.
Aside from the obvious solution of switching to PostgresSQL, what MongoDB query can I run to remove duplicate entries from foo_list?
Similar answers that do not quite cut it:
https://stackoverflow.com/a/16907596/432
https://stackoverflow.com/a/18804460/432
These questions answer the question if the array had bare strings in it. However in my situation the array is filled with objects.
I hope it is clear that I am not interested querying the database; I want the duplicates to be gone from the database forever.
Purely from an aggregation framework point of view there are a few approaches to this.
You can either just apply $setUnion in modern releases:
db.collection.aggregate([
{ "$project": {
"foo_list": { "$setUnion": [ "$foo_list", "$foo_list" ] }
}}
])
Or more traditionally with $unwind and $addToSet:
db.collection.aggregate([
{ "$unwind": "$foo_list" },
{ "$group": {
"_id": "$_id",
"foo_list": { "$addToSet": "$foo_list" }
}}
])
Or if you were just interested in the duplicates only then by general grouping:
db.collection.aggregate([
{ "$unwind": "$foo_list" },
{ "$group": {
"_id": {
"_id": "$_id",
"foo_list": "$foo_list"
},
"count": { "$sum": 1 }
}},
{ "$match": { "count": { "$ne": 1 } } },
{ "$group": {
"_id": "$_id._id",
"foo_list": { "$push": "$_id.foo_list" }
}}
])
The last form could be useful to you if you actually want to "remove" the duplicates from your data with another update statement as it identifies the elements which are duplicates.
So in that last form the returned result from your sample data identifies the duplicate:
{
"_id" : ObjectId("53f5f7314ffa9b02cf01c076"),
"foo_list" : [
{
"id" : "98aa4987-d812-4aba-ac20-92d1079f87b2",
"name" : "Foo 1",
"slug" : "foo-1"
}
]
}
Where results are returned from your collection per document that contains duplicate entries in the array and which entries are duplicated. This is the information you need to update, and you loop the results as you need to specify the update information from the results in order to remove duplicates.
This is actually done with two update statements per document, as a simple $pull operation would remove "both" items, which is not what you want:
var cursor = db.collection.aggregate([
{ "$unwind": "$foo_list" },
{ "$group": {
"_id": {
"_id": "$_id",
"foo_list": "$foo_list"
},
"count": { "$sum": 1 }
}},
{ "$match": { "count": { "$ne": 1 } } },
{ "$group": {
"_id": "$_id._id",
"foo_list": { "$push": "$_id.foo_list" }
}}
])
var batch = db.collection.initializeOrderedBulkOp();
var count = 0;
cursor.forEach(function(doc) {
doc.foo_list.forEach(function(dup) {
batch.find({ "_id": doc._id, "foo_list": { "$elemMatch": dup } }).updateOne({
"$unset": { "foo_list.$": "" }
});
batch.find({ "_id": doc._id }).updateOne({
"$pull": { "foo_list": null }
});
});
count++;
if ( count % 500 == 0 ) {
batch.execute();
batch = db.collection.initializeOrderedBulkOp();
}
});
if ( count % 500 != 0 ) {
batch.execute();
}
That's the modern MongoDB 2.6 and above way to do it, with a cursor result from aggregation and Bulk operations for updates. But the principles remain the same:
Identify the duplicates in documents
Loop the results to issue the updates to the affected documents
Use $unset with the positional $ operator to set the "first" matched array element to null
Use $pull to remove the null entry from the array
So after processing the above operations your sample now looks like this:
{
"_id" : ObjectId("53f5f7314ffa9b02cf01c076"),
"foo_list" : [
{
"id" : "98aa4987-d812-4aba-ac20-92d1079f87b2",
"name" : "Foo 1",
"slug" : "foo-1"
},
{
"id" : "157569ec-abab-4bfb-b732-55e9c8f4a57d",
"name" : "Foo 3",
"slug" : "foo-3"
}
]
}
The duplicate is removed with the "duplicated" item still intact. That is how you process to identify and remove the duplicate data from your collection.
Related
I'm using Mongoose in a Node.js backend and I need to update a subset of elements of an array within a document based on a condition. I used to perform the operations using save(), like this:
const channel = await Channel.findById(id);
channel.messages.forEach((i) =>
i._id.toString() === messageId && i.views < channel.counter
? i.views++
: null
);
await channel.save();
I'd like to change this code by using findByIdAndUpdate since it is only an increment and for my use case, there isn't the need of retrieving the document. Any suggestion on how I can perform the operation?
Of course, channel.messages is the array under discussion. views and counter are both of type Number.
EDIT - Example document:
{
"_id": {
"$oid": "61546b9c86a9fc19ac643924"
},
"counter": 0,
"name": "#TEST",
"messages": [{
"views": 0,
"_id": {
"$oid": "61546bc386a9fc19ac64392e"
},
"body": "test",
"sentDate": {
"$date": "2021-09-29T13:36:03.092Z"
}
}, {
"views": 0,
"_id": {
"$oid": "61546dc086a9fc19ac643934"
},
"body": "test",
"sentDate": {
"$date": "2021-09-29T13:44:32.382Z"
}
}],
"date": {
"$date": "2021-09-29T13:35:33.011Z"
},
"__v": 2
}
You can try updateOne method if you don't want to retrieve document in result,
match both fields id and messageId conditions
check expression condition, $filter to iterate loop of messages array and check if messageId and views is less than counter then it will return result and $ne condition will check the result should not empty
$inc to increment the views by 1 if query matches using $ positional operator
messageId = mongoose.Types.ObjectId(messageId);
await Channel.updateOne(
{
_id: id,
"messages._id": messageId,
$expr: {
$ne: [
{
$filter: {
input: "$messages",
cond: {
$and: [
{ $eq: ["$$this._id", messageId] },
{ $lt: ["$$this.views", "$counter"] }
]
}
}
},
[]
]
}
},
{ $inc: { "messages.$.views": 1 } }
)
Playground
{
"_id" : ObjectId("59786a62a96166007d7e364dsadasfafsdfsdgdfgfd"),
"someotherdata" : {
"place1" : "lwekjfrhweriufesdfwergfwr",
"place2" : "sgfertgryrctshyctrhysdthc ",
"place3" : "sdfsdgfrdgfvk",
"place4" : "asdfkjaseeeeeeeeeeeeeeeeefjnhwklegvds."
}
}
I have thousands of these in my collection. I need to look through all the someotherdata and do the following
Check to see if it is present (in some records i have place1 and not place4)
Find the longest record (in terms of string length)
The output must look something like this (showing the count of characters for the longest)
{
place1: 123,
place2: 12,
place3: 17
place4: 445
}
I'am using Mongodb 3.2.9 so don't have access to the new aggregate functions. But I do have the Mongodb shell
EDIT: To be clear I want the longest throughout the whole collection. So there might be 1000 documents but only one result with the longest length for each field throughout the whole collection.
Use .mapReduce() for this to reduce down to the largest values for each key:
db.collection.mapReduce(
function() {
emit(null,
Object.keys(this.someotherdata).map(k => ({ [k]: this.someotherdata[k].length }))
.reduce((acc,curr) => Object.assign(acc,curr),{})
);
},
function(key,values) {
var result = {};
values.forEach(value => {
Object.keys(value).forEach(k => {
if (!result.hasOwnProperty(k))
result[k] = 0;
if ( value[k] > result[k] )
result[k] = value[k];
});
});
return result;
},
{
"out": { "inline": 1 },
"query": { "someotherdata": { "$exists": true } }
}
)
Which basically emits the "length" of each key present in the sub-document path for each document, and then in "reduction", only the largest "length" for each key is actually returned.
Note that in mapReduce you need to put out the same structure you put in, since the way it deals with a large number of documents is by "reducing" in gradual batches. Which is why we emit in numeric form, just like the "reduce" function does.
Gives this output on your document shown in the question. Of course it's the "max" on all documents in the collection when you have more.
{
"_id" : null,
"value" : {
"place1" : 25.0,
"place2" : 26.0,
"place3" : 13.0,
"place4" : 38.0
}
}
For the interested, the context of the question is in fact that features of MongoDB 3.4 were not available to them. But to do the same thing using .aggregate() where the features are available:
db.collection.aggregate([
{ "$match": { "someotherdata": { "$exists": true } } },
{ "$project": {
"_id": 0,
"someotherdata": {
"$map": {
"input": { "$objectToArray": "$someotherdata" },
"as": "s",
"in": { "k": "$$s.k", "v": { "$strLenCP": "$$s.v" } }
}
}
}},
{ "$unwind": "$someotherdata" },
{ "$group": {
"_id": "$someotherdata.k",
"v": { "$max": "$someotherdata.v" }
}},
{ "$sort": { "_id": 1 } },
{ "$group": {
"_id": null,
"data": {
"$push": { "k": "$_id", "v": "$v" }
}
}},
{ "$replaceRoot": {
"newRoot": {
"$arrayToObject": "$data"
}
}}
])
With the same output:
{
"place1" : 25,
"place2" : 26,
"place3" : 13,
"place4" : 38
}
Use cursor.forEach to iterate through the collection.
Keep track of the longest placen values (starting from -1, updating when greater found). Print out values with print() or printjson()
I have the following schema:
{ "_id": {
"$oid": "58c0204d9f10810115f13e5d"
},"OrgName": "A",
"modules": [
{
"name": "test",
"fullName": "john smith",
"_id": {
"$oid": "58c0204d9f10810115f13e5e"
},
"TimeSavedPlanning": 520,
"TimeSavedWorking": 1000,
"costSaved": 0
},
{
"name": "test1",
"fullName": "john smith",
"_id": {
"$oid": "58c020f85437c22215be92cc"
},
"TimeSavedPlanning": 0,
"TimeSavedWorking": 1000,
"costSaved": 500
}
]
}
I want to aggregate the data within the "modules" array for all documents where OrgName = A and outputs the following totals.
TimeSavedPlanning = 520 (because 520 + 0 = 520)
TimeSavedWorking = 2000 (because 1000 + 1000 = 2000)
costSaved = 500 (because 0 + 500)
Just supply each field for the $group accumulators. And use the "double barreled" $sum to "sum" both from arrays, and from documents:
Model.aggregate([
{ "$match": { "OrgName": "A" } },
{ "$group": {
"_id": null,
"TimeSavedPlanning": { "$sum": { "$sum":"$modules.TimeSavedPlanning" } },
"TimeSavedWorking": { "$sum": { "$sum": "$modules.TimeSavedWorking" } },
"costSaved": { "$sum": { "$sum": { "$modules.costSaved" } }
}}
])
You have been allowed to use $sum like that since MongoDB 3.2. Since that release it has "two" functions:
Takes an "array" of values and "sums" them together.
Acts and an "accumulator" within $group to "sum" values provided from documents.
So here you use "both" functions by "reducing" the arrays down to numeric values per document, and then "accumulating" via the $group.
Of course the $match does the "selection" right at the beginning of the operation chain. Since that determines the selection of data, and you put that there for that purpose, as well as the fact you can use an "index" from that "first" stage.
What I'm trying to achieve with a find query is to only include "someArray"s if it's inner array is not empty. For example the JSON below:
{
"document": "some document",
"someArray": [
{
"innerArray": [
"not empty"
]
},
{
"innerArray": [
[] //empty
]
}
]
}
Would return this:
{
"document": "some document",
"someArray": [
{
"innerArray": [
"not empty"
]
}
]
}
I'm using the following find:
Visit.find({'someArray.innerArray.0': {$exists: true}}, function(err, data){});
However, this returns all data.
Have also tried:
Visit.find({}, {'someArray.innerArray': {$gt: 0}}, function(err, data) {});
But this returns nothing
Any ideas on how to approach this?
Cheers
The general case here to check for a non-empty array is to check to see if the "first" element actually exists. For single matches you can project with the positional $ operator:
Vist.find(
{ "someArray.innerArray.0": { "$exists": true } },
{ "document": 1,"someArray.$": 1},
function(err,data) {
}
);
If you need more than a single match or have arrays nested more deeply than this, then the aggregation framework is what you need to handle the harder projection and/or "filter" the array results for more than one match:
Visit.aggregate(
[
// Match documents that "contain" the match
{ "$match": {
"someArray.innerArray.0": { "$exists": true }
}},
// Unwind the array documents
{ "$unwind": "$someArray" },
// Match the array documents
{ "$match": {
"someArray.innerArray.0": { "$exists": true }
}},
// Group back to form
{ "$group": {
"_id": "$_id",
"document": { "$first": "$document" },
"someArray": { "$push": "$someArray" }
}}
],function(err,data) {
}
)
Worth noting here that you are calling this "empty" but in fact is is not, as it actually contains another empty array. You probably don't want to do that with real data, but if you have then you would need to filter like this:
Visit.aggregate(
[
{ "$match": {
"someArray": { "$elemMatch": { "innerArray.0": { "$ne": [] } } }
}},
{ "$unwind": "$someArray" },
{ "$match": {
"someArray.innerArray.0": { "$ne": [] }
}},
{ "$group": {
"_id": "$_id",
"document": { "$first": "$document" },
"someArray": { "$push": "$someArray" }
}}
],function(err,data) {
}
);
Can someone please help me update a collection based on another? I have a pickups collection like so.
{
"_id": {
"$oid": "53a46be700b94521574b6f75"
},
"created": {
"$date": 1403236800000
},
"receivers": [
{
"model": "somemodel1",
"serial": "someserial1",
"access": "someaccess1"
},
{
"model": "somemodel2",
"serial": "someserial2",
"access": "someaccess2"
},
{
"model": "somemodel3",
"serial": "someserial3",
"access": "someaccess3"
}
],
"__v": 0
}
I would like to iterate through the receivers array and search each access in another collection and if found add the activity it was found in.
Here is the workorders collection I want to search in.
{
"_id": {
"$oid": "53af72481b2aeade0b46d025"
},
"activityNumber": "someactivity",
"date": "06/28/2014",
"lines": [
{
"Line #": "1",
"Access Card #": "someaccess1"
},
{
"Line #": "2",
"Access Card #": "someaccess2"
},
{
"Line #": "3",
"Access Card #": "someacess3"
}
],
}
And this is what I would like to end up with.
{
"_id": {
"$oid": "53a46be700b94521574b6f75"
},
"created": {
"$date": 1403236800000
},
"receivers": [
{
"model": "somemodel1",
"serial": "someserial1",
"access": "someaccess1",
"activityNumber": "someactivity"
},
{
"model": "somemodel2",
"serial": "someserial2",
"access": "someaccess2",
"activityNumber": "someactivity"
},
{
"model": "somemodel3",
"serial": "someserial3",
"access": "someaccess3",
"activityNumber": "someactivity"
}
],
"__v": 0
}
I have created an array containing all the access from pickups.
var prodValues = db.pickups.aggregate([
{ "$unwind":"$receivers" },
{ "$group": {
"_id": null,
"products": { "$addToSet": "$receivers.access"}
}}
])
I can easily iterate through the array and search the workorders colleciton and return the activity these are used in. But I'm not sure how to perform a find and update the pickups collection when found.
db.workorders.find({ "lines.Access Card #": { "$in": prodValues.result[0].products }},{activityNumber:1})
Thank you for your help.
Really I would loop this in the completely opposite order as that should be more efficient:
var result = db.workorders.aggregate([
{ "$project": {
"activityNumber": 1,
"access": "$lines.Access Card #",
}}
]).result;
result.forEach(function(res) {
res.access.forEach(function(acc) {
db.pickups.update(
{ "receivers.access": acc },
{ "$set": { "receivers.$.activityNumber": res.activityNumber } }
);
});
});
With MongDB 2.6 you can clean this up a bit with a cursor on the aggregate output and the use of the bulk operations API:
var batch = db.pickups.initializeOrderedBulkOp();
var counter = 0;
db.workorders.aggregate([
{ "$project": {
"activityNumber": 1,
"access": "$lines.Access Card #",
}}
]).forEach(function(res) {
res.access.forEach(function(acc) {
batch.find({ "receivers.access": acc }).updateOne(
{ "$set": { "receivers.$.activityNumber": res.activityNumber } }
);
});
if ( counter % 500 == 0 ) {
batch.execute();
var batch = db.pickups.initializeOrderedBulkOp();
counter = 0;
}
});
if ( counter > 0 )
batch.execute();
Either way, you are basically matching the document and the position of the array on the values of "access" returned from the first aggregation query, and in the current line. This allows the update of the related information at the specified position.
The MongoDB 2.6 improvements are that you are not pulling all the results out of the "workoders" collection into memory as an array, so only each document is pulled in from the the cursor results.
The Bulk operations actions store the "updates" in manageable blocks that should fall under the 16MB BSON limit and then you send this in those blocks instead of individual updates. The driver implementation should handle most of this, but there is some "self management" added in just to be safe.