I'm new to MongoDB and Mongoose and I'm trying to use it to save stock ticks for daytrading analysis. So I imagined this Schema:
symbolSchema = Schema({
name:String,
code:String
});
quoteSchema = Schema({
date:{type:Date, default: now},
open:Number,
high:Number,
low:Number,
close:Number,
volume:Number
});
intradayQuotesSchema = Schema({
id_symbol:{type:Schema.Types.ObjectId, ref:"symbol"},
day:Date,
quotes:[quotesSchema]
});
From my link I receive information like this every minute:
date | symbol | open | high | low | close | volume
2015-03-09 13:23:00|AAPL|127,14|127,17|127,12|127,15|19734
I have to:
Find the ObjectId of the symbol (AAPL).
Discover if the intradayQuote document of this symbol already exists (symbol and date combination)
Discover if the minute OHLCV data of this symbol exists on the quotes array (because it could be repeated)
Update or create the document and update or create the quotes inside the array
I'm able to accomplish this task without veryfing if the quotes already exists, but this method can creates repeated entries inside quotes array:
symbol.find({"code":mySymbol}, function(err, stock) {
intradayQuote.findOneAndUpdate({
{ id_symbol:stock[0]._id, day: myDay },
{ $push: { quotes: myQuotes } },
{ upsert: true },
myCallback
});
});
I already tried:
$addToSet instead of $push, but unfortunatelly this doesn't seems to work with array of documents
{ id_symbol:stock[0]._id, day: myDay, 'quotes["date"]': myDate } on the conditions of findOneAndUpdate; but unfortunatelly if mongo doesn't find it, it creates a new document for the minute instead of appending to the quotes array.
Is there a way to get this working without using one more query (I'm already using 2)? Should I rethink my Schema to facilitate this job? Any help will be appreciated. Thanks!
Basically put an $addToSet operator cannot work for you because your data is not a true "set" by definition being a collection of "completely distinct" objects.
The other piece of logical sense here is that you would be working on the data as it arrives, either as a sinlge object or a feed. I'll presume its a feed of many items in some form and that you can use some sort of stream processor to arrive at this structure per document received:
{
"date": new Date("2015-03-09 13:23:00.000Z"),
"symbol": "AAPL",
"open": 127.14
"high": 127.17,
"low": 127.12
"close": 127.15,
"volume": 19734
}
Converting to a standard decimal format as well as a UTC date since any locale settings really should be the domain of your application once data is retrieved from the datastore of course.
I would also at least flatten out your "intraDayQuoteSchema" a little by removing the reference to the other collection and just putting the data in there. You would still need a lookup on insertion, but the overhead of the additional populate on read would seem to be more costly than the storage overhead:
intradayQuotesSchema = Schema({
symbol:{
name: String,
code: String
},
day:Date,
quotes:[quotesSchema]
});
It depends on you usage patterns, but it's likely to be more effective that way.
The rest really comes down to what is acceptable to
stream.on(function(data) {
var symbol = data.symbol,
myDay = new Date(
data.date.valueOf() -
( data.date.valueOf() % 1000 * 60 * 60 * 24 ));
delete data.symbol;
symbol.findOne({ "code": symbol },function(err,stock) {
intraDayQuote.findOneAndUpdate(
{ "symbol.code": symbol , "day": myDay },
{ "$setOnInsert": {
"symbol.name": stock.name
"quotes": [data]
}},
{ "upsert": true }
function(err,doc) {
intraDayQuote.findOneAndUpdate(
{
"symbol.code": symbol,
"day": myDay,
"quotes.date": data.date
},
{ "$set": { "quotes.$": data } },
function(err,doc) {
intraDayQuote.findOneAndUpdate(
{
"symbol.code": symbol,
"day": myDay,
"quotes.date": { "$ne": data.date }
},
{ "$push": { "quotes": data } },
function(err,doc) {
}
);
}
);
}
);
});
});
If you don't actually need the modified document in the response then you would get some benefit by implementing the Bulk Operations API here and sending all updates in this package within a single database request:
stream.on("data",function(data) {
var symbol = data.symbol,
myDay = new Date(
data.date.valueOf() -
( data.date.valueOf() % 1000 * 60 * 60 * 24 ));
delete data.symbol;
symbol.findOne({ "code": symbol },function(err,stock) {
var bulk = intraDayQuote.collection.initializeOrderedBulkOp();
bulk.find({ "symbol.code": symbol , "day": myDay })
.upsert().updateOne({
"$setOnInsert": {
"symbol.name": stock.name
"quotes": [data]
}
});
bulk.find({
"symbol.code": symbol,
"day": myDay,
"quotes.date": data.date
}).updateOne({
"$set": { "quotes.$": data }
});
bulk.find({
"symbol.code": symbol,
"day": myDay,
"quotes.date": { "$ne": data.date }
}).updateOne({
"$push": { "quotes": data }
});
bulk.execute(function(err,result) {
// maybe do something with the response
});
});
});
The point is that only one of the statements there will actually modify data, and since this is all sent in the same request there is less back and forth between the application and server.
The alternate case is that it might just be more simple in this case to have the actual data referenced in another collection. This then just becomes a simple matter of processing upserts:
intradayQuotesSchema = Schema({
symbol:{
name: String,
code: String
},
day:Date,
quotes:[{ type: Schema.Types.ObjectId, ref: "quote" }]
});
// and in the steam processor
stream.on("data",function(data) {
var symbol = data.symbol,
myDay = new Date(
data.date.valueOf() -
( data.date.valueOf() % 1000 * 60 * 60 * 24 ));
delete data.symbol;
symbol.findOne({ "code": symbol },function(err,stock) {
quote.update(
{ "date": data.date },
{ "$setOnInsert": data },
{ "upsert": true },
function(err,num,raw) {
if ( !raw.updatedExisting ) {
intraDayQuote.update(
{ "symbol.code": symbol , "day": myDay },
{
"$setOnInsert": {
"symbol.name": stock.name
},
"$addToSet": { "quotes": data }
},
{ "upsert": true },
function(err,num,raw) {
}
);
}
}
);
});
});
It really comes down to how important to you is it to have the data for quotes nested within the "day" document. The main distinction is if you want to query those documents based on the data some of those "quote" fields or otherwise live with the overhead of using .populate() to pull in the "quotes" from the other collection.
Of course if referenced and the quote data is important to your query filtering, then you can always just query that collection for the _id values that match and use an $in query on the "day" documents to only match days that contain those matched "quote" documents.
It's a big decision where it matters most which path you take based on how your application uses the data. Hopefully this should guide you on the general concepts behind doing what you want to achieve.
P.S Unless you are "sure" that your source data is always a date rounded to an exact "minute" then you probably want to employ the same kind of date rounding math as used to get the discrete "day" as well.
Related
There are three items in database:
[
{
"year": 2013,
"info": {
"genres": ["Action", "Biography"]
}
},
{
"year": 2013,
"info": {
"genres": ["Crime", "Drama", "Thriller"]
}
},
{
"year": 2013,
"info": {
"genres": ["Action", "Adventure", "Sci-Fi", "Thriller"]
}
}
]
With the year attribute being the table's Primary Key I can go ahead and use the FilterExpression to match to the exact list value ["Action", "Biography"]:
var params = {
TableName : TABLE_NAME,
KeyConditionExpression: "#yr = :yyyy",
FilterExpression: "info.genres = :genres",
ExpressionAttributeNames:{
"#yr": "year"
},
ExpressionAttributeValues: {
":yyyy": 2013,
":genres": ["Action", "Biography"]
}
};
var AWS = require("aws-sdk");
var docClient = new AWS.DynamoDB.DocumentClient();
let promise = docClient.query(params).promise();
promise.then(res => {
console.log("res:", res);
})
Instead of matching an entire list ["Action", "Biography"] I would rather make a query to return only those table items that contain a string "Biography" in a list stored in the item's info.genres field. I wonder if this possible using DynamoDB query API?
Edited later.
Working solution (Thanks to Balu) is to use QueryFilter contains comparison operator:
var params = {
TableName: TABLE_NAME,
Limit: 20,
KeyConditionExpression: "id = :yyyy",
FilterExpression: `contains(info.genres , :qqqq)`,
ExpressionAttributeValues: {
":qqqq": { S: "Biography" },
":yyyy": { N: 2013 },
},
}
let promise = docClient.query(params).promise();
promise.then(res => {
console.log("res:", res);
})
We can use contains in Filter expressions instead of =.
So, "info.genres = :genres" can be changed to contains(info.genres , :gnOne)
AWS is still going to query on Partition Key extract max 1 MB of data in single query before applying the filter. so, we will be charged with same RCU with or without filter expression but amount of data returned to client will be limited, so, still useful.
const dynamodb = new AWS.DynamoDB();
dynamodb.query(
{
TableName: "my-test-table",
Limit: 20,
KeyConditionExpression: "id = :yyyy",
FilterExpression: `contains(info.genres , :gnOne)`,
ExpressionAttributeValues: {
":gnOne": { S: "Biography" },
":yyyy": { S: "2020" },
},
},
function (err, data) {
if (err) console.error(err);
else console.log("dynamodb scan succeeded:", JSON.stringify(data, null, 2));
}
);
Short answer, no. DDB allows to store key:val pairs so your element which you want to query upon should be the top element.
Long answer, yes. However, it is using scan. Honestly, I don't see much difference between query and scan as far as RCUs consumption is concerned. You can use Limit param to limit your RCUs use in a single network call.
If we are good till now, you can use Document Paths in your Filter Expression to achieve what you're trying to do. See this stack overflow post, and this github example.
However, note that this is a Scan operation, not a query, and it might turn out to be very expensive as it will not use any indices and will iterate over every document in your table.
It would be best to pull these attributes out into the top-level document, and query accordingly with a secondary index.
In MongoDB, is it possible to update the value of a field using the value from another field? The equivalent SQL would be something like:
UPDATE Person SET Name = FirstName + ' ' + LastName
And the MongoDB pseudo-code would be:
db.person.update( {}, { $set : { name : firstName + ' ' + lastName } );
The best way to do this is in version 4.2+ which allows using the aggregation pipeline in the update document and the updateOne, updateMany, or update(deprecated in most if not all languages drivers) collection methods.
MongoDB 4.2+
Version 4.2 also introduced the $set pipeline stage operator, which is an alias for $addFields. I will use $set here as it maps with what we are trying to achieve.
db.collection.<update method>(
{},
[
{"$set": {"name": { "$concat": ["$firstName", " ", "$lastName"]}}}
]
)
Note that square brackets in the second argument to the method specify an aggregation pipeline instead of a plain update document because using a simple document will not work correctly.
MongoDB 3.4+
In 3.4+, you can use $addFields and the $out aggregation pipeline operators.
db.collection.aggregate(
[
{ "$addFields": {
"name": { "$concat": [ "$firstName", " ", "$lastName" ] }
}},
{ "$out": <output collection name> }
]
)
Note that this does not update your collection but instead replaces the existing collection or creates a new one. Also, for update operations that require "typecasting", you will need client-side processing, and depending on the operation, you may need to use the find() method instead of the .aggreate() method.
MongoDB 3.2 and 3.0
The way we do this is by $projecting our documents and using the $concat string aggregation operator to return the concatenated string.
You then iterate the cursor and use the $set update operator to add the new field to your documents using bulk operations for maximum efficiency.
Aggregation query:
var cursor = db.collection.aggregate([
{ "$project": {
"name": { "$concat": [ "$firstName", " ", "$lastName" ] }
}}
])
MongoDB 3.2 or newer
You need to use the bulkWrite method.
var requests = [];
cursor.forEach(document => {
requests.push( {
'updateOne': {
'filter': { '_id': document._id },
'update': { '$set': { 'name': document.name } }
}
});
if (requests.length === 500) {
//Execute per 500 operations and re-init
db.collection.bulkWrite(requests);
requests = [];
}
});
if(requests.length > 0) {
db.collection.bulkWrite(requests);
}
MongoDB 2.6 and 3.0
From this version, you need to use the now deprecated Bulk API and its associated methods.
var bulk = db.collection.initializeUnorderedBulkOp();
var count = 0;
cursor.snapshot().forEach(function(document) {
bulk.find({ '_id': document._id }).updateOne( {
'$set': { 'name': document.name }
});
count++;
if(count%500 === 0) {
// Excecute per 500 operations and re-init
bulk.execute();
bulk = db.collection.initializeUnorderedBulkOp();
}
})
// clean up queues
if(count > 0) {
bulk.execute();
}
MongoDB 2.4
cursor["result"].forEach(function(document) {
db.collection.update(
{ "_id": document._id },
{ "$set": { "name": document.name } }
);
})
You should iterate through. For your specific case:
db.person.find().snapshot().forEach(
function (elem) {
db.person.update(
{
_id: elem._id
},
{
$set: {
name: elem.firstname + ' ' + elem.lastname
}
}
);
}
);
Apparently there is a way to do this efficiently since MongoDB 3.4, see styvane's answer.
Obsolete answer below
You cannot refer to the document itself in an update (yet). You'll need to iterate through the documents and update each document using a function. See this answer for an example, or this one for server-side eval().
For a database with high activity, you may run into issues where your updates affect actively changing records and for this reason I recommend using snapshot()
db.person.find().snapshot().forEach( function (hombre) {
hombre.name = hombre.firstName + ' ' + hombre.lastName;
db.person.save(hombre);
});
http://docs.mongodb.org/manual/reference/method/cursor.snapshot/
Starting Mongo 4.2, db.collection.update() can accept an aggregation pipeline, finally allowing the update/creation of a field based on another field:
// { firstName: "Hello", lastName: "World" }
db.collection.updateMany(
{},
[{ $set: { name: { $concat: [ "$firstName", " ", "$lastName" ] } } }]
)
// { "firstName" : "Hello", "lastName" : "World", "name" : "Hello World" }
The first part {} is the match query, filtering which documents to update (in our case all documents).
The second part [{ $set: { name: { ... } }] is the update aggregation pipeline (note the squared brackets signifying the use of an aggregation pipeline). $set is a new aggregation operator and an alias of $addFields.
Regarding this answer, the snapshot function is deprecated in version 3.6, according to this update. So, on version 3.6 and above, it is possible to perform the operation this way:
db.person.find().forEach(
function (elem) {
db.person.update(
{
_id: elem._id
},
{
$set: {
name: elem.firstname + ' ' + elem.lastname
}
}
);
}
);
I tried the above solution but I found it unsuitable for large amounts of data. I then discovered the stream feature:
MongoClient.connect("...", function(err, db){
var c = db.collection('yourCollection');
var s = c.find({/* your query */}).stream();
s.on('data', function(doc){
c.update({_id: doc._id}, {$set: {name : doc.firstName + ' ' + doc.lastName}}, function(err, result) { /* result == true? */} }
});
s.on('end', function(){
// stream can end before all your updates do if you have a lot
})
})
update() method takes aggregation pipeline as parameter like
db.collection_name.update(
{
// Query
},
[
// Aggregation pipeline
{ "$set": { "id": "$_id" } }
],
{
// Options
"multi": true // false when a single doc has to be updated
}
)
The field can be set or unset with existing values using the aggregation pipeline.
Note: use $ with field name to specify the field which has to be read.
Here's what we came up with for copying one field to another for ~150_000 records. It took about 6 minutes, but is still significantly less resource intensive than it would have been to instantiate and iterate over the same number of ruby objects.
js_query = %({
$or : [
{
'settings.mobile_notifications' : { $exists : false },
'settings.mobile_admin_notifications' : { $exists : false }
}
]
})
js_for_each = %(function(user) {
if (!user.settings.hasOwnProperty('mobile_notifications')) {
user.settings.mobile_notifications = user.settings.email_notifications;
}
if (!user.settings.hasOwnProperty('mobile_admin_notifications')) {
user.settings.mobile_admin_notifications = user.settings.email_admin_notifications;
}
db.users.save(user);
})
js = "db.users.find(#{js_query}).forEach(#{js_for_each});"
Mongoid::Sessions.default.command('$eval' => js)
With MongoDB version 4.2+, updates are more flexible as it allows the use of aggregation pipeline in its update, updateOne and updateMany. You can now transform your documents using the aggregation operators then update without the need to explicity state the $set command (instead we use $replaceRoot: {newRoot: "$$ROOT"})
Here we use the aggregate query to extract the timestamp from MongoDB's ObjectID "_id" field and update the documents (I am not an expert in SQL but I think SQL does not provide any auto generated ObjectID that has timestamp to it, you would have to automatically create that date)
var collection = "person"
agg_query = [
{
"$addFields" : {
"_last_updated" : {
"$toDate" : "$_id"
}
}
},
{
$replaceRoot: {
newRoot: "$$ROOT"
}
}
]
db.getCollection(collection).updateMany({}, agg_query, {upsert: true})
(I would have posted this as a comment, but couldn't)
For anyone who lands here trying to update one field using another in the document with the c# driver...
I could not figure out how to use any of the UpdateXXX methods and their associated overloads since they take an UpdateDefinition as an argument.
// we want to set Prop1 to Prop2
class Foo { public string Prop1 { get; set; } public string Prop2 { get; set;} }
void Test()
{
var update = new UpdateDefinitionBuilder<Foo>();
update.Set(x => x.Prop1, <new value; no way to get a hold of the object that I can find>)
}
As a workaround, I found that you can use the RunCommand method on an IMongoDatabase (https://docs.mongodb.com/manual/reference/command/update/#dbcmd.update).
var command = new BsonDocument
{
{ "update", "CollectionToUpdate" },
{ "updates", new BsonArray
{
new BsonDocument
{
// Any filter; here the check is if Prop1 does not exist
{ "q", new BsonDocument{ ["Prop1"] = new BsonDocument("$exists", false) }},
// set it to the value of Prop2
{ "u", new BsonArray { new BsonDocument { ["$set"] = new BsonDocument("Prop1", "$Prop2") }}},
{ "multi", true }
}
}
}
};
database.RunCommand<BsonDocument>(command);
MongoDB 4.2+ Golang
result, err := collection.UpdateMany(ctx, bson.M{},
mongo.Pipeline{
bson.D{{"$set",
bson.M{"name": bson.M{"$concat": []string{"$lastName", " ", "$firstName"}}}
}},
)
I converted this date format
"date" : "2016-02-22 13:52:23"
using this code
db.gmastats.find({date: {$not: {$type: 9}}}).forEach(function(doc) {doc.date = new Date(doc.date); db.gmastats.save(doc);})
And Mongo gave me this:
ISODate("-292275055-05-16T16:47:03.192Z")
What do I do to get this into something that is a real date?
The date string is not valid, but with a small conversion it is. Basically you just need to a "T" between the "date" part and the "time" part, then new Date() will work it out just fine.
db.gmastats.find({ "date": { "$type": 2 } }).forEach(function(doc) {
doc.date = new Date(doc.date.split(" ").join("T"));
db.gmastats.save(doc);
})
Also noting that it's not safe to just look for things that are not a BSON Date. You should be looking only at "string" data when you are acting on a "string".
But really you "should" be doing this with "Bulk" operations:
var ops = [];
db.gmastats.find({ "date": { "$type": 2 } }).forEach(function(doc) {
ops.push({ "updateOne": {
"filter": { "_id": doc._id },
"update": { "$set": { "date": new Date(doc.date.split(" ").join("T")) } }
}});
if ( ops.length == 1000 ) {
db.gmastats.bulk_write(ops);
ops = []
}
})
if ( ops.length > 0 ) {
db.gmastats.bulk_write(ops);
}
Which will go much faster and also safely "only" updates the "date" property of the data without affecting other concurrent write operations.
I am trying a sample that uses addtoset to update an array inside a collection. The new elements are being added but not as intended. According to addtoset a new element is added only if it is not in the list.
Issue:
It is simply taking whatever element is being added.
here is my code sample
Schema(mongo_database.js):
var category = new Schema({
Category_Name: { type: String, required: true},
//SubCategories: [{}]
Category_Type: { type: String},
Sub_Categories: [{Sub_Category_Name: String, UpdatedOn: { type:Date, default:Date.now} }],
CreatedDate: { type:Date, default: Date.now},
UpdatedOn: {type: Date, default: Date.now}
});
service.js
exports.addCategory = function (req, res){
//console.log(req.body);
var category_name = req.body.category_name;
var parent_category_id = req.body.parent_categoryId;
console.log(parent_category_id);
var cats = JSON.parse('{ "Sub_Category_Name":"'+category_name+'"}');
//console.log(cats);
var update = db.category.update(
{
_id: parent_category_id
},
{
$addToSet: { Sub_Categories: cats}
},
{
upsert:true
}
);
update.exec(function(err, updation){
})
}
Can someone help me to figure this out?
many thanks..
As mentioned already, $addToSet does not work this way as the elements in the array or "set" are meant to truly represent a "set" where each element is totally unique. Additionally, the operation methods such as .update() do not take the mongoose schema default or validation rules into account.
However operations such as .update() are a lot more effective than "finding" the document, then manipulating and using .save() for the changes in your client code. They also avoid concurrency problems where other processes or event operations could have modified the document after it was retrieved.
To do what you want requires making "mulitple" update statements to the server. I'ts a "fallback" logic situation where when one operation does not update the document you fallback to the the next:
models/category.js:
var mongoose = require('mongoose'),
Schema = mongoose.Schema;
var category = new Schema({
Category_Name: { type: String, required: true},
Category_Type: { type: String},
Sub_Categories: [{Sub_Category_Name: String, UpdatedOn: { type:Date, default:Date.now} }],
CreatedDate: { type:Date, default: Date.now},
UpdatedOn: {type: Date, default: Date.now}
});
exports.Category = mongoose.model( "Category", category );
in your code:
var Category = require('models/category').Category;
exports.addCategory = function(req,res) {
var category_name = req.body.category_name;
var parent_category_id = req.body.parent_categoryId;
Category.update(
{
"_id": parent_category_id,
"Sub_Categories.Sub_Category_Name": category_name
},
{
"$set": { "Sub_Categories.$.UpdatedOn": new Date() }
},
function(err,numAffected) {
if (err) throw error; // or handle
if ( numAffected == 0 )
Category.update(
{
"_id": parent_category_id,
"Sub_Categories.Sub_Category_Name": { "$ne": category_name }
},
{
"$push": {
"Sub_Categories": {
"Sub_Category_Name": category_name,
"UpdatedOn": new Date()
}
}
},
function(err,numAffected) {
if (err) throw err; // or handle
if ( numAffected == 0 )
Category.update(
{
"_id": parent_category_id
},
{
"$push": {
"Sub_Categories": {
"Sub_Category_Name": category_name,
"UpdatedOn": new Date()
}
}
},
{ "$upsert": true },
function(err,numAffected) {
if (err) throw err;
}
);
});
);
}
);
};
Essentially a possible three operations are tried:
Try to match a document where the category name exists and change the "UpdatedOn" value for the matched array element.
If that did not update. Find a document matching the parentId but where the category name is not present in the array and push a new element.
If that did not update. Perform an operation trying to match the parentId and just push the array element with the upsert set as true. Since both previous updates failed, this is basically an insert.
You can clean that up by either using something like async.waterfall to pass down the numAffected value and avoid the indentation creep, or by my personal preference of not bothering to check the affected value and just pass all statements at once to the server via the Bulk Operations API.
The latter can be accessed from a mongoose model like so:
var ObjectId = mongoose.mongo.ObjectID,
Category = require('models/category').Category;
exports.addCategory = function(req,res) {
var category_name = req.body.category_name;
var parent_category_id = req.body.parent_categoryId;
var bulk = Category.collection.initializeOrderBulkOp();
// Reversed insert
bulk.find({ "_id": { "$ne": new ObjectId( parent_category_id ) })
.upsert().updateOne({
"$setOnInsert": { "_id": new ObjectId( parent_category_id ) },
"$push": {
"Sub_Category_Name": category_name,
"UpdatedOn": new Date()
}
});
// In place
bulk.find({
"_id": new ObjectId( parent_category_id ),
"Sub_Categories.Sub_Category_Name": category_name
}).updateOne({
"$set": { "Sub_Categories.$.UpdatedOn": new Date() }
});
// Push where not matched
bulk.find({
"_id": new ObjectId( parent_category_id ),
"Sub_Categories.Sub_Category_Name": { "$ne": category_name }
}).updateOne({
"$push": {
"Sub_Category_Name": category_name,
"UpdatedOn": new Date()
}
});
// Send to server
bulk.execute(function(err,response) {
if (err) throw err; // or handle
console.log( JSON.stringify( response, undefined, 4 ) );
});
};
Note the reversed logic where the "upsert" occurs first but if course if that succeeded then only the "second" statement would apply, but actually under the Bulk API this would not affect the document. You will get a WriteResult object with the basic information similar to this (in abridged form):
{ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 }
Or on the "upsert":
{
"nMatched" : 1,
"nUpserted" : 1,
"nModified" : 0,
"_id" : ObjectId("54af8fe7628bee196ce97ce0")
}
Also note the need to include the ObjectId function from the base mongo driver since this is the "raw" method from the base driver and it does not "autocast" based on schema like the mongoose methods do.
Additionally be very careful with this, because it is a base driver method and does not share the mongoose logic, so if there is no connection established to the database already then calling the .collection accessor will not return a Collection object and the subsequent method calls fail. Mongoose itself does a "lazy" instantation of the database connection, and the method calls are "queued" until the connection is available. Not so with the basic driver methods.
So it can be done, it's just that you need to handle the logic for such array handling yourself as there is no native operator to do that. But it's still pretty simple and quite efficient if you take the proper care.
First of all: I'm using Mongo 2.6 and Mongoose 3.8.8
I have the follow Schema:
var Link = new Schema({
title: { type: String, trim: true },
owner: { id: { type: Schema.ObjectId }, name: { type: String } },
url: { type: String, default: '', trim: true},
stars: { users: [ { name: { type: String }, _id: {type: Schema.ObjectId} }] },
createdAt: { type: Date, default: Date.now }
});
And my collection already have 500k documents.
What I need is sort the documents using a custom strategy. My initial solution was use the aggregate framework.
var today = new Date();
//fx = (TodayDay * TodayYear) - ( DocumentCreatedDay * DocumentCreatedYear)
var relevance = { $subtract: [
{ $multiply: [ { $dayOfYear: today }, { $year: today } ] },
{ $multiply: [ { $dayOfYear: '$createdAt' }, { $year: '$createdAt' } ] }
]}
var projection = {
_id: 1,
url: 1,
title: 1,
createdAt: 1,
thumbnail: 1,
stars: { $size: '$stars.users'}
ranking: { $multiply: [ relevance, { $size: '$stars.users' } ] }
}
var sort = {
$sort: { ranking: 1, stars: 1 }
}
var page = 1;
var limit = { $limit: 40 }
var skip = { $skip: ( 40 * (page - 1) ) }
var project = { $project: projection }
Link.aggregate([project, sort, limit, skip]).exec(resultCallback);
It works nicely until 100k, after that the query is getting slow and slow.
How I could accomplish that ?
Redesign ?
Wrong use of projection Am I doing ?
Thanks for your time !
You can do all of this as you update and then you can actually index on ranking and use range queries in order to implement your paging. Much better than the use of $skip and $limit which in any form is bad news for large data. You should be able to find many sources that confirm that skip and limit is a poor practice for paging.
The only catch here is since you cannot use an .update() type of statement to actually refer to the existing value of another field, you have to be careful with concurrency issues on updates. This required "rolling in" some custom lock handling which you can do with the .findOneAndUpdate() method:
Link.findOneAndUpdate(
{ "_id": docId, "locked": false },
{ "locked": true },
function(err,doc) {
if ( doc.locked.true ) {
// then update your document
// I would just use the epoch date difference per day
var relevance = (
( Date.now.valueOf() - ( Date.now().valueOf() % 1000 * 60 * 60 * 24) )
- ( doc.createdAt.valueOf() - ( doc.createdAt.valueOf() % 1000 * 60 * 60 * 24 ))
);
var update = { "$set": { "locked": false } };
if ( actionAdd ) {
update["$push"] = { "stars.users": star };
update["$set"]["score"] = relevance * ( doc.stars.users.length +1 );
} else {
update["$pull"] = { "stars.users": star };
update["$set"]["score"] = relevance * ( doc.stars.users.length -1 );
}
// Then update
Link.findOneAndUpdate(
{ "_id": doc._id, "locked": update,function(err,newDoc) {
// possibly check that new "locked" is false, but really
// that should be okay
});
} else {
// some mechanism to retry "n" times at interval
// or report that you cannot update
}
}
)
The idea there is that you can only grab a document with a "locked" status equal to false in order to actually update, and the first "update" operation just sets that value to true so that no other operation could update the document until this completes.
As per the code comments, you probably want to have a few tries at doing this rather than just failing the update as there could be another operation adding or subtracting from the array.
Then depending on the "mode" of your current update if you are either adding to the array or taking an item off of there you simply alter the update statement to be issued to do either operation and set the appropriate "score" value in your document.
The update will then of course set the "locked" status to false and it makes sense to check that the current status is not true though it really should be okay at this point. But this gives you some room on being able to raise exceptions.
That manages the general update situation but you still have a problem with sorting out your "ranking" order here as skip and limit are still not what you want for performance. That is probably best handled by a periodic update of yet another field which you can use for a definitive "range" query, but you probably only really want to be concerned with the the most "relevant" score range in a set range of pages, rather than update the whole collection.
The update needs to be periodic as you will have concurrency problems if you try to change the "ranking" order of multiple documents in individual updates. So you need to make sure this process does not overlap with another such update.
As a final note consider your "score" calculation as what you really want is the newest and "most starred" content at the top. The current calculation has some flaws there such as on the same day and 0 "stars", but I'll leave that to you to work out.
This is essentially what you need to do for your solution. Trying to do this dynamically on a large collection using the aggregation framework is not going to produce favorable performance for your application experience. So there are few pointers here to things you can do to more efficiently maintain the order of your results.