First of all: I'm using Mongo 2.6 and Mongoose 3.8.8
I have the follow Schema:
var Link = new Schema({
title: { type: String, trim: true },
owner: { id: { type: Schema.ObjectId }, name: { type: String } },
url: { type: String, default: '', trim: true},
stars: { users: [ { name: { type: String }, _id: {type: Schema.ObjectId} }] },
createdAt: { type: Date, default: Date.now }
});
And my collection already have 500k documents.
What I need is sort the documents using a custom strategy. My initial solution was use the aggregate framework.
var today = new Date();
//fx = (TodayDay * TodayYear) - ( DocumentCreatedDay * DocumentCreatedYear)
var relevance = { $subtract: [
{ $multiply: [ { $dayOfYear: today }, { $year: today } ] },
{ $multiply: [ { $dayOfYear: '$createdAt' }, { $year: '$createdAt' } ] }
]}
var projection = {
_id: 1,
url: 1,
title: 1,
createdAt: 1,
thumbnail: 1,
stars: { $size: '$stars.users'}
ranking: { $multiply: [ relevance, { $size: '$stars.users' } ] }
}
var sort = {
$sort: { ranking: 1, stars: 1 }
}
var page = 1;
var limit = { $limit: 40 }
var skip = { $skip: ( 40 * (page - 1) ) }
var project = { $project: projection }
Link.aggregate([project, sort, limit, skip]).exec(resultCallback);
It works nicely until 100k, after that the query is getting slow and slow.
How I could accomplish that ?
Redesign ?
Wrong use of projection Am I doing ?
Thanks for your time !
You can do all of this as you update and then you can actually index on ranking and use range queries in order to implement your paging. Much better than the use of $skip and $limit which in any form is bad news for large data. You should be able to find many sources that confirm that skip and limit is a poor practice for paging.
The only catch here is since you cannot use an .update() type of statement to actually refer to the existing value of another field, you have to be careful with concurrency issues on updates. This required "rolling in" some custom lock handling which you can do with the .findOneAndUpdate() method:
Link.findOneAndUpdate(
{ "_id": docId, "locked": false },
{ "locked": true },
function(err,doc) {
if ( doc.locked.true ) {
// then update your document
// I would just use the epoch date difference per day
var relevance = (
( Date.now.valueOf() - ( Date.now().valueOf() % 1000 * 60 * 60 * 24) )
- ( doc.createdAt.valueOf() - ( doc.createdAt.valueOf() % 1000 * 60 * 60 * 24 ))
);
var update = { "$set": { "locked": false } };
if ( actionAdd ) {
update["$push"] = { "stars.users": star };
update["$set"]["score"] = relevance * ( doc.stars.users.length +1 );
} else {
update["$pull"] = { "stars.users": star };
update["$set"]["score"] = relevance * ( doc.stars.users.length -1 );
}
// Then update
Link.findOneAndUpdate(
{ "_id": doc._id, "locked": update,function(err,newDoc) {
// possibly check that new "locked" is false, but really
// that should be okay
});
} else {
// some mechanism to retry "n" times at interval
// or report that you cannot update
}
}
)
The idea there is that you can only grab a document with a "locked" status equal to false in order to actually update, and the first "update" operation just sets that value to true so that no other operation could update the document until this completes.
As per the code comments, you probably want to have a few tries at doing this rather than just failing the update as there could be another operation adding or subtracting from the array.
Then depending on the "mode" of your current update if you are either adding to the array or taking an item off of there you simply alter the update statement to be issued to do either operation and set the appropriate "score" value in your document.
The update will then of course set the "locked" status to false and it makes sense to check that the current status is not true though it really should be okay at this point. But this gives you some room on being able to raise exceptions.
That manages the general update situation but you still have a problem with sorting out your "ranking" order here as skip and limit are still not what you want for performance. That is probably best handled by a periodic update of yet another field which you can use for a definitive "range" query, but you probably only really want to be concerned with the the most "relevant" score range in a set range of pages, rather than update the whole collection.
The update needs to be periodic as you will have concurrency problems if you try to change the "ranking" order of multiple documents in individual updates. So you need to make sure this process does not overlap with another such update.
As a final note consider your "score" calculation as what you really want is the newest and "most starred" content at the top. The current calculation has some flaws there such as on the same day and 0 "stars", but I'll leave that to you to work out.
This is essentially what you need to do for your solution. Trying to do this dynamically on a large collection using the aggregation framework is not going to produce favorable performance for your application experience. So there are few pointers here to things you can do to more efficiently maintain the order of your results.
Related
I have opened a related issue on GitHub, but maybe someone here will be able to help quicker.
Summary:
ValidationException: Query key condition not supported
I need to find records in last (amount) seconds on a given location.
Pretty simple, but already related to other issues:
One and another one
WORKS:
Activity.query('locationId').eq(locationId).exec();
DOES NOT WORK:
Activity.query('locationId').eq(locationId).where('createdAt').ge(date).exec();
Code sample:
Schema
const Activity = (dynamoose: typeof dynamooseType) => dynamoose.model<ActivityType, {}>('Activity',
new Schema({
id: {
type: String,
default: () => {
return uuid();
},
hashKey: true,
},
userId: {
type: String,
},
locationId: {
type: String,
rangeKey: true,
index: {
global: true,
},
},
createdAt: { type: Number, rangeKey: true, required: true, default: Date.now },
action: {
type: Number,
},
}, {
expires: 60 * 60 * 24 * 30 * 3, // activity logs to expire after 3 months
}));
Code which executes the function
Funny part is that I found this as workaround proposed to be used until they merge PR giving ability to specify timestamps as keys, but unfortunately it does not work.
async getActivitiesInLastSeconds(locationId: string, timeoutInSeconds: number) {
const Activity = schema.Activity(this.dynamoose);
const date = moment().subtract(timeoutInSeconds, 'seconds').valueOf();
return await Activity.query('locationId').eq(locationId)
.where('createdAt').ge(date).exec();
}
I suspect createdAt is not a range key of your table / index. You need to either do .filter('createdAt').ge(date) or modify your table / index schema.
I'm pretty sure the problem is that when you specifying rangeKey: true on the createdAt property you are telling that to be used on the global index (I don't think that is the correct term). That range key will be linked to the id property.
I believe the easiest solution would be to change your locationId index to be something like the following:
index: {
global: true,
rangeKey: 'createdAt',
},
That way you are being very explicit about which index you want to set createdAt as the rangeKey for.
After making that change please remember to sync your changes with either your local DynamoDB server or the actual DynamoDB service, so that the changes get populated both in your code and on the database system.
Hopefully this helps! If it doesn't fix your problem please feel free to comment and I'll help you further.
Sorry if this is pretty basic, but I'm a mongodb newbie and haven't been able to find an answer to this:
Let's say I'm doing the following:
db.collection("bugs").updateOne({ _id: searchId }, { $set: { "fixed": true }}
How to set "fixed" to the contrary of whatever the last value of "fixed" was? Without any additional queries? Something like { $set: { "fixed": !fixed }}
It is not really possible to achieve this in MongoDB as of now in just one operation by sticking to the idea of storing boolean values (might be possible in future versions of MongoDB). But, there is a workaround to do this by storing bits (0 or 1) to represent true or false instead of boolean values and performing bitwise xor operation on those in MongoDB as follows:
db.collection("bugs").updateOne(
{
_id: searchId
},
{
$bit : {
fixed: {
xor: NumberInt(1)
}
}
}
)
Please note that you also have to store 0 as NumberInt(0) to represent false and 1 as NumberInt(1) to represent true in the fixed prop as MongoDB by default treats all numbers as floating-point values.
This is not possible in MongoDB. You have to retrieve the doc from the db and then update it:
var doc = db.collection("bugs").findOne({ _id: searchId });
db.collection("bugs").updateOne({ _id: searchId }, { $set: { "fixed": !doc.fixed } }
Yes, it's possible to do that with MongoDB, of course !! ... just use the find's forEach feature like this:
db.collection("bugs").find({ _id: searchId }).forEach(function(bugDoc) {
db.collection("bugs").updateOne({ _id: searchId }, { $set: { "fixed": !bugDoc.fixed }});
});
NOTE: bugDoc contains all the fields of the original document and you can make all the calculations and changes you want in this double operation update
I have .tsv file with some orders information. After remake into my script i got this.
[{"order":"5974842dfb458819244adbf7","name":"Сергей Климов","email":"wordkontent#gmail.com"},
{"order":"5974842dfb458819244adbf8","name":"Сушков А.В.","email":"mail#wwwcenter.ru"},
{"order":"5974842dfb458819244adbf9","name":"Виталий","email":"wawe2012#mail.ru"},
...
and so on
I have a scheema into mongoose.
var ClientSchema = mongoose.Schema({
name:{
type: String
},
email:{
type: String,
unique : true,
required: true,
index: true
},
forums:{
type: String
},
other:{
type: String
},
status:{
type: Number,
default: 3
},
subscribed:{
type: Boolean,
default: true
},
clienturl:{
type: String
},
orders:{
type: [String]
}
});
clienturl is an password 8 chars length, that generated by function.
module.exports.arrayClientSave = function(clientsArray,callback){
let newClientsArray = clientsArray
.map(function(x) {
var randomstring = Math.random().toString(36).slice(-8);
x.clienturl = randomstring;
return x;
});
console.log(newClientsArray);
Client.update( ??? , callback );
}
But i dont undestand how to make an update. Just if email already exsists push orders array, but not rewrite all other fields. But if email not exsists - save new user with clienturl and so on. Thanks!
Probably the best way to handle this is via .bulkWrite() which is a MongoDB method for sending "multiple operations" in a "single" request with a "single" response. This counters the need to control async functions in issue and response for each "looped" item.
module.exports.arrayClientSave = function(clientsArray,callback){
let newClientsArray = clientsArray
.map(x => {
var randomstring = Math.random().toString(36).slice(-8);
x.clienturl = randomstring;
return x;
});
console.log(newClientsArray);
let ops = newClientsArray.map( x => (
{ "updateOne": {
"filter": { "email": x.email },
"update": {
"$addToSet": { "orders": x.order },
"$setOnInsert": {
"name": x.name,
"clientUrl": x.clienturl
}
},
"upsert": true
}}
));
Client.bulkWrite(ops,callback);
};
The main idea there being that you use the "upsert" functionality of MongoDB to drive the "creation or update" functionality. Where the $addToSet only appends the "orders" property information to the array where not already present, and the $setOnInsert actually only takes effect when the action is actually an "upsert" and not applied when the action matches an existing document.
Also by applying this within .bulkWrite() this becomes a "single async call" when talking to a MongoDB server that supports it, and that being any version greater than or equal to MongoDB 2.6.
However the main point of the specific .bulkWrite() API, is that the API itself will "detect" if the server connected to actually supports "Bulk" operations. When it does not, this "downgrades" to individual "async" calls instead of one batch. But this is controlled by the "driver", and it will still interact with your code as if it were actually one request and response.
This means all the difficulty of dealing with the "async loop" is actually handled in the driver software itself. Being either negated by the supported method, or "emulated" in a way that makes it simple for your code to just use.
I've stumbled upon some very strange behavior with MongoDB. For my test case, I have an MongoDB collection with 9 documents. All documents have the exact same structure, including the fields expired_at: Date and location: [lng, lat].
I now need to find all documents that are not expired yet and are within a bounding box; I show match documents on map. for this I set up the following queries:
var qExpiry = {"expired_at": { $gt : new Date() } };
var qLocation = { "location" : { $geoWithin : { $box : [ [ 123.8766, 8.3269 ] , [ 122.8122, 8.24974 ] ] } } };
var qFull = { $and: [ qExpiry, qLocation ] };
Since the expiry date is long in the past, and when I set the bounding box large enough, the following queries give me all 9 documents as expected:
db.docs.find(qExpiry);
db.docs.find(qLocation);
db.docs.find(qFull);
db.docs.find(qExpiry).sort({"created_at" : -1});
db.docs.find(qLocation).sort({"created_at" : -1});
Now here's the deal: The following query returns 0 documents:
db.docs.find(qFull).sort({"created_at" : -1});
Just adding sort to the AND query ruins the result (please note that I want to sort since I also have a limit in order to avoid cluttering the map on larger scales). Sorting by other fields yield the same empty result. What's going on here?
(Actually even stranger: When I zoom into my map, I sometimes get results for qFull, even with sorting. One could argue that qLocation is faulty. But when I only use qLocation, the results are always correct. And qExpiry is always true for all documents anyway)
You may want to try running the same query using the aggregation framework's $match and $sort pipelines:
db.docs.aggregate([
{ "$match": qFull },
{ "$sort": { "created_at": -1 } }
]);
or implicitly using $and by specifiying a comma-separated list of expressions as in
db.docs.aggregate([
{
"$match": {
"expired_at": { "$gt" : new Date() },
"location" : {
"$geoWithin" : {
"$box" : [
[ 123.8766, 8.3269 ],
[ 122.8122, 8.24974 ]
]
}
}
}
},
{ "$sort": { "created_at": -1 } }
]);
Not really sure why that fails with find()
chridam suggestion using the aggregation framework of MongoDB proved to be the way to go. My working query now looks like this:
db.docs.aggregate(
[
{ $match : { $and : [qExpiry, qLocation]} },
{ $sort: {"created_at": -1} }.
{ $limit: 50 }.
]
);
Nevertheless, if any can point out way my first approach did not work, that would be very useful. Simply adding sort() to a non-empty query shouldn't suddenly return 0 documents. Just to add, since I still tried for a bit, .sort({}) return all documents but was not very useful. Everything else failed including .sort({'_id': 1}).
I'm new to MongoDB and Mongoose and I'm trying to use it to save stock ticks for daytrading analysis. So I imagined this Schema:
symbolSchema = Schema({
name:String,
code:String
});
quoteSchema = Schema({
date:{type:Date, default: now},
open:Number,
high:Number,
low:Number,
close:Number,
volume:Number
});
intradayQuotesSchema = Schema({
id_symbol:{type:Schema.Types.ObjectId, ref:"symbol"},
day:Date,
quotes:[quotesSchema]
});
From my link I receive information like this every minute:
date | symbol | open | high | low | close | volume
2015-03-09 13:23:00|AAPL|127,14|127,17|127,12|127,15|19734
I have to:
Find the ObjectId of the symbol (AAPL).
Discover if the intradayQuote document of this symbol already exists (symbol and date combination)
Discover if the minute OHLCV data of this symbol exists on the quotes array (because it could be repeated)
Update or create the document and update or create the quotes inside the array
I'm able to accomplish this task without veryfing if the quotes already exists, but this method can creates repeated entries inside quotes array:
symbol.find({"code":mySymbol}, function(err, stock) {
intradayQuote.findOneAndUpdate({
{ id_symbol:stock[0]._id, day: myDay },
{ $push: { quotes: myQuotes } },
{ upsert: true },
myCallback
});
});
I already tried:
$addToSet instead of $push, but unfortunatelly this doesn't seems to work with array of documents
{ id_symbol:stock[0]._id, day: myDay, 'quotes["date"]': myDate } on the conditions of findOneAndUpdate; but unfortunatelly if mongo doesn't find it, it creates a new document for the minute instead of appending to the quotes array.
Is there a way to get this working without using one more query (I'm already using 2)? Should I rethink my Schema to facilitate this job? Any help will be appreciated. Thanks!
Basically put an $addToSet operator cannot work for you because your data is not a true "set" by definition being a collection of "completely distinct" objects.
The other piece of logical sense here is that you would be working on the data as it arrives, either as a sinlge object or a feed. I'll presume its a feed of many items in some form and that you can use some sort of stream processor to arrive at this structure per document received:
{
"date": new Date("2015-03-09 13:23:00.000Z"),
"symbol": "AAPL",
"open": 127.14
"high": 127.17,
"low": 127.12
"close": 127.15,
"volume": 19734
}
Converting to a standard decimal format as well as a UTC date since any locale settings really should be the domain of your application once data is retrieved from the datastore of course.
I would also at least flatten out your "intraDayQuoteSchema" a little by removing the reference to the other collection and just putting the data in there. You would still need a lookup on insertion, but the overhead of the additional populate on read would seem to be more costly than the storage overhead:
intradayQuotesSchema = Schema({
symbol:{
name: String,
code: String
},
day:Date,
quotes:[quotesSchema]
});
It depends on you usage patterns, but it's likely to be more effective that way.
The rest really comes down to what is acceptable to
stream.on(function(data) {
var symbol = data.symbol,
myDay = new Date(
data.date.valueOf() -
( data.date.valueOf() % 1000 * 60 * 60 * 24 ));
delete data.symbol;
symbol.findOne({ "code": symbol },function(err,stock) {
intraDayQuote.findOneAndUpdate(
{ "symbol.code": symbol , "day": myDay },
{ "$setOnInsert": {
"symbol.name": stock.name
"quotes": [data]
}},
{ "upsert": true }
function(err,doc) {
intraDayQuote.findOneAndUpdate(
{
"symbol.code": symbol,
"day": myDay,
"quotes.date": data.date
},
{ "$set": { "quotes.$": data } },
function(err,doc) {
intraDayQuote.findOneAndUpdate(
{
"symbol.code": symbol,
"day": myDay,
"quotes.date": { "$ne": data.date }
},
{ "$push": { "quotes": data } },
function(err,doc) {
}
);
}
);
}
);
});
});
If you don't actually need the modified document in the response then you would get some benefit by implementing the Bulk Operations API here and sending all updates in this package within a single database request:
stream.on("data",function(data) {
var symbol = data.symbol,
myDay = new Date(
data.date.valueOf() -
( data.date.valueOf() % 1000 * 60 * 60 * 24 ));
delete data.symbol;
symbol.findOne({ "code": symbol },function(err,stock) {
var bulk = intraDayQuote.collection.initializeOrderedBulkOp();
bulk.find({ "symbol.code": symbol , "day": myDay })
.upsert().updateOne({
"$setOnInsert": {
"symbol.name": stock.name
"quotes": [data]
}
});
bulk.find({
"symbol.code": symbol,
"day": myDay,
"quotes.date": data.date
}).updateOne({
"$set": { "quotes.$": data }
});
bulk.find({
"symbol.code": symbol,
"day": myDay,
"quotes.date": { "$ne": data.date }
}).updateOne({
"$push": { "quotes": data }
});
bulk.execute(function(err,result) {
// maybe do something with the response
});
});
});
The point is that only one of the statements there will actually modify data, and since this is all sent in the same request there is less back and forth between the application and server.
The alternate case is that it might just be more simple in this case to have the actual data referenced in another collection. This then just becomes a simple matter of processing upserts:
intradayQuotesSchema = Schema({
symbol:{
name: String,
code: String
},
day:Date,
quotes:[{ type: Schema.Types.ObjectId, ref: "quote" }]
});
// and in the steam processor
stream.on("data",function(data) {
var symbol = data.symbol,
myDay = new Date(
data.date.valueOf() -
( data.date.valueOf() % 1000 * 60 * 60 * 24 ));
delete data.symbol;
symbol.findOne({ "code": symbol },function(err,stock) {
quote.update(
{ "date": data.date },
{ "$setOnInsert": data },
{ "upsert": true },
function(err,num,raw) {
if ( !raw.updatedExisting ) {
intraDayQuote.update(
{ "symbol.code": symbol , "day": myDay },
{
"$setOnInsert": {
"symbol.name": stock.name
},
"$addToSet": { "quotes": data }
},
{ "upsert": true },
function(err,num,raw) {
}
);
}
}
);
});
});
It really comes down to how important to you is it to have the data for quotes nested within the "day" document. The main distinction is if you want to query those documents based on the data some of those "quote" fields or otherwise live with the overhead of using .populate() to pull in the "quotes" from the other collection.
Of course if referenced and the quote data is important to your query filtering, then you can always just query that collection for the _id values that match and use an $in query on the "day" documents to only match days that contain those matched "quote" documents.
It's a big decision where it matters most which path you take based on how your application uses the data. Hopefully this should guide you on the general concepts behind doing what you want to achieve.
P.S Unless you are "sure" that your source data is always a date rounded to an exact "minute" then you probably want to employ the same kind of date rounding math as used to get the discrete "day" as well.