I was wondering how to implement lazy loading/more data on scroll with mongoose. I want to load 10 posts at a time, but I'm not sure how to best load the next 10 elements in a query.
I currently have:
var q = Post.find().sort("rating").limit(10);
To load the 10 posts with the highest "rating". How do i go about doing this for the next 10 posts?
The general concept of "paging" is to use .skip() which essentially "skips" over the results that have already been retrieved, so you can essentially do this:
var q = Post.find().sort( "rating" ).skip(10).limit(10);
But really, as you can imagine this is going to slow down considerably when you get a few "pages" in. So you really want something smarter. Essentially this is a "range query" where you want to grab higher (or lower if descending ) results than the last set of results retrieved. So given the last value of 5 then for greater than you would do:
var q = Post.find({ "rating": { "$gt": 5 } }).sort( "rating" ).limit(10);
Looks Okay, but really there is still a problem. What if the next "page" also contained results with a rating of 5? This query would skip over those and never display them.
The smart thing to do is "keep" all of the _id values from the document since they are unique keys. Basically apply the same sort of thing, except this time you make sure you are not including the results from the previous page in your new one. The $nin operator helps here:
var q = Post.find({ "rating": { "$gte": 5 }, "_id": { "$nin": seenIds } })
.sort( "rating" ).limit(10);
Whether the seenIds is just the last page of results or some more depends on the "density" of the value you are sorting on, and of course you need to "keep" these in a session variable or something.
But try to adapt this, as range queries are usually your best performance result.
Related
How would I go about filtering a set of records based on their child records.
Let's say I have a collection Item that has a field to another collection Bag called bagId. I'd like to find all Items where a field on Bags matches some clause.
I.e. db.Items.find( { "where bag.type:'Paper' " }) . How would I go about doing this in MongoDB. I understand I'd have to join on Bags and then link where Item.bagId == Bag._id
I used Studio3T to convert a SQL GROUP BY to a Mongo aggregate. I'm just wondering if there's any defacto way to do this.
Should I perform a data migration to simply include Bag.type on every Item document (don't want to get into the habit of continuously making schema changes everytime I want to sort/filter Items by Bag fields).
Use something like https://github.com/meteorhacks/meteor-aggregate (No luck with that syntax yet)
Grapher https://github.com/cult-of-coders/grapher I played around with this briefly and while it's cool I'm not sure if it'll actually solve my problem. I can use it to add Bag.type to every Item returned, but I don't see how that could help me filter every item by Bag.type.
Is this just one of the tradeoffs of using a NoSQL dbms? What option above is recommended or are there any other ideas?
Thanks
You could use the $in functionality of MongoDB. It would look something like this:
const bagsIds = Bags.find({type: 'paper'}, {fields: {"_id": 1}}).map(function(bag) { return bag._id; });
const items = Items.find( { bagId: { $in: bagsIds } } ).fetch();
It would take some testing if the reactivity of this solution is still how you expect it to work and if this would still be suitable for larger collections instead of going for your first solution and performing the migration.
I am not lost when dealing with databases but also not an expert.
I want to implement infinite scroll on my website, which means data needs to be in order, either by date_created or id descending. My initial thought was to use LIMIT and OFFSET in a query like this (using SQLalchemy):
session.query(Posts).filter(Posts.owner_id == _userid_).filter(Posts.id < post_id).orderBy(desc(Posts.id)).limit(5).all()
which translates to something like this:
SELECT * from posts WHERE owner_id = _userid_ AND id < _post_id_ ORDER BY id DESC LIMIT 10 OFFSET _somevalue_;
and in my js:
var minimum_post_id = 0;
var posts_list = [];
var post_ids = [];
function infinite_load(_userid_, _post_id_) {
fetch('/users/' + _userid_ + '/posts/' + _post_id_)
.then(r => r.json())
.then(data => {
console.log(data);
data.posts.forEach(post => { posts_list.push(post); post_ids.push(post.id) });
minimum_post_id = Math.min(...post_ids);
})
}
infinite_load(1, minimum_post_id) // random user id
However, i was researching to see if this was efficient and came across this: https://www.eversql.com/faster-pagination-in-mysql-why-order-by-with-limit-and-offset-is-slow/
Basically it is saying that there limit and offset is bad because it still has to count all of the records to offset, only to throw them away.
So my question is, is my implementation inadequate? How do i efficiently query a database sequentially?
Pagination -- done correctly -- has a few more barbs than a simple "What id range did we show last page? Add 10 to limit and offset." Some quick questions to whet your appetite, then a suggestion:
While a user is looking at items positioned 11 through 20, a record is inserted at position 15. What is returned to the user upon clicking the 'Next' pagination button?
Conversely, while a user is looking at records positioned from 101 through 110, 10 arbitrarily records below are position 100 are removed. What does the user get after a 'Next' pagination click? Or a 'Previous' pagination click?
Depending on your data model, schema, and UI requirements, these can be simple or really difficult to answer.
Now, to why LIMIT/OFFSET is the wrong way to do it ... It's not, actually, provided you have a small enough dataset -- and that can be plenty large for most sites. In other words, pick what works for your setup.
Meanwhile, for the pedagogically minded under the "really large" data set assumption: it's the OFFSET that is the killer part of that query (as it requires the results to be tallied, sorted, counted, then skipped before the LIMIT can kick in). So, how can we remove the OFFSET? Incorporate it into the CONSTRAINT section of your query.
Your query orders by ID, then offsets by some number. Remove the offset, by ensuring that the ID is greater (or less) than what the current screen shows for the user:
SELECT * FROM posts
WHERE
owner_id = _userid_
AND id < _last_displayed_id
ORDER BY id DESC
LIMIT 10;
Similarly, if you're ordering by time, then, make your pagination button (or scroll handler) request new records after/before the last item already presented to the user.
My project inserts a large number of historical data documents into a history collection, since they are never modified the order is correct (as no updating goes on) but backwards for retrieving.
I'm using something like this to retrieve the data in pages of 20 records.
var page = 2;
hStats_collection.find({name:name},{_id:0, name:0}).limit(20).skip(page*20).toArray( function(err, doc) {
});
After reading my eyes dry, $sort is no good, and it appears the best way to reverse the order that the above code retrieves is to add an index via a time stamp element in the document, of which I already have a useful item called started (seconds since epoc) and need to create an index for it.
http://docs.mongodb.org/manual/tutorial/create-an-index/
The docs say I can do something like:
hStats_collection.createIndex( { started: -1 } )
How do I change my .find() code above to find name and reference the started index so the results come back newest to oldest (as oposed to the natural find() oldest to newest).
hStats_collection.find({name:name},{_id:0, name:0}).sort({ started:-1 }).limit(20).
I am one of many SQL users who probably have a hard time transitioning to the NoSQL world, and I have a scenario, where I have tonnes of entries in my database, but I would only like to get the most recent ones, which is easy, but then it should all be for the same user. I'm sure it's simple, but after loads of trial and error without a good solution, I'm asking you for help!
So, my keys look like this.. (because I'm thinking that's the way to go!)
emit([doc.eventTime, doc.userId], doc);
My question then is, how would I go about only getting the 10 last results from CouchDB? For that one specific user. The reason why I include the time as key, is because I think that's the simplest way to sort the results descending, as I want the ten last actions, for example.
If I had to do it in SQL i'd do this, to give you an exact example.
SELECT * FROM table WHERE userId = ID ORDER BY eventTime DESC LIMIT 10
I hope someone out there can help :-)
Change your key to:
emit([doc.userId, doc.eventTime], null);
Query with:
view?descending=true&startkey=[<user id>,{}]&endkey=[<user id>]&limit=10
So add something like this to a view...
"test": {
"map": "function(doc) { key = doc.userId; value = {'time': doc.eventTime, 'userid': doc.userId}; emit(key, value)}"
}
And then call the view...(assuming userId = "123"
http://192.168.xxx.xxx:5984/dbname/_design/docname/_view/test?key="123"&limit=10
You will need to add some logic to the map to get the most recent, as I don't believe order is preserved in any manner.
I want to implement pagination on top of a MongoDB. For my range query, I thought about using ObjectIDs:
db.tweets.find({ _id: { $lt: maxID } }, { limit: 50 })
However, according to the docs, the structure of the ObjectID means that "ObjectId values do not represent a strict insertion order":
The relationship between the order of ObjectId values and generation time is not strict within a single second. If multiple systems, or multiple processes or threads on a single system generate values, within a single second; ObjectId values do not represent a strict insertion order. Clock skew between clients can also result in non-strict ordering even for values, because client drivers generate ObjectId values, not the mongod process.
I then thought about querying with a timestamp:
db.tweets.find({ created: { $lt: maxDate } }, { limit: 50 })
However, there is no guarantee the date will be unique — it's quite likely that two documents could be created within the same second. This means documents could be missed when paging.
Is there any sort of ranged query that would provide me with more stability?
It is perfectly fine to use ObjectId() though your syntax for pagination is wrong. You want:
db.tweets.find().limit(50).sort({"_id":-1});
This says you want tweets sorted by _id value in descending order and you want the most recent 50. Your problem is the fact that pagination is tricky when the current result set is changing - so rather than using skip for the next page, you want to make note of the smallest _id in the result set (the 50th most recent _id value and then get the next page with:
db.tweets.find( {_id : { "$lt" : <50th _id> } } ).limit(50).sort({"_id":-1});
This will give you the next "most recent" tweets, without new incoming tweets messing up your pagination back through time.
There is absolutely no need to worry about whether _id value is strictly corresponding to insertion order - it will be 99.999% close enough, and no one actually cares on the sub-second level which tweet came first - you might even notice Twitter frequently displays tweets out of order, it's just not that critical.
If it is critical, then you would have to use the same technique but with "tweet date" where that date would have to be a timestamp, rather than just a date.
Wouldn't a tweet "actual" timestamp (i.e. time tweeted and the criteria you want it sorted by) be different from a tweet "insertion" timestamp (i.e. time added to local collection). This depends on your application, of course, but it's a likely scenario that tweet inserts could be batched or otherwise end up being inserted in the "wrong" order. So, unless you work at Twitter (and have access to collections inserted in correct order), you wouldn't be able to rely just on $natural or ObjectID for sorting logic.
Mongo docs suggest skip and limit for paging:
db.tweets.find({created: {$lt: maxID}).
sort({created: -1, username: 1}).
skip(50).limit(50); //second page
There is, however, a performance concern when using skip:
The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return result. As offset increases, cursor.skip() will become slower and more CPU intensive.
This happens because skip does not fit into the MapReduce model and is not an operation that would scale well, you have to wait for a sorted collection to become available before it can be "sliced". Now limit(n) sounds like an equally poor method as it applies a similar constraint "from the other end"; however with sorting applied, the engine is able to somewhat optimize the process by only keeping in memory n elements per shard as it traverses the collection.
An alternative is to use range based paging. After retrieving the first page of tweets, you know what the created value is for the last tweet, so all you have to do is substitute the original maxID with this new value:
db.tweets.find({created: {$lt: lastTweetOnCurrentPageCreated}).
sort({created: -1, username: 1}).
limit(50); //next page
Performing a find condition like this can be easily parallellized. But how to deal with pages other than the next one? You don't know the begin date for pages number 5, 10, 20, or even the previous page! #SergioTulentsev suggests creative chaining of methods but I would advocate pre-calculating first-last ranges of the aggregate field in a separate pages collection; these could be re-calculated on update. Furthermore, if you're not happy with DateTime (note the performance remarks) or are concerned about duplicate values, you should consider compound indexes on timestamp + account tie (since a user can't tweet twice at the same time), or even an artificial aggregate of the two:
db.pages.
find({pagenum: 3})
> {pagenum:3; begin:"01-01-2014#BillGates"; end:"03-01-2014#big_ben_clock"}
db.tweets.
find({_sortdate: {$lt: "03-01-2014#big_ben_clock", $gt: "01-01-2014#BillGates"}).
sort({_sortdate: -1}).
limit(50) //third page
Using an aggregate field for sorting will work "on the fold" (although perhaps there are more kosher ways to deal with the condition). This could be set up as a unique index with values corrected at insert time, with a single tweet document looking like
{
_id: ...,
created: ..., //to be used in markup
user: ..., //also to be used in markup
_sortdate: "01-01-2014#BillGates" //sorting only, use date AND time
}
The following approach wil work even if there are multiple documents inserted/updated at same millisecond even if from multiple clients (which generates ObjectId). For simiplicity, In following queries I am projecting _id, lastModifiedDate.
First page, fetch the result Sorted by modifiedTime (Descending), ObjectId (Ascending) for fist page.
db.product.find({},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)
Note down the ObjectId and lastModifiedDate of the last record fetched in this page. (loid, lmd)
For sencod page, include query condition to search if (lastModifiedDate = lmd AND oid > loid ) OR (lastModifiedDate < loid)
db.productfind({$or:[{"lastModifiedDate":{$lt:lmd}},{"_id":1,"lastModifiedDate":1},{$and:[{"lastModifiedDate":lmd},{"_id":{$gt:loid}}]}]},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)
repeat same for subsequent pages.
ObjectIds should be good enough for pagination if you limit your queries to the previous second (or don't care about the subsecond possibility of weirdness). If that is not good enough for your needs then you will need to implement an ID generation system that works like an auto-increment.
Update:
To query the previous second of ObjectIds you will need to construct an ObjectID manually.
See the specification of ObjectId http://docs.mongodb.org/manual/reference/object-id/
Try using this expression to do it from a mongos.
{ _id :
{
$lt : ObjectId(Math.floor((new Date).getTime()/1000 - 1).toString(16)+"ffffffffffffffff")
}
}
The 'f''s at the end are to max out the possible random bits that are not associated with a timestamp since you are doing a less than query.
I recommend during the actual ObjectId creation on your application server rather than on the mongos since this type of calculation can slow you down if you have many users.
I have build a pagination using mongodb _id this way.
// import ObjectId from mongodb
let sortOrder = -1;
let query = []
if (prev) {
sortOrder = 1
query.push({title: 'findTitle', _id:{$gt: ObjectId('_idValue')}})
}
if (next) {
sortOrder = -1
query.push({title: 'findTitle', _id:{$lt: ObjectId('_idValue')}})
}
db.collection.find(query).limit(10).sort({_id: sortOrder})