Is this an optimal structure for querying MongoDB?

Is this an optimal structure for querying MongoDB? - javascript

I am trying to find which approach is more scalable.
I have a user who has requested a seat in a carpool trip, and the user needs to be able to see all trips that apply to them. My models look like this:
var UserSchema = new mongoose.Schema({
id: String,
name: String,
trips: [String] // An array of strings, which holds the id of trips
});
var TripSchema = new mongoose.Schema({
id: String,
description: String,
passengers: [String] // An array of strings, which holds the id of users
});
So when the user goes to see all trips that apply to them, my backend will search through all the trips in the Mongo database.
I am deciding between 2 approaches:
Search through all trips and return the trips where the user's id is in the passengers array
Search through all trips and return the trips with an id matching an id in the user's trips array.
I believe approach #2 is better because it does not have to search deeper in the Trip model. I am just seeking confirmation and wondering if there is anything else I should consider.

If you don't do big data, I would simply say that it does not matter - both are good enough, but if you really have millions of queries on millions of users and trips...
for option 1 you only have one query but you would have to make sure, that you have your field passengers indexed, so you would need to maintain another index for this to be efficient. Another index impacts your write performance.
for option 2 you always have to do two queries.
First query for the user object in the user collection, then do an in style query to load the trip items that match any of those tripIds from user.trips. You will query on on the _id field which is always indexed. Of course, when you always load your user anyway there is only one query which really counts.
You would also have to consider whether write or read performance matters more. Your model is pretty inefficient for write because for every new trip you need to update two collections (the trip and the user). So currently you double your writes and usually writes are more expensive than reads.
And finally: to have easy and maintainable code is mostly more imporant than a bit of performance --> just use the mongoose populate feature, and all is done automatically for you. Don't store the references as Strings but as type ObjectId and use the ref keywoard in your model.

Related

Querying for object key in Firestore

I currently have a few issues with my Firestore querying technique. As per this stackoverflow post I made recently, Querying with two array with firestore security rules
The answer proposed to add the the "ids" into a object, with the key as the id, and the value simply being "true". I have completed this, and now my structure looks like so:
This leaves me with this query:
db.collection('Depots')
.where(`products.${productId}`, '==', true)
.where(`users.${userId}`, '==', true)
.where('created', '>', 1585998560500)
.orderBy('created', 'asc')
.get();
This query leaves me with throwing an error, asking to create an index:
The query requires an index. You can create it here: ...
However, this tries to index the specific object key, i.e. QXooVYGBIFWKo6C so products.QXooVYGBIFWKo6C. Which is certianly not what I want, as this query changes, and can have an infinite number of possibilities, which means I would have to create another index for each key entry in order to query it.
Is there any way to solve this issue? I am assuming it needs to index this query due to the different operators used in the query, so I was wondering if there were any workarounds to this issue.
Thank you very much in advance.

What you have here is a map field, for which indexes should usually be created automatically.
That indeed means that you'll have as many indexes as you have products, which means:
You are limited in how many products you can have, as there is a maximum of 40,000 index entries per document.
You pay more per document, as you pay for the storage of each index.
If these are not what you want, you'll have to switch back to your original model, with the query limitations you had there. There doesn't seem to be a solution that fits both of your requirements.

After our discussion in chat, this is the starting point I would suggest. Who knows what the end architecture would look like, but I think this or very close to this. You say that a user can exist in multiple depots at the same time and multiple depots can contain the same products, also at the same time. You also said that a depot can never have more than 40 users at a given time, so an array of 40 users would certainly not encroach on Firestore's document limit of 1,048,576 bytes.
[collection]
<documentId>
- field: value
[depots]
<UUID>
- depotId: string "depot456"
- productCount: num 5,000
<UUID>
- depotId: string "depot789"
- productCount: num 4,500
[products]
<UUID>
- productId: string "lotion123"
- depotId: string "depot456"
- users: [string] ["user10", "user27", "user33"]
<UUID>
- productId: string "lotion123"
- depotId: string "depot789"
- users: [string] ["user10", "user17", "user50"]
[users]
<userId>
- depots: [string] ["depot456", "depot999"]
<userId>
- depots: [string] ["depot333", "depot999"]
In NoSQL, storage is cheap and computation isn't so denormalize your data as much as you need to make your queries possible and efficient (fast and cheap).
To find all depots in a single query where user10 and lotion123 are both true, query the products collection where productId equals x and users array-contains y and collect the depotId values from those results. If you want to preserve the array-contains operation for something else, you'd have to denormalize your data further (replace the array for a single user). Or you could split this query into two separate queries.
With this model, when a user leaves a depot, get all products where users array-contains that user and remove that userId from the array. And when a user joins a depot, get all products where depotId equals x and append that userId to the array.
Watch this video, and others by Rick, to get a solid handle on NoSQL: https://www.youtube.com/watch?v=HaEPXoXVf2k

#danwillm If you are not sure about the number of users and products then your DB structure seems unfit for this situation because there are size and length limitations of the firestore document.
You should rather create a separate collection for products and users i.e normalize your data and have a reference for the user in the product collection.
User :
{
userId: documentId,
name: John,
...otherInfo
}
Product :
{
productId: documentId,
createdBy: userId,
createdOn:date,
productName:"exa",
...otherInfo
}
This way you there will be the size of the document would be limited, i.e try avoiding using maps/arrays in firestore if you are not sure about there size.
Also, in this case, the number of queries would be increased but you don't need many indexes in this case.

MongoDB and Mongoose: Nested Array of Document Reference IDs

I have been diving into a study of MongoDB and came across a particularly interesting pattern in which to store relationships between documents. This pattern involves the parent document containing an array of ids referencing the child document as follows:
//Parent Schema
export interface Post extends mongoose.Document {
content: string;
dateCreated: string;
comments: Comment[];
}
let postSchema = new mongoose.Schema({
content: {
type: String,
required: true
},
dateCreated: {
type: String,
required: true
},
comments: [{ type: mongoose.Schema.Types.ObjectId, ref: 'Comment' }] //nested array of child reference ids
});
And the child being referenced:
//Child Schema
export interface Comment extends mongoose.Document {
content: string;
dateCreated: string;
}
let commentSchema = new mongoose.Schema({
content: {
type: String,
required: true
},
dateCreated: {
type: String,
required: true
}
});
This all seems fine and dandy until I go to send a request from the front end to create a new comment. The request has to contain the Post _id (to update the post) and the new Comment, which are both common to a request one would send when using a normal relational database. The issue appears when it comes time to write the new Comment to the database. Instead of one db write, like you would do in a normal relational database, I have to do 2 writes AND 1 read. The first write to insert the new Comment and retrieve the _id. Then a read to retrieve the Post by the Post _id sent with the request so I can push the new Comment _id to the nested reference array. Finally, a last write to update the Post back into the database.
This seems extremely inefficient. My question is two-fold:
Is there a better/more efficient way to handle this relationship pattern (parent containing an array of child reference ids)?
If not, what would be the benefit of using this pattern as opposed to A) storing the parent _id in a property on the child similar to a traditional foreign key, or B) taking advantage of MongoDB documents and storing an array of the Comments as opposed to an array of reference ids to the Comments.
Thanks in advance for your insight!

Regarding your first question:
You specifically ask for a better way to work with child-ids that are stored in the parent. I'm pretty sure that there is no better way to deal with this, if it has to be this pattern.
But this problem also exist in relational databases. If you want to save your post in a relational database (using that pattern), you also have to first create the comment, get its ID and then update the post. Granted, you can send all these tasks in a single request, which is probably more efficient than using mongoose, but the type of work that needs to be done is the same.
Regarding your second question:
The benefit over variant A is, that you can for example get the post, and instantly know how many comments it has, without asking the mongodb to go through probably hundrets of documents.
The benefit over variant B is, that you can store more references to comments in a single document (a single post), than whole comments, because of mongos 16MB document-size-limit.
The Downside however is the one you mentioned, that it's inefficient to maintain that structure. I take it, that this is only an example to showcase the scenario, so here is what i would do:
I would decide on a case by case basis what to use.
If the document will be read a lot, and not much written to, AND it is unlikely to grow larger than 16MB: Embed the sub-document. this way you can get all the data in a single query.
If you need to reference the document from multiple other documents AND your data really must be consistent, then you have no choice but to reference it.
If you need to reference the document from multiple other documents BUT data-consitency is not that super important AND the restrictions from the first bulletpoint apply, then embed the sub-documents, and write code to keep your data consistent.
If you need to reference the document from multiple other documents, and they are written to a lot, but not read that often, you're probably better off referencing them, as this is easier to code, because you don't need to write code to sync duplicate data.
In this specific case (post/comment) referencing the parent from the child (letting the child know the parents _id) is probably a good idea, because it's easier to maintain than the other way around, and the document might grow larger than 16MB if they were embedded directly. If i'd know for sure, that the document would NOT larger than over 16MB, embedding them would be better, because its faster to query the data that way

Parse - How do I query a Class and include another that points to it?

I have two classes - _User and Car. A _User will have a low/limited number of Cars that they own. Each Car has only ONE owner and thus an "owner" column that is a to the _User. When I got to the user's page, I want to see their _User info and all of their Cars. I would like to make one call, in Cloud Code if necessary.
Here is where I get confused. There are 3 ways I could do this -
In _User have a relationship column called "cars" that points to each individual Car. If so, how come I can't use the "include(cars)" function on a relation to include the Cars' data in my query?!!
_User.cars = relationship, Car.owner = _User(pointer)
Query the _User, and then query all Cars with (owner == _User.objectId) separately. This is two queries though.
_User.cars = null, Car.owner = _User(pointer)
In _User have a array of pointers column called "cars". Manually inject pointers to cars upon car creation. When querying the user I would use "include(cars)".
_User.cars = [Car(pointer)], Car.owner = _User(pointer)
What is your recommended way to do this and why? Which one is the fastest? The documentation just leaves me further confused.

I recommend you the 3rd option, and yes, you can ask to include an array. You even don't need to "manually inject" the pointers, you just need to add the objects into the array and they'll automatically be converted into pointers.

You've got the right ideas. Just to clarify them a bit:
A relation. User can have a relation column called cars. To get from user to car, there's a user query and then second query like user.relation("cars").query, on which you would .find().
What you might call a belongs_to pointer in Car. To get from user to car you'd have a query to get your user and you create a carQuery like carQuery.equalTo("user", user)
An array of pointers. For small-sized collections, this is superior to the relation, because you can aggressively load cars when querying user by saying include("cars") on a user query. Not sure if there's a second query under the covers - probably not if parse (mongo) is storing these as embedded.
But I wouldn't get too tied up over one or two queries. Using the promise forms of find() will keep your code nice and tidy. There probably is a small speed advantage to the array technique, which is good while the collection size is small (<100 is my rule of thumb).
It's easy to google (or I'll add here if you have a specific question) code examples for maintaining the relations and for getting from user->car or from car->user for each approach.

Range query for MongoDB pagination

I want to implement pagination on top of a MongoDB. For my range query, I thought about using ObjectIDs:
db.tweets.find({ _id: { $lt: maxID } }, { limit: 50 })
However, according to the docs, the structure of the ObjectID means that "ObjectId values do not represent a strict insertion order":
The relationship between the order of ObjectId values and generation time is not strict within a single second. If multiple systems, or multiple processes or threads on a single system generate values, within a single second; ObjectId values do not represent a strict insertion order. Clock skew between clients can also result in non-strict ordering even for values, because client drivers generate ObjectId values, not the mongod process.
I then thought about querying with a timestamp:
db.tweets.find({ created: { $lt: maxDate } }, { limit: 50 })
However, there is no guarantee the date will be unique — it's quite likely that two documents could be created within the same second. This means documents could be missed when paging.
Is there any sort of ranged query that would provide me with more stability?

It is perfectly fine to use ObjectId() though your syntax for pagination is wrong. You want:
db.tweets.find().limit(50).sort({"_id":-1});
This says you want tweets sorted by _id value in descending order and you want the most recent 50. Your problem is the fact that pagination is tricky when the current result set is changing - so rather than using skip for the next page, you want to make note of the smallest _id in the result set (the 50th most recent _id value and then get the next page with:
db.tweets.find( {_id : { "$lt" : <50th _id> } } ).limit(50).sort({"_id":-1});
This will give you the next "most recent" tweets, without new incoming tweets messing up your pagination back through time.
There is absolutely no need to worry about whether _id value is strictly corresponding to insertion order - it will be 99.999% close enough, and no one actually cares on the sub-second level which tweet came first - you might even notice Twitter frequently displays tweets out of order, it's just not that critical.
If it is critical, then you would have to use the same technique but with "tweet date" where that date would have to be a timestamp, rather than just a date.

Wouldn't a tweet "actual" timestamp (i.e. time tweeted and the criteria you want it sorted by) be different from a tweet "insertion" timestamp (i.e. time added to local collection). This depends on your application, of course, but it's a likely scenario that tweet inserts could be batched or otherwise end up being inserted in the "wrong" order. So, unless you work at Twitter (and have access to collections inserted in correct order), you wouldn't be able to rely just on $natural or ObjectID for sorting logic.
Mongo docs suggest skip and limit for paging:
db.tweets.find({created: {$lt: maxID}).
sort({created: -1, username: 1}).
skip(50).limit(50); //second page
There is, however, a performance concern when using skip:
The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return result. As offset increases, cursor.skip() will become slower and more CPU intensive.
This happens because skip does not fit into the MapReduce model and is not an operation that would scale well, you have to wait for a sorted collection to become available before it can be "sliced". Now limit(n) sounds like an equally poor method as it applies a similar constraint "from the other end"; however with sorting applied, the engine is able to somewhat optimize the process by only keeping in memory n elements per shard as it traverses the collection.
An alternative is to use range based paging. After retrieving the first page of tweets, you know what the created value is for the last tweet, so all you have to do is substitute the original maxID with this new value:
db.tweets.find({created: {$lt: lastTweetOnCurrentPageCreated}).
sort({created: -1, username: 1}).
limit(50); //next page
Performing a find condition like this can be easily parallellized. But how to deal with pages other than the next one? You don't know the begin date for pages number 5, 10, 20, or even the previous page! #SergioTulentsev suggests creative chaining of methods but I would advocate pre-calculating first-last ranges of the aggregate field in a separate pages collection; these could be re-calculated on update. Furthermore, if you're not happy with DateTime (note the performance remarks) or are concerned about duplicate values, you should consider compound indexes on timestamp + account tie (since a user can't tweet twice at the same time), or even an artificial aggregate of the two:
db.pages.
find({pagenum: 3})
> {pagenum:3; begin:"01-01-2014#BillGates"; end:"03-01-2014#big_ben_clock"}
db.tweets.
find({_sortdate: {$lt: "03-01-2014#big_ben_clock", $gt: "01-01-2014#BillGates"}).
sort({_sortdate: -1}).
limit(50) //third page
Using an aggregate field for sorting will work "on the fold" (although perhaps there are more kosher ways to deal with the condition). This could be set up as a unique index with values corrected at insert time, with a single tweet document looking like
{
_id: ...,
created: ..., //to be used in markup
user: ..., //also to be used in markup
_sortdate: "01-01-2014#BillGates" //sorting only, use date AND time
}

The following approach wil work even if there are multiple documents inserted/updated at same millisecond even if from multiple clients (which generates ObjectId). For simiplicity, In following queries I am projecting _id, lastModifiedDate.
First page, fetch the result Sorted by modifiedTime (Descending), ObjectId (Ascending) for fist page.
db.product.find({},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)
Note down the ObjectId and lastModifiedDate of the last record fetched in this page. (loid, lmd)
For sencod page, include query condition to search if (lastModifiedDate = lmd AND oid > loid ) OR (lastModifiedDate < loid)
db.productfind({$or:[{"lastModifiedDate":{$lt:lmd}},{"_id":1,"lastModifiedDate":1},{$and:[{"lastModifiedDate":lmd},{"_id":{$gt:loid}}]}]},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)
repeat same for subsequent pages.

ObjectIds should be good enough for pagination if you limit your queries to the previous second (or don't care about the subsecond possibility of weirdness). If that is not good enough for your needs then you will need to implement an ID generation system that works like an auto-increment.
Update:
To query the previous second of ObjectIds you will need to construct an ObjectID manually.
See the specification of ObjectId http://docs.mongodb.org/manual/reference/object-id/
Try using this expression to do it from a mongos.
{ _id :
{
$lt : ObjectId(Math.floor((new Date).getTime()/1000 - 1).toString(16)+"ffffffffffffffff")
}
}
The 'f''s at the end are to max out the possible random bits that are not associated with a timestamp since you are doing a less than query.
I recommend during the actual ObjectId creation on your application server rather than on the mongos since this type of calculation can slow you down if you have many users.

I have build a pagination using mongodb _id this way.
// import ObjectId from mongodb
let sortOrder = -1;
let query = []
if (prev) {
sortOrder = 1
query.push({title: 'findTitle', _id:{$gt: ObjectId('_idValue')}})
}
if (next) {
sortOrder = -1
query.push({title: 'findTitle', _id:{$lt: ObjectId('_idValue')}})
}
db.collection.find(query).limit(10).sort({_id: sortOrder})

Mongoose - recursive query (merge from multiple results)

I have the following generic schema to represent different types of information.
var Record = new Schema (
{
type: {type: String}, // any string (foo, bar, foobar)
value: {type: String}, // any string value
o_id: {type:String}
}
);
Some of the records based on this schema have:
type="car"
value="ferrari" or
value="ford"
Some records have type "topspeed" with value "210" but they always share o_id (e.g. related "ferrari has this topspeed"). So if "ferrari has top speed 300", then both records have same o_id.
How can I make query to find "ferrari with topspeed 300" when I don't know o_id?
The only solution I found out is to select cars "ferrari" first and then with knowledge of all o_id for all "ferrari" use it to find topspeed.
In pseudocode:
Record.find({type:"car", value:"ferrari"}, function(err, docs)
{
var condition = [];// create array of all found o_id;
Record.find({type:"topspeed", value:"300"}...
}
I know that some merging or joining might not be possible, but what about some chaining these conditions to avoid recursion?
EDIT:
Better example:
Lets imagine I have a HTML document that contains DIV elements with certain id (o_id).
Now each div element can contain different type of microdata items (Car, Animal...).
Each microdata item has different properties ("topspeed", "numberOfLegs"...) based on the type (Car has a topspeed, animal numberOfLegs)
Each property has some value (310 kph, 4 legs)
Now I'm saving these microdata items to the database but in a general way, agnostic of the type and values they contain since the user can define custom schemas from Car, to Animal, to pretty much anything). For that I defined the Record schema: type consists of "itemtype_propertyname" and value is value of the property.
I would eventually like to query "Give me o_id(s) of all DIV elements that contain item Ferrari and item Dog" at the same time.
The reason for this general approach is to allow anyone the ability to define custom schema and corresponding parser that stores the values.
But I will have only one search engine to find all different schemas and value combinations that will treat all possible schemas as a single definition.

I think it'd be far better to combine all records that share an o_id into a single record. E.g.:
{
_id: ObjectId(...),
car: "ferarri",
topspeed: 300
}
Then you won't have this problem, and your schema will be more efficient both in speed and storage size. This is how MongoDB is intended to be used -- heterogenous data can be stored in a single collection, because MongoDB is schemaless. If you continue with your current design, then no, there's no way to avoid multiple round-trips to the database.

Develop Reference

JavaScript is the programming language of the Web.