MongoDB and Mongoose: Nested Array of Document Reference IDs

MongoDB and Mongoose: Nested Array of Document Reference IDs - javascript

I have been diving into a study of MongoDB and came across a particularly interesting pattern in which to store relationships between documents. This pattern involves the parent document containing an array of ids referencing the child document as follows:
//Parent Schema
export interface Post extends mongoose.Document {
content: string;
dateCreated: string;
comments: Comment[];
}
let postSchema = new mongoose.Schema({
content: {
type: String,
required: true
},
dateCreated: {
type: String,
required: true
},
comments: [{ type: mongoose.Schema.Types.ObjectId, ref: 'Comment' }] //nested array of child reference ids
});
And the child being referenced:
//Child Schema
export interface Comment extends mongoose.Document {
content: string;
dateCreated: string;
}
let commentSchema = new mongoose.Schema({
content: {
type: String,
required: true
},
dateCreated: {
type: String,
required: true
}
});
This all seems fine and dandy until I go to send a request from the front end to create a new comment. The request has to contain the Post _id (to update the post) and the new Comment, which are both common to a request one would send when using a normal relational database. The issue appears when it comes time to write the new Comment to the database. Instead of one db write, like you would do in a normal relational database, I have to do 2 writes AND 1 read. The first write to insert the new Comment and retrieve the _id. Then a read to retrieve the Post by the Post _id sent with the request so I can push the new Comment _id to the nested reference array. Finally, a last write to update the Post back into the database.
This seems extremely inefficient. My question is two-fold:
Is there a better/more efficient way to handle this relationship pattern (parent containing an array of child reference ids)?
If not, what would be the benefit of using this pattern as opposed to A) storing the parent _id in a property on the child similar to a traditional foreign key, or B) taking advantage of MongoDB documents and storing an array of the Comments as opposed to an array of reference ids to the Comments.
Thanks in advance for your insight!

Regarding your first question:
You specifically ask for a better way to work with child-ids that are stored in the parent. I'm pretty sure that there is no better way to deal with this, if it has to be this pattern.
But this problem also exist in relational databases. If you want to save your post in a relational database (using that pattern), you also have to first create the comment, get its ID and then update the post. Granted, you can send all these tasks in a single request, which is probably more efficient than using mongoose, but the type of work that needs to be done is the same.
Regarding your second question:
The benefit over variant A is, that you can for example get the post, and instantly know how many comments it has, without asking the mongodb to go through probably hundrets of documents.
The benefit over variant B is, that you can store more references to comments in a single document (a single post), than whole comments, because of mongos 16MB document-size-limit.
The Downside however is the one you mentioned, that it's inefficient to maintain that structure. I take it, that this is only an example to showcase the scenario, so here is what i would do:
I would decide on a case by case basis what to use.
If the document will be read a lot, and not much written to, AND it is unlikely to grow larger than 16MB: Embed the sub-document. this way you can get all the data in a single query.
If you need to reference the document from multiple other documents AND your data really must be consistent, then you have no choice but to reference it.
If you need to reference the document from multiple other documents BUT data-consitency is not that super important AND the restrictions from the first bulletpoint apply, then embed the sub-documents, and write code to keep your data consistent.
If you need to reference the document from multiple other documents, and they are written to a lot, but not read that often, you're probably better off referencing them, as this is easier to code, because you don't need to write code to sync duplicate data.
In this specific case (post/comment) referencing the parent from the child (letting the child know the parents _id) is probably a good idea, because it's easier to maintain than the other way around, and the document might grow larger than 16MB if they were embedded directly. If i'd know for sure, that the document would NOT larger than over 16MB, embedding them would be better, because its faster to query the data that way

Related

storing data as object vs array in MongoDb for write performance

Should I store objects in an Array or inside an Object with top importance given Write Speed?
I'm trying to decide whether data should be stored as an array of objects, or using nested objects inside a mongodb document.
In this particular case, I'm keeping track of a set of continually updating files that I add and update and the file name acts as a key and the number of lines processed within the file.
the document looks something like this
{
t_id:1220,
some-other-info: {}, // there's other info here not updated frequently
files: {
log1-txt: {filename:"log1.txt",numlines:233,filesize:19928},
log2-txt: {filename:"log2.txt",numlines:2,filesize:843}
}
}
or this
{
t_id:1220,
some-other-info: {},
files:[
{filename:"log1.txt",numlines:233,filesize:19928},
{filename:"log2.txt",numlines:2,filesize:843}
]
}
I am making an assumption that handling a document, especially when it comes to updates, it is easier to deal with objects, because the location of the object can be determined by the name; unlike an array, where I have to look through each object's value until I find the match.
Because the object key will have periods, I will need to convert (or drop) the periods to create a valid key (fi.le.log to filelog or fi-le-log).
I'm not worried about the files' possible duplicate names emerging (such as fi.le.log and fi-le.log) so I would prefer to use Objects, because the number of files is relatively small, but the updates are frequent.
Or would it be better to handle this data in a separate collection for best write performance...
{
"_id": ObjectId('56d9f1202d777d9806000003'),"t_id": "1220","filename": "log1.txt","filesize": 1843,"numlines": 554
},
{
"_id": ObjectId('56d9f1392d777d9806000004'),"t_id": "1220","filename": "log2.txt","filesize": 5231,"numlines": 3027
}

From what I understand you are talking about write speed, without any read consideration. So we have to think about how you will insert/update your document.
We have to compare (assuming you know the _id you are replacing, replace {key} by the key name, in your example log1-txt or log2-txt):
db.Col.update({ _id: '' }, { $set: { 'files.{key}': object }})
vs
db.Col.update({ _id: '', 'files.filename': '{key}'}, { $set: { 'files.$': object }})
The second one means that MongoDB have to browse the array, find the matching index and update it. The first one means MongoDB just update the specified field.
The worst:
The second command will not work if the matching filename is not present in the array! So you have to execute it, check if nMatched is 0, and create it if it is so. That's really bad write speed (see here MongoDB: upsert sub-document).
If you will never/almost never use read queries / aggregation framework on this collection: go for the first one, that will be faster. If you want to aggregate, unwind, do some analytics on the files you parsed to have statistics about file size and line numbers, you may consider using the second one, you will avoid some headache.
Pure write speed will be better with the first solution.

Is this an optimal structure for querying MongoDB?

I am trying to find which approach is more scalable.
I have a user who has requested a seat in a carpool trip, and the user needs to be able to see all trips that apply to them. My models look like this:
var UserSchema = new mongoose.Schema({
id: String,
name: String,
trips: [String] // An array of strings, which holds the id of trips
});
var TripSchema = new mongoose.Schema({
id: String,
description: String,
passengers: [String] // An array of strings, which holds the id of users
});
So when the user goes to see all trips that apply to them, my backend will search through all the trips in the Mongo database.
I am deciding between 2 approaches:
Search through all trips and return the trips where the user's id is in the passengers array
Search through all trips and return the trips with an id matching an id in the user's trips array.
I believe approach #2 is better because it does not have to search deeper in the Trip model. I am just seeking confirmation and wondering if there is anything else I should consider.

If you don't do big data, I would simply say that it does not matter - both are good enough, but if you really have millions of queries on millions of users and trips...
for option 1 you only have one query but you would have to make sure, that you have your field passengers indexed, so you would need to maintain another index for this to be efficient. Another index impacts your write performance.
for option 2 you always have to do two queries.
First query for the user object in the user collection, then do an in style query to load the trip items that match any of those tripIds from user.trips. You will query on on the _id field which is always indexed. Of course, when you always load your user anyway there is only one query which really counts.
You would also have to consider whether write or read performance matters more. Your model is pretty inefficient for write because for every new trip you need to update two collections (the trip and the user). So currently you double your writes and usually writes are more expensive than reads.
And finally: to have easy and maintainable code is mostly more imporant than a bit of performance --> just use the mongoose populate feature, and all is done automatically for you. Don't store the references as Strings but as type ObjectId and use the ref keywoard in your model.

Mongodb parent reference tree get current location and full path

I'm making parental reference tree with MongoDB and Mongoose. My schema looks like this
var NodesSchema = new Schema({
_id: {
type: ShortId,
len: 7
},
name: { // name of the file or folder
type: String,
required: true
},
isFile: { // is the node file or folder
type: Boolean,
required: true
},
location: { // location, null for root
type: ShortId,
default: null
},
data: { // optional if isFile is true
type: String
}
});
Note that files/folders are rename-able.
In my current setup if I want to get files in specific folder I perform the following query:
NodesModel.find({ location: 'LOCATION_ID' })
If I want to get a single file/folder I run:
NodesModel.findOne({ _id: 'ITEM_ID' })
and the location field looks like f8mNslZ1 but if I want to get the location folder name I need to do second query.
Unfortunately if I want to get path to root I need to do a recursive query, which might be slow if I have 300 nested folders.
So I have been searching and figured out the following possible solution:
Should I change the location field from string to object and save the information in it as following:
location: {
_id: 'LOCATION_ID',
name: 'LOCATION_NAME',
fullpath: '/FOLDERNAME1/FOLDERNAME2'
}
The problem in this solution is that files/folders are rename-able. On rename I should update all children. However rename occurs much more rarely then indexing, but if the folder has 1000 items, would be a problem I guess.
My questions are:
Is my suggestion with the location object instead of string viable? What problems might it cause?
Are there better ways to realize this?
How can I improve my code?

Looking at your Node Schema, if you change the location property to an object, you'll have 2 places where you state the Node's name so be mindful of updating both name properties. Usually you want to keep you database as DRY as possible, and in most cases doing nested queries is quite common. That being said, you know your database much more than I do, and if you see a significant performance delay by doing more queries, then just be sure to update all name properties.
In addition to this, if you have your location's fullpath property be a string, and let's say you run into a case where you have to rename a folder, you'll have to analyze the whole string by breaking it down and comparing substrings to a new value for the new folder name. This can get tedious.
A possible solution could be to store the full path as an array instead of a string, having the order be the next folder in the chain, that way you can quickly compare and update when need be.

The different ways to model tree structures are extensively covered in the MongoDB docs.
The way you are proposing is one of them.
Depending on how frequent folder renaming is expected to happen (and/or any other hierarchy changes more complex than adding a new leaf node) you might consider storing the "path" as an "array of ancestors" instead. But whichever way you happen to denormalize or materialize the path up the tree in each folder, the trade-off is that for faster look-ups, you will have slower and/or more complicated updates.
In your case it seems clear to optimize for the read and not for the rare update - in addition to being less frequent, it seems that renames could be done asynchronously where that's simply not possible with displaying names of parent folders.
While DRY is a great principle in programming, it's pretty much not applicable to non-relational databases, so unless you are using a strictly relational database and normal form don't apply it to your schema design and in fact this would be specifically discouraged in MongoDB as you would then be using the wrong tool for the job.

Mongoose behavior and schema

I am learning nodejs along with mongodb currently, and there are two things that confuse me abit.
(1),
When a new Schema and model name are used (not in db), the name is changed into its plural form. Example:
mongoose.model('person', personSchema);
in the database, the table will be called "people" instead.
Isn't this easy to confuse new developer, why has they implemented it this way?
(2),
Second thing is that whenever I want to refer to an existing model in mongoDb (assume that in db, a table called people exists). Then in my nodejs code, I still have to define a Schema in order to create a model that refer to the table.
personSchema = new mongoose.Schema({});
mongoose.model('person',personSchema);
The unusual thing is, it does not seem to matter how I define the schema, it can just be empty like above, or fill with random attribute, yet the model will always get the right table and CRUD operations performs normally.
Then what is the usage of Schema other than defining table structure for creating new table?
Many thanks,

Actually two questions, you usually do better asking one, just for future reference.
1. Pluralization
Short form is that it is good practice. In more detail, this is generally logical as what you are referring to is a "collection" of items or objects rather. So the general inference in a "collection" is "many" and therefore a plural form of what the "object" itself is named.
So a "people" collection implies that it is in fact made up of many "person" objects, just as "dogs" to "dog" or "cats" to "cat". Not necessarily "bovines" to "cow", but generally speaking mongoose does not really deal with Polymorphic entities, so there would not be "bull" or "bison" objects in there unless just specified by some other property to "cow".
You can of course change this if you want in either of these forms and specify your own name:
var personSchema = new Schema({ ... },{ "collection": "person" });
mongoose.model( "Person", personSchema, "person" );
But a model is general a "singular" model name and the "collection" is the plural form of good practice when there are many. Besides, every SQL database ORM I can think of also does it this way. So really this is just following the practice that most people are already used to.
2. Why Schema?
MongoDB is actually "schemaless", so it does not have any internal concept of "schema", which is one big difference from SQL based relational databases which hold their own definition of "schema" in a "table" definition.
While this is often actually a "strength" of MongoDB in that data is not tied to a certain layout, some people actually like it that way, or generally want to otherwise encapsulate logic that governs how data is stored.
For these reasons, mongoose supports the concept of defining a "Schema". This allows you to say "which fields" are "allowed" in the collection (model) this is "tied" to, and which "type" of data may be contained.
You can of course have a "schemaless" approach, but the schema object you "tie" to your model still must be defined, just not "strictly":
var personSchema = new Schema({ },{ "strict": false });
mongoose.model( "Person", personSchema );
Then you can pretty much add whatever you want as data without any restriction.
The reverse case though is that people "usually" do want some type of rules enforced, such as which fields and what types. This means that only the "defined" things can happen:
var personSchema = new Schema({
name: { type: String, required: true },
age: Number,
sex: { type: String, enum: ["M","F"] },
children: [{ type: Schema.Types.ObjectId, ref: "Person" }],
country: { type: String, default: "Australia" }
});
So the rules there break down to:
"name" must have "String" data in it only. Bit of a JavaScript idiom here as everything in JavaScript will actually stringify. The other thing on here is "required", so that if this field is not present in the object sent to .save() it will throw a validation error.
"age" must be numeric. If you try to .save() this object with data other than numeric supplied in this field then you will throw a validation error.
"sex" must be a string again, but this time we are adding a "constraint" to say what the valid value are. In the same way this also can throw a validation error if you do not supply the correct data.
"children" actually an Array of items, but these are just "reference" ObjectId values that point to different items in another model. Or in this case this one. So this will keep that ObjectId reference in there when you add to "children". Mongoose can actually .populate() these later with their actual "Person" objects when requested to do so. This emulates a form of "embedding" in MongoDB, but used when you actually want to store the object separately without "embedding" every time.
"country" is again just a String and requires nothing special, but we give it a default value to fill in if no other is supplied explicitly.
There are many other things you can do, I would suggest really reading through the documentation. Everything is explained in a lot of detail there, and if you have specific questions then you can always ask, "here" (for example).
So MongoDB does things differently to how SQL databases work, and throws out some of the things that are generally held in "opinion" to be better implemented at the application business logic layer anyway.
Hence in Mongoose, it tries to "put back" some of the good things people like about working with traditional relational databases, and allow some rules and good practices to be easily encapsulated without writing other code.
There is also some logic there that helps in "emulating" ( cannot stress enough ) "joins", as there are methods that "help" you in being able to retrieve "related" data from other sources, by essentially providing definitions where which "model" that data resides in within the "Schema" definition.
Did I also not mention that "Schema" definitions are again just objects and re-usable? Well yes they are an can in fact be tied to "many" models, which may or may not reside on the same database.
Everything here has a lot more function and purpose than you are currently aware of, the good advice here it to head forth and "learn". That is the usual path to the realization ... "Oh, now I see, that's what they do it that way".

Firebase - Get All Data That Contains

I have a firebase model where each object looks like this:
done: boolean
|
tags: array
|
text: string
Each object's tag array can contain any number of strings.
How do I obtain all objects with a matching tag? For example, find all objects where the tag contains "email".

Many of the more common search scenarios, such as searching by attribute (as your tag array would contain) will be baked into Firebase as the API continues to expand.
In the mean time, it's certainly possible to grow your own. One approach, based on your question, would be to simply "index" the list of tags with a list of records that match:
/tags/$tag/record_ids...
Then to search for records containing a given tag, you just do a quick query against the tags list:
new Firebase('URL/tags/'+tagName).once('value', function(snap) {
var listOfRecordIds = snap.val();
});
This is a pretty common NoSQL mantra--put more effort into the initial write to make reads easy later. It's also a common denormalization approach (and one most SQL database use internally, on a much more sophisticated level).
Also see the post Frank mentioned as that will help you expand into more advanced search topics.

Develop Reference

JavaScript is the programming language of the Web.