MongoDB queries optimisation - javascript

I wish to retrieve several information from my User model that looks like this:
var userSchema = new mongoose.Schema({
email: { type: String, unique: true, lowercase: true },
password: String,
created_at: Date,
updated_at: Date,
genre : { type: String, enum: ['Teacher', 'Student', 'Guest'] },
role : { type: String, enum: ['user', 'admin'], default: 'user' },
active : { type: Boolean, default: false },
profile: {
name : { type: String, default: '' },
headline : { type: String, default: '' },
description : { type: String, default: '' },
gender : { type: String, default: '' },
ethnicity : { type: String, default: '' },
age : { type: String, default: '' }
},
contacts : {
email : { type: String, default: '' },
phone : { type: String, default: '' },
website : { type: String, default: '' }
},
location : {
formattedAddress : { type: String, default: '' },
country : { type: String, default: '' },
countryCode : { type: String, default: '' },
state : { type: String, default: '' },
city : { type: String, default: '' },
postcode : { type: String, default: '' },
lat : { type: String, default: '' },
lng : { type: String, default: '' }
}
});
In Homepage I have a filter for location where you can browse Users from Country or City.
All the fields contains also the number of users in there:
United Kingdom
All Cities (300)
London (150)
Liverpool (80)
Manchester (70)
France
All Cities (50)
Paris (30)
Lille (20)
Nederland
All Cities (10)
Amsterdam (10)
Etc...
This in the Homepage, then I have also the Students and Teachers pages where I wish to have information only about how many teachers there are in those Countries and Cities...
What I'm trying to do is to create a query to MongoDB to retrieve all these information with a single query.
At the moment the query looks like this:
User.aggregate([
{
$group: {
_id: { city: '$location.city', country: '$location.country', genre: '$genre' },
count: { $sum: 1 }
}
},
{
$group: {
_id: '$_id.country',
count: { $sum: '$count' },
cities: {
$push: {
city: '$_id.city',
count: '$count'
}
},
genres: {
$push: {
genre: '$_id.genre',
count: '$count'
}
}
}
}
], function(err, results) {
if (err) return next();
res.json({
res: results
});
});
The problem is that I don't know how to get all the information I need.
I don't know how to get the length of the total users in every Country.
I have the users length for each Country.
I have the users length for each city.
I don't know how to get the same but for specific genre.
Is it possible to have all these information with a single query in Mongo?
Otherwise:
Creating few promises with 2, 3 different requests to Mongo like this:
getSomething
.then(getSomethingElse)
.then(getSomethingElseAgain)
.done
I'm sure it would be easier storing every time specified data but: is it good for performance when there are more than 5000 / 10000 users in the DB?
Sorry but I'm still in the process of learning and I think these things are crucial to understand MongoDB performance / optimisation.
Thanks

What you want is a "faceted search" result where you hold the statistics about the matched terms in the current result set. Subsequently, while there are products that "appear" to do all the work in a single response, you have to consider that most generic storage engines are going to need multiple operations.
With MongoDB you can use two queries to get the results themselves and another to get the facet information. This would give similar results to the faceted results available from dedicated search engine products like Solr or ElasticSearch.
But in order to do this effectively, you want to include this in your document in a way it can be used effectively. A very effective form for what you want is using an array of tokenized data:
{
"otherData": "something",
"facets": [
"country:UK",
"city:London-UK",
"genre:Student"
]
}
So "factets" is a single field in your document and not in multiple locations. This makes it very easy to index and query. Then you can effectively aggregate across your results and get the totals for each facet:
User.aggregate(
[
{ "$unwind": "$facets" },
{ "$group": {
"_id": "$facets",
"count": { "$sum": 1 }
}}
],
function(err,results) {
}
);
Or more ideally with some criteria in $match:
User.aggregate(
[
{ "$match": { "facets": { "$in": ["genre:student"] } } },
{ "$unwind": "$facets" },
{ "$group": {
"_id": "$facets",
"count": { "$sum": 1 }
}}
],
function(err,results) {
}
);
Ultimately giving a response like:
{ "_id": "country:FR", "count": 50 },
{ "_id": "country:UK", "count": 300 },
{ "_id": "city:London-UK", "count": 150 },
{ "_id": "genre:Student": "count": 500 }
Such a structure is easy to traverse and inspect for things like the discrete "country" and the "city" that belongs to a "country" as that data is just separated consistently by a hyphen "-".
Trying to mash up documents within arrays is a bad idea. There is a BSON size limit of 16MB to be respected also, from which mashing together results ( especially if you are trying to keep document content ) is most certainly going to end up being exceeded in the response.
For something as simple as then getting the "overall count" of results from such a query, then just sum up the elements of a particular facet type. Or just issue your same query arguments to a .count() operation:
User.count({ "facets": { "$in": ["genre:Student"] } },function(err,count) {
});
As said here, particularly when implementing "paging" of results, then the roles of getting "Result Count", "Facet Counts" and the actual "Page of Results" are all delegated to "separate" queries to the server.
There is nothing wrong with submitting each of those queries to the server in parallel and then combining a structure to feed to your template or application looking much like the faceted search result from one of the search engine products that offers this kind of response.
Concluding
So put something in your document to mark the facets in a single place. An array of tokenized strings works well for this purpose. It also works well with query forms such as $in and $all for either "or" or "and" conditions on facet selection combinations.
Don't try and mash results or nest additions just to match some perceived hierarchical structure, but rather traverse the results received and use simple patterns in the tokens. It's very simple to
Run paged queries for the content as separate queries to either facets or overall counts. Trying to push all content in arrays and then limit out just to get counts does not make sense. The same would apply to a RDBMS solution to do the same thing, where paging result counts and the current page are separate query operations.
There is more information written on the MongoDB Blog about Faceted Search with MongoDB that also explains some other options. There are also articles on integration with external search solutions using mongoconnector or other approaches.

Related

How can i improve my query speed in MongoDB, NodeJS?

I have one collection who include some value coming from sensor. My collection look like this.
const MainSchema: Schema = new Schema(
{
deviceId: {
type: mongoose.Types.ObjectId,
required: true,
ref: 'Device',
},
sensorId: {
type: mongoose.Types.ObjectId,
default: null,
ref: 'Sensor',
},
value: {
type: Number,
},
date: {
type: Date,
},
},
{
versionKey: false,
}
);
I want to get data from this collection with my endpoint. This collection should has more 300.000 documents. I want to get data from this collection with sensor data. (like name and desc. to "Sensor")
My Sensor Collection:
const Sensor: Schema = new Schema(
{
name: {
type: String,
required: true,
min: 3,
},
description: {
type: String,
default: null,
},
type: {
type: String,
},
},
{
timestamps: true,
versionKey: false,
}
);
I use 2 method for get data from MainSchema. First approach is look like this (Include aggregate):
startDate, endDate and _sensorId are passed by parameter for this functions.
const data= await MainSchema.aggregate([
{
$lookup: {
from: 'Sensor',
localField: 'sensorId',
foreignField: '_id',
as: 'sensorDetail',
},
},
{
$unwind: '$sensorDetail',
},
{
$match: {
$and: [
{ sensorId: new Types.ObjectId(_sensorId) },
{
date: {
$gte: new Date(startDate),
$lt: new Date(endDate),
},
},
],
},
},
{
$project: {
sensorDetail: {
name: 1,
description: 1,
},
value: 1,
date: 1,
},
},
{
$sort: {
_id: 1,
},
},
]);
Second approach look like this (Include find and populate):
const data= await MainSchema.find({
sensorId: _sensorId,
date: {
$gte: new Date(startDate),
$lte: new Date(endDate),
},
})
.lean()
.sort({ date: 1 })
.populate('sensorId', { name: 1, description: 1});
Execution time for same data set:
First approach: 25 - 30 second
Second approach: 11 - 15 second
So how can i get this data more faster. Which one is best practise?
And how can i do extras for improve the query speed?
Overall #NeNaD's answer touches on a lot of the important points. What I'm going to say in this one should be considered in addition to that other information.
Index
Just to clarify, the ideal index here would be a compound index of { sensorId: 1, date: 1 }. This index follows the ESR Guidance for index key ordering and will provide the most efficient retrieval of the data according to the query predicates specified in the $match stage.
If the index: true annotation in Mongoose creates two separate single field indexes, then you should go manually create this index in the collection. MongoDB will only use one of those indexes to execute this query which will not be as efficient as using the compound index described above.
Also regarding the existing approach, what is the purpose of the trailing $sort?
If the application (a chart in this situation) does not need sorted results then you should remove this stage entirely. If the client does need sorted results then you should:
Move the $sort stage earlier in the pipeline (behind the $match), and
Test if including the sort field in the index improves performance.
As written, the $sort is currently a blocking operation which is going to prevent any results from being returned to the client until they are all processed. If you move the $sort stage up and can change it to sort on date (which probably makes sense for sensor data) the it should automatically use the compound index that we mentioned earlier to provide the sort in a non-blocking manner.
Stage Ordering
Ordering of aggregation stages is important, both for semantic purposes as well as for performance reasons. The database itself will attempt to do various things (such as reordering stages) to improve performance so long as it does not logically change the result set in any way. Some of these optimizations are described here. As these are version specific anyway, you can always take a look at the explain plan to get a better indication of what specific changes the database has applied. The fact that performance did not improve when you manually moved the $match to the beginning (which is generally a best practice) could suggest that the database was able to automatically do that on your behalf.
Schema
I'm a little curious about the schema itself. Is there any reason that there are two separate collections here?
My guess is that this is mostly a play at 'normalization' to help reduce data duplication. That is mostly fine, unless you find yourself constantly performing $lookups like this for most of your read operations. You could certainly consider testing what performance (and storage) looks like if you combine them.
Also for this particular operation, would it make sense to just issue two separate queries, one to get the measurements and one to get the sensor data (a single time)? The aggregation matches on sensorId and the value of that field is what is then used to match against the _id field from the other collection. Unless I'm doing the logic wrong, this should be the same data for each of the source documents.
Time Series Collections
Somewhat related to schema, have you looked into using Time Series Collections? I don't know what your specific goals or pain points are, but it seems that you may be working with IoT data. Time Series collections are purpose-built to help handle use cases like that. Might be worth looking into as they may help you achieve your goals with less hassle or overhead.
Frist step
Create index for sensorId and date properties in the collection. You can do it by specifying index: true in your model:
const MainSchema: Schema = new Schema(
{
deviceId: { type: mongoose.Types.ObjectId, required: true, ref: 'Device' },
sensorId: { type: mongoose.Types.ObjectId, default: null, ref: 'Sensor', index: true },
value: { type: Number },
date: { type: Date, index: true },
},
{
versionKey: false,
}
);
Second step
Aggregation queries can take leverage of indexes only if your $match stage is the first stage in the pipeline, so you should change the order of the items in your aggregation query:
const data= await MainSchema.aggregate([
{
$match: {
{ sensorId: new Types.ObjectId(_sensorId) },
{
date: {
$gte: new Date(startDate),
$lt: new Date(endDate),
},
},
},
},
{
$lookup: {
from: 'Sensor',
localField: 'sensorId',
foreignField: '_id',
as: 'sensorDetail',
},
},
{
$unwind: '$sensorDetail',
},
{
$project: {
sensorDetail: {
name: 1,
description: 1,
},
value: 1,
date: 1,
},
},
{
$sort: {
_id: 1,
},
},
]);

Remove Object from Array in MongoDB given ID

I am making an application that works with playlists. I am using MongoDB with mongoose and storing videos in each playlist via an array, like so:
{
_id: ObjectId("61ca1d0ddb66b3c5ff93df9c")
name: "Playlist A"
videos: [
{
title: "Video 1",
url: "www.YouTube.com/video1",
startTime: "20",
endTime: "40",
_id: ObjectId("61ca1d1ddb66b3c5ff93e0ba")
},
...
]
}
I want to be able to remove a video from a playlist based on the _id of a video. I have tried looking online for solutions but the solutions I've found don't work. This is what I am trying:
Playlist.updateOne(
{ _id: req.params.playlistId },
{ $pull: { videos: { _id: req.params.vidId } } }
)
When I run the code above and log the result I get the following (not sure if this is relevant):
{
acknowledged: true,
modifiedCount: 1,
upsertedId: null,
upsertedCount: 0,
matchedCount: 1
}
Please let me know if you need any more information this is my first time posting a question :)
This is a TypeError req.params.vidId returns a string and mongoose is looking for Type ObjectId. The solution is to convert the string literal (req.params.vidId) to an ObjectId, like so:
Playlist.updateOne(
{ _id: req.params.playlistId },
{ $pull: { videos: { _id: mongoose.Types.ObjectId(req.params.vidId) } } }
)

How to search for partial match using index in fauna db

I have a faunadb collection of users. The data is as follows:
{
"username": "Hermione Granger",
"fullName": "Hermione Jean Granger",
"DOB": "19-September-1979",
"bloodStatus": "Muggle-Born",
"gender": "Female",
"parents": [
"Wendell Wilkins",
"Monica Wilkins"
]
}
when I use an index I have to search for the whole phrase i.e. Hermione Granger. But I want to search for just Hermione and get the result.
I came across a solution that seems to work.
The below uses the faunadb client.
"all-items" is an index setup on a collection in Fauna that returns all items in the collection
The lambda is searching on the title field
This will return any document with a title that partially matches the search term.
I know this is a bit late; I hope it helps anyone else who may be looking to do this.
const response = await faunaClient.query(
q.Map(
q.Filter(
q.Paginate(q.Match(q.Index("all_items"))),
q.Lambda((ref) =>
q.ContainsStr(
q.LowerCase(
q.Select(["data", "title"], q.Get(ref))
),
title // <= this is your search term
)
)
),
q.Lambda((ref) => q.Get(ref))
)
The Match function only applies an exact comparison. Partial matches are not supported.
One approach that might work for you is to store fields that would contain multiple values that need to be indexed as arrays.
When you index a field whose value is an array, the index creates multiple index entries for the document so that any one of the array items can be used to match entries. Note that this strategy increases the read and write operations involved.
Here's an example:
> CreateCollection({ name: "u" })
{
ref: Collection("u"),
ts: 1618532727920000,
history_days: 30,
name: 'u'
}
> Create(Collection("u"), { data: { n: ["Hermione", "Granger"] }})
{
ref: Ref(Collection("u"), "295985674342892032"),
ts: 1618532785650000,
data: { n: [ 'Hermione', 'Granger' ] }
}
> Create(Collection("u"), { data: { n: ["Harry", "Potter"] }})
{
ref: Ref(Collection("u"), "295985684233060864"),
ts: 1618532795080000,
data: { n: [ 'Harry', 'Potter' ] }
}
> Create(Collection("u"), { data: { n: ["Ginny", "Potter"] }})
{
ref: Ref(Collection("u"), "295985689713967616"),
ts: 1618532800300000,
data: { n: [ 'Ginny', 'Potter' ] }
}
> CreateIndex({
name: "u_by_n",
source: Collection("u"),
terms: [
{ field: ["data", "n"] }
]
})
{
ref: Index("u_by_n"),
ts: 1618533007000000,
active: true,
serialized: true,
name: 'u_by_n3',
source: Collection("u"),
terms: [ { field: [ 'data', 'n' ] } ],
partitions: 1
}
> Paginate(Match(Index("u_by_n"), ["Potter"]))
{
data: [
Ref(Collection("u"), "295985684233060864"),
Ref(Collection("u"), "295985689713967616")
]
}
Note that you cannot query for multiple array items in a single field:
> Paginate(Match(Index("u_by_n"), ["Harry", "Potter"]))
{ data: [] }
The reason is that the index has only one field defined in terms, and successful matches require sending an array having the same structure as terms to Match.
To be able to search for the full username and the username as an array, I'd suggest storing both the string and array version of the username field in your documents, e.g. username: 'Hermione Granger' and username_items: ['Hermione', 'Granger']. Then create one index for searching the string field, and another for the array field, then you can search either way,

How to ensure randomatic creates a unqiue identifier?

I'm using randomatic npm to store id to mongoDB as mongo only creates a long objectID which can not be used for invoice numbers. My current model is:
orders: [
{
orderReference: { type: String },
orderStatus: { type: String },
orderType: { type: String },
orderDate: { type: Date, default: Date.now },
itemDetails: { type: String },
purchaseOrder: {
orderReference: { type: String },
orderStatus: { type: String },
orderType: { type: String },
orderDate: { type: Date, default: Date.now },
itemDetails: { type: String },
},
thirdPartyOrder: {
orderReference: { type: String },
orderType: { type: String },
orderDate: { type: Date },
itemDetails: { type: String },
},
platformRevenue: {
orderReference: { type: String },
orderType: { type: String },
orderDate: { type: Date, default: Date.now },
itemDetails: { type: String }
}
}
],
In the current model how do I check/query that the ID that is currently in other orders are not duplicate and actually unique. Because randomatic doesn't create a unique ID by itself. It just creates a random ID whereas in my use case I want a unique 4 DIGIT NUMBERIC ID for orderReference.
Is there a possibility that four-digit random unique ids will finish one day and I should rather create sequential IDs instead which an increment of +1?
What is the industry standard of creating IDs for orderReferencing. I have mostly seen 4 to 6 digit number IDs on all the invoices I have received in my life.
Please help and suggest the best possible use-case.
I would rather use your second suggestion, sequential IDs with increments.
While it is possible to use randoms, you are likely to check if a specific ID is already in the database before inserting, in which case you have to generate new IDs.
What's more, with a sequential version you will be able to keep a (approximate) track on the invoice / number of invoice.
The limit will be 10.000 different IDs either way (including 0).
Hope it helps!

Is It possible to use query projection on the same collection that has a $elemMatch projection?

I understand that you can limit the items in a subcollection array using $elemMatch as a projection. When using it as such it returns all fields of the subdocuments that match regardless if query projections are also specified.
Is it possible to limit the fields returned in the matching subdocuments? How would you do so?
Using version 3.8.9.
Given the simple schemas:
var CommentSchema = mongoose.Schema({
body: String,
created: {
by: {
type: String,
required: true
},
date: {
type: Date,
default: Date.now
}
}
});
var BlogSchema = mongoose.Schema({
title: String,
blog: String,
created: {
by: {
type: String,
required: true
},
date: {
type: Date,
default: Date.now
}
},
comments: [CommentSchema]
});
var Blog = mongoose.model('Blog',modelSchema);
Example
Blog.findOne({_id: id}, {_id: 1, comments: {$elemMatch: {'created.by': 'Jane'}, body: 1}}, function(err, blog) {
console.log(blog.toJSON());
});
// outputs:
{
_id: 532cb63e25e4ad524ba17102,
comments: [
_id: 53757456d9c99f00009cdb5b,
body: 'My awesome comment',
created: { by: 'Jane', date: Fri Mar 21 2014 20:34:45 GMT-0400 (EDT) }
]
}
// DESIRED OUTPUT
{
_id: 532cb63e25e4ad524ba17102,
comments: [
body: 'My awesome comment'
]
}
Yes there are two ways to do this. So you can either use the $elemMatch on the projection side as you already have, with slight changes:
Model.findById(id,
{ "comments": { "$elemMatch": {"created.by": "Jane" } } },
function(err,doc) {
Or just add to the query portion and use the positional $ operator:
Model.findOne(
{ "_id": id, "comments.created.by": "Jane" },
{ "comments.$": 1 },
function(err,doc) {
Either way is perfectly valid.
If you wanted something a little more involved than that, you can use the .aggregate() method and it's $project operator instead:
Model.aggregate([
// Still match the document
{ "$match": "_id": id, "comments.created.by": "Jane" },
// Unwind the array
{ "$unwind": "$comments" },
// Only match elements, there can be more than 1
{ "$match": "_id": id, "comments.created.by": "Jane" },
// Project only what you want
{ "$project": {
"comments": {
"body": "$comments.body",
"by": "$comments.created.by"
}
}},
// Group back each document with the array if you want to
{ "$group": {
"_id": "$_id",
"comments": { "$push": "$comments" }
}}
],
function(err,result) {
So the aggregation framework can be used for a lot more than simply aggregating results. It's $project operator gives you more flexibility than is available to projection using .find(). It also allows you to filter and return multiple array results, which is also something that cannot be done with projection in .find().

Categories

Resources