Is there a possibility in Mongodb to update a document, based on a query for 'count' of docs in other collection?
I'm looking for options to make it an atomic operation.
To be more specific, the following is how the two collections are designed:
Books(Collection)
_id
Name
isReviewed [Boolean]
Reviews (Collection)
_id
bookID [_id in Books Collection]
Comments
I came up with this design as "Reviews" array for a book will keep growing and are mutable.
Now, there's a requirement to set isReviewed flag in "Book" document to "true" when there's a "Review" doc. created for respective Book. The flag will stay "true" as long as there's at least one associated review existing for a book.
The same flag will be set to "false", when there're no reviews existing for a book. (The default value for the flag is "false" when the Book doc. is created)
When a Review is deleted, count of reviews for a book is calculated to see if it can be set to "false". (if count is 0, then set to false).
This system is designed for a multi-user environment sharing all resources such as books and reviews. All Users are given read/write permissions on all resources.(I know it sounds weird to allow all users to be able to create/ read/ edit/ delete all reviews, But, Let's say that's the case functionally).
Now, considering the above case, how can I ensure I perform an update related to setting the "isReviewed" flag based on the 'count' in the 'Reviews' collection?
Is this a case which can't be solved without transactions (I mean, do I need RDBMS)? I'm open to redesign my collections as well.
Any help is appreciated, Thank you
since mongodb provides AUTOMICITY at document level, You can pre-Join your books and reviews collection as below :
db.Books.insert(
{
"_id" : 1,
"Name": "ABC",
"reviews": [ ],
"count" : 0
}
P.S : your document size should be within 16MB size at MAX.
if you are expecting your "reviews" will be more than that, you will have to split into
two collection and handle the case in your code.
Related
I have a collection of employees that has data sent to it. Right now there is 4 employees but eventually there will be many more.
I want to add a grouping feature so that the user can sort the employees by their group. I am trying to find the best way to assign these employees groups and I found the reference field type in cloud firestore and thought I could use it to solve my problem. But I am stuck and not sure the most efficeient way to use it to link employees to a group.
This is my database. Right now I have the employees doc (ex. 2569) and inside that is a sub-collection with 2 documents in itself.
So end goal is to assign employees groups and then be able to sort and display them separately. Right now I have the group name assigned in articles/group -> groupName: "example".
(display them hopefully with ".Where( "groupName" "==" "example" ) somehow in code without hard-coding the group name. The group name will be created by the user so it could be anything)
Is what I am doing a good start? I know this question is a little odd but I am stuck and could really use some pointers on where to head next.
A collection group query would allow you to query all articles regardless of which employee contained them:
db.collectionGroup('articles')
.where('groupName', '==', 'X')
.get()
This would match documents in any collection (i.e. employees) where the last part of the collection path is articles. If you would like to find the employees who belong to a certain groupName, you may want to find the parent by retrieving the collection this DocumentReference belongs to.
Once you have the parent of the CollectionReference, you will get a reference to the containing DocumentReference of your subcollection.
I currently have a few issues with my Firestore querying technique. As per this stackoverflow post I made recently, Querying with two array with firestore security rules
The answer proposed to add the the "ids" into a object, with the key as the id, and the value simply being "true". I have completed this, and now my structure looks like so:
This leaves me with this query:
db.collection('Depots')
.where(`products.${productId}`, '==', true)
.where(`users.${userId}`, '==', true)
.where('created', '>', 1585998560500)
.orderBy('created', 'asc')
.get();
This query leaves me with throwing an error, asking to create an index:
The query requires an index. You can create it here: ...
However, this tries to index the specific object key, i.e. QXooVYGBIFWKo6C so products.QXooVYGBIFWKo6C. Which is certianly not what I want, as this query changes, and can have an infinite number of possibilities, which means I would have to create another index for each key entry in order to query it.
Is there any way to solve this issue? I am assuming it needs to index this query due to the different operators used in the query, so I was wondering if there were any workarounds to this issue.
Thank you very much in advance.
What you have here is a map field, for which indexes should usually be created automatically.
That indeed means that you'll have as many indexes as you have products, which means:
You are limited in how many products you can have, as there is a maximum of 40,000 index entries per document.
You pay more per document, as you pay for the storage of each index.
If these are not what you want, you'll have to switch back to your original model, with the query limitations you had there. There doesn't seem to be a solution that fits both of your requirements.
After our discussion in chat, this is the starting point I would suggest. Who knows what the end architecture would look like, but I think this or very close to this. You say that a user can exist in multiple depots at the same time and multiple depots can contain the same products, also at the same time. You also said that a depot can never have more than 40 users at a given time, so an array of 40 users would certainly not encroach on Firestore's document limit of 1,048,576 bytes.
[collection]
<documentId>
- field: value
[depots]
<UUID>
- depotId: string "depot456"
- productCount: num 5,000
<UUID>
- depotId: string "depot789"
- productCount: num 4,500
[products]
<UUID>
- productId: string "lotion123"
- depotId: string "depot456"
- users: [string] ["user10", "user27", "user33"]
<UUID>
- productId: string "lotion123"
- depotId: string "depot789"
- users: [string] ["user10", "user17", "user50"]
[users]
<userId>
- depots: [string] ["depot456", "depot999"]
<userId>
- depots: [string] ["depot333", "depot999"]
In NoSQL, storage is cheap and computation isn't so denormalize your data as much as you need to make your queries possible and efficient (fast and cheap).
To find all depots in a single query where user10 and lotion123 are both true, query the products collection where productId equals x and users array-contains y and collect the depotId values from those results. If you want to preserve the array-contains operation for something else, you'd have to denormalize your data further (replace the array for a single user). Or you could split this query into two separate queries.
With this model, when a user leaves a depot, get all products where users array-contains that user and remove that userId from the array. And when a user joins a depot, get all products where depotId equals x and append that userId to the array.
Watch this video, and others by Rick, to get a solid handle on NoSQL: https://www.youtube.com/watch?v=HaEPXoXVf2k
#danwillm If you are not sure about the number of users and products then your DB structure seems unfit for this situation because there are size and length limitations of the firestore document.
You should rather create a separate collection for products and users i.e normalize your data and have a reference for the user in the product collection.
User :
{
userId: documentId,
name: John,
...otherInfo
}
Product :
{
productId: documentId,
createdBy: userId,
createdOn:date,
productName:"exa",
...otherInfo
}
This way you there will be the size of the document would be limited, i.e try avoiding using maps/arrays in firestore if you are not sure about there size.
Also, in this case, the number of queries would be increased but you don't need many indexes in this case.
I've got a node/mongo web application. In the mongo database, I have a (very large -- in the millions) collection of documents called "stories". Each story has a few basic properties ("title", "author", "text", etc). I'm experimenting with a few machine learning algorithms that attempt to classify the text into different "moods".
Each of the different classifier algorithms has a record in the "algorithms" collection, which consists of a name and a "score", which represents how successfully the algorithm is performing.
I store all the classification results in a collection called "detectedMoods"; each record holds the storyID, algorithmID, and a list of detected moods.
Here's where we finally get to my question:
I want to keep a list of detected moods for each story IN the story document, where the list in the story document is the one generated by the Algorithm that currently has the highest "score". There's no guarantee that every algorithm will have been run on every story, and it's always possible to either add a new Algorithm with a higher score, or for the scores to change... though that will NOT happen very often (maybe once every couple of days). So it needs to work something like this:
For each Story, among the Algorithms where we have a detectedMoods record for this Story's storyID, find the one with the highest Score, and then store that algorithm's detected moods list in the "detectedMoods" property of the Story along with the algorithmID of the Algorithm that was used.
It feels like some variation of Map-Reduce makes sense here, but I can't figure how how to quite fit it into that model... Any thoughts? Do I need to do some scripting, or is this doable within a single Mongo command?
== Update, per request ==
Commenter requested example documents, so here goes.
Stories collection:
{
"_id":"s01",
"Title":"Misery",
"Author":"Stephen King",
"Text":"......",
}
{
"_id":"s02",
"Title":"Catch-22",
"Author":"Joseph Heller",
"Text":"......",
}
Algorithms collection:
{
"_id":"c01",
"Name":"algorithm A",
"Score":104
}
{
"_id":"c02",
"Name":"algorithm B",
"Score":22
}
DetectedMoods collection:
{
"_id":"fh3fha",
"algorithmID":"c01",
"storyID":"s01",
"moods":["desperate","afraid","bitter"]
}
{
"_id":"m12y49",
"algorithmID":"c02",
"storyID":"s01",
"moods":["bored","unhappy"]
}
{
"_id":"fj37ah",
"algorithmID":"c02",
"storyID":"s02",
"moods":["confused"]
}
Stories collection updates from pseudo-map-reduce:
{
<...Misery...>
"moods":["desperate","afraid","bitter"],
"algorithm":"c01"
}
{
<...Catch-22...>
"moods":["confused"],
"algorithm":"c02"
}
So, both algorithms (c01 and c02) were used to process "Misery", and as c01 has a better Score than c02, its results are the ones that get stored in the Story document for that story, along with a property that shows c01 was the source of those moods. However, "Catch-22" was processed only with c02, so it's the best option we have for that story, and thus its list of moods for that story is the one that gets stored in the Story document.
Hopefully that clarifies things.
The lookback api docs say PortfolioItem field is an index. Is the lowest portfolio item type also an index?
E.g: the Portfolio Item types in my workspace are Product, Milestone and Feature. will Feature be an index in Lookback API in addition to PortfolioItem?
The reason I ask is because only top-level UserStories have the PortfolioItem field, but both top-level and child UserStories have the Feature field. I want to query all User Stories under a particular Feature, which means I can't use PortfolioItem field, because it will not include child User Stories, only top-level User Stories.
Example of what i want to do if Feature is indexed:
Ext.create('Rally.data.lookback.SnapshotStore', {
listeners: {
load: function(store, data, success) {
//do stuff
}
},
autoLoad:true,
limit:Infinity,
fetch: ['ScheduleState', 'PlanEstimate', 'Feature', 'ObjectID'],
compress:true,
find: {
_TypeHierarchy: 'HierarchicalRequirement',
Children: null,
Release: //a release OID
},
hydrate: ['ScheduleState']
});
There may be some confusion with the use of the word 'index'. Some fields are "indexed" for fast lookup..."Feature" isn't one of them, though it is a valid field and you can search for it. More correctly, the field that is the lowest-level Portfolio Item type is kept in the snapshots.* Given what you're asking for, adding "Feature": {oid} to the find should give you what you want.
* The distinction is due to the fact that the label "Feature" can be changed to something else, so what is "Feature" in one workspace might be "Thing" in another.
The _ItemHierarchy field includes all levels of PortfolioItems through all levels of Stories to Defects, Tasks and (I'm pretty sure) TestCases. So, if you want "all User Stories under a particular Feature", simply specify find: {_ItemHierarchy: 1234567} where 1234567 is the ObjectID of the Feature. You could combine this with the _TypeHierarchy and Release clauses. If you combine it with the Children and _TypeHiearchy clauses as you proposes, that will give you only leaf stories as opposed to all the levels. This is ideal if you are doing aggregations on fields like sum of PlanEstimate (points) or TaskActual, etc.
Note, I don't think this has anything to do with being indexed, so I may be misunderstanding your question. Please accept my apology if that's the case.
I want to implement pagination on top of a MongoDB. For my range query, I thought about using ObjectIDs:
db.tweets.find({ _id: { $lt: maxID } }, { limit: 50 })
However, according to the docs, the structure of the ObjectID means that "ObjectId values do not represent a strict insertion order":
The relationship between the order of ObjectId values and generation time is not strict within a single second. If multiple systems, or multiple processes or threads on a single system generate values, within a single second; ObjectId values do not represent a strict insertion order. Clock skew between clients can also result in non-strict ordering even for values, because client drivers generate ObjectId values, not the mongod process.
I then thought about querying with a timestamp:
db.tweets.find({ created: { $lt: maxDate } }, { limit: 50 })
However, there is no guarantee the date will be unique — it's quite likely that two documents could be created within the same second. This means documents could be missed when paging.
Is there any sort of ranged query that would provide me with more stability?
It is perfectly fine to use ObjectId() though your syntax for pagination is wrong. You want:
db.tweets.find().limit(50).sort({"_id":-1});
This says you want tweets sorted by _id value in descending order and you want the most recent 50. Your problem is the fact that pagination is tricky when the current result set is changing - so rather than using skip for the next page, you want to make note of the smallest _id in the result set (the 50th most recent _id value and then get the next page with:
db.tweets.find( {_id : { "$lt" : <50th _id> } } ).limit(50).sort({"_id":-1});
This will give you the next "most recent" tweets, without new incoming tweets messing up your pagination back through time.
There is absolutely no need to worry about whether _id value is strictly corresponding to insertion order - it will be 99.999% close enough, and no one actually cares on the sub-second level which tweet came first - you might even notice Twitter frequently displays tweets out of order, it's just not that critical.
If it is critical, then you would have to use the same technique but with "tweet date" where that date would have to be a timestamp, rather than just a date.
Wouldn't a tweet "actual" timestamp (i.e. time tweeted and the criteria you want it sorted by) be different from a tweet "insertion" timestamp (i.e. time added to local collection). This depends on your application, of course, but it's a likely scenario that tweet inserts could be batched or otherwise end up being inserted in the "wrong" order. So, unless you work at Twitter (and have access to collections inserted in correct order), you wouldn't be able to rely just on $natural or ObjectID for sorting logic.
Mongo docs suggest skip and limit for paging:
db.tweets.find({created: {$lt: maxID}).
sort({created: -1, username: 1}).
skip(50).limit(50); //second page
There is, however, a performance concern when using skip:
The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return result. As offset increases, cursor.skip() will become slower and more CPU intensive.
This happens because skip does not fit into the MapReduce model and is not an operation that would scale well, you have to wait for a sorted collection to become available before it can be "sliced". Now limit(n) sounds like an equally poor method as it applies a similar constraint "from the other end"; however with sorting applied, the engine is able to somewhat optimize the process by only keeping in memory n elements per shard as it traverses the collection.
An alternative is to use range based paging. After retrieving the first page of tweets, you know what the created value is for the last tweet, so all you have to do is substitute the original maxID with this new value:
db.tweets.find({created: {$lt: lastTweetOnCurrentPageCreated}).
sort({created: -1, username: 1}).
limit(50); //next page
Performing a find condition like this can be easily parallellized. But how to deal with pages other than the next one? You don't know the begin date for pages number 5, 10, 20, or even the previous page! #SergioTulentsev suggests creative chaining of methods but I would advocate pre-calculating first-last ranges of the aggregate field in a separate pages collection; these could be re-calculated on update. Furthermore, if you're not happy with DateTime (note the performance remarks) or are concerned about duplicate values, you should consider compound indexes on timestamp + account tie (since a user can't tweet twice at the same time), or even an artificial aggregate of the two:
db.pages.
find({pagenum: 3})
> {pagenum:3; begin:"01-01-2014#BillGates"; end:"03-01-2014#big_ben_clock"}
db.tweets.
find({_sortdate: {$lt: "03-01-2014#big_ben_clock", $gt: "01-01-2014#BillGates"}).
sort({_sortdate: -1}).
limit(50) //third page
Using an aggregate field for sorting will work "on the fold" (although perhaps there are more kosher ways to deal with the condition). This could be set up as a unique index with values corrected at insert time, with a single tweet document looking like
{
_id: ...,
created: ..., //to be used in markup
user: ..., //also to be used in markup
_sortdate: "01-01-2014#BillGates" //sorting only, use date AND time
}
The following approach wil work even if there are multiple documents inserted/updated at same millisecond even if from multiple clients (which generates ObjectId). For simiplicity, In following queries I am projecting _id, lastModifiedDate.
First page, fetch the result Sorted by modifiedTime (Descending), ObjectId (Ascending) for fist page.
db.product.find({},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)
Note down the ObjectId and lastModifiedDate of the last record fetched in this page. (loid, lmd)
For sencod page, include query condition to search if (lastModifiedDate = lmd AND oid > loid ) OR (lastModifiedDate < loid)
db.productfind({$or:[{"lastModifiedDate":{$lt:lmd}},{"_id":1,"lastModifiedDate":1},{$and:[{"lastModifiedDate":lmd},{"_id":{$gt:loid}}]}]},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)
repeat same for subsequent pages.
ObjectIds should be good enough for pagination if you limit your queries to the previous second (or don't care about the subsecond possibility of weirdness). If that is not good enough for your needs then you will need to implement an ID generation system that works like an auto-increment.
Update:
To query the previous second of ObjectIds you will need to construct an ObjectID manually.
See the specification of ObjectId http://docs.mongodb.org/manual/reference/object-id/
Try using this expression to do it from a mongos.
{ _id :
{
$lt : ObjectId(Math.floor((new Date).getTime()/1000 - 1).toString(16)+"ffffffffffffffff")
}
}
The 'f''s at the end are to max out the possible random bits that are not associated with a timestamp since you are doing a less than query.
I recommend during the actual ObjectId creation on your application server rather than on the mongos since this type of calculation can slow you down if you have many users.
I have build a pagination using mongodb _id this way.
// import ObjectId from mongodb
let sortOrder = -1;
let query = []
if (prev) {
sortOrder = 1
query.push({title: 'findTitle', _id:{$gt: ObjectId('_idValue')}})
}
if (next) {
sortOrder = -1
query.push({title: 'findTitle', _id:{$lt: ObjectId('_idValue')}})
}
db.collection.find(query).limit(10).sort({_id: sortOrder})