I've got a node/mongo web application. In the mongo database, I have a (very large -- in the millions) collection of documents called "stories". Each story has a few basic properties ("title", "author", "text", etc). I'm experimenting with a few machine learning algorithms that attempt to classify the text into different "moods".
Each of the different classifier algorithms has a record in the "algorithms" collection, which consists of a name and a "score", which represents how successfully the algorithm is performing.
I store all the classification results in a collection called "detectedMoods"; each record holds the storyID, algorithmID, and a list of detected moods.
Here's where we finally get to my question:
I want to keep a list of detected moods for each story IN the story document, where the list in the story document is the one generated by the Algorithm that currently has the highest "score". There's no guarantee that every algorithm will have been run on every story, and it's always possible to either add a new Algorithm with a higher score, or for the scores to change... though that will NOT happen very often (maybe once every couple of days). So it needs to work something like this:
For each Story, among the Algorithms where we have a detectedMoods record for this Story's storyID, find the one with the highest Score, and then store that algorithm's detected moods list in the "detectedMoods" property of the Story along with the algorithmID of the Algorithm that was used.
It feels like some variation of Map-Reduce makes sense here, but I can't figure how how to quite fit it into that model... Any thoughts? Do I need to do some scripting, or is this doable within a single Mongo command?
== Update, per request ==
Commenter requested example documents, so here goes.
Stories collection:
{
"_id":"s01",
"Title":"Misery",
"Author":"Stephen King",
"Text":"......",
}
{
"_id":"s02",
"Title":"Catch-22",
"Author":"Joseph Heller",
"Text":"......",
}
Algorithms collection:
{
"_id":"c01",
"Name":"algorithm A",
"Score":104
}
{
"_id":"c02",
"Name":"algorithm B",
"Score":22
}
DetectedMoods collection:
{
"_id":"fh3fha",
"algorithmID":"c01",
"storyID":"s01",
"moods":["desperate","afraid","bitter"]
}
{
"_id":"m12y49",
"algorithmID":"c02",
"storyID":"s01",
"moods":["bored","unhappy"]
}
{
"_id":"fj37ah",
"algorithmID":"c02",
"storyID":"s02",
"moods":["confused"]
}
Stories collection updates from pseudo-map-reduce:
{
<...Misery...>
"moods":["desperate","afraid","bitter"],
"algorithm":"c01"
}
{
<...Catch-22...>
"moods":["confused"],
"algorithm":"c02"
}
So, both algorithms (c01 and c02) were used to process "Misery", and as c01 has a better Score than c02, its results are the ones that get stored in the Story document for that story, along with a property that shows c01 was the source of those moods. However, "Catch-22" was processed only with c02, so it's the best option we have for that story, and thus its list of moods for that story is the one that gets stored in the Story document.
Hopefully that clarifies things.
Related
The Redux documentation (https://redux.js.org/usage/structuring-reducers/normalizing-state-shape) recommends to avoid nested data and to normalize and flatten it instead. But I wonder if it really is a good idea to flatten the state in the case of nested one-to-many relationships.
Let's say I want to model a collection of books, where each books has a certain number of pages, and each page has a certain number of sentences. Every sentence is in only one page and every page is in only one book. From what I understand, the Redux docs suggest the following normalized flat structure. Note that pages and sentences need ids that are globally unique.
state = {
books: {
allIds: ["book1", "book2"],
byId: {
"book1": {
/*name: "First book",*/
pages: ["page1", "page2"],
},
"book2": {/*...*/},
},
},
pages: {
byId: {
"page1": {
/*pageNumber: "1",*/
sentences: ["sentence1", "sentence2"]
},
"page2": {/*...*/},
}
},
sentences: {
byId: {
"sentence1": {
/*contents: "First sentence"*/
},
"sentence2": {/*...*/},
}
},
};
But in practice I found that this representation makes it very difficult to do certain operations. For instance (deep) cloning a book requires coming up with a whole lot of new unique ids (for the cloned book, all the cloned pages, and all the cloned sentences), but I have no idea where I can generate them: the reducer is not allowed to generate random ids, and the action creator only has access to the book id so it doesn’t know how many random ids need to be generated.
On the other hand, in the representation below where we nest the {allIds, byId} objects, cloning books is very easy: you just need to come up with one new id for the new book and then do a regular deep clone. Note that the ids are now only required to be locally unique.
state.books = {
allIds: ["book1", "book2"],
byId: {
"book1": {
/*name: "First book",*/
pages: {
allIds: ["page1", "page2"],
byId: {
"page1": {
/*pageNumber: "1",*/
sentences: {
allIds: ["sentence1", "sentence2"],
byId: {
"sentence1": {
/*contents: "First sentence"*/
},
"sentence2": {/*...*/},
}
}
},
"page2": {/*...*/},
}
}
},
"book2": {/*...*/}
}
};
This nested representation also seems to avoid every single pitfall mentioned in the Redux docs:
When a piece of data is duplicated in several places, it becomes harder to make sure that it is updated appropriately.
Not relevant for one-to-many relationships as data is not duplicated to begin with.
Nested data means that the corresponding reducer logic has to be more nested and therefore more complex. In particular, trying to update a deeply nested field can become very ugly very fast.
Well, no, using Immer (included by default in Redux Toolkit) updating a deeply nested field is super easy. Even without Immer, it’s not that complicated. On the other hand, the flat representation makes a lot of other things way more complex (in particular cloning a book which seems incredibly complicated to me, but also adding or removing pages). Yes, in order to access a sentence the sentenceId is not enough, you now also need to give the pageId and the bookId, but that’s just two more arguments to a bunch of function calls.
Since immutable data updates require all ancestors in the state tree to be copied and updated as well, and new object references will cause connected UI components to re-render, an update to a deeply nested data object could force totally unrelated UI components to re-render even if the data they're displaying hasn't actually changed.
Doesn’t seem to be an issue either, as long as you don’t select the whole state.books.byId[bookId] object. If you only select state.books.byId[bookId].name and state.books.byId[bookId].pages.allIds, then you have all the information you need, and changing a page or a sentence won’t make the book component rerender. It seems just as efficient as the flat version. I guess there is slightly more boilerplate, especially if you need to select many other fields, but it shouldn’t be too hard to manage.
In summary, I think that turning an array of books/pages/sentences into an {allIds, byId} object and having every book/page/sentence component select its own data is definitely a good idea, but I really don’t see the point of going one step further and flattening out everything. Keeping the data nested seems significantly easier to work with and doesn’t seem to have any real drawbacks.
So I guess my question is:
Am I missing something?
Is there something else I would gain by flattening my state, which would make it worth figuring out how to clone a book?
How would I go about filtering a set of records based on their child records.
Let's say I have a collection Item that has a field to another collection Bag called bagId. I'd like to find all Items where a field on Bags matches some clause.
I.e. db.Items.find( { "where bag.type:'Paper' " }) . How would I go about doing this in MongoDB. I understand I'd have to join on Bags and then link where Item.bagId == Bag._id
I used Studio3T to convert a SQL GROUP BY to a Mongo aggregate. I'm just wondering if there's any defacto way to do this.
Should I perform a data migration to simply include Bag.type on every Item document (don't want to get into the habit of continuously making schema changes everytime I want to sort/filter Items by Bag fields).
Use something like https://github.com/meteorhacks/meteor-aggregate (No luck with that syntax yet)
Grapher https://github.com/cult-of-coders/grapher I played around with this briefly and while it's cool I'm not sure if it'll actually solve my problem. I can use it to add Bag.type to every Item returned, but I don't see how that could help me filter every item by Bag.type.
Is this just one of the tradeoffs of using a NoSQL dbms? What option above is recommended or are there any other ideas?
Thanks
You could use the $in functionality of MongoDB. It would look something like this:
const bagsIds = Bags.find({type: 'paper'}, {fields: {"_id": 1}}).map(function(bag) { return bag._id; });
const items = Items.find( { bagId: { $in: bagsIds } } ).fetch();
It would take some testing if the reactivity of this solution is still how you expect it to work and if this would still be suitable for larger collections instead of going for your first solution and performing the migration.
Should I store objects in an Array or inside an Object with top importance given Write Speed?
I'm trying to decide whether data should be stored as an array of objects, or using nested objects inside a mongodb document.
In this particular case, I'm keeping track of a set of continually updating files that I add and update and the file name acts as a key and the number of lines processed within the file.
the document looks something like this
{
t_id:1220,
some-other-info: {}, // there's other info here not updated frequently
files: {
log1-txt: {filename:"log1.txt",numlines:233,filesize:19928},
log2-txt: {filename:"log2.txt",numlines:2,filesize:843}
}
}
or this
{
t_id:1220,
some-other-info: {},
files:[
{filename:"log1.txt",numlines:233,filesize:19928},
{filename:"log2.txt",numlines:2,filesize:843}
]
}
I am making an assumption that handling a document, especially when it comes to updates, it is easier to deal with objects, because the location of the object can be determined by the name; unlike an array, where I have to look through each object's value until I find the match.
Because the object key will have periods, I will need to convert (or drop) the periods to create a valid key (fi.le.log to filelog or fi-le-log).
I'm not worried about the files' possible duplicate names emerging (such as fi.le.log and fi-le.log) so I would prefer to use Objects, because the number of files is relatively small, but the updates are frequent.
Or would it be better to handle this data in a separate collection for best write performance...
{
"_id": ObjectId('56d9f1202d777d9806000003'),"t_id": "1220","filename": "log1.txt","filesize": 1843,"numlines": 554
},
{
"_id": ObjectId('56d9f1392d777d9806000004'),"t_id": "1220","filename": "log2.txt","filesize": 5231,"numlines": 3027
}
From what I understand you are talking about write speed, without any read consideration. So we have to think about how you will insert/update your document.
We have to compare (assuming you know the _id you are replacing, replace {key} by the key name, in your example log1-txt or log2-txt):
db.Col.update({ _id: '' }, { $set: { 'files.{key}': object }})
vs
db.Col.update({ _id: '', 'files.filename': '{key}'}, { $set: { 'files.$': object }})
The second one means that MongoDB have to browse the array, find the matching index and update it. The first one means MongoDB just update the specified field.
The worst:
The second command will not work if the matching filename is not present in the array! So you have to execute it, check if nMatched is 0, and create it if it is so. That's really bad write speed (see here MongoDB: upsert sub-document).
If you will never/almost never use read queries / aggregation framework on this collection: go for the first one, that will be faster. If you want to aggregate, unwind, do some analytics on the files you parsed to have statistics about file size and line numbers, you may consider using the second one, you will avoid some headache.
Pure write speed will be better with the first solution.
I have two classes - _User and Car. A _User will have a low/limited number of Cars that they own. Each Car has only ONE owner and thus an "owner" column that is a to the _User. When I got to the user's page, I want to see their _User info and all of their Cars. I would like to make one call, in Cloud Code if necessary.
Here is where I get confused. There are 3 ways I could do this -
In _User have a relationship column called "cars" that points to each individual Car. If so, how come I can't use the "include(cars)" function on a relation to include the Cars' data in my query?!!
_User.cars = relationship, Car.owner = _User(pointer)
Query the _User, and then query all Cars with (owner == _User.objectId) separately. This is two queries though.
_User.cars = null, Car.owner = _User(pointer)
In _User have a array of pointers column called "cars". Manually inject pointers to cars upon car creation. When querying the user I would use "include(cars)".
_User.cars = [Car(pointer)], Car.owner = _User(pointer)
What is your recommended way to do this and why? Which one is the fastest? The documentation just leaves me further confused.
I recommend you the 3rd option, and yes, you can ask to include an array. You even don't need to "manually inject" the pointers, you just need to add the objects into the array and they'll automatically be converted into pointers.
You've got the right ideas. Just to clarify them a bit:
A relation. User can have a relation column called cars. To get from user to car, there's a user query and then second query like user.relation("cars").query, on which you would .find().
What you might call a belongs_to pointer in Car. To get from user to car you'd have a query to get your user and you create a carQuery like carQuery.equalTo("user", user)
An array of pointers. For small-sized collections, this is superior to the relation, because you can aggressively load cars when querying user by saying include("cars") on a user query. Not sure if there's a second query under the covers - probably not if parse (mongo) is storing these as embedded.
But I wouldn't get too tied up over one or two queries. Using the promise forms of find() will keep your code nice and tidy. There probably is a small speed advantage to the array technique, which is good while the collection size is small (<100 is my rule of thumb).
It's easy to google (or I'll add here if you have a specific question) code examples for maintaining the relations and for getting from user->car or from car->user for each approach.
I want to implement pagination on top of a MongoDB. For my range query, I thought about using ObjectIDs:
db.tweets.find({ _id: { $lt: maxID } }, { limit: 50 })
However, according to the docs, the structure of the ObjectID means that "ObjectId values do not represent a strict insertion order":
The relationship between the order of ObjectId values and generation time is not strict within a single second. If multiple systems, or multiple processes or threads on a single system generate values, within a single second; ObjectId values do not represent a strict insertion order. Clock skew between clients can also result in non-strict ordering even for values, because client drivers generate ObjectId values, not the mongod process.
I then thought about querying with a timestamp:
db.tweets.find({ created: { $lt: maxDate } }, { limit: 50 })
However, there is no guarantee the date will be unique — it's quite likely that two documents could be created within the same second. This means documents could be missed when paging.
Is there any sort of ranged query that would provide me with more stability?
It is perfectly fine to use ObjectId() though your syntax for pagination is wrong. You want:
db.tweets.find().limit(50).sort({"_id":-1});
This says you want tweets sorted by _id value in descending order and you want the most recent 50. Your problem is the fact that pagination is tricky when the current result set is changing - so rather than using skip for the next page, you want to make note of the smallest _id in the result set (the 50th most recent _id value and then get the next page with:
db.tweets.find( {_id : { "$lt" : <50th _id> } } ).limit(50).sort({"_id":-1});
This will give you the next "most recent" tweets, without new incoming tweets messing up your pagination back through time.
There is absolutely no need to worry about whether _id value is strictly corresponding to insertion order - it will be 99.999% close enough, and no one actually cares on the sub-second level which tweet came first - you might even notice Twitter frequently displays tweets out of order, it's just not that critical.
If it is critical, then you would have to use the same technique but with "tweet date" where that date would have to be a timestamp, rather than just a date.
Wouldn't a tweet "actual" timestamp (i.e. time tweeted and the criteria you want it sorted by) be different from a tweet "insertion" timestamp (i.e. time added to local collection). This depends on your application, of course, but it's a likely scenario that tweet inserts could be batched or otherwise end up being inserted in the "wrong" order. So, unless you work at Twitter (and have access to collections inserted in correct order), you wouldn't be able to rely just on $natural or ObjectID for sorting logic.
Mongo docs suggest skip and limit for paging:
db.tweets.find({created: {$lt: maxID}).
sort({created: -1, username: 1}).
skip(50).limit(50); //second page
There is, however, a performance concern when using skip:
The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return result. As offset increases, cursor.skip() will become slower and more CPU intensive.
This happens because skip does not fit into the MapReduce model and is not an operation that would scale well, you have to wait for a sorted collection to become available before it can be "sliced". Now limit(n) sounds like an equally poor method as it applies a similar constraint "from the other end"; however with sorting applied, the engine is able to somewhat optimize the process by only keeping in memory n elements per shard as it traverses the collection.
An alternative is to use range based paging. After retrieving the first page of tweets, you know what the created value is for the last tweet, so all you have to do is substitute the original maxID with this new value:
db.tweets.find({created: {$lt: lastTweetOnCurrentPageCreated}).
sort({created: -1, username: 1}).
limit(50); //next page
Performing a find condition like this can be easily parallellized. But how to deal with pages other than the next one? You don't know the begin date for pages number 5, 10, 20, or even the previous page! #SergioTulentsev suggests creative chaining of methods but I would advocate pre-calculating first-last ranges of the aggregate field in a separate pages collection; these could be re-calculated on update. Furthermore, if you're not happy with DateTime (note the performance remarks) or are concerned about duplicate values, you should consider compound indexes on timestamp + account tie (since a user can't tweet twice at the same time), or even an artificial aggregate of the two:
db.pages.
find({pagenum: 3})
> {pagenum:3; begin:"01-01-2014#BillGates"; end:"03-01-2014#big_ben_clock"}
db.tweets.
find({_sortdate: {$lt: "03-01-2014#big_ben_clock", $gt: "01-01-2014#BillGates"}).
sort({_sortdate: -1}).
limit(50) //third page
Using an aggregate field for sorting will work "on the fold" (although perhaps there are more kosher ways to deal with the condition). This could be set up as a unique index with values corrected at insert time, with a single tweet document looking like
{
_id: ...,
created: ..., //to be used in markup
user: ..., //also to be used in markup
_sortdate: "01-01-2014#BillGates" //sorting only, use date AND time
}
The following approach wil work even if there are multiple documents inserted/updated at same millisecond even if from multiple clients (which generates ObjectId). For simiplicity, In following queries I am projecting _id, lastModifiedDate.
First page, fetch the result Sorted by modifiedTime (Descending), ObjectId (Ascending) for fist page.
db.product.find({},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)
Note down the ObjectId and lastModifiedDate of the last record fetched in this page. (loid, lmd)
For sencod page, include query condition to search if (lastModifiedDate = lmd AND oid > loid ) OR (lastModifiedDate < loid)
db.productfind({$or:[{"lastModifiedDate":{$lt:lmd}},{"_id":1,"lastModifiedDate":1},{$and:[{"lastModifiedDate":lmd},{"_id":{$gt:loid}}]}]},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)
repeat same for subsequent pages.
ObjectIds should be good enough for pagination if you limit your queries to the previous second (or don't care about the subsecond possibility of weirdness). If that is not good enough for your needs then you will need to implement an ID generation system that works like an auto-increment.
Update:
To query the previous second of ObjectIds you will need to construct an ObjectID manually.
See the specification of ObjectId http://docs.mongodb.org/manual/reference/object-id/
Try using this expression to do it from a mongos.
{ _id :
{
$lt : ObjectId(Math.floor((new Date).getTime()/1000 - 1).toString(16)+"ffffffffffffffff")
}
}
The 'f''s at the end are to max out the possible random bits that are not associated with a timestamp since you are doing a less than query.
I recommend during the actual ObjectId creation on your application server rather than on the mongos since this type of calculation can slow you down if you have many users.
I have build a pagination using mongodb _id this way.
// import ObjectId from mongodb
let sortOrder = -1;
let query = []
if (prev) {
sortOrder = 1
query.push({title: 'findTitle', _id:{$gt: ObjectId('_idValue')}})
}
if (next) {
sortOrder = -1
query.push({title: 'findTitle', _id:{$lt: ObjectId('_idValue')}})
}
db.collection.find(query).limit(10).sort({_id: sortOrder})