MongoDB: how to find 10 random document in a collection of 100? - javascript

Is MongoDB capable of funding number of random documents without making multiple queries?
e.g. I implemented on the JS side after loading all the document in the collection, which is wasteful - hence just wanted to check if this can be done better with one db query?
The path I took on the JS side:
get all data
make an array of the IDs
shuffle array of IDs (random order)
splice the array to the number of document required
create a list of document by selecting them by ID which we have left after two previous operations, one by one from the whole collection
Two major drawback are that I am loading all data - or I make multiple queries.
Any suggestion much appreciated

This was answered long time ago and, since then, MongoDB has greatly evolved.
As posted in another answer, MongoDB now supports sampling within the Aggregation Framework since version 3.2:
The way you could do this is:
db.products.aggregate([{$sample: {size: 5}}]); // You want to get 5 docs
Or:
db.products.aggregate([
{$match: {category:"Electronic Devices"}}, // filter the results
{$sample: {size: 5}} // You want to get 5 docs
]);
However, there are some warnings about the $sample operator:
(as of Nov, 6h 2017, where latest version is 3.4) => If any of this is not met:
$sample is the first stage of the pipeline
N is less than 5% of the total documents in the collection
The collection contains more than 100 documents
If any of the above conditions are NOT met, $sample performs a
collection scan followed by a random sort to select N documents.
Like in the last example with the $match
OLD ANSWER
You could always run:
db.products.find({category:"Electronic Devices"}).skip(Math.random()*YOUR_COLLECTION_SIZE)
But the order won't be random and you will need two queries (one count to get YOUR_COLLECTION_SIZE) or estimate how big it is (it is about 100 records, about 1000, about 10000...)
You could also add a field to all documents with a random number and query by that number. The drawback here would be that you will get the same results every time you run the same query. To fix that you can always play with limit and skip or even with sort. you could as well update those random numbers every time you fetch a record (implies more queries).
--I don't know if you are using Mongoose, Mondoid or directly the Mongo Driver for any specific language, so I'll write all about mongo shell.
Thus your, let's say, product record would look like this:
{
_id: ObjectId("..."),
name: "Awesome Product",
category: "Electronic Devices",
}
and I would suggest to use:
{
_id: ObjectId("..."),
name: "Awesome Product",
category: "Electronic Devices",
_random_sample: Math.random()
}
Then you could do:
db.products.find({category:"Electronic Devices",_random_sample:{$gte:Math.random()}})
then, you could run periodically so you update the document's _random_sample field periodically:
var your_query = {} //it would impact in your performance if there are a lot of records
your_query = {category: "Electronic Devices"} //Update
//upsert = false, multi = true
db.products.update(your_query,{$set:{_random_sample::Math.random()}},false,true)
or just whenever you retrieve some records you could update all of them or just a few (depending on how many records you've retrieved)
for(var i = 0; i < records.length; i++){
var query = {_id: records[i]._id};
//upsert = false, multi = false
db.products.update(query,{$set:{_random_sample::Math.random()}},false,false);
}
EDIT
Be aware that
db.products.update(your_query,{$set:{_random_sample::Math.random()}},false,true)
won't work very well as it will update every products that matches your query with the same random number. The last approach works better (updating some documents as you retrieve them)

Since 3.2 there is an easier way to to get a random sample of documents from a collection:
$sample
New in version 3.2.
Randomly selects the specified number of documents from its input.
The $sample stage has the following syntax:
{ $sample: { size: <positive integer> } }
Source: MongoDB Docs
In this case:
db.products.aggregate([{$sample: {size: 10}}]);

Here is what I came up in the end:
var numberOfItems = 10;
// GET LIST OF ALL ID's
SchemaNameHere.find({}, { '_id': 1 }, function(err, data) {
if (err) res.send(err);
// shuffle array, as per here https://github.com/coolaj86/knuth-shuffle
var arr = shuffle(data.slice(0));
// get only the first numberOfItems of the shuffled array
arr.splice(numberOfItems, arr.length - numberOfItems);
// new array to store all items
var return_arr = [];
// use async each, as per here http://justinklemm.com/node-js-async-tutorial/
async.each(arr, function(item, callback) {
// get items 1 by 1 and add to the return_arr
SchemaNameHere.findById(item._id, function(err, data) {
if (err) res.send(err);
return_arr.push(data);
// go to the next one item, or to the next function if done
callback();
});
}, function(err) {
// run this when looped through all items in arr
res.json(return_arr);
});
});

skip didn't work out for me. Here is what I wound up with:
var randomDoc = db.getCollection("collectionName").aggregate([ {
$match : {
// criteria to filter matches
}
}, {
$sample : {
size : 1
}
} ]).result[0];
gets a single random result, matching the criteria.

Sample may not be best as you wouldn't get virtual like that.
Instead, create a function in your back end that shuffles the results.
Then return the shuffled array instead of the mongodb result

Related

Is this O(N) approach the only way of avoiding a while loop when walking this linked list in Javascript?

I have a data structure that is essentially a linked list stored in state. It represents a stream of changes (patches) to a base object. It is linked by key, rather than by object reference, to allow me to trivially serialise and deserialise the state.
It looks like this:
const latest = 'id4' // They're actually UUIDs, so I can't sort on them (text here for clarity)
const changes = {
id4: {patch: {}, previous: 'id3'},
id3: {patch: {}, previous: 'id2'},
id2: {patch: {}, previous: 'id1'},
id1: {patch: {}, previous: undefined},
}
At some times, a user chooses to run an expensive calculation and results get returned into state. We do not have results corresponding to every change but only some. So results might look like:
const results = {
id3: {performance: 83.6},
id1: {performance: 49.6},
}
Given the changes array, I need to get the results closest to the tip of the changes list, in this case results.id3.
I've written a while loop to do this, and it's perfectly robust at present:
let id = latest
let referenceId = undefined
while (!!id) {
if (!!results[id]) {
referenceId = id
id = undefined
} else {
id = changes[id].previous
}
}
The approach is O(N) but that's the pathological case: I expect a long changelist but with fairly frequent results updates, such that you'd only have to walk back a few steps to find a matching result.
While loops can be vulnerable
Following the great work of Gene Krantz (read his book "Failure is not an option" to understand why NASA never use recursion!) I try to avoid using while loops in code bases: They tend to be susceptible to inadvertent mistakes.
For example, all that would be required to make an infinite loop here is to do delete changes.id1.
So, I'd like to avoid that vulnerability and instead fail to retrieve any result, because not returning a performance value can be handled; but the user's app hanging is REALLY bad!
Other approaches I tried
Sorted array O(N)
To avoid the while loop, I thought about sorting the changes object into an array ordered per the linked list, then simply looping through it.
The problem is that I have to traverse the whole changes list first to get the array in a sorted order, because I don't store an ordering key (it would violate the point of a linked list, because you could no longer do O(1) insert).
It's not a heavy operation, to push an id onto an array, but is still O(N).
The question
Is there a way of traversing this linked list without using a while loop, and without an O(N) approach to convert the linked list into a normal array?
Since you only need to append at the end and possibly remove from the end, the required structure is a stack. In JavaScript the best data structure to implement a stack is an array -- using its push and pop features.
So then you could do things like this:
const changes = [];
function addChange(id, patch) {
changes.push({id, patch});
}
function findRecentMatch(changes, constraints) {
for (let i = changes.length - 1; i >= 0; i--) {
const {id} = changes[i];
if (constraints[id]) return id;
}
}
// Demo
addChange("id1", { data: 10 });
addChange("id2", { data: 20 });
addChange("id3", { data: 30 });
addChange("id4", { data: 40 });
const results = {
id3: {performance: 83.6},
id1: {performance: 49.6},
}
const referenceId = findRecentMatch(changes, results);
console.log(referenceId); // id3
Depending on what you want to do with that referenceId you might want findRecentMatch to return the index in changes instead of the change-id itself. This gives you the possibility to still retrieve the id, but also to clip the changes list to end at that "version" (i.e. as if you popped all the entries up to that point, but then in one operation).
While writing out the question, I realised that rather than avoiding a while-loop entirely, I can add an execution count and an escape hatch which should be sufficient for the purpose.
This solution uses Object.keys() which is strictly O(N) so not technically a correct answer to the question - but it is very fast.
If I needed it faster, I could restructure changes as a map instead of a general object and access changes.size as per this answer
let id = latest
let referenceId = undefined
const maxLoops = Object.keys(changes).length
let loop = 0
while (!!id && loop < maxLoops) {
loop++
if (!!results[id]) {
referenceId = id
id = undefined
} else {
id = changes[id].previous
}
}

Querying MarkLogic merged collection

I'm trying to write a query to get the unique values of an attribute from the final merged collection(sm-Survey-merged). Something like:
select distinct(participantID) from sm-Survey-merged;
I get a tree-cache error with the below equivalent JS query. Can someone help me with a better query?
[...new Set (fn.collection("sm-Survey-merged").toArray().map(doc => doc.root.participantID.valueOf()).sort(), "unfiltered")]
If there are a lot of documents, and you attempt to read them all in a single query, then you run the risk of blowing out the Expanded Tree Cache. You can try bumping up that limit, but with a large database with a lot of documents you are still likely to hit that limit.
The fastest and most efficient way to produce a list of the unique values is to create a range index, and select the values from that lexicon with cts.values().
Without an index, you could attempt to perform iterative queries that search and retrieve a set of random values, and then perform additional searches excluding those previously seen values. This still runs the risk of either blowing out the Expanded Tree Cache, timeouts, etc. So, may not be ideal - but would allow you to get some info now without reindexing the data.
You could experiment with the number of iterations and search page size and see if that stays within limits, and provides consistent results. Maybe add some logging or flags to know if you have hit the iteration limit, but still have more values returned to know if it's a complete list or not. You could also try running without an iteration limit, but run the risk of blowing OOM or ETC errors.
function distinctParticipantIDs(iterations, values) {
const participantIDs = new Set([]);
const docs = fn.subsequence(
cts.search(
cts.andNotQuery(
cts.collectionQuery("sm-Survey-merged"),
cts.jsonPropertyValueQuery("participantID", Array.from(values))
),
("unfiltered","score-random")),
1, 1000);
for (const doc of docs) {
const participantID = doc.root.participantID.valueOf();
participantIDs.add(participantID);
}
const uniqueParticipantIDs = new Set([...values, ...participantIDs]);
if (iterations > 0 && participantIDs.size > 0) {
//there are still unique values, and we haven't it our iterations limit, so keep searching
return distinctParticipantIDs(iterations - 1, uniqueParticipantIDs);
} else {
return uniqueParticipantIDs;
}
}
[...distinctParticipantIDs(100, new Set()) ];
Another option would be to run a CoRB job against the database, and apply the EXPORT-FILE-SORT option with ascending|distinct or descending|distinct, to dedup the values produced in an output file.

MongoDB - return X number of random records

In my node.js server, I am trying to return 4 random records from my collection.
Here is my current code, the issue is that currently it returns between 0-4 random records from my collection, whereas I want to return 4 (no more no less) random records every time.
db.collection('articles')
.find()
.limit( 4 )
.skip(Math.round(Math.random() * 4))
.sort("date", -1).toArray()
Any help or advice is appreciate - thank you in advance!
I had a look at some similar questions but they all only seem to
generate random records between 0-X records, not a set amount.
You can use $sample aggregation pipeline for that.
Randomly selects the specified number of documents from its input.
The $sample stage has the following syntax:
{ $sample: { size: <positive integer> } }
E.g. this code returns 4 random documents:
db.collection('articles').aggregate([
{ $sample: { size: 4 } }
]);
If you need to select x random documents by some criterias, then just add $match

How can I modify mongo query for limited values using push function

I have a following Mongo query which is giving me result like the screen shot .
how can I modify the query to limit the number of values in nodes array.
Gor example I want only top 3 values in my nodes array.
{ $group: {
_id: "$url",
nodes: {
$push: {
totaltime: "$totaltime",
},
},
},
}
You got 4 items just because you have 4 different kinds of url field within your collection in mongodb.
To change the amount of nodes you might want to play with _id param inside $group

How to change result position based off parameter in a mongodb / mongoose query?

So I am using mongoose and node.js to access a mongodb database. I want to bump up each result based on a number (they are ordered by date created if none are bumped up). For example:
{ name: 'A',
bump: 0 },
{ name: 'B',
bump: 0 },
{ name: 'C',
bump: 2 },
{ name: 'D',
bump: 1 }
would be retreived in the order: C, A, D, B. How can this be accomplished (without iterating through every entry in the database)?
Try something like this. Store a counter tracking the total # of threads, let's call it thread_count, initially set to 0, so have a document somewhere that looks like {thread_count:0}.
Every time a new thread is created, first call findAndModify() using {$inc : {thread_count:1}} as the modifier - i.e., increment the counter by 1 and return its new value.
Then when you insert the new thread, use the new value for the counter as the value for a field in its document, let's call it post_order.
So each document you insert has a value 1 greater each time. For example, the first 3 documents you insert would look like this:
{name:'foo', post_order:1, created_at:... } // value of thread_count is at 1
{name:'bar', post_order:2, created_at:... } // value of thread_count is at 2
{name:'baz', post_order:3, created_at:... } // value of thread_count is at 3
etc.
So effectively, you can query and order by post_order as ASCENDING, and it will return them in the order of oldest to newest (or DESCENDING for newest to oldest).
Then to "bump" a thread in its sorting order when it gets upvoted, you can call update() on the document with {$inc:{post_order:1}}. This will advance it by 1 in the order of result sorting. If two threads have the same value for post_order, created_at will differentiate which one comes first. So you will sort by post_order, created_at.
You will want to have an index on post_order and created_at.
Let's guess your code is the variable response (which is an array), then I would do:
response.sort(function(obj1, obj2){
return obj2.bump - obj1.bump;
});
or if you want to also take in mind name order:
response.sort(function(obj1, obj2){
var diff = obj2.bump - obj1.bump;
var nameDiff = (obj2.name > obj1.name)?-1:((obj2.name < obj1.name)?1:0);
return (diff == 0) ? nameDiff : diff;
});
Not a pleasant answer, but the solution you request is unrealistic. Here's my suggestion:
Add an OrderPosition property to your object instead of Bump.
Think of "bumping" as an event. It is best represented as an event-handler function. When an item gets "bumped" by whatever trigger in your business logic, the collection of items needs to be adjusted.
var currentOrder = this.OrderPosition
this.OrderPosition = currentOrder - bump; // moves your object up the list
// write a foreach loop here, iterating every item AFTER the items unadjusted
// order, +1 to move them all down the list one notch.
This does require iterating through many items, and I know you are trying to prevent that, but I do not think there is any other way to safely ensure the integrity of your item ordering - especially when relative to other pulled collections that occur later down the road.
I don't think a purely query-based solution is possible with your document schema (I assume you have createdDate and bump fields). Instead, I suggest a single field called sortorder to keep track of your desired retrieval order:
sortorder is initially the creation timestamp. If there are no "bumps", sorting by this field gives the correct order.
If there is a "bump," the sortorder is invalidated. So simply correct the sortorder values: each time a "bump" occurs swap the sortorder fields of the bumped document and the document directly ahead of it. This literally "bumps" the document up in the sort order.
When querying, sort by sortorder.
You can remove fields bump and createdDate if they are not used elsewhere.
As an aside, most social sites don't directly manipulate a post's display position based on its number of votes (or "bumps"). Instead, the number of votes is used to calculate a score. Then the posts are sorted and displayed by this score. In your case, you should combine createdDate and bumps into a single score that can be sorted in a query.
This site (StackOverflow.com) had a related meta discussion about how to determine "hot" questions. I think there was even a competition to come up with a new formula. The meta question also shared the formulas used by two other popular social news sites: Y Combinator Hacker News and Reddit.

Categories

Resources