Fuse JS search results scoring doesn't make sense

Fuse JS search results scoring doesn't make sense - javascript

I'm trying to use Fuse JS to do a fuzzy search on a list of strings but the results it's giving back don't make sense.
I've set up a playcode here as an example.
Here my list is
[
"Yoga with Adriene - Revolution",
"Brewdog beers",
"NBA teams seen live",
"Yoga with Adriene - DEDICATE - 30 days",
"Yoga with Adriene - home",
]
My code config is
const options = {
isCaseSensitive: false,
includeScore: true,
shouldSort: true,
};
When I search with search term 'yoga home'
The closest match returned is 'Brewdog beers' which is clearly the worst search result, I'm expecting it to return the 'Yoga with Adriene - home' result first with 2 words matching.
The results are:
[
{
"item": "Brewdog beers",
"refIndex": 1,
"score": 0.6932183529538054
},
{
"item": "Yoga with Adriene - Revolution",
"refIndex": 0,
"score": 0.6959441913889373
},
{
"item": "Yoga with Adriene - home",
"refIndex": 4,
"score": 0.6959441913889373
},
{
"item": "Yoga with Adriene - DEDICATE - 30 days",
"refIndex": 3,
"score": 0.7504597227725356
}
]
Where a score of 0 is an exact match.

Related

Javascript: Parse array of json objects within a string

I have an array of json object within a string:
[
{
"title": "Ham on Rye",
"alternativeTitles": [],
"secondaryYearSourceId": 0,
"sortTitle": "ham on rye",
"sizeOnDisk": 0,
"status": "released",
"overview": "A bizarre rite of passage at the local deli determines the fate of a generation of teenagers, leading some to escape their suburban town and dooming others to remain…",
"inCinemas": "2019-08-10T00:00:00Z",
"images": [
{
"coverType": "poster",
"url": "http://image.tmdb.org/t/p/original/px7iCT1SgsOOSAXaqHwP50o0jAI.jpg"
}
],
"downloaded": false,
"remotePoster": "http://image.tmdb.org/t/p/original/px7iCT1SgsOOSAXaqHwP50o0jAI.jpg",
"year": 2019,
"hasFile": false,
"profileId": 0,
"pathState": "dynamic",
"monitored": false,
"minimumAvailability": "tba",
"isAvailable": true,
"folderName": "",
"runtime": 0,
"tmdbId": 527176,
"titleSlug": "ham-on-rye-527176",
"genres": [],
"tags": [],
"added": "0001-01-01T00:00:00Z",
"ratings": {
"votes": 5,
"value": 6.9
},
"qualityProfileId": 0
}
]
When I try to parse my array of json objects, it always returns a string, so when I try to access it by:
let data = JSON.parse(JSON.stringify(jsonData));
console.log(data[0].title)
it returns a specific character like "[", as if it was a string.

Te problem is that you are stringifying something that is already a JSON string.
Change this:
let data = JSON.parse(JSON.stringify(jsonData));
to this:
let data = JSON.parse(jsonData);

What algorithm can perform string pattern search JS?

In the frontend there is a list of game titles that consist of strings:
id, game name, date, price
The search shall accept multiple keywords, e.g. user might type: 1998 Streetfighter 2 or Streetfighter 1998 Currently I create an array separated by empty space, that creates 3 keywords: [1998, Streetfighter , 2 ] Then I go through the collection of game titles to filter matches. unfortunately it also gives back any title that includes "2" because there is no pattern recognition that identifies "Streetfighter 2" belongs together. Is there a simple algorithm to provide a pattern search?
const allGames = [
"Streetfighter 1, 1992, 20",
"Streetfighter 2, 1998, 20",
"pokemon, 2016, 20",
"Diablo 3, 2015, 40",
"Super mario, 1995, 20",
"The Witcher, 2012, 20",
]

Your search query looks advanced enough to justify using a search engine. Just don't create one yourself (it's harder than you may think).
In this answer I'll be using Lunr.js
Here's a 2min crash course:
Transform your data into documents. (I've already converted your initial allGames array.)
Then create a search index where you:
Specify which property of a document holds a unique identifier. (In our case title.)
Define which properties should be indexed. (In our case all of them.)
Define a boost score for each property. (i.e a match in the title has a higher relevance score than a match in the price.)
Add all documents to the search index.
Search! ;)
📢 Search Query FTW!
Notice the last search, I'm using a wildcard in the search string (street*) to find the two Street Fighter titles!
const createLunrIndex = docs =>
lunr(function () {
this.ref('title');
this.field('title', 5);
this.field('date', 3);
this.field('price', 1);
for (doc of docs) this.add(doc);
});
const search = (lunrIndex, term) =>
lunrIndex
.search(term)
.map(res => res.ref)
const gamesIndex = createLunrIndex(allGames);
console.log(
search(gamesIndex, '1998 Streetfighter 2')
);
console.log(
search(gamesIndex, 'Streetfighter 1998')
);
console.log(
search(gamesIndex, 'street*')
);
<script src="https://unpkg.com/lunr/lunr.js"></script>
<script>
const allGames =
[ { "title": "Streetfighter 1"
, "date": "1992"
, "price": "20"
}
,
{ "title": "Streetfighter 2"
, "date": "1998"
, "price": "20"
}
,
{ "title": "pokemon"
, "date": "2016"
, "price": "20"
}
,
{ "title": "Diablo 3"
, "date": "2015"
, "price": "40"
}
,
{ "title": "Super mario"
, "date": "1985"
, "price": "20"
}
,
{ "title": "The Witcher"
, "date": "2012"
, "price": "20"
}
]
</script>

Get JSON array (API) where object is in object with Javascript

I am using the code below to build a table based on an API and am fine when there is in object in an array (e.g. lineStatuses[0].statusSeverityDescription), however when there is an object, in an object, in an array, it does not work and I get the result [object Object] returned.
Here is a sample of the JSON data from the URL (I am expecting Undefined to be returned for the first record):
[
{
"$type": "Tfl.Api.Presentation.Entities.Line, Tfl.Api.Presentation.Entities",
"id": "bakerloo",
"name": "Bakerloo",
"modeName": "tube",
"disruptions": [],
"created": "2016-06-03T12:36:54.19Z",
"modified": "2016-06-03T12:36:54.19Z",
"lineStatuses": [
{
"$type": "Tfl.Api.Presentation.Entities.LineStatus, Tfl.Api.Presentation.Entities",
"id": 0,
"statusSeverity": 10,
"statusSeverityDescription": "Good Service",
"created": "0001-01-01T00:00:00",
"validityPeriods": []
}
],
"routeSections": [],
"serviceTypes": [
{
"$type": "Tfl.Api.Presentation.Entities.LineServiceTypeInfo, Tfl.Api.Presentation.Entities",
"name": "Regular",
"uri": "/Line/Route?ids=Bakerloo&serviceTypes=Regular"
}
]
},
{
"$type": "Tfl.Api.Presentation.Entities.Line, Tfl.Api.Presentation.Entities",
"id": "central",
"name": "Central",
"modeName": "tube",
"disruptions": [],
"created": "2016-06-03T12:36:54.037Z",
"modified": "2016-06-03T12:36:54.037Z",
"lineStatuses": [
{
"$type": "Tfl.Api.Presentation.Entities.LineStatus, Tfl.Api.Presentation.Entities",
"id": 0,
"lineId": "central",
"statusSeverity": 5,
"statusSeverityDescription": "Part Closure",
"reason": "CENTRAL LINE: Saturday 11 and Sunday 12 June, no service between White City and Ealing Broadway / West Ruislip. This is to enable track replacement work at East Acton and Ruislip Gardens. Replacement buses operate.",
"created": "0001-01-01T00:00:00",
"validityPeriods": [
{
"$type": "Tfl.Api.Presentation.Entities.ValidityPeriod, Tfl.Api.Presentation.Entities",
"fromDate": "2016-06-11T03:30:00Z",
"toDate": "2016-06-13T01:29:00Z",
"isNow": false
}
],
"disruption": {
"$type": "Tfl.Api.Presentation.Entities.Disruption, Tfl.Api.Presentation.Entities",
"category": "PlannedWork",
"categoryDescription": "PlannedWork",
"description": "CENTRAL LINE: Saturday 11 and Sunday 12 June, no service between White City and Ealing Broadway / West Ruislip. This is to enable track replacement work at East Acton and Ruislip Gardens. Replacement buses operate.",
"additionalInfo": "Replacement buses operate as follows:Service A: White City - East Acton - North Acton - West Acton - Ealing Common (for District and Piccadilly Lines) - Ealing BroadwayService B: White City - North Acton - Northolt - South Ruislip - Ruislip Gardens - West RuislipService C: White City - North Acton - Park Royal (Piccadilly Line) - Hanger Lane - Perivale - Greenford - Northolt",
"created": "2016-05-12T11:04:00Z",
"affectedRoutes": [],
"affectedStops": [],
"isBlocking": true,
"closureText": "partClosure"
}
}
],
"routeSections": [],
"serviceTypes": [
{
"$type": "Tfl.Api.Presentation.Entities.LineServiceTypeInfo, Tfl.Api.Presentation.Entities",
"name": "Regular",
"uri": "/Line/Route?ids=Central&serviceTypes=Regular"
}
]
}
]
I am also trying to use setInterval to refresh the tube-disruption DIV with updated data from the API, which is not workng. Below is the code (with the URL returning some of the JSON data above). Any ideas what I am doing wrong?
var xmlhttp = new XMLHttpRequest();
var url = "https://api.tfl.gov.uk/line/mode/tube/status";
xmlhttp.onreadystatechange=function() {
if (xmlhttp.readyState == 4 && xmlhttp.status == 200) {
myFunctionDisruption(xmlhttp.responseText);
}
};
xmlhttp.open("GET", url, true);
xmlhttp.send();
setInterval(myFunctionDisruption, 600000);
function myFunctionDisruption(response) {
var arr = JSON.parse(response);
var i;
var out = "<table>";
for(i = 0; i < arr.length; i++) {
out += "<tr><td>" +
arr[i].lineStatuses[0].disruption.description + <!-- DOES NOT WORK -->
"</td></tr>";
}
out += "</table>";
document.getElementById("tube-disruption").innerHTML = out;
}

The below code will generate a table for you. The generic tableMaker function takes an array of an object or an array of multiple objects provided in the first argument. All objects should have same keys (properties) since these keys are used to create the table header (if the second argument is set to true) and the values are used to create each row. It will return an HTML table text. You can see the tableMaker function working with a smaller size data at here. You can also practice it with some sample and simple data you may produce.
In your case you have multiple nested objects those, I guess, need to be converted into separate tables within the corresponding cells of the main table. For that purpose i have another function tabelizer which handles this job recursively by utilizing the tableMaker function. The result of tabelizer is a complete HTML text of your master table.
Lets see
var tableMaker = (o,h) => {var keys = o.length && Object.keys(o[0]),
rowMaker = (a,t) => a.reduce((p,c,i,a) => p + (i === a.length-1 ? "<" + t + ">" + c + "</" + t + "></tr>"
: "<" + t + ">" + c + "</" + t + ">"),"<tr>"),
rows = o.reduce((r,c) => r + rowMaker(keys.reduce((v,k) => v.concat(c[k]),[]),"td"),h ? rowMaker(keys,"th") : []);
return rows.length ? "<table>" + rows + "</table>" : "";
},
data = [
{
"$type": "Tfl.Api.Presentation.Entities.Line, Tfl.Api.Presentation.Entities",
"id": "bakerloo",
"name": "Bakerloo",
"modeName": "tube",
"disruptions": [],
"created": "2016-06-03T12:36:54.19Z",
"modified": "2016-06-03T12:36:54.19Z",
"lineStatuses": [
{
"$type": "Tfl.Api.Presentation.Entities.LineStatus, Tfl.Api.Presentation.Entities",
"id": 0,
"statusSeverity": 10,
"statusSeverityDescription": "Good Service",
"created": "0001-01-01T00:00:00",
"validityPeriods": []
}
],
"routeSections": [],
"serviceTypes": [
{
"$type": "Tfl.Api.Presentation.Entities.LineServiceTypeInfo, Tfl.Api.Presentation.Entities",
"name": "Regular",
"uri": "/Line/Route?ids=Bakerloo&serviceTypes=Regular"
}
]
},
{
"$type": "Tfl.Api.Presentation.Entities.Line, Tfl.Api.Presentation.Entities",
"id": "central",
"name": "Central",
"modeName": "tube",
"disruptions": [],
"created": "2016-06-03T12:36:54.037Z",
"modified": "2016-06-03T12:36:54.037Z",
"lineStatuses": [
{
"$type": "Tfl.Api.Presentation.Entities.LineStatus, Tfl.Api.Presentation.Entities",
"id": 0,
"lineId": "central",
"statusSeverity": 5,
"statusSeverityDescription": "Part Closure",
"reason": "CENTRAL LINE: Saturday 11 and Sunday 12 June, no service between White City and Ealing Broadway / West Ruislip. This is to enable track replacement work at East Acton and Ruislip Gardens. Replacement buses operate.",
"created": "0001-01-01T00:00:00",
"validityPeriods": [
{
"$type": "Tfl.Api.Presentation.Entities.ValidityPeriod, Tfl.Api.Presentation.Entities",
"fromDate": "2016-06-11T03:30:00Z",
"toDate": "2016-06-13T01:29:00Z",
"isNow": false
}
],
"disruption": {
"$type": "Tfl.Api.Presentation.Entities.Disruption, Tfl.Api.Presentation.Entities",
"category": "PlannedWork",
"categoryDescription": "PlannedWork",
"description": "CENTRAL LINE: Saturday 11 and Sunday 12 June, no service between White City and Ealing Broadway / West Ruislip. This is to enable track replacement work at East Acton and Ruislip Gardens. Replacement buses operate.",
"additionalInfo": "Replacement buses operate as follows:Service A: White City - East Acton - North Acton - West Acton - Ealing Common (for District and Piccadilly Lines) - Ealing BroadwayService B: White City - North Acton - Northolt - South Ruislip - Ruislip Gardens - West RuislipService C: White City - North Acton - Park Royal (Piccadilly Line) - Hanger Lane - Perivale - Greenford - Northolt",
"created": "2016-05-12T11:04:00Z",
"affectedRoutes": [],
"affectedStops": [],
"isBlocking": true,
"closureText": "partClosure"
}
}
],
"routeSections": [],
"serviceTypes": [
{
"$type": "Tfl.Api.Presentation.Entities.LineServiceTypeInfo, Tfl.Api.Presentation.Entities",
"name": "Regular",
"uri": "/Line/Route?ids=Central&serviceTypes=Regular"
}
]
}
],
tabelizer = (a) => a.length ? tableMaker(a.map(e => Object.keys(e).reduce((p,k) => (p[k] = Array.isArray(e[k]) ? tabelizer(e[k]) : e[k],p),{})),true)
: "",
tableHTML = tabelizer(data);
document.write(tableHTML);
I used arrow functions but they might not work at Safari or IE. You might need to convert them to the conventional function notation.
You can also try the code out at repl.it where you can see the HTML text displayed through console.log.

elasticsearch autosuggest returning tricky JSON

I'm running a node.js server that sends queries to an elasticsearch instance. Here is an example of the JSON returned by the query:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 9290,
"max_score": 0,
"hits": []
},
"suggest": {
"postSuggest": [
{
"text": "a",
"offset": 0,
"length": 1,
"options": [
{
"text": "Academic Librarian",
"score": 2
},
{
"text": "Able Seamen",
"score": 1
},
{
"text": "Academic Dean",
"score": 1
},
{
"text": "Academic Deans-Registrar",
"score": 1
},
{
"text": "Accessory Designer",
"score": 1
}
]
}
]
}
}
I need to create an array containing each job title as a string. I've run into this weird behavior that I can't figure out. Whenever I try to pull values out of the JSON, I can't go below options or everything comes back as undefined.
For example:
arr.push(results.suggest.postSuggest) will push just what you'd expect: all the stuff inside postSuggest.
arr.push(results.suggest.postSuggest.options) will come up as undefined even though I can see it when I run it without .options. This is also true for anything below .options.
I think it may be because .options is some sort of built-in function that acts on variables, so instead of seeing options as JSON and is instead trying to run a function on results.suggest.postSuggest

arr.push(results.suggest.postSuggest.options)
postSuggest is an array of object.options inside postSuggest is also array of object. So first you need to get postSuggest by postSuggest[0] and then
postSuggest[0].options to get array of options
This below snippet can be usefule
var myObj = {..}
// used jquery just to demonstrate postSuggest is an Array
console.log($.isArray(myObj.suggest.postSuggest)) //return true
var getPostSuggest =myObj.suggest.postSuggest //Array of object
var getOptions = getPostSuggest[0].options; // 0 since it contain only one element
console.log(getOptions.length) ; // 5 , contain 5 objects
getOptions.forEach(function(item){
document.write("<pre>Score is "+ item.score + " Text</pre>")
})
Jsfiddle

Return limited number of records of a certain type, but unlimited number of other records?

I have a query where I need to return 10 of "Type A" records, while returning all other records. How can I accomplish this?
Update: Admittedly, I could do this with two queries, but I wanted to avoid that, if possible, thinking it would be less overhead, and possibly more performant. My query already is an aggregation query that takes both kinds of records into account, I just need to limit the number of the one type of record in the results.
Update: the following is an example query that highlights the problem:
db.books.aggregate([
{$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
{$match: {"type": "Fiction"}},
{$project: {
'title': 1,
'author': 1,
'type': 1,
'typeSortOrder':
{$add: [
{$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
{$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
{$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
]},
}},
{$sort: {'typeSortOrder'}},
{$limit: 10}
])
db.books.aggregate([
{$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
{$match: {"type": "Horror"}},
{$project: {
'title': 1,
'author': 1,
'type': 1,
'typeSortOrder':
{$add: [
{$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
{$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
{$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
]},
}},
{$sort: {'typeSortOrder'}},
{$limit: 10}
])
db.books.aggregate([
{$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
{$match: {"type": "Science"}},
{$project: {
'title': 1,
'author': 1,
'type': 1,
'typeSortOrder':
{$add: [
{$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
{$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
{$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
]},
}},
{$sort: {'typeSortOrder'}},
{$limit: 10}
])
I would like to have all these records returned in one query, but limit the type to at most 10 of any category.
I realize that the typeSortOrder doesn't need to be conditional when the queries are broken out like this, I had it there for when the queries were one query, originally (which is where I would like to get back to).

I don't think this is presently (2.6) possible to do with one aggregation pipeline. It's difficult to give a precise argument as to why not, but basically the aggregation pipeline performs transformations of streams of documents, one document at a time. There's no awareness within the pipeline of the state of the stream itself, which is what you'd need to determine that you've hit the limit for A's, B's, etc and need to drop further documents of the same type. $group does bring multiple documents together and allows their field values in aggregate to affect the resulting group document ($sum, $avg, etc.). Maybe this makes some sense, but it's necessarily not rigorous because there are simple operations you could add to make it possible to limit based on the types, e.g., adding a $push x accumulator to $group that only pushes the value if the array being pushed to has fewer than x elements.
Even if I did have a way to do it, I'd recommend just doing two aggregations. Keep it simple.

Problem
The results here are not impossible but are also possibly impractical. The general notes have been made that you cannot "slice" an array or otherwise "limit" the amount of results pushed onto one. And the method for doing this per "type" is essentially to use arrays.
The "impractical" part is usually about the number of results, where too large a result set is going to blow up the BSON document limit when "grouping". But, I'm going to consider this with some other recommendations on your "geo search" along with the ultimate goal to return 10 results of each "type" at most.
Principle
To first consider and understand the problem, let's look at a simplified "set" of data and the pipeline code necessary to return the "top 2 results" from each type:
{ "title": "Title 1", "author": "Author 1", "type": "Fiction", "distance": 1 },
{ "title": "Title 2", "author": "Author 2", "type": "Fiction", "distance": 2 },
{ "title": "Title 3", "author": "Author 3", "type": "Fiction", "distance": 3 },
{ "title": "Title 4", "author": "Author 4", "type": "Science", "distance": 1 },
{ "title": "Title 5", "author": "Author 5", "type": "Science", "distance": 2 },
{ "title": "Title 6", "author": "Author 6", "type": "Science", "distance": 3 },
{ "title": "Title 7", "author": "Author 7", "type": "Horror", "distance": 1 }
That's a simplified view of the data and somewhat representative of the state of documents after an initial query. Now comes the trick of how to use the aggregation pipeline to get the "nearest" two results for each "type":
db.books.aggregate([
{ "$sort": { "type": 1, "distance": 1 } },
{ "$group": {
"_id": "$type",
"1": {
"$first": {
"_id": "$_id",
"title": "$title",
"author": "$author",
"distance": "$distance"
}
},
"books": {
"$push": {
"_id": "$_id",
"title": "$title",
"author": "$author",
"distance": "$distance"
}
}
}},
{ "$project": {
"1": 1,
"books": {
"$cond": [
{ "$eq": [ { "$size": "$books" }, 1 ] },
{ "$literal": [false] },
"$books"
]
}
}},
{ "$unwind": "$books" },
{ "$project": {
"1": 1,
"books": 1,
"seen": { "$eq": [ "$1", "$books" ] }
}},
{ "$sort": { "_id": 1, "seen": 1 } },
{ "$group": {
"_id": "$_id",
"1": { "$first": "$1" },
"2": { "$first": "$books" },
"books": {
"$push": {
"$cond": [ { "$not": "$seen" }, "$books", false ]
}
}
}},
{ "$project": {
"1": 1,
"2": 2,
"pos": { "$literal": [1,2] }
}},
{ "$unwind": "$pos" },
{ "$group": {
"_id": "$_id",
"books": {
"$push": {
"$cond": [
{ "$eq": [ "$pos", 1 ] },
"$1",
{ "$cond": [
{ "$eq": [ "$pos", 2 ] },
"$2",
false
]}
]
}
}
}},
{ "$unwind": "$books" },
{ "$match": { "books": { "$ne": false } } },
{ "$project": {
"_id": "$books._id",
"title": "$books.title",
"author": "$books.author",
"type": "$_id",
"distance": "$books.distance",
"sortOrder": {
"$add": [
{ "$cond": [ { "$eq": [ "$_id", "Fiction" ] }, 1, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Science" ] }, 0, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Horror" ] }, 3, 0 ] }
]
}
}},
{ "$sort": { "sortOrder": 1 } }
])
Of course that is just two results, but it outlines the process for getting n results, which naturally is done in generated pipeline code. Before moving onto the code the process deserves a walk through.
After any query, the first thing to do here is $sort the results, and this you want to basically do by both the "grouping key" which is the "type" and by the "distance" so that the "nearest" items are on top.
The reason for this is shown in the $group stages that will repeat. What is done is essentially "popping the $first result off of each grouping stack. So other documents are not lost, they are placed in an array using $push.
Just to be safe, the next stage is really only required after the "first step", but could optionally be added for similar filtering in the repetition. The main check here is that the resulting "array" is larger than just one item. Where it is not, the contents are replaced with a single value of false. The reason for which is about to become evident.
After this "first step" the real repetition cycle beings, where that array is then "de-normalized" with $unwind and then a $project made in order to "match" the document that has been last "seen".
As only one of the documents will match this condition the results are again "sorted" in order to float the "unseen" documents to the top, while of course maintaining the grouping order. The next thing is similar to the first $group step, but where any kept positions are maintained and the "first unseen" document is "popped off the stack" again.
The document that was "seen" is then pushed back to the array not as itself but as a value of false. This is not going to match the kept value and this is generally the way to handle this without being "destructive" to the array contents where you don't want the operations to fail should there not be enough matches to cover the n results required.
Cleaning up when complete, the next "projection" adds an array to the final documents now grouped by "type" representing each position in the n results required. When this array is unwound, the documents can again be grouped back together, but now all in a single array
that possibly contains several false values but is n elements long.
Finally unwind the array again, use $match to filter out the false values, and project to the required document form.
Practicality
The problem as stated earlier is with the number of results being filtered as there is a real limit on the number of results that can be pushed into an array. That is mostly the BSON limit, but you also don't really want 1000's of items even if that is still under the limit.
The trick here is keeping the initial "match" small enough that the "slicing operations" becomes practical. There are some things with the $geoNear pipeline process that can make this a possibility.
The obvious is limit. By default this is 100 but you clearly want to have something in the range of:
(the number of categories you can possibly match) X ( required matches )
But if this is essentially a number not in the 1000's then there is already some help here.
The others are maxDistance and minDistance, where essentially you put upper and lower bounds on how "far out" to search. The max bound is the general limiter while the min bound is useful when "paging", which is the next helper.
When "upwardly paging", you can use the query argument in order to exclude the _id values of documents "already seen" using the $nin query. In much the same way, the minDistance can be populated with the "last seen" largest distance, or at least the smallest largest distance by "type". This allows some concept of filtering out things that have already been "seen" and getting another page.
Really a topic in itself, but those are the general things to look for in reducing that initial match in order to make the process practical.
Implementing
The general problem of returning "10 results at most, per type" is clearly going to want some code in order to generate the pipeline stages. No-one wants to type that out, and practically speaking you will probably want to change that number at some point.
So now to the code that can generate the monster pipeline. All code in JavaScript, but easy to translate in principles:
var coords = [-118.09771, 33.89244];
var key = "$type";
var val = {
"_id": "$_id",
"title": "$title",
"author": "$author",
"distance": "$distance"
};
var maxLen = 10;
var stack = [];
var pipe = [];
var fproj = { "$project": { "pos": { "$literal": [] } } };
pipe.push({ "$geoNear": {
"near": coords,
"distanceField": "distance",
"spherical": true
}});
pipe.push({ "$sort": {
"type": 1, "distance": 1
}});
for ( var x = 1; x <= maxLen; x++ ) {
fproj["$project"][""+x] = 1;
fproj["$project"]["pos"]["$literal"].push( x );
var rec = {
"$cond": [ { "$eq": [ "$pos", x ] }, "$"+x ]
};
if ( stack.length == 0 ) {
rec["$cond"].push( false );
} else {
lval = stack.pop();
rec["$cond"].push( lval );
}
stack.push( rec );
if ( x == 1) {
pipe.push({ "$group": {
"_id": key,
"1": { "$first": val },
"books": { "$push": val }
}});
pipe.push({ "$project": {
"1": 1,
"books": {
"$cond": [
{ "$eq": [ { "$size": "$books" }, 1 ] },
{ "$literal": [false] },
"$books"
]
}
}});
} else {
pipe.push({ "$unwind": "$books" });
var proj = {
"$project": {
"books": 1
}
};
proj["$project"]["seen"] = { "$eq": [ "$"+(x-1), "$books" ] };
var grp = {
"$group": {
"_id": "$_id",
"books": {
"$push": {
"$cond": [ { "$not": "$seen" }, "$books", false ]
}
}
}
};
for ( n=x; n >= 1; n-- ) {
if ( n != x )
proj["$project"][""+n] = 1;
grp["$group"][""+n] = ( n == x ) ? { "$first": "$books" } : { "$first": "$"+n };
}
pipe.push( proj );
pipe.push({ "$sort": { "_id": 1, "seen": 1 } });
pipe.push(grp);
}
}
pipe.push(fproj);
pipe.push({ "$unwind": "$pos" });
pipe.push({
"$group": {
"_id": "$_id",
"msgs": { "$push": stack[0] }
}
});
pipe.push({ "$unwind": "$books" });
pipe.push({ "$match": { "books": { "$ne": false } }});
pipe.push({
"$project": {
"_id": "$books._id",
"title": "$books.title",
"author": "$books.author",
"type": "$_id",
"distance": "$books",
"sortOrder": {
"$add": [
{ "$cond": [ { "$eq": [ "$_id", "Fiction" ] }, 1, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Science" ] }, 0, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Horror" ] }, 3, 0 ] },
]
}
}
});
pipe.push({ "$sort": { "sortOrder": 1, "distance": 1 } });
Alternate
Of course the end result here and the general problem with all above is that you really only want the "top 10" of each "type" to return. The aggregation pipeline will do it, but at the cost of keeping more than 10 and then "popping off the stack" until 10 is reached.
An alternate approach is to "brute force" this with mapReduce and "globally scoped" variables. Not as nice since the results all in arrays, but it may be a practical approach:
db.collection.mapReduce(
function () {
if ( !stash.hasOwnProperty(this.type) ) {
stash[this.type] = [];
}
if ( stash[this.type.length < maxLen ) {
stash[this.type].push({
"title": this.title,
"author": this.author,
"type": this.type,
"distance": this.distance
});
emit( this.type, 1 );
}
},
function(key,values) {
return 1; // really just want to keep the keys
},
{
"query": {
"location": {
"$nearSphere": [-118.09771, 33.89244]
}
},
"scope": { "stash": {}, "maxLen": 10 },
"finalize": function(key,value) {
return { "msgs": stash[key] };
},
"out": { "inline": 1 }
}
)
This is a real cheat which just uses the "global scope" to keep a single object whose keys are the grouping keys. The results are pushed onto an array in that global object until the maximum length is reached. Results are already sorted by nearest, so the mapper just gives up doing anything with the current document after the 10 are reached per key.
The reducer wont be called since only 1 document per key is emitted. The finalize then just "pulls" the value from the global and returns it in the result.
Simple, but of course you don't have all the $geoNear options if you really need them, and this form has the hard limit of 100 document as the output from the initial query.

This is a classic case for subquery/join which is not supported by MongoDB. All joins and subquery-like operations need to be implemented in the application logic. So multiple queries is your best bet. Performance of the multiple query approach should be good if you have an index on type.
Alternatively you can write a single aggregation query minus the type-matching and limit clauses and then process the stream in your application logic to limit documents per type.
This approach will be low on performance for large result sets because documents may be returned in random order. Your limiting logic will then need to traverse to the entire result set.

i guess you can use cursor.limit() on a cursor to specify the maximum number of documents the cursor will return. limit() is analogous to the LIMIT statement in a SQL database.
You must apply limit() to the cursor before retrieving any documents from the database.
The limit function in the cursors can be used for limiting the number of records in the find.
I guess this example should help:
var myCursor = db.bios.find( );
db.bios.find().limit( 5 )

Develop Reference

JavaScript is the programming language of the Web.

Fuse JS search results scoring doesn't make sense - javascript

Related

Javascript: Parse array of json objects within a string

What algorithm can perform string pattern search JS?

Get JSON array (API) where object is in object with Javascript

elasticsearch autosuggest returning tricky JSON

Return limited number of records of a certain type, but unlimited number of other records?

Categories

Resources