What is the maximum value for a compound CouchDB key? - javascript

I'm using what seems to be a common trick for creating a join view:
// a Customer has many Orders; show them together in one view:
function(doc) {
if (doc.Type == "customer") {
emit([doc._id, 0], doc);
} else if (doc.Type == "order") {
emit([doc.customer_id, 1], doc);
}
}
I know I can use the following query to get a single customer and all related Orders:
?startkey=["some_customer_id"]&endkey=["some_customer_id", 2]
But now I've tied my query very closely to my view code. Is there a value I can put where I put my "2" to more clearly say, "I want everything tied to this Customer"? I think I've seen
?startkey=["some_customer_id"]&endkey=["some_customer_id", {}]
But I'm not sure that {} is certain to sort after everything else.
Credit to cmlenz for the join method.
Further clarification from the CouchDB wiki page on collation:
The query startkey=["foo"]&endkey=["foo",{}] will match most array keys with "foo" in the first element, such as ["foo","bar"] and ["foo",["bar","baz"]]. However it will not match ["foo",{"an":"object"}]
So {} is late in the sort order, but definitely not last.

I have two thoughts.
Use timestamps
Instead of using simple 0 and 1 for their collation behavior, use a timestamp that the record was created (assuming they are part of the records) a la [doc._id, doc.created_at]. Then you could query your view with a startkey of some sufficiently early date (epoch would probably work), and an endkey of "now", eg date +%s. That key range should always include everything, and it has the added benefit of collating by date, which is probably what you want anyways.
or, just don't worry about it
You could just index by the customer_id and nothing more. This would have the nice advantage of being able to query using just key=<customer_id>. Sure, the records won't be collated when they come back, but is that an issue for your application? Unless you are expecting tons of records back, it would likely be trivial to simply pluck the customer record out of the list once you have the data retrieved by your application.
For example in ruby:
customer_records = records.delete_if { |record| record.type == "customer" }
Anyways, the timestamps is probably the more attractive answer for your case.

Rather than trying to find the greatest possible value for the second element in your array key, I would suggest instead trying to find the least possible value greater than the first: ?startkey=["some_customer_id"]&endkey=["some_customer_id\u0000"]&inclusive_end=false.

CouchDB is mostly written in Erlang. I don't think there would be an upper limit for a string compound/composite key tuple sizes other than system resources (e.g. a key so long it used all available memory). The limits of CouchDB scalability are unknown according to the CouchDB site. I would guess that you could keep adding fields into a huge composite primary key and the only thing that would stop you is system resources or hard limits such as maximum integer sizes on the target architecture.
Since CouchDB stores everything using JSON, it is probably limited to the largest number values by the ECMAScript standard.All numbers in JavaScript are stored as a floating-point IEEE 754 double. I believe the 64-bit double can represent values from - 5e-324 to +1.7976931348623157e+308.

It seems like it would be nice to have a feature where endKey could be inclusive instead of exclusive.

This should do the trick:
?startkey=["some_customer_id"]&endkey=["some_customer_id", "\uFFFF"]
This should include anything that starts with a character less than \uFFFF (all unicode characters)
http://wiki.apache.org/couchdb/View_collation

Related

Inserting numbers (floats) in Elasticsearch with the node client library

I'm trying to insert documents into Elasticsearch, they come as a format like:
{
total: 1,
subtotal: 1.2,
totalDiscount: 0}
The issue I'm having is with the zeroes, in JavaScript you can't force '0' to be represented as '0.0' or '0.00'.
I can't use text in the mappings in ES, as I want to obviously do mathematical operations on these fields. So I'm using a 'float' mapping for all of the above.
So, for each of those fields I have something like:
"subtotal": {
"type": "float"
},
I've tried all sort of different combinations, storing them as 'text' doesn't let me query them as I want, if I don't define the mapping I get a 'long' type for the fields, which truncates them, If I use float I get an exception mapper [totalDiscount] cannot be changed from type [float] to [long], if I remove them complitely, so skipping the save I get an error too
Rejecting mapping update to [...] as the final mapping would have more than 1 type
Any help much appreciated, thanks.
Update:
the scaled_float didn't work well for me, so I ended up doing this "the stripe way"
i.e. representing all monetary amounts in cents, safe, less space on disk, just works without having to define a mapping.
also used this https://currency.js.org/ to make sure the multiplication and output wouldn't suffer from the 'well known' issues with floats in JS.
as this might be useful to someone reading, I think the answer might be using this sort of mapping:
"price": {
"type": "scaled_float",
"scaling_factor": 100
}
not only is more disk-efficient, but it won't have the above issues.
I'll keep this thread updated, to see if that works.
I am not familiar with the node client library, but in elasticsearch, the errors signify that -
mapper [totalDiscount] cannot be changed from type [float] to [long]
From the above error, it seems as if when you created the index, totalDiscount field, was defined with the float field data type and now you are changing it to the long data type. This is not possible, that is why the above error is thrown.
Rejecting mapping update to [...] as the final mapping would have more than 1 type
This error occurs because types are deprecated in APIs in 7.0, with breaking changes to the index creation, put mapping, get mapping, put template, get template and get field mappings APIs. Refer to this to know more about the removal of mapping types.

Dynamic Frequency Map from MongoDB Keys

I'm using MiniMongo through Meteor, and I'm trying to create a frequency table based off of a dynamic set of queries.
I have two main fields, localHour and localDay. I expect many overlaps, and I'd like to determine where the most overlaps occur. My current method of doing this is so.
if(TempStats.findOne({
localHour: hours,
localDay: day
})){//checks if there is already some entry on the same day/hour
TempStats.update({//if so, we just increment frequency
localHour: hours,
localDay: day
},{
$inc: {freq: 1}
})
} else {//if nothing exists yet, we put in a new entry
TempStats.insert({
localHour: hours,
localDay: day,
freq: 1
});
}
Essentially, this code runs every time I have new data I want to insert. It works fine at the moment, in that, after all data is inserted, I can sort by frequency to find what set of hours & days occurs the most often (TempStats.find({}, {sort: {freq: -1}}).fetch()).
However, I'm looking more for a way to search by frequency for any key. For instance, searching for the day which everything occurs on the most often as opposed to both the date and hour. With my current way of doing this, I would need to have multiple databases and different methods of inserting for each, which is a bit ridiculous. Is there a Mongo (specifically MiniMongo) solution to do frequency maps based on keys?
Thanks!
It looks like miniMongo does not in fact support aggregation, which makes this kind of operation difficult. One way to go about it would be aggregating yourself at the end of each day and inserting that aggregate record into your db (without the hour field or with it set to something like -1). Conversely as wastefully you could also update that record at the time of each insert. This would allow you to use the same collection for both and is fairly common in other dbs.
Also you should consider #nickmilon's first suggestion since the use of an upsert statement with the $inc operator would reduce your example to a single operation per data point.
a small note on your code: the part that comes as an else statement is not really required your update will do the complete job if you combine it with the option upsert=true it will insert a new document and $inc will set the freq field to 1 as desired see: here and here
for alternative ways to count your frequencies: assuming you store the date as a datetime object I would suggest to use an aggregation (I am not sure if they added support for aggregation yet in minimongo) but there are solutions then with aggregation you can use datetime operators as
$hour, $week, etc for filtering and $count to count the frequencies without you having to keep counts in the database.
This is basically a simple map-reduce problem.
First, don't separate the derived data into 2 fields. This violates DB best practices. If the data comes to you this way, use it to create a Date object. I assume you have a bunch of collections that are being subscribed to and then you aggregate all those into this temporary local collection. This is the mapping of the map-reduce pattern. At this point, since your query in unknown, it's a waste of CPU (even though it's your client) to aggregate. Map first, reduce second. What you should have is a collection full of datetimes. call it TempMapCollection if you wish. Now, use a forEach() and pass in your reduce function (by day, by hour, etc).
You can reduce into another local collection, or into a javascript object. I like using collections, but if the objects are complex, you'll get EJSON errors all up in there. Since your objects are nothing more than a datetime, let's use collections.
so you've got something like:
TempMapCollection.find().forEach(function(doc) {
var date = doc.dateTime.getDate();
TempReduceCollection.upsert({timequery: hours}, {$inc: {freq: 1}});
})
Now query your reduce collection. This has the added benefit that you won't have to re-map if you want to do 2 unique queries.

Range query for MongoDB pagination

I want to implement pagination on top of a MongoDB. For my range query, I thought about using ObjectIDs:
db.tweets.find({ _id: { $lt: maxID } }, { limit: 50 })
However, according to the docs, the structure of the ObjectID means that "ObjectId values do not represent a strict insertion order":
The relationship between the order of ObjectId values and generation time is not strict within a single second. If multiple systems, or multiple processes or threads on a single system generate values, within a single second; ObjectId values do not represent a strict insertion order. Clock skew between clients can also result in non-strict ordering even for values, because client drivers generate ObjectId values, not the mongod process.
I then thought about querying with a timestamp:
db.tweets.find({ created: { $lt: maxDate } }, { limit: 50 })
However, there is no guarantee the date will be unique — it's quite likely that two documents could be created within the same second. This means documents could be missed when paging.
Is there any sort of ranged query that would provide me with more stability?
It is perfectly fine to use ObjectId() though your syntax for pagination is wrong. You want:
db.tweets.find().limit(50).sort({"_id":-1});
This says you want tweets sorted by _id value in descending order and you want the most recent 50. Your problem is the fact that pagination is tricky when the current result set is changing - so rather than using skip for the next page, you want to make note of the smallest _id in the result set (the 50th most recent _id value and then get the next page with:
db.tweets.find( {_id : { "$lt" : <50th _id> } } ).limit(50).sort({"_id":-1});
This will give you the next "most recent" tweets, without new incoming tweets messing up your pagination back through time.
There is absolutely no need to worry about whether _id value is strictly corresponding to insertion order - it will be 99.999% close enough, and no one actually cares on the sub-second level which tweet came first - you might even notice Twitter frequently displays tweets out of order, it's just not that critical.
If it is critical, then you would have to use the same technique but with "tweet date" where that date would have to be a timestamp, rather than just a date.
Wouldn't a tweet "actual" timestamp (i.e. time tweeted and the criteria you want it sorted by) be different from a tweet "insertion" timestamp (i.e. time added to local collection). This depends on your application, of course, but it's a likely scenario that tweet inserts could be batched or otherwise end up being inserted in the "wrong" order. So, unless you work at Twitter (and have access to collections inserted in correct order), you wouldn't be able to rely just on $natural or ObjectID for sorting logic.
Mongo docs suggest skip and limit for paging:
db.tweets.find({created: {$lt: maxID}).
sort({created: -1, username: 1}).
skip(50).limit(50); //second page
There is, however, a performance concern when using skip:
The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return result. As offset increases, cursor.skip() will become slower and more CPU intensive.
This happens because skip does not fit into the MapReduce model and is not an operation that would scale well, you have to wait for a sorted collection to become available before it can be "sliced". Now limit(n) sounds like an equally poor method as it applies a similar constraint "from the other end"; however with sorting applied, the engine is able to somewhat optimize the process by only keeping in memory n elements per shard as it traverses the collection.
An alternative is to use range based paging. After retrieving the first page of tweets, you know what the created value is for the last tweet, so all you have to do is substitute the original maxID with this new value:
db.tweets.find({created: {$lt: lastTweetOnCurrentPageCreated}).
sort({created: -1, username: 1}).
limit(50); //next page
Performing a find condition like this can be easily parallellized. But how to deal with pages other than the next one? You don't know the begin date for pages number 5, 10, 20, or even the previous page! #SergioTulentsev suggests creative chaining of methods but I would advocate pre-calculating first-last ranges of the aggregate field in a separate pages collection; these could be re-calculated on update. Furthermore, if you're not happy with DateTime (note the performance remarks) or are concerned about duplicate values, you should consider compound indexes on timestamp + account tie (since a user can't tweet twice at the same time), or even an artificial aggregate of the two:
db.pages.
find({pagenum: 3})
> {pagenum:3; begin:"01-01-2014#BillGates"; end:"03-01-2014#big_ben_clock"}
db.tweets.
find({_sortdate: {$lt: "03-01-2014#big_ben_clock", $gt: "01-01-2014#BillGates"}).
sort({_sortdate: -1}).
limit(50) //third page
Using an aggregate field for sorting will work "on the fold" (although perhaps there are more kosher ways to deal with the condition). This could be set up as a unique index with values corrected at insert time, with a single tweet document looking like
{
_id: ...,
created: ..., //to be used in markup
user: ..., //also to be used in markup
_sortdate: "01-01-2014#BillGates" //sorting only, use date AND time
}
The following approach wil work even if there are multiple documents inserted/updated at same millisecond even if from multiple clients (which generates ObjectId). For simiplicity, In following queries I am projecting _id, lastModifiedDate.
First page, fetch the result Sorted by modifiedTime (Descending), ObjectId (Ascending) for fist page.
db.product.find({},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)
Note down the ObjectId and lastModifiedDate of the last record fetched in this page. (loid, lmd)
For sencod page, include query condition to search if (lastModifiedDate = lmd AND oid > loid ) OR (lastModifiedDate < loid)
db.productfind({$or:[{"lastModifiedDate":{$lt:lmd}},{"_id":1,"lastModifiedDate":1},{$and:[{"lastModifiedDate":lmd},{"_id":{$gt:loid}}]}]},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)
repeat same for subsequent pages.
ObjectIds should be good enough for pagination if you limit your queries to the previous second (or don't care about the subsecond possibility of weirdness). If that is not good enough for your needs then you will need to implement an ID generation system that works like an auto-increment.
Update:
To query the previous second of ObjectIds you will need to construct an ObjectID manually.
See the specification of ObjectId http://docs.mongodb.org/manual/reference/object-id/
Try using this expression to do it from a mongos.
{ _id :
{
$lt : ObjectId(Math.floor((new Date).getTime()/1000 - 1).toString(16)+"ffffffffffffffff")
}
}
The 'f''s at the end are to max out the possible random bits that are not associated with a timestamp since you are doing a less than query.
I recommend during the actual ObjectId creation on your application server rather than on the mongos since this type of calculation can slow you down if you have many users.
I have build a pagination using mongodb _id this way.
// import ObjectId from mongodb
let sortOrder = -1;
let query = []
if (prev) {
sortOrder = 1
query.push({title: 'findTitle', _id:{$gt: ObjectId('_idValue')}})
}
if (next) {
sortOrder = -1
query.push({title: 'findTitle', _id:{$lt: ObjectId('_idValue')}})
}
db.collection.find(query).limit(10).sort({_id: sortOrder})

Equal Precedence View Collation CouchDB?

According to the view collation documentation for CouchDB(
http://wiki.apache.org/couchdb/View_collation), member order does matter for collation. I was wondering if there is a way to disable this attribute such that collation order does not matter? I want to be able to "search" my views such that the documents that are emitted satisfy all the key ranges for the field.
here is some more on view collation for your reference: CouchDB sorting and filtering in the same view
Likewise, if it is possible to set CouchDB such that order does not matter for view collation, the following parameters used for the GET request should only emit docs where doc.phone_number == "ZZZZZZZ" , whereas right now it emits the documents that fall within the range of the first 3 keys and completely ignores the last key. This occurs because the last key has the least precedence in the current collation scheme.
startkey: [null,null,null,"ZZZZZZZ"],
endkey: ["\ufff0","\ufff0","\ufff0","ZZZZZZZZ"],
Sample Mapping Function
var map = function(doc) {
/*
//Keys emitted
1. name
2. address
3. age
3. phone_number
*/
emit([doc.name,doc.address,doc.num_age,doc.phone_number],doc._id)
}
Is this possible, or do I have to create multiple views to perform this? The use of multiple views seems very inefficent.
I've read that CouchDB-Lucene:( How to realize complex search filters in couchdb? Should I avoid temporary views? )would be helpful for complex searching, but that doesn't seem applicable in this case.
Use of multiple views is not inefficient, quite to the contrary : having four views (name, address, age and phone number) will not use significantly more time or memory than having a single view emit everything. It is the simple, straightforward, efficient way of performing "WHERE field = value" queries in CouchDB.
If you are in fact looking for "WHERE field = value AND field2 = value2" queries, then CouchDB will not help you, and you will need to use Lucene.
You need to understand that the collation merely describes how keys are ordered. Even if you could specify any arbitrary collation, you will still have to deal with the fact that CouchDB need you to define an order for the keys, and only lets you query contiguous ranges of keys. This is not compatible with multi-dimensional range queries.

How do I create a "like" filter view in CouchDB

Here's an example of what I need in sql:
SELECT name FROM employ WHERE name LIKE %bro%
How do I create view like that in CouchDB?
The simple answer is that CouchDB views aren't ideal for this.
The more complicated answer is that this type of query tends to be very inefficient in typical SQL engines too, and so if you grant that there will be tradeoffs with any solution then CouchDB actually has the benefit of letting you choose your tradeoff.
1. The SQL Ways
When you do SELECT ... WHERE name LIKE %bro%, all the SQL engines I'm familiar with must do what's called a "full table scan". This means the server reads every row in the relevant table, and brute force scans the field to see if it matches.
You can do this in CouchDB 2.x with a Mango query using the $regex operator. The query would look something like this for the basic case:
{"selector":{
"name": {
"$regex": "bro"
}
}}
There do not appear to be any options exposed for case-sensitivity, etc. but you could extend it to match only at the beginning/end or more complicated patterns. If you can also restrict your query via some other (indexable) field operator, that would likely help performance. As the documentation warns:
Regular expressions do not work with indexes, so they should not be used to filter large data sets. […]
You can do a full scan in CouchDB 1.x too, using a temporary view:
POST /some_database/_temp_view
{"map": "function (doc) { if (doc.name && doc.name.indexOf('bro') !== -1) emit(null); }"}
This will look through every single document in the database and give you a list of matching documents. You can tweak the map function to also match on a document type, or to emit with a certain key for ordering — emit(doc.timestamp) — or some data value useful to your purpose — emit(null, doc.name).
2. The "tons of disk space available" way
Depending on your source data size you could create an index that emits every possible "interior string" as its permanent (on-disk) view key. That is to say for a name like "Dobros" you would emit("dobros"); emit("obros"); emit("bros"); emit("ros"); emit("os"); emit("s");. Then for a term like '%bro%' you could query your view with startkey="bro"&endkey="bro\uFFFF" to get all occurrences of the lookup term. Your index will be approximately the size of your text content squared, but if you need to do an arbitrary "find in string" faster than the full DB scan above and have the space this might work. You'd be better served by a data structure designed for substring searching though.
Which brings us too...
3. The Full Text Search way
You could use a CouchDB plugin (couchdb-lucene now via Dreyfus/Clouseau for 2.x, ElasticSearch, SQLite's FTS) to generate an auxiliary text-oriented index into your documents.
Note that most full text search indexes don't naturally support arbitrary wildcard prefixes either, likely for similar reasons of space efficiency as we saw above. Usually full text search doesn't imply "brute force binary search", but "word search". YMMV though, take a look around at the options available in your full text engine.
If you don't really need to find "bro" anywhere in a field, you can implement basic "find a word starting with X" search with regular CouchDB views by just splitting on various locale-specific word separators and omitting these "words" as your view keys. This will be more efficient than above, scaling proportionally to the amount of data indexed.
Unfortunately, doing searches using LIKE %...% aren't really how CouchDB Views work, but you can accomplish a great deal of search capability by installing couchdb-lucene, it's a fulltext search engine that creates indexes on your database that you can do more sophisticated searches with.
The typical way to "search" a database for a given key, without any 3rd party tools, is to create a view that emits the value you are looking for as the key. In your example:
function (doc) {
emit(doc.name, doc);
}
This outputs a list of all the names in your database.
Now, you would "search" based on the first letters of your key. For example, if you are searching for names that start with "bro".
/db/_design/test/_view/names?startkey="bro"&endkey="brp"
Notice I took the last letter of the search parameter, and "incremented" the last letter in it. Again, if you want to perform searches, rather than aggregating statistics, you should use a fulltext search engine like lucene. (see above)
You can use regular expressions. As per this table you can write something like this to return any id that contains "SMS".
{
"selector": {
"_id": {
"$regex": "sms"
}
}
}
Basic regex you can use on that includes
"sms$" roughly to LIKE "%sms"
"^sms" roughly to LIKE "sms%"
You can read more on regular expressions here
i found a simple view code for my problem...
{"getavailableproduct": {
"map": "function(doc) { var prefix = doc['productid'].match(/[A-Za-z0-9]+/g); if(prefix) for(var pre in prefix) { emit(prefix[pre],null); } }"
}
}
from this view code if i split a key sentence into a key word...
and i can call
?key="[search_keyword]"
but i need more complex code because if i run this code i can only find word wich i type (ex: eat, food, etc)...
but if i want to type not a complete word (ex: ea from eat, or foo from food) that code does not work..
I know it is an old question, but: What about using a "list" function? You can have all your normal views, andthen add a "list" function to the design document to process the view's results:
{
"_id": "_design/...",
"views": {
"employees": "..."
},
"lists": {
"by_name": "..."
}
}
And the function attached to "by_name" function, should be something like:
function (head, req) {
provides('json', function() {
var filtered = [];
while (row = getRow()) {
// We can retrive all row information from the view
var key = row.key;
var value = row.value;
var doc = req.query.include_docs ? row.doc : {};
if (value.name.indexOf(req.query.name) == 0) {
if (req.query.include_docs) {
filtered.push({ key: key, value: value, doc: doc});
} else {
filtered.push({ key: key, value: value});
}
}
}
return toJSON({ total_rows: filtered.length, rows: filtered });
});
}
You can, of course, use regular expressions too. It's not a perfect solution, but it works to me.
You could emit your documents like normal. emit(doc.name, null); I would throw a toLowerCase() on that name to remove case sensitivity.
and then query the view with a slew of keys to see if something "like" the query shows up.
keys = differentVersions("bro"); // returns ["bro", "br", "bo", "ro", "cro", "dro", ..., "zro"]
$.couch("db").view("employeesByName", { keys: keys, success: dealWithIt } )
Some considerations
Obviously that array can get really big really fast depending on what differentVersions returns. You might hit a post data limit at some point or conceivably get slow lookups.
The results are only as good as differentVersions is at giving you guesses for what the person meant to spell. Obviously this function can be as simple or complex as you like. In this example I tried two strategies, a) removed a letter and pushed that, and b) replaced the letter at position n with all other letters. So if someone had been looking for "bro" but typed in "gro" or "bri" or even "bgro", differentVersions would have permuted that to "bro" at some point.
While not ideal, it's still pretty fast since a look up in Couch's b-trees is fast.
why cann't we just use indexOf() in view?

Categories

Resources