Inserting numbers (floats) in Elasticsearch with the node client library

Inserting numbers (floats) in Elasticsearch with the node client library - javascript

I'm trying to insert documents into Elasticsearch, they come as a format like:
{
total: 1,
subtotal: 1.2,
totalDiscount: 0}
The issue I'm having is with the zeroes, in JavaScript you can't force '0' to be represented as '0.0' or '0.00'.
I can't use text in the mappings in ES, as I want to obviously do mathematical operations on these fields. So I'm using a 'float' mapping for all of the above.
So, for each of those fields I have something like:
"subtotal": {
"type": "float"
},
I've tried all sort of different combinations, storing them as 'text' doesn't let me query them as I want, if I don't define the mapping I get a 'long' type for the fields, which truncates them, If I use float I get an exception mapper [totalDiscount] cannot be changed from type [float] to [long], if I remove them complitely, so skipping the save I get an error too
Rejecting mapping update to [...] as the final mapping would have more than 1 type
Any help much appreciated, thanks.

Update:
the scaled_float didn't work well for me, so I ended up doing this "the stripe way"
i.e. representing all monetary amounts in cents, safe, less space on disk, just works without having to define a mapping.
also used this https://currency.js.org/ to make sure the multiplication and output wouldn't suffer from the 'well known' issues with floats in JS.

as this might be useful to someone reading, I think the answer might be using this sort of mapping:
"price": {
"type": "scaled_float",
"scaling_factor": 100
}
not only is more disk-efficient, but it won't have the above issues.
I'll keep this thread updated, to see if that works.

I am not familiar with the node client library, but in elasticsearch, the errors signify that -
mapper [totalDiscount] cannot be changed from type [float] to [long]
From the above error, it seems as if when you created the index, totalDiscount field, was defined with the float field data type and now you are changing it to the long data type. This is not possible, that is why the above error is thrown.
Rejecting mapping update to [...] as the final mapping would have more than 1 type
This error occurs because types are deprecated in APIs in 7.0, with breaking changes to the index creation, put mapping, get mapping, put template, get template and get field mappings APIs. Refer to this to know more about the removal of mapping types.

Related

Suggest type of an object based on a given title

I am working on an Ember application that deals with geospatial data processing. Part of this project is importing a JSON object that describes a data layer which contains fields corresponding to data entries. For example, I suppose I am importing a data layer called "Laundry Facilities"; the JSON will look something like this:
{
key: "laundryFacilities",
label: "Laundry Facilities",
fields: [
{
"label": "Name of Facility",
"key": "name",
},
{
"label": "Number of Dryers",
"key": "numberDryers",
}
]
}
At some point in my data import workflow, the user must specify a type for each field. For example, the type for "Name of Facility" would be a string, and the type for "Number of Dryers" would be an integer. I'd like to be able to provide a suggested type to the user based off of the label or key attribute rather than forcing them to specify the type for every field. Is there any kind of algorithm, package, framework, etc. that provides functionality for guessing a data type based off of something qualitative like a label describing the data field? Or does anyone know of another way I could implement this? I know not to expect 100% accuracy but even a rough type guess would be extremely helpful. Bonus points if it's an Ember addon.

Your best bet is to write some simple heuristic, not much more complicated than a bunch of keywords mapping to types. As you've described, 'number' probably means a number type and 'name' likely means a 'name' type.
In general, you're describing a classification problem. This is going to be difficult to solve with a (presumably) small set of training examples. If you can get a decent number of examples of column names, I would first try a decision tree or a logistic regression, which would take the presence of certain words as features, and produce a data type as the output variable.

storing data as object vs array in MongoDb for write performance

Should I store objects in an Array or inside an Object with top importance given Write Speed?
I'm trying to decide whether data should be stored as an array of objects, or using nested objects inside a mongodb document.
In this particular case, I'm keeping track of a set of continually updating files that I add and update and the file name acts as a key and the number of lines processed within the file.
the document looks something like this
{
t_id:1220,
some-other-info: {}, // there's other info here not updated frequently
files: {
log1-txt: {filename:"log1.txt",numlines:233,filesize:19928},
log2-txt: {filename:"log2.txt",numlines:2,filesize:843}
}
}
or this
{
t_id:1220,
some-other-info: {},
files:[
{filename:"log1.txt",numlines:233,filesize:19928},
{filename:"log2.txt",numlines:2,filesize:843}
]
}
I am making an assumption that handling a document, especially when it comes to updates, it is easier to deal with objects, because the location of the object can be determined by the name; unlike an array, where I have to look through each object's value until I find the match.
Because the object key will have periods, I will need to convert (or drop) the periods to create a valid key (fi.le.log to filelog or fi-le-log).
I'm not worried about the files' possible duplicate names emerging (such as fi.le.log and fi-le.log) so I would prefer to use Objects, because the number of files is relatively small, but the updates are frequent.
Or would it be better to handle this data in a separate collection for best write performance...
{
"_id": ObjectId('56d9f1202d777d9806000003'),"t_id": "1220","filename": "log1.txt","filesize": 1843,"numlines": 554
},
{
"_id": ObjectId('56d9f1392d777d9806000004'),"t_id": "1220","filename": "log2.txt","filesize": 5231,"numlines": 3027
}

From what I understand you are talking about write speed, without any read consideration. So we have to think about how you will insert/update your document.
We have to compare (assuming you know the _id you are replacing, replace {key} by the key name, in your example log1-txt or log2-txt):
db.Col.update({ _id: '' }, { $set: { 'files.{key}': object }})
vs
db.Col.update({ _id: '', 'files.filename': '{key}'}, { $set: { 'files.$': object }})
The second one means that MongoDB have to browse the array, find the matching index and update it. The first one means MongoDB just update the specified field.
The worst:
The second command will not work if the matching filename is not present in the array! So you have to execute it, check if nMatched is 0, and create it if it is so. That's really bad write speed (see here MongoDB: upsert sub-document).
If you will never/almost never use read queries / aggregation framework on this collection: go for the first one, that will be faster. If you want to aggregate, unwind, do some analytics on the files you parsed to have statistics about file size and line numbers, you may consider using the second one, you will avoid some headache.
Pure write speed will be better with the first solution.

How do I create a "like" filter view in CouchDB

Here's an example of what I need in sql:
SELECT name FROM employ WHERE name LIKE %bro%
How do I create view like that in CouchDB?

The simple answer is that CouchDB views aren't ideal for this.
The more complicated answer is that this type of query tends to be very inefficient in typical SQL engines too, and so if you grant that there will be tradeoffs with any solution then CouchDB actually has the benefit of letting you choose your tradeoff.
1. The SQL Ways
When you do SELECT ... WHERE name LIKE %bro%, all the SQL engines I'm familiar with must do what's called a "full table scan". This means the server reads every row in the relevant table, and brute force scans the field to see if it matches.
You can do this in CouchDB 2.x with a Mango query using the $regex operator. The query would look something like this for the basic case:
{"selector":{
"name": {
"$regex": "bro"
}
}}
There do not appear to be any options exposed for case-sensitivity, etc. but you could extend it to match only at the beginning/end or more complicated patterns. If you can also restrict your query via some other (indexable) field operator, that would likely help performance. As the documentation warns:
Regular expressions do not work with indexes, so they should not be used to filter large data sets. […]
You can do a full scan in CouchDB 1.x too, using a temporary view:
POST /some_database/_temp_view
{"map": "function (doc) { if (doc.name && doc.name.indexOf('bro') !== -1) emit(null); }"}
This will look through every single document in the database and give you a list of matching documents. You can tweak the map function to also match on a document type, or to emit with a certain key for ordering — emit(doc.timestamp) — or some data value useful to your purpose — emit(null, doc.name).
2. The "tons of disk space available" way
Depending on your source data size you could create an index that emits every possible "interior string" as its permanent (on-disk) view key. That is to say for a name like "Dobros" you would emit("dobros"); emit("obros"); emit("bros"); emit("ros"); emit("os"); emit("s");. Then for a term like '%bro%' you could query your view with startkey="bro"&endkey="bro\uFFFF" to get all occurrences of the lookup term. Your index will be approximately the size of your text content squared, but if you need to do an arbitrary "find in string" faster than the full DB scan above and have the space this might work. You'd be better served by a data structure designed for substring searching though.
Which brings us too...
3. The Full Text Search way
You could use a CouchDB plugin (couchdb-lucene now via Dreyfus/Clouseau for 2.x, ElasticSearch, SQLite's FTS) to generate an auxiliary text-oriented index into your documents.
Note that most full text search indexes don't naturally support arbitrary wildcard prefixes either, likely for similar reasons of space efficiency as we saw above. Usually full text search doesn't imply "brute force binary search", but "word search". YMMV though, take a look around at the options available in your full text engine.
If you don't really need to find "bro" anywhere in a field, you can implement basic "find a word starting with X" search with regular CouchDB views by just splitting on various locale-specific word separators and omitting these "words" as your view keys. This will be more efficient than above, scaling proportionally to the amount of data indexed.

Unfortunately, doing searches using LIKE %...% aren't really how CouchDB Views work, but you can accomplish a great deal of search capability by installing couchdb-lucene, it's a fulltext search engine that creates indexes on your database that you can do more sophisticated searches with.
The typical way to "search" a database for a given key, without any 3rd party tools, is to create a view that emits the value you are looking for as the key. In your example:
function (doc) {
emit(doc.name, doc);
}
This outputs a list of all the names in your database.
Now, you would "search" based on the first letters of your key. For example, if you are searching for names that start with "bro".
/db/_design/test/_view/names?startkey="bro"&endkey="brp"
Notice I took the last letter of the search parameter, and "incremented" the last letter in it. Again, if you want to perform searches, rather than aggregating statistics, you should use a fulltext search engine like lucene. (see above)

You can use regular expressions. As per this table you can write something like this to return any id that contains "SMS".
{
"selector": {
"_id": {
"$regex": "sms"
}
}
}
Basic regex you can use on that includes
"sms$" roughly to LIKE "%sms"
"^sms" roughly to LIKE "sms%"
You can read more on regular expressions here

i found a simple view code for my problem...
{"getavailableproduct": {
"map": "function(doc) { var prefix = doc['productid'].match(/[A-Za-z0-9]+/g); if(prefix) for(var pre in prefix) { emit(prefix[pre],null); } }"
}
}
from this view code if i split a key sentence into a key word...
and i can call
?key="[search_keyword]"
but i need more complex code because if i run this code i can only find word wich i type (ex: eat, food, etc)...
but if i want to type not a complete word (ex: ea from eat, or foo from food) that code does not work..

I know it is an old question, but: What about using a "list" function? You can have all your normal views, andthen add a "list" function to the design document to process the view's results:
{
"_id": "_design/...",
"views": {
"employees": "..."
},
"lists": {
"by_name": "..."
}
}
And the function attached to "by_name" function, should be something like:
function (head, req) {
provides('json', function() {
var filtered = [];
while (row = getRow()) {
// We can retrive all row information from the view
var key = row.key;
var value = row.value;
var doc = req.query.include_docs ? row.doc : {};
if (value.name.indexOf(req.query.name) == 0) {
if (req.query.include_docs) {
filtered.push({ key: key, value: value, doc: doc});
} else {
filtered.push({ key: key, value: value});
}
}
}
return toJSON({ total_rows: filtered.length, rows: filtered });
});
}
You can, of course, use regular expressions too. It's not a perfect solution, but it works to me.

You could emit your documents like normal. emit(doc.name, null); I would throw a toLowerCase() on that name to remove case sensitivity.
and then query the view with a slew of keys to see if something "like" the query shows up.
keys = differentVersions("bro"); // returns ["bro", "br", "bo", "ro", "cro", "dro", ..., "zro"]
$.couch("db").view("employeesByName", { keys: keys, success: dealWithIt } )
Some considerations
Obviously that array can get really big really fast depending on what differentVersions returns. You might hit a post data limit at some point or conceivably get slow lookups.
The results are only as good as differentVersions is at giving you guesses for what the person meant to spell. Obviously this function can be as simple or complex as you like. In this example I tried two strategies, a) removed a letter and pushed that, and b) replaced the letter at position n with all other letters. So if someone had been looking for "bro" but typed in "gro" or "bri" or even "bgro", differentVersions would have permuted that to "bro" at some point.
While not ideal, it's still pretty fast since a look up in Couch's b-trees is fast.

why cann't we just use indexOf() in view?

Autocompletion results formatting with JQuery

I am currently using this autocomplete plugin. It's pretty straightforward. It accepts a URL, and then uses that data to perform an auto-complete.
This is my code to auto-complete it.
autocompleteurl = '/misc/autocomplete/?q='+$("#q").val()
$("#q").autocomplete(autocompleteurl, {multiple:true});
If someone types "apple", that autocompleteurl page will return this result:
apple store,applebees,apple.com,apple trailers,apple store locator,apple vacations,applebees menu,apple iphone,apple tablet,apple tv
However, for some reason, when I actually use this auto-complete, everything is junked together. The plugin treats the entire page as a one big string, instead of separating the commas and treating them as individual items.
Can someone tell me what options I need to put in order to treat them as individual items? I've tried many options but none work.

From the manual (http://docs.jquery.com/Plugins/Autocomplete/autocomplete#url_or_dataoptions)
A value of "foo" would result in this
request url:
my_autocomplete_backend.php?q=foo&limit=10
The result must return with one value
on each line. The result is presented
in the order the backend sends it.
From what you have posted it seems like you have it comma separated.

The plugin automatically adds the q to the querystring and uses the current value of the text box as the value.
This should be sufficient as long as you're returning the data in the correct format:
$("#q").autocomplete('/misc/autocomplete/', {multiple:true});

#alex I'm getting quirky behavior too - for 2/3/4 alphabets.
See http://docs.jquery.com/Plugins/Autocomplete/autocomplete#toptions .
If you set the minChars option to 2 or 3 it makes things more sane.
There's funny behavior when you have 5 results for "ab" and the same 5 results for "abc" - it does nothing, giving the impression that it is not working!
But it is working and I suspect it has to do with caching options.

What is the maximum value for a compound CouchDB key?

I'm using what seems to be a common trick for creating a join view:
// a Customer has many Orders; show them together in one view:
function(doc) {
if (doc.Type == "customer") {
emit([doc._id, 0], doc);
} else if (doc.Type == "order") {
emit([doc.customer_id, 1], doc);
}
}
I know I can use the following query to get a single customer and all related Orders:
?startkey=["some_customer_id"]&endkey=["some_customer_id", 2]
But now I've tied my query very closely to my view code. Is there a value I can put where I put my "2" to more clearly say, "I want everything tied to this Customer"? I think I've seen
?startkey=["some_customer_id"]&endkey=["some_customer_id", {}]
But I'm not sure that {} is certain to sort after everything else.
Credit to cmlenz for the join method.
Further clarification from the CouchDB wiki page on collation:
The query startkey=["foo"]&endkey=["foo",{}] will match most array keys with "foo" in the first element, such as ["foo","bar"] and ["foo",["bar","baz"]]. However it will not match ["foo",{"an":"object"}]
So {} is late in the sort order, but definitely not last.

I have two thoughts.
Use timestamps
Instead of using simple 0 and 1 for their collation behavior, use a timestamp that the record was created (assuming they are part of the records) a la [doc._id, doc.created_at]. Then you could query your view with a startkey of some sufficiently early date (epoch would probably work), and an endkey of "now", eg date +%s. That key range should always include everything, and it has the added benefit of collating by date, which is probably what you want anyways.
or, just don't worry about it
You could just index by the customer_id and nothing more. This would have the nice advantage of being able to query using just key=<customer_id>. Sure, the records won't be collated when they come back, but is that an issue for your application? Unless you are expecting tons of records back, it would likely be trivial to simply pluck the customer record out of the list once you have the data retrieved by your application.
For example in ruby:
customer_records = records.delete_if { |record| record.type == "customer" }
Anyways, the timestamps is probably the more attractive answer for your case.

Rather than trying to find the greatest possible value for the second element in your array key, I would suggest instead trying to find the least possible value greater than the first: ?startkey=["some_customer_id"]&endkey=["some_customer_id\u0000"]&inclusive_end=false.

CouchDB is mostly written in Erlang. I don't think there would be an upper limit for a string compound/composite key tuple sizes other than system resources (e.g. a key so long it used all available memory). The limits of CouchDB scalability are unknown according to the CouchDB site. I would guess that you could keep adding fields into a huge composite primary key and the only thing that would stop you is system resources or hard limits such as maximum integer sizes on the target architecture.
Since CouchDB stores everything using JSON, it is probably limited to the largest number values by the ECMAScript standard.All numbers in JavaScript are stored as a floating-point IEEE 754 double. I believe the 64-bit double can represent values from - 5e-324 to +1.7976931348623157e+308.

It seems like it would be nice to have a feature where endKey could be inclusive instead of exclusive.

This should do the trick:
?startkey=["some_customer_id"]&endkey=["some_customer_id", "\uFFFF"]
This should include anything that starts with a character less than \uFFFF (all unicode characters)
http://wiki.apache.org/couchdb/View_collation

Develop Reference

JavaScript is the programming language of the Web.

Inserting numbers (floats) in Elasticsearch with the node client library - javascript

as this might be useful to someone reading, I think the answer might be using this sort of mapping: "price": { "type": "scaled_float", "scaling_factor": 100 } not only is more disk-efficient, but it won't have the above issues. I'll keep this thread updated, to see if that works.

Related

Suggest type of an object based on a given title

storing data as object vs array in MongoDb for write performance

How do I create a "like" filter view in CouchDB

Autocompletion results formatting with JQuery

What is the maximum value for a compound CouchDB key?

Categories

Resources