Store language (ISO 639) as Number

Store language (ISO 639) as Number - javascript

I'm working on a MongoDB database and so far have stored some information as Numbers instead of Strings because I assumed that would be more efficient. For example, I store countries following ISO 3166-1 numeric and sex following ISO/IEC 5218. But so far I have not found a similar standard for languages, ISO 639 does not appear to have a matching list of numeric codes.
What would be the right way to do this? Should I just use the String codes?
Thanks!

If you're a fan of the numbers, you can use country calling codes, although they "only" represent the ITU members (193 countries according to Wikipedia). But hey, they have Somalia and Palestine, so that's a good hint about how global this is.
However, storing everything in an encoded format (numbers here) implies a decoding step on the fly when any piece of data is requested (with translation tables stored in RAM instead of DB's ROM). Probably on the server whose CPU is precious, but you might have deported the issue on the client, overworking the precious, time-critical server-client link in the process.
So, back in the 90s, when a 40MB HDD was expensive, that might have been interesting. Today, the cost of storing data vs. the cost of processing data is not on the same side of 1... Not counting the time it takes you to think and implement the transformations. All being said "IMHO", I think this level of efficiency actually kills efficiency. ;)
EDIT: Oops, just realized I misthought (does that verb even exist?) the country/language issue. Countries you have sorted out already, my bad. I know no numbered list of languages. However, the second part of the post might still be relevant...

If you are after raw performance and/or want to achieve really small data sizes, I would suggest you use either the three-letter (higher granularity) or the two-letter (lower granularity) codes from IOC ISO-639-1/2.
To my knowledge, there's no helper or anything for this standard built into any programming language that I know, so you'd need to build your own translator (code<->full name) which, however, should be trivial.
And as others already mentioned, you have to assess the cost involved with this (e.g. not being able to simply look at the data and understand it right away anymore) for yourself. I personally do recommend keeping data sizes small since BSON parsing and string operations are horribly expensive compared to dealing with numbers (or shorter strings for that matter). When dealing with small data sets, this won't make a noticeable difference. If, however, you need to churn through millions of documents or more optimizations like this can become mission critical.

Related

Client side search engine optimization

Due to the reasons outlined in this question I am building my own client side search engine rather than using the ydn-full-text library which is based on fullproof. What it boils down to is that fullproof spawns "too freaking many records" in the order of 300.000 records whilst (after stemming) there are only about 7700 unique words. So my 'theory' is that fullproof is based on traditional assumptions which only apply to the server side:
Huge indices are fine
Processor power is expensive
(and the assumption of dealing with longer records which is just applicable to my case as my records are on average 24 words only1)
Whereas on the client side:
Huge indices take ages to populate
Processing power is still limited, but relatively cheaper than on the server side
Based on these assumptions I started of with an elementary inverted index (giving just 7700 records as IndexedDB is a document/nosql database). This inverted index has been stemmed using the Lancaster stemmer (most aggressive one of the two or three popular ones) and during a search I would retrieve the index for each of the words, assign a score based on overlap of the different indices and on similarity of typed word vs original (Jaro-Winkler distance).
Problem of this approach:
Combination of "popular_word + popular_word" is extremely expensive
So, finally getting to my question: How can I alleviate the above problem with a minimal growth of the index? I do understand that my approach will be CPU intensive, but as a traditional full text search index seems unusably big this seems to be the only reasonable road to go down on. (Pointing me to good resources or works is also appreciated)
1 This is a more or less artificial splitting of unstructured texts into small segments, however this artificial splitting is standardized in the relevant field so has been used here as well. I have not studied the effect on the index size of keeping these 'snippets' together and throwing huge chunks of texts at fullproof. I assume that this would not make a huge difference, but if I am mistaken then please do point this out.

This is a great question, thanks for bringing some quality to the IndexedDB tag.
While this answer isn't quite production ready, I wanted to let you know that if you launch Chrome with --enable-experimental-web-platform-features then there should be a couple features available that might help you achieve what you're looking to do.
IDBObjectStore.openKeyCursor() - value-free cursors, in case you can get away with the stem only
IDBCursor.continuePrimaryKey(key, primaryKey) - allows you to skip over items with the same key
I was informed of these via an IDB developer on the Chrome team and while I've yet to experiment with them myself this seems like the perfect use case.
My thought is that if you approach this problem with two different indexes on the same column, you might be able to get that join-like behavior you're looking for without bloating your stores with gratuitous indexes.
While consecutive writes are pretty terrible in IDB, reads are great. Good performance across 7700 entries should be quite tenable.

Rough Unicode -> Language Code without CLDR?

I am writing a dictionary app. If a user types an Unicode character I want to check which language the character is.
e.g.
字 - returns ['zh', 'ja', 'ko']
العربية - returns ['ar']
a - returns ['en', 'fr', 'de'] //and many more
й - returns ['ru', 'be', 'bg', 'uk']
I searched and found that it could be done with CLDR https://stackoverflow.com/a/6445024/41948
Or Google API Python - can I detect unicode string language code?
But in my case
Looking up a large charmap db seems cost a lot of storage and memory
Too slow to call an API, besides it requires a network connection
don't need to be very accurate. just about 80% correct ratio is acceptable
simple & fast is the main requirement
it's OK to just cover UCS2 BMP characters.
Any tips?
I need to use this in Python and Javascript. Thanks!

Would it be sufficient to narrow the glyph down to language families? If so, you could create a set of ranges (language -> code range) based on the mapping of BMP like the one shown at http://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane or the Scripts section of the Unicode charts page - http://www.unicode.org/charts/
Reliably determining parent language for glyphs is definitely more complicated because of the number of shared symbols. If you only need 80% accuracy, you could potentially adjust your ranges for certain languages to intentionally include/leave out certain characters if it simplifies your ranges.
Edit: I re-read through the question you referenced CLDR from and the first answer regarding code -> language mapping. I think that's definitely out of the question but the reverse seems feasible if a bit computationally expensive. With clever data structuring, you could identify language families and then drill down to the actual language ranges from there, reducing traversals through irrelevant language -> range pairs.

If the number of languages is relatively small (or the number you care about is fairly small), you could use a Bloom filter for each language. Bloom filters let you do very cheap membership tests (which can have false positives) without having to store all of the members (in this case the code points) in memory. Then you build your result set by checking the code point against each language's preconstructed filter. It's tuneable - if you get too many false positives, you can use a larger size filter, at the cost of memory.
There are Bloom filter implementations for Python and Javascript. (Hey - I've met the guy who did this one! http://www.jasondavies.com/bloomfilter/)
Bloom filters: http://en.m.wikipedia.org/wiki/Bloom_filter
Doing a bit more reading, if you only need the BMP (65,536 code points), you could just store a straight bit set for each language. Or a 2D bitarray for language X code point.
How many languages do you want to consider?

C.S. Basics: Understanding Data Packets, Protocols, Wireshark

The Quest
I'm trying to talk to a SRCDS Server from node.js via the RCON Protocol.
The RCON Protocol seems to be explained enough, implementations can be found on the bottom of the site in every major programming language. Using those is simple enough, but understanding the protocol and develop a JS library is what I set out to do.
Background
Being a self taught programmer, I skipped a lot of Computer Science Basics - learned only what I needed, to accomplish what I wanted. I started coding with PHP, eventually wrapped my head around OO, talked to databases etc. I'm currently programming with JavaScript, more specifically doing web stuff with node.js ..
Binary Data?!?!
I've read and understood the absolute binary basics. But when it comes to the packet data I'm totally lost. I'd like to read and understand the wireshark output, but I can't make any sense if it. My biggest problem is probably that I don't understand what the binary representation of the various INT and STRING (char ..) from JS look like and how I convert from data I got from the server to something usable in the program.
Help
So I'd be more than grateful if someone can point me to a tutorial on these topics. Tutorial as in "explanation that mere mortals can understand, preferably not written by a C.S. professor". :)
When I'm looking at the PHP reference implementation I see (too much) magic happening there which I can't translate to JS. Sending and reading data from a socket is no problem, but I need to know how PHPs unpack function works respectively how I can do that in JS with node.js.
So I hope you can see what I'm trying to accomplish here. First and foremost is understanding the whole theory needed to make implementing the protocol a breeze. But because I'm only good with scripting languages it would be incredibly helpful if someone could guide me a bit in the HOWTO part in PHP/JS..
Thank you so much for your time!

I applaud the low level protocol pursuit.
I'll tell you the path I took. My approach was to use the client and server that already spoke the protocol and use libpcap to do analysis. I created a library that was able to unpack the custom protocol I was analyzing during this phase.
Its super helpful to start with diagrams like this one:
From the wiki on TCP. Its an incredibly useful way to visualize the structure of the binary data. Its tightly packed, so slicing it apart requires attention to detail.
Buffers and Binary
I read up on Buffer. Its the way you deal with Binary in node. http://nodejs.org/docs/v0.4.8/api/buffers.html -- the first thing to realize here is that buffers can be accessed bit by bit via array syntax, ie buffer[0] and such.
Visualization
Its helpful to be able to dump your binary data into a hex representation. I used https://github.com/a2800276/hexy.js to achieve this.
node_pcap
I grabbed https://github.com/mranney/node_pcap -- this is the equivalent to wireshark, but you can programmatically poke at all outgoing and incoming traffic. I added udp payload support: https://github.com/jmoyers/node_pcap/commit/2852a8123486339aa495ede524427f6e5302326d
I read through all mranney's "unpack" code https://github.com/mranney/node_pcap/blob/master/pcap.js#L116-171
I found https://github.com/rmustacc/node-ctype
I read through all their "unpack" code https://github.com/rmustacc/node-ctype/blob/master/ctio.js
Now, things to remember when you're looking through this stuff. Most of the time they're taking a binary Buffer representation and converting to a native javascript type, like say Number or String. They'll use advanced techniques to do so -- bitwise operations like shifts and such. You don't necessarily need to understand all that.
The key things are:
1) endianness -- the ordering of bits (network and host byte order can be reverse from each other) as this pertains to how things are unpacked
2) Javascript Number representation is quirky -- node-ctype goes into detail in the comments about how they convert the various number types in javascript's Number. Integer, float, double etc are all Number in javascript land.
In the end, its likely fine if you just USE these unpackers for your adventures. I ended up having to unpack things that weren't covered in these libraries, like GUIDs and such, and it was tremendously helpful to study the source.
Isolate the traffic you're looking at
Filter, filter, filter. Target one host. Target one direction. Target one message type. Focus on stripping off data that has a known fixed length first -- often times the header in a protocol is a good place to start. Once you get the header unpacking into a nice json structure from binary, you are well on your way.
After that, its one field at a time, top to bottom, one message at a time. You can use Buffer#slice and the unpack functions from node-ctype to grab each piece of data at a time.

What does sorting mean in non-alphabetic (i.e, Asian) languages?

I have some code that sorts table columns by object properties. It occurred to me that in Japanese or Chinese (non-alphabetical languages), the strings that are sent to the sort function would be compared the way an alphabetical language would.
Take for example a list of Japanese surnames:
寿拘 (Suzuki)
松坂 (Matsuzaka)
松井 (Matsui)
山田 (Yamada)
藤本 (Fujimoto)
When I sort the above list via Javascript, the result is:
寿拘 (Suzuki)
山田 (Yamada)
松井 (Matsui)
松坂 (Matsuzaka)
藤本 (Fujimoto)
This is different from the ordering of the Japanese syllabary, which would arrange the list phonetically (the way a Japanese dictionary would):
寿拘 (Suzuki)
藤本 (Fujimoto)
松井 (Matsui)
松坂 (Matsuzaka)
山田 (Yamada)
What I want to know is:
Does one double-byte character really get compared against the other in a sort function?
What really goes on in such a sort?
(Extra credit) Does the result of such a sort mean anything at all? Does the concept of sorting really work in Asian (and other) languages? If so, what does it mean and what should one strive for in creating a compare function for those languages?
ADDENDUM TO SUMMARIZE ANSWERS AND DRAW CONCLUSIONS:
First, thanks to all who contributed to the discussion. This has been very informative and helpful. Special shout-outs to bobince, Lie Ryan, Gumbo, Jeffrey Zheng, and Larry K, for their in-depth and thoughtful analyses. I awarded the check mark to Larry K for pointing me toward a solution my question failed to foresee, but I up-ticked every answer I found useful.
The consensus appears to be that:
Chinese and Japanese character strings are sorted by Unicode code points, and their ordering may be predicated on a rationale that may be in some way intelligible to knowledgeable readers but is not likely to be of much practical value in helping users to find the information they're seeking.
The kind of compare function that would be required to make a sort semantically or phonetically useful is far too cumbersome to consider pursuing, especially since the results would probably be less than satisfactory, and in any case the comparison algorithms would have to be changed for each language. Best just to allow the sort to proceed without even attempting a compare function.
I was probably asking the wrong question here. That is, I was thinking too much "inside the box" without considering that the real question is not how do I make sorting useful in these languages, but how do I provide the user with a useful way of finding items in a list. Westerners automatically think of sorting for this purpose, and I was guilty of that. Larry K pointed me to a Wikipedia article that suggests a filtering function might be more useful for Asian readers. This is what I plan to pursue, as it's at least as fast as sorting, client-side. I will keep the column sorting because it's well understood in Western languages, and because speakers of any language would find the sorting of dates and other numerical-based data types useful. But I will also add that filtering mechanism, which would be useful in long lists for any language.

Does one double-byte character really get compared against the other in a sort function?
The native String type in JavaScript is based on UTF-16 code units, and that's what gets compared. For characters in the Basic Multilingual Plane (which all these are), this is the same as Unicode code points.
The term ‘double-byte’ as in encodings like Shift-JIS has no meaning in a web context: DOM and JavaScript strings are natively Unicode, the original bytes in the encoded page received by the browser are long gone.
Does the result of such a sort mean anything at all?
Little. Unicode code points do not claim to offer any particular ordering... for one, because there is no globally-accepted ordering. Even for the most basic case of ASCII Latin characters, languages disagree (eg. on whether v and w are the same letter, or whether the uppercase of i is I or İ). And CJK gets much gnarlier than that.
The main Unicode CJK Unified Ideographs block happens to be ordered by radical and number of strokes (Kangxi dictionary order), which may be vaguely useful. But use characters from any of the other CJK extension blocks, or mix in some kana, or romaji, and there will be no meaningful ordering between them.
The Unicode Consortium do attempt to define some general ordering rules, but it's complex and not generally attempted at a language level. Systems that really need language-sensitive sorting abilities (eg. OSes, databases) tend to have their own collation schemes.
This is different from the ordering of the Japanese syllabary
Yes. Above and beyond collation issues in general, it's a massively difficult task to handle kanji accurately by syllable, because you have to guess at the pronunciation. JavaScript can't realistically know that by ‘藤本’ you mean ‘Fujimoto’ and not ‘touhon’; this sort of thing requires in-depth built-in dictionaries and still-unreliable heuristics... not the sort of thing you want to build in to a programming language.

You could implement the Unicode Collation Algorithm in Javascript if you want something better than the default JS sort for strings. Might improve some things. Though as the Unicode doc states:
Collation is not uniform; it varies
according to language and culture:
Germans, French and Swedes sort the
same characters differently. It may
also vary by specific application:
even within the same language,
dictionaries may sort differently than
phonebooks or book indices. For
non-alphabetic scripts such as East
Asian ideographs, collation can be
either phonetic or based on the
appearance of the character.
The Wikipedia article points out that since collation is so tough in non-alphabetic scripts, now a days the answer is to make it very easy to look up information by entering characters, rather than by looking through a list.
I suggest that you talk to truly knowledgeable end users of your application to see how they would best like it to behave. The problem of ordering Chinese characters is not unique to your application.
Also, if you don't want to implement the collation in your system, another solution would for you to create a Ajax service that stores the names in a MySql or other database, then looks up the data with an order statement.

Strings are compared character by character where the code point value defines the order:
The comparison of strings uses a simple lexicographic ordering on sequences of code point value values. There is no attempt to use the more complex, semantically oriented definitions of character or string equality and collating order defined in the Unicode specification. Therefore strings that are canonically equal according to the Unicode standard could test as unequal. In effect this algorithm assumes that both strings are already in normalised form.
If you need more than this, you will need to use a string comparison that can take collations into account.

Others have answered the other questions, I will take on this one:
what should one strive for in creating a
compare function for those languages?
One way to do it is that, you will need to create a program that can "read" the characters; that is, able to map hanzi/kanji characters to their "sound" (pinyin/hiragana reading). At the simplest level, this means a database that maps hanzi/kanji to sounds. Of course this is more difficult than it sounds (pun not intended), since a lot of characters can have different pronunciations in different contexts, and Chinese have many different dialects to consider.
Another way, is to order by stroke order. This means there would need to be a database that maps hanzi/kanji to their strokes. Another problem: Chinese and Japanese writes in different stroke orders. However, aside from Japanese and Chinese difference, using stroke ordering is much more consistent within a single text, since hanzi/kanji characters are almost always written using the same stroke order irrespective of what they meant or how they are read. A similar idea is to sort by radicals instead of plain stroke orders.
The third way, is sorting by Unicode code points. This is simple, and always gives undisputably consistent ordering; however, the problem is that the sort order is meaningless for human.
The last way is to rethink about the need for absolute ordering, and just use some heuristic to sort by relevance to the user's need. For example, in a shopping cart software, you can sort depending on user's buying habits or by price. This kinda avoids the problem, but most of the time it works (except if you're compiling a dictionary).
As you notice, the first two methods require creating a huge database of one-to-many mapping, but they still doesn't always give a useful result. The third method also require a huge database, but many programming languages already have this database built into the language. The last way is a bit of heuristic, probably most useful, however they are doomed to never give consistent ordering (much worse than the first two method).

Yes, the characters get compared. They are usually compared based on their Unicode code points, though, which are quite different between hiragana and kanji -- making the sort potentially useless in Japanese. (Kanji borrowed from Chinese, but the order they'd appear in Chinese doesn't correspond to the order of the hiragana that'd represent the same meaning). There are collations that could render some of the characters "equal" for comparison purposes, but i don't know if there's one that'll consider a kanji to be equivalent to the hiragana that'd comprise its pronunciation -- especially since a character can have a number of different pronunciations.
In Chinese or Korean, or other languages that don't have 3 different alphabets (one of which is quite irregular), it'd probably be less of an issue.

Those are sorted by codepoint value, ascending. This is certainly meaningless for human readers. It's not impossible to devise a sensible sorting scheme for Japanese, but sorting Chinese characters is hard (partly because we don't necessarily know whether we're looking at Japanese or Chinese), and lot of programmers punt to this solution.

The normal string comparison functions in many programming languages are designed to ensure that strings can be sorted into a unique order, to allow algorithms like binary search and duplicate-detection to work correctly. To sort data in a fashion meaningful to a human reader, one must know what the data represents. For example, in a list of English movie titles, "El Mariachi" would typically sort under "E", but in a list of Spanish movie titles it would sort under "M". The application will need information beyond that contained in the strings themselves to know how the strings should be sorted.

The answers to Q1 (can you sort) and Q3 (is sort meaningful) are both "yes" for Chinese (from a mainland perspective). For Q2 (how to sort):
All Chinese characters have definite pronunciation (some are polyphonic) as defined in pinyin, and it's far more common (as in virtually all Chinese dictionaries) to sort by pinyin, where there is no ambiguity. Characters with the same pronunciation are then sorted by stroke order.
The polyphonic characters pose extra challenge for sorting, as their pinyin usually depends on the word they are in (I heard Japanese characters could be even more hairy). For example, the character 阿 is pronounced a(1) in 阿姨 (tone in parenthesis), and e(1) in 阿胶. So if you need to sort words or sentences, you cannot simply look at one character at a time from each item.

Recall that in JavaScript, you can pass into sort() a function in which you can implement sort yourself, in order to achieve a sort that matters to humans:
myarray.sort(function(a,b){
//return 0, 1, or -1 based on the comparison of the two strings
});

Creating and parsing huge strings with javascript?

I have a simple piece of data that I'm storing on a server, as a plain string. It is kind of ridiculous, but it looks like this:
name|date|grade|description|name|date|grade|description|repeat for a long time
this string can be up to 1.4mb in size. The idea is that it's a bunch of student records, just strung together with a simple pipe delimeter. It's a very poor serialization method.
Once this massive string is pushed to the client, it is split along the pipes into student records again, using javascript.
I've been timing how long it takes to create, and split, these strings on the client side. The times are actually quite good, the slowest run I've seen on a few different machines is 0.2 seconds for 10,000 'student records', which has a final string size of ~1.4mb.
I realize this is quite bizarre, just wondering if there are any inherent problems with creating and splitting such large strings using javascript? I don't know how different browsers implement their javascript engines. I've tried this on the 'major' browsers, but don't know how this would perform on earlier versions of each.
Yeah looking for any comments on this, this is more for fun than anything else!
Thanks

String splitting for 1.4mb data is not a problem for decent machines, instead you should worry about the internet connection speed of your users. I've tried to do spell check with 800 kb dictionary (which is half of your data), main issue was loading time.
But looks like your students records data could be put in database, and might not need to load everything at loading time, So, how about do a pagination to show user records or use ajax to request to search certain user names?

If it's a really large string it may pay to continuously slice the string with 'string'.slice(from, to) to only process a smaller subset, appending all of the individual items to the end of the output with list.push() or something similar might work.
String split methods are probably the most efficient way of doing this though, even in IE. Processing individual characters using string.charAt(x) is extremely slow and will often show a security error as it stalls the browser. Using string split methods would certainly be much faster than splitting using regular expressions.
It may also be possible to encode the data using a JSON array, some newer browsers such as IE8/Webkit/FF3.5 have fast JSON parsing built in using JSON.parse(data). But using eval(JSON) may overflow the browser if there's enough data, so is probably a bad idea. It may pay to compare for performance though.
A much better approach in a lot of cases is to use AJAX and only load some of the data at once from the server, which would also save download time.

Besides S. Mark's excellent comments about local vs. x-fer speed and the tip to re-encode using AJAX, I suggest a (longterm) move away from JavaScript in the Browser (assuming that's were it runs) to either a non-browser implementation of JS (or possibly another language).
A browser based JS seems a week link in a data-x-fer chain and nothing I would want to run unmonitored, since the browsers are upgraded from time to time and breaking your JS-x-fer might be an unanticipates side effect!

Develop Reference

JavaScript is the programming language of the Web.