proper algo to convert strings to integers while keeping semantic value

proper algo to convert strings to integers while keeping semantic value - javascript

I've been trying to convert natural language strings into integers for use in a long short-term neural-network. I tried converting to binary, using a bag-of-words, and an associative-array with each letter corresponding to a prime-number.
I looked into Google's word2vec just to convert the words into word-vectors, but I'm looking for something I can implement in the browser. This is why I am looking for an algorithm that I can write in js.
I know there's node.js implementations of word2vec, but they just run word2vec in the command-line.
This is different than this question, here, that I asked earlier because I am looking for something that retains semantic meaning. I thought about using word similarity techniques, but didn't know how to implement resnik similarity in js.
I greatly appreciate any help or direction in converting nl sentences, or just the topic of them, to word-vectors or an array of ints.

Related

converting LARGE string to an integer [duplicate]

How do I parse a 20-digit number using JavaScript and jQuery?

A 20-digit number is generally too big for a JavaScript native numeric type, so you'll need to find a "big number" package to use. Here's one that I've seen mentioned on Stack Overflow and that looks interesting: http://silentmatt.com/biginteger/
That one is just integers. If you need decimal fractions too, you'll have to find something else.
If you just want to check that a 20-digit string looks like a number, and you don't need to know the value, you can do that with a regular expression.

You can't have a 20-digit number in JavaScript - it'll get floated off on you.
You can keep 20 digits (or 200 or 2000) intact as a string, or as an array of digits, but to do any math on it you need a big integer object or library.

Normally you use Number() or parseInt("", [radix]) to parse a string into a real number.
I am guessing you are asking about what happens when the string you parse is above the int - threshold. In this case it greatly depends on what you are trying to accomplish.
There are some libraries that allow working with big numbers such as https://silentmatt.com/biginteger/ - see answer - (did not test it, but it looks OK). Also try searching for BigInt JavaScript or BigMath.
In short: working with VERY large number or exact decimals is a challenge in every programming language and often requires very specific mathematical libraries (which are less convenient and often a lot slower than when you work in "normal" (int/long) areas) - which obviously is not an issue when you REALLY want those big numbers.

'+' or parseInt() - which one is efficient to convert string to integer in javascript

In JavaScript codes, I have seen people using '+' character to convert string to integer as in -
var i = +"2";
another way is just using parseInt() method as following -
var i = parseInt("2");
I want to know which one is efficient?
Sorry I should also add that, I am dealing with integers only and the data is huge so even a little difference in performance would be good for me.

It depends on the Browser.
i've created a nasty little Testcase for some String-To-Number conversion possibilities i know.
Ive also added possibilities to convert to floating-point-numbers, as in Javascript Numbers are Numbers, no matter if they have floating point or not.
Check it out. Corrections and suggestions appreciated!
As some other folks around here said in the comments below the question: I also think its better not to think to much about it, bu to focus on readability..

Long story short, don't worry about it, use whatever is more convinient for you and the actual case; micro-optimizations like this are useless. Id' say just remember that you might need to pass in the radix parameter into parseInt if your number is (or looks) octal or some other format.

Use of letters for doing matrix math in Javascript

I'm doing a course in Quantum Computation. In it, we represent possible actions, or operators, by matrices. I've been looking into creating a webpage for solving these maths problems.
It is also a small challenge for myself in order to freshen up my JS.
I've been looking at various options, like Sylvester, MathJax and MathML.
Problem: However, none of the above appear to give functionality for using letters throughout my computation.
For instance, in Quantum Computation we often use multiply a matrix containing unknowns alpha and beta, with other matrices.
This is the sort of math I need to do:
http://i.stack.imgur.com/vH9Dk.gif
Ideally, I'd write this in the style of:
M=[[a],[b]], which of course, I cannot. Further, I'd be able to multiply to get "2*a" etc.
Any suggestions?

As suggested in the comments on the question, you could use strings. Then you just have to write your own matrix-matrix multiplication routine which will understand the difference between an entry containing a string and an entry containing a number.
However, as soon as you do more than one of these, you'll end up with expressions as well as variables and numbers. So we can generalise this to make every element be an expression. This is the beginnings of a symbolic algebra system as #High Performance Mark pointed out.
In javascript, I would guess that you want a set of expression objects, each implementing an interface including a method that returns whether the expression is determined or not yet. The gnarly bit is simplifying the resulting expressions to resolve the values of the variables.
Alternatively, do a bit more maths beforehand; move the variables out of the equations, and then let the code do the calculation.

What does sorting mean in non-alphabetic (i.e, Asian) languages?

I have some code that sorts table columns by object properties. It occurred to me that in Japanese or Chinese (non-alphabetical languages), the strings that are sent to the sort function would be compared the way an alphabetical language would.
Take for example a list of Japanese surnames:
寿拘 (Suzuki)
松坂 (Matsuzaka)
松井 (Matsui)
山田 (Yamada)
藤本 (Fujimoto)
When I sort the above list via Javascript, the result is:
寿拘 (Suzuki)
山田 (Yamada)
松井 (Matsui)
松坂 (Matsuzaka)
藤本 (Fujimoto)
This is different from the ordering of the Japanese syllabary, which would arrange the list phonetically (the way a Japanese dictionary would):
寿拘 (Suzuki)
藤本 (Fujimoto)
松井 (Matsui)
松坂 (Matsuzaka)
山田 (Yamada)
What I want to know is:
Does one double-byte character really get compared against the other in a sort function?
What really goes on in such a sort?
(Extra credit) Does the result of such a sort mean anything at all? Does the concept of sorting really work in Asian (and other) languages? If so, what does it mean and what should one strive for in creating a compare function for those languages?
ADDENDUM TO SUMMARIZE ANSWERS AND DRAW CONCLUSIONS:
First, thanks to all who contributed to the discussion. This has been very informative and helpful. Special shout-outs to bobince, Lie Ryan, Gumbo, Jeffrey Zheng, and Larry K, for their in-depth and thoughtful analyses. I awarded the check mark to Larry K for pointing me toward a solution my question failed to foresee, but I up-ticked every answer I found useful.
The consensus appears to be that:
Chinese and Japanese character strings are sorted by Unicode code points, and their ordering may be predicated on a rationale that may be in some way intelligible to knowledgeable readers but is not likely to be of much practical value in helping users to find the information they're seeking.
The kind of compare function that would be required to make a sort semantically or phonetically useful is far too cumbersome to consider pursuing, especially since the results would probably be less than satisfactory, and in any case the comparison algorithms would have to be changed for each language. Best just to allow the sort to proceed without even attempting a compare function.
I was probably asking the wrong question here. That is, I was thinking too much "inside the box" without considering that the real question is not how do I make sorting useful in these languages, but how do I provide the user with a useful way of finding items in a list. Westerners automatically think of sorting for this purpose, and I was guilty of that. Larry K pointed me to a Wikipedia article that suggests a filtering function might be more useful for Asian readers. This is what I plan to pursue, as it's at least as fast as sorting, client-side. I will keep the column sorting because it's well understood in Western languages, and because speakers of any language would find the sorting of dates and other numerical-based data types useful. But I will also add that filtering mechanism, which would be useful in long lists for any language.

Does one double-byte character really get compared against the other in a sort function?
The native String type in JavaScript is based on UTF-16 code units, and that's what gets compared. For characters in the Basic Multilingual Plane (which all these are), this is the same as Unicode code points.
The term ‘double-byte’ as in encodings like Shift-JIS has no meaning in a web context: DOM and JavaScript strings are natively Unicode, the original bytes in the encoded page received by the browser are long gone.
Does the result of such a sort mean anything at all?
Little. Unicode code points do not claim to offer any particular ordering... for one, because there is no globally-accepted ordering. Even for the most basic case of ASCII Latin characters, languages disagree (eg. on whether v and w are the same letter, or whether the uppercase of i is I or İ). And CJK gets much gnarlier than that.
The main Unicode CJK Unified Ideographs block happens to be ordered by radical and number of strokes (Kangxi dictionary order), which may be vaguely useful. But use characters from any of the other CJK extension blocks, or mix in some kana, or romaji, and there will be no meaningful ordering between them.
The Unicode Consortium do attempt to define some general ordering rules, but it's complex and not generally attempted at a language level. Systems that really need language-sensitive sorting abilities (eg. OSes, databases) tend to have their own collation schemes.
This is different from the ordering of the Japanese syllabary
Yes. Above and beyond collation issues in general, it's a massively difficult task to handle kanji accurately by syllable, because you have to guess at the pronunciation. JavaScript can't realistically know that by ‘藤本’ you mean ‘Fujimoto’ and not ‘touhon’; this sort of thing requires in-depth built-in dictionaries and still-unreliable heuristics... not the sort of thing you want to build in to a programming language.

You could implement the Unicode Collation Algorithm in Javascript if you want something better than the default JS sort for strings. Might improve some things. Though as the Unicode doc states:
Collation is not uniform; it varies
according to language and culture:
Germans, French and Swedes sort the
same characters differently. It may
also vary by specific application:
even within the same language,
dictionaries may sort differently than
phonebooks or book indices. For
non-alphabetic scripts such as East
Asian ideographs, collation can be
either phonetic or based on the
appearance of the character.
The Wikipedia article points out that since collation is so tough in non-alphabetic scripts, now a days the answer is to make it very easy to look up information by entering characters, rather than by looking through a list.
I suggest that you talk to truly knowledgeable end users of your application to see how they would best like it to behave. The problem of ordering Chinese characters is not unique to your application.
Also, if you don't want to implement the collation in your system, another solution would for you to create a Ajax service that stores the names in a MySql or other database, then looks up the data with an order statement.

Strings are compared character by character where the code point value defines the order:
The comparison of strings uses a simple lexicographic ordering on sequences of code point value values. There is no attempt to use the more complex, semantically oriented definitions of character or string equality and collating order defined in the Unicode specification. Therefore strings that are canonically equal according to the Unicode standard could test as unequal. In effect this algorithm assumes that both strings are already in normalised form.
If you need more than this, you will need to use a string comparison that can take collations into account.

Others have answered the other questions, I will take on this one:
what should one strive for in creating a
compare function for those languages?
One way to do it is that, you will need to create a program that can "read" the characters; that is, able to map hanzi/kanji characters to their "sound" (pinyin/hiragana reading). At the simplest level, this means a database that maps hanzi/kanji to sounds. Of course this is more difficult than it sounds (pun not intended), since a lot of characters can have different pronunciations in different contexts, and Chinese have many different dialects to consider.
Another way, is to order by stroke order. This means there would need to be a database that maps hanzi/kanji to their strokes. Another problem: Chinese and Japanese writes in different stroke orders. However, aside from Japanese and Chinese difference, using stroke ordering is much more consistent within a single text, since hanzi/kanji characters are almost always written using the same stroke order irrespective of what they meant or how they are read. A similar idea is to sort by radicals instead of plain stroke orders.
The third way, is sorting by Unicode code points. This is simple, and always gives undisputably consistent ordering; however, the problem is that the sort order is meaningless for human.
The last way is to rethink about the need for absolute ordering, and just use some heuristic to sort by relevance to the user's need. For example, in a shopping cart software, you can sort depending on user's buying habits or by price. This kinda avoids the problem, but most of the time it works (except if you're compiling a dictionary).
As you notice, the first two methods require creating a huge database of one-to-many mapping, but they still doesn't always give a useful result. The third method also require a huge database, but many programming languages already have this database built into the language. The last way is a bit of heuristic, probably most useful, however they are doomed to never give consistent ordering (much worse than the first two method).

Yes, the characters get compared. They are usually compared based on their Unicode code points, though, which are quite different between hiragana and kanji -- making the sort potentially useless in Japanese. (Kanji borrowed from Chinese, but the order they'd appear in Chinese doesn't correspond to the order of the hiragana that'd represent the same meaning). There are collations that could render some of the characters "equal" for comparison purposes, but i don't know if there's one that'll consider a kanji to be equivalent to the hiragana that'd comprise its pronunciation -- especially since a character can have a number of different pronunciations.
In Chinese or Korean, or other languages that don't have 3 different alphabets (one of which is quite irregular), it'd probably be less of an issue.

Those are sorted by codepoint value, ascending. This is certainly meaningless for human readers. It's not impossible to devise a sensible sorting scheme for Japanese, but sorting Chinese characters is hard (partly because we don't necessarily know whether we're looking at Japanese or Chinese), and lot of programmers punt to this solution.

The normal string comparison functions in many programming languages are designed to ensure that strings can be sorted into a unique order, to allow algorithms like binary search and duplicate-detection to work correctly. To sort data in a fashion meaningful to a human reader, one must know what the data represents. For example, in a list of English movie titles, "El Mariachi" would typically sort under "E", but in a list of Spanish movie titles it would sort under "M". The application will need information beyond that contained in the strings themselves to know how the strings should be sorted.

The answers to Q1 (can you sort) and Q3 (is sort meaningful) are both "yes" for Chinese (from a mainland perspective). For Q2 (how to sort):
All Chinese characters have definite pronunciation (some are polyphonic) as defined in pinyin, and it's far more common (as in virtually all Chinese dictionaries) to sort by pinyin, where there is no ambiguity. Characters with the same pronunciation are then sorted by stroke order.
The polyphonic characters pose extra challenge for sorting, as their pinyin usually depends on the word they are in (I heard Japanese characters could be even more hairy). For example, the character 阿 is pronounced a(1) in 阿姨 (tone in parenthesis), and e(1) in 阿胶. So if you need to sort words or sentences, you cannot simply look at one character at a time from each item.

Recall that in JavaScript, you can pass into sort() a function in which you can implement sort yourself, in order to achieve a sort that matters to humans:
myarray.sort(function(a,b){
//return 0, 1, or -1 based on the comparison of the two strings
});

GWT's JSONParser producing incorrect values for numbers

I'm work with GWT and parsing a JSON result from an ASP.NET webservice method that returns a DataTable. I can parse the result into a JSONvalue/JSONObject just fine. The issue I'm having is that one my columns in a DECIMAL(20, 0) and the values that are getting parsed into JSON aren't exact. To demonstrate w/o the need for a WS call, in GWT I threw this together:
String jsonString = "{value:4768428229311981600}";
JSONObject jsonObject = JSONParser.parse( jsonString ).isObject();
Window.alert( jsonObject.toString() );
This in turn alerts:
{"value":4768428229311982000}
I'm under the understanding that GWT's JSONParser is just using eval() to do the parsing, so is this some sort of number/precision issue with JavaScript that I've never been aware of. I'll admit I don't work with numbers that much in JavaScript and I might be able to work around this by changing the .NET WebService to return this column as string, but I'd really rather not do that.
Thanks for any help.

There was a similar question I answered sometime ago - Arbitrary precision in GWT
A more up to date answer is that BigDecimal support looks on track for GWT 2.1
Until then, if you don't need to do calculation with the numbers client side, I recommend passing them around as strings.
Additionally, looking at your example, you could transfer them as strings and maybe use the emulated GWT java.lang.Long.
Last ditch, you can try the svn version of BigDecimal GWT-Math - it shouldn't be all that bad to drop the java files into your jar (It would not need to be compiled, since it's all emul code)
If you go that route, you would still have to pass the numbers as JSON strings, but you could perform meaningful math.

Well, Javascript just uses ordinary IEEE 754 64-bit floating point, so there's an inherent precision limit. The language does not provide support for arbitrary-sized integers (or, really, any pure integer at all). You're going to have to use a string representation when you need to manipulate the values in Javascript, and hopefully you won't have to do any math.
edit: I've looked before at this: http://www-cs-students.stanford.edu/~tjw/jsbn/
It seems like a fairly hairy solution if you don't need to do much manipulating of the numbers, but it might be worth looking at. There may be less ambitious variations on that idea out there.
In any case, that's not going to help you with straight interpretation of JSON unless you also wired up a variant JSON parser to construct numeric values using such a library.

Develop Reference

JavaScript is the programming language of the Web.