Why does NFKC normalize of all digits not work? - javascript

In JavaScript I am using NFKC normalization via String.prototype.normalize to normalize fullwidth to standard ASCII halfwidth characters.
'1'.normalize('NFKC') === '1'
> true
However, looking at more obscure digits like ૫ which is the digit 5 in Gujarati it does not normalize.
'૫'.normalize('NFKC') === '5'
> false
What am I missing?

Unicode normalisation is meant for characters that are variants of each other, not for every set of characters that might have similar meanings.
The character ‘1’ (FULLWIDTH DIGIT ONE) is essentially just the character ‘1’ (DIGIT ONE) with slightly different styling and would not have been encoded if it was not necessary for compatibility. They are – in some contexts – completely interchangeable, so the former was assigned a decomposition mapping to the latter. The character ‘૫’ (GUJARATI DIGIT FIVE) does not have a decomposition mapping because it is not a variant of any other character; it is its own distinct thing.
You can consult the Unicode Character Database to see which characters decompose and which (i.e. most of them) don’t. The link to the tool you posted as part of your question shows you for example that ૫ does not change under any form of Unicode normalisation.

You are looking the wrong problem.
Unicode is main purpose is about encoding characters (without loosing information). Fonts and other programs should be able to interpret such characters and give a glyph (according combination code point, nearby characters, and other characteristics outside code points [like language, epoch, font characteristic [script and non-script, uppercase, italic, etc changes how to combine characters and ligature (and also glyph form).
There are two main normalization (canonical and compatible) [and two variant: decomposed, and composed when possible]. Canonical normalization remove unneeded characters (repetition) and order composing characters in a standard way. Compatible normalization remove "compatible characters": characters that are in Unicode just not to lose information on converting to and from other charset.
Some digits (like small 2 exponent) have compatible character as normal digit (this is a formatting question, unicode is not about formatting). But on the other cases, digits in different characters should be keep ad different characters.
That was about normalization.
But you want to get the numeric value of a unicode character (warning: it could depends on other characters, position, etc.).
Unicode database provides also such property.
With Javascript, you may use unicode-properties javasript package, which provide you also the function getNumericValue(codePoint). This packages seems to use efficient compression of database, but so I don't know how fast it could be. The database is huge.

The NFKC which you are using here stands for Compatibility Decomposition followed by Canonical Composition, which in trivial english means first break things to smaller more often used symbols and then combine them to find the equivalent simpler character. For example 𝟘->0, fi->fi (Codepoint fi=64257).
It does not do conversion to ASCII, for example in ख़(2393)-> ख़([2326, 2364])
Reference:https://unicode.org/reports/tr15/#Norm_Forms
For simpler understanding:https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c

Related

Find matches between given data

I have seen that sort function in javascript transforms every letter of a word in ascii code to permit the comparison between words when an alphabetical sort is required.How this function manage to find asciI code for every letters?Does It slide a list for every letters?
What is the method with the function associate ascii code to a letter?
Thank you so much for help:)
Abstractly, characters are numbers in that what we think of as a character "a" and the number "97" are both a byte "‭01100001‬".
It is a lot more nuanced than that as unicode supports multi-byte characters and numbers in javascript are multi-byte floating points but the concept holds at a high level.
An "encoding" such as ASCII, WE8DEC, or one of the Unicode flavors is essentially a way to map a (set of) byte to what we think of as a character.
So, if numbers can be sorted, then so too can characters and thus strings.
You might also be interested in this post that explains the native sorting rules : How does JavaScript decide sort order with characters from different character sets?

How well is Node.js' support for Unicode?

According to its language specification JavaScript has some problems with Unicode (if I understand it correctly), as text is always handled as one character consisting of 16 bits internally.
JavaScript: The Good Parts speaks out in a similar way.
When you search Google for V8's support of UTF-8, you get contradictory statements.
So: What is the state of Unicode support in Node.js (0.10.26 was the current version when this question was asked)? Does it handle UTF-8 will all possible codepoints correctly, or doesn't it?
If not: What are possible workarounds?
The two sources you cite, the language specification and Crockford's “JavaScript: The Good Parts” (page 103) say the same thing, although the latter says it much more concisely (and clearly, if you already know the subject). For reference I'll cite Crockford:
JavaScript was designed at a time when Unicode was expected to have at most 65,536 characters. It has since grown to have a capacity of more than 1 million characters.
JavaScript's characters are 16 bits. That is enough to cover the original 65,536 (which is now known as the Basic Multilingual Plane). Each of the remaining million characters can be represented as a pair of characters. Unicode considers the pair to be a single character. JavaScript thinks the pair is two distinct characters.
The language specification calls the 16-bit unit a “character” and a “code unit”. A “Unicode character”, or “code point”, on the other hand, can (in rare cases) need two 16-bit “code units” to be represented.
All of JavaScript's string properties and methods, like length, substr(), etc., work with 16-bit “characters” (it would be very inefficient to work with 16-bit/32-bit Unicode characters, i.e., UTF-16 characters). E.g., this means that, if you are not careful, with substr() you can leave one half of a 32-bit UTF-16 Unicode character alone. JavaScript won't complain as long as you don't display it, and maybe won't even complain if you do. This is because, as the specification says, JavaScript does not check that the characters are valid UTF-16, it only assumes they are.
In your question you ask
Does [Node.js] handle UTF-8 will all possible codepoints correctly, or doesn't it?
Since all possible UTF-8 codepoints are converted to UTF-16 (as one or two 16-bit “characters”) in input before anything else happens, and vice versa in output, the answer depends on what you mean by “correctly”, but if you accept JavaScript's interpretation of this “correctly”, the answer is “yes”.
For further reading and head-scratching: https://mathiasbynens.be/notes/javascript-unicode
The JavaScript string type is UTF-16 so its Unicode support is 100%. All UTF forms support all Unicode code points.
Here is a general breakdown of the common forms:
UTF-8 - 8-bit code units; variable width (code points are 1-4 code units)
UTF-16 - 16-bit code units; variable width (code points are 1-2 code units); big-endian or little-endian
UTF-32 - 32-bit code units; fixed width; big-endian or little endian
UTF-16 was popularized when it was thought every code point would fit in 16 bits. This was not the case. UTF-16 was later redesigned to allow code points to take two code units and the old version was renamed UCS-2.
However, it turns out that visible widths do not equate very well to memory storage units anyway so both UTF-16 and UTF-32 are of limited utility. Natural language is complex and in many cases sequences of code points combine in surprising ways.
The measurement of width for a "character" depends on context. Memory? Number of visible graphemes? Render width in pixels?
UTF-16 remains in common use because many of today's popular languages/environments (Java/JavaScript/Windows NT) were born in the '90s. It is not broken. However, UTF-8 is usually preferred.
If you are suffering from a data-loss/corruption issue it is usually because of a defect in a transcoder or the misuse of one.

Javascript Regex + Unicode Diacritic Combining Characters`

I want to match this character in the African Yoruba language 'ẹ́'. Usually this is made by combining an 'é' with a '\u0323' under dot diacritic. I found that:
'é\u0323'.match(/[é]\u0323/) works but
'ẹ́'.match(/[é]\u0323/) does not work.
I don't just want to match e. I want to match all combinations. Right now, my solution involves enumerating all combinations. Like so: /[ÁÀĀÉÈĒẸE̩Ẹ́É̩Ẹ̀È̩Ẹ̄Ē̩ÍÌĪÓÒŌỌO̩Ọ́Ó̩Ọ̀Ò̩Ọ̄Ō̩ÚÙŪṢS̩áàāéèēẹe̩ẹ́é̩ẹ̀è̩ẹ̄ē̩íìīóòōọo̩ọ́ó̩ọ̀ò̩ọ̄ō̩úùūṣs̩]/
Could there not be a shorter and thus better way to do this, or does regex matching in javascript of unicode diacritic combining characters not work this easily?
Thank you
Normally the solution would be to use Unicode properties and/or scripts, but JavaScript does not support them natively.
But there exists the lib XRegExp that adds this support. With this lib you can use
\p{L}: to match any kind of letter from any language.
\p{M}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
So your character class would look like this:
[\p{L}\p{M}]+
that would match all possible letters that are in the Unicode table.
If you want to limit it, you can have a look at Unicode scripts and replace \p{L} by a script, they collect all letters from certain languages. e.g. \p{Latin} for all Latin letters or \p{Cyrillic} for all Cyrillic letters.
Usually this is made by combining an 'é' with a '\u0323' under dot diacritic
However, that isn't what you have here:
'ẹ́'
that's not U+0065,U+0323 but U+1EB9,U+0301 - combining an ẹ with an acute diacritic.
The usual solution would be to normalise each string (typically to Unicode Normal Form C) before doing the comparison.
I don't just want to match e. I want to match all combinations
Matching without diacriticals is typically done by normalising to Normal Form D and removing all the combining diacritical characters.
Unfortunately normalisation is not available in JS, so if you want it you would have to drag in code to do it, which would have to include a large Unicode data table. One such effort is unorm. For picking up characters based on Unicode preoperties like being a combining diacritical, you'd also need a regexp engine with support for the Unicode database, such as XRegExp Unicode Categories.
Server-side languages (eg Python, .NET) typically have native support for Unicode normalisation, so if you can do the processing on the server that would generally be easier.

Using charCodeAt() and fromCharCode to obtain Unicode characters (code value > 55349) with JS

I use "".charCodeAt(pos) to get the Unicode number for a strange character, and then String.fromCharCode for the reverse.
But I'm having problems with characters that have a Unicode number greater than 55349. For example, the Blackboard Bold characters. If I want Lowercase Blackboard Bold X (𝕩), which has a Unicode number of 120169, if I alert the code from JavaScript:
alert(String.fromCharCode(120169));
I get another character. The same thing happens if I log an Uppercase Blackboard Bold X (𝕏), which has a Unicode number of 120143, from directly within JavaScript:
s="𝕏";
alert(s.charCodeAt(0))
alert(s.charCodeAt(1))
Output:
55349
56655
Is there a method to work with these kind of characters?
Internally, Javascript stores strings in a 16-bit encoding resembling UCS2 and UTF-16. (I say resembling, since it’s really neither of those two). The fact that they’re 16-bits means that characters outside the BMP, with code points above 65535, will be split up into two different characters. If you store the two different characters separately, and recombine them later, you should get the original character without problem.
Recognizing that you have such a character can be rather tricky, though.
Mathias Bynens has written a blog post about this: JavaScript’s internal character encoding: UCS-2 or UTF-16?. It’s very interesting (though a bit arcane at times), and concludes with several references to code libraries that support the conversion from UCS-2 to UTF-16 and vice versa. You might be able to find what you need in there.

Can I depend on the behavior of charCodeAt() and fromCharCode() to remain the same?

I have written a personal web app that uses charCodeAt() to convert text that is input by the user into the relevant character codes (for example ⊇ is converted to 8839 for storage), which is then sent to Perl, which sends them to MySQL. To retrieve the input text, the app uses fromCharCode() to convert the numbers back to text.
I chose to do this because Perl's unicode support is very hard to deal with correctly. So Perl and MySQL only see numbers, which makes life a lot simpler.
My question is can I depend on fromCharCode() to always convert a number like 8834 to the relevant character? I don't know what standard it uses, but let's say it uses UTF-8, if it is changed to use UTF-16 in the future, this will obviously break my program if there is no backward compatibility.
I know that my ideas about these concepts aren't that clear, therefore please care to clarify if I've shown a misunderstanding.
fromCharCode and toCharCode deal with Unicode code points, i.e. numbers between 0 and 65535(0xffff), assuming all characters are in the Basic-Multilingual Plane(BMP). Unicode and the code points are permanent, so you can trust them to remain the same forever.
Encodings such as UTF-8 and UTF-16 take a stream of code points (numbers) and output a byte stream. JavaScript is somewhat strange in that characters outside the BMP have to be constructed by two calls to toCharCode, according to UTF-16 rules. However, virtually every character you'll ever encounter (including Chinese, Japanese etc.) is in the BMP, so your program will work even if you don't handle these cases.
One thing you can do is convert the numbers back into bytes (in big-endian int16 format), and interpret the resulting text as UTF-16. The behavior of fromCharCode and toCharCode is fixed in current JavaScript implementations and will not ever change.
I chose to do this because Perl's unicode support is very hard to deal with correctly.
This is ɴᴏᴛ true!
Perl has the strongest Unicode support of any major programming language. It is much easier to work with Unicode if you use Perl than if you use any of C, C++, Java, C♯, Python, Ruby, PHP, or Javascript. This is not hyperbole and boosterism from uneducated, blind allegiance.; it is a considered appraisal based on more than ten years of professional experience and study.
The problems encountered by naïve users are virtually always because they have deceived themselves about what Unicode is. The number-one worst brain-bug is thinking that Unicode is like ASCII but bigger. This is absolutely and completely wrong. As I wrote elsewhere:
It’s fundamentally and critically not true that Uɴɪᴄᴏᴅᴇ is just some enlarged character set relative to ᴀsᴄɪɪ. At most, that’s true of nothing more than the stultified ɪsᴏ‑10646. Uɴɪᴄᴏᴅᴇ includes much much more that just the assignment of numbers to glyphs: rules for collation and comparisons, three forms of casing, non-letter casing, multi-codepoint casefolding, both canonical and compatible composed and decomposed normalization forms, serialization forms, grapheme clusters, word- and line-breaking, scripts, numeric equivs, widths, bidirectionality, mirroring, print widths, logical ordering exclusions, glyph variants, contextual behavior, locales, regexes, multiple forms of combining classes, multiple types of decompositions, hundreds and hundreds of critically useful properties, and much much much more‼
Yes, that’s a lot, but it has nothing to do with Perl. It has to do with Unicode. That Perl allows you to access these things when you work with Unicode is not a bug but a feature. That those other languages do not allow you full access to Unicode can by no means be construed as a point in their favor: rather, those are all major bugs of the highest possible severity, because if you cannot work with Unicode in the 21st century, then that language is a primitive, broken, and fundamentally useless for the demanding requirements of modern text processing.
Perl is not. And it is a gazillion times easier to do those things right in Perl than in those other languages; in most of them, you cannot even begin to work around their design flaws. You’re just plain screwed. If a language doesn’t provide full Unicode support, it is not fit for this century; discard it.
Perl makes Unicode infinitely easier than languages that don’t let you use Unicode properly can ever do.
In this answer, you will find at the front, Seven Simple Steps for dealing with Unicode in Perl, and at the bottom of that same answer, you will find some boilerplate code that will help. Understand it, then use it. Do not accept brokenness. You have to learn Unicode before you can use Unicode.
And that is why there is no simple answer. Perl makes it easy to work with Unicode, provided that you understand what Unicode really is. And if you’re dealing with external sources, you are doing to have to arrange for that source to use some sort of encoding.
Also read up on all the stuff I said about 𝔸𝕤𝕤𝕦𝕞𝕖 𝔹𝕣𝕠𝕜𝕖𝕟𝕟𝕖𝕤𝕤. Those are things that you truly need to understand. Another brokenness issue that falls out of Rule #49 is that Javascript is broken because it doesn’t treat all valid Unicode code points in exactly the same way irrespective of their plane. Javascript is broken in almost all the other ways, too. It is unsuitable for Unicode work. Just Rule #34 will kill you, since you can’t get Javascript to follow the required standard about what things like \w are defined to do in Unicode regexes.
It’s amazing how many languages are utterly useless for Unicode. But Perl is most definitely not one of those!
In my opinion it won't break.
Read Joel Spolsky's article on Unicode and character encoding. Relevant part of the article is quoted below:
Every letter in every
alphabet is assigned a number by
the Unicode consortium which is
written like this: U+0639. This
number is called a code point. The U+
means "Unicode" and the numbers are
hexadecimal. The English letter A would
be U+0041.
It does not matter whether this magical number is encoded in utf-8 or utf-16 or any other encoding. The number will still be the same.
As pointed out in other answers, fromCharCode() and toCharCode() deal with Unicode code points for any code point in the Basic Multilingual Plane (BMP). Strings in JavaScript are UCS-2 encoded, and any code point outside the BMP is represented as two JavaScript characters. None of these things are going to change.
To handle any Unicode character on the JavaScript side, you can use the following function, which will return an array of numbers representing the sequence of Unicode code points for the specified string:
var getStringCodePoints = (function() {
function surrogatePairToCodePoint(charCode1, charCode2) {
return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000;
}
// Read string in character by character and create an array of code points
return function(str) {
var codePoints = [], i = 0, charCode;
while (i < str.length) {
charCode = str.charCodeAt(i);
if ((charCode & 0xF800) == 0xD800) {
codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i)));
} else {
codePoints.push(charCode);
}
++i;
}
return codePoints;
}
})();
var str = "𝌆";
var codePoints = getStringCodePoints(s);
console.log(str.length); // 2
console.log(codePoints.length); // 1
console.log(codePoints[0].toString(16)); // 1d306
JavaScript Strings are UTF-16 this isn't something that is going to be changed.
But don't forget that UTF-16 is variable length encoding.
In 2018, you can use String.codePointAt() and String.fromCodePoint().
These methods work even if a character is not in the Basic-Multilingual Plane(BMP).

Categories

Resources