Unexpected behaviour of String.fromCodePoint / String#codePointAt (Firefox/ES6) - javascript

Since version 29 of Firefox, Mozilla provides the String.fromCodePoint and String#codePointAt methods and also published polyfills on the respective MDN pages.
So it happens that I am currently trying this out and it seems that I am missing something important, as splitting the string "ä☺𠜎" into codepoints and reassembling it from these returns an, at least for me, unexpected result.
I've built a test case: http://jsfiddle.net/dcodeIO/YhwP7/
var str = "ä☺𠜎";
...split it, reassemble it...
Am I missing something?

This is not a problem of .codePointAt, but more of the char encoding of the character 𠜎. 𠜎 has a javascript string length of 2.
Why?
Because Javascript Strings are encoded using 2-byte UTF-16. 𠜎 ( charcode: 132878 ) is greater than 2-byte UTF-16 ( 0-65535 ). This means it needs to be encoded using 4-byte UTF-16. Its UTF-16 representation is 0xD841 0xDF0E consuming two characters in the string.
When using .charAt() you will see the correct values:
var string = "𠜎";
console.log( string.charAt(0), string.charAt(1) ); // logs 55361 57102 (0xD841 0xDF0E)
Why doesn't it display 228, 9786, 55361, 57102?
Thats because .codePointAt() converts 4-byte UTF-16 characters to integers correctly ( 132878 ).
So why does it output 57,102 then?
Because you are iterating for str.length in your loop, which returns 4 (because "𠜎".length == "), so .codePointAt() will get executed on str[3] which is 57102.

Related

Javascript: Convert Unicode Character to hex string [duplicate]

I'm using a barcode scanner to read a barcode on my website (the website is made in OpenUI5).
The scanner works like a keyboard that types the characters it reads. At the end and the beginning of the typing it uses a special character. These characters are different for every type of scanner.
Some possible characters are:
█
▄
–
—
In my code I use if (oModelScanner.oData.scanning && oEvent.key == "\u2584") to check if the input from the scanner is ▄.
Is there any way to get the code from that character in the \uHHHH style? (with the HHHH being the hexadecimal code for the character)
I tried the charCodeAt but this returns the decimal code.
With the codePointAt examples they make the code I need into a decimal code so I need a reverse of this.
Javascript strings have a method codePointAt which gives you the integer representing the Unicode point value. You need to use a base 16 (hexadecimal) representation of that number if you wish to format the integer into a four hexadecimal digits sequence (as in the response of Nikolay Spasov).
var hex = "▄".codePointAt(0).toString(16);
var result = "\\u" + "0000".substring(0, 4 - hex.length) + hex;
However it would probably be easier for you to check directly if you key code point integer match the expected code point
oEvent.key.codePointAt(0) === '▄'.codePointAt(0);
Note that "symbol equality" can actually be trickier: some symbols are defined by surrogate pairs (you can see it as the combination of two halves defined as four hexadecimal digits sequence).
For this reason I would recommend to use a specialized library.
you'll find more details in the very relevant article by Mathias Bynens
var hex = "▄".charCodeAt(0).toString(16);
var result = "\\u" + "0000".substring(0, 4 - hex.length) + hex;
If you want to print the multiple code points of a character, e.g., an emoji, you can do this:
const facepalm = "🤦🏼‍♂️";
const codePoints = Array.from(facepalm)
.map((v) => v.codePointAt(0).toString(16))
.map((hex) => "\\u{" + hex + "}");
console.log(codePoints);
["\u{1f926}", "\u{1f3fc}", "\u{200d}", "\u{2642}", "\u{fe0f}"]
If you are wondering about the components and the length of 🤦🏼‍♂️, check out this article.

When is an umlaut not an umlaut (u-umlaut maps to charCode 117)

> let uValFromStr = "Würzburg".charCodeAt(1)
undefined
> uValFromStr
117
> String.fromCharCode(252)
'ü'
> "Würzburg".charCodeAt(1) === String.fromCharCode(252)
false
>
We have a case where an umlaut in a string is failing a simple string comparison test because its value is actually mapped to charCode 117. The u-umlaut should be mapped to charCode 252. Note the first two lines where we extract the character's charCode. So when this occurs, a user enters a text string matching the first three characters and the match fails as code is evaluating 117===252.
Any ideas as to how this can occur? We have numerous use cases with umlauts in our data which work correctly so it is not an endemic issue but rather one that is particular to this input only (so far).
The ü in that specific "Würzburg" string is written using the Unicode code point for u (U+0075) followed by an umlaut combining mark (U+0308) which modifies it, but the ü you're comparing it to is written with the single Unicode code point for u-with-umlaut (U+00FC). Nearly all of JavaScript's string handling is quite naive, which is why they aren't equal. This naive (but fast!) nature has two parts: 1) It doesn't know about combining marks, which is why "Würzburg".length is 9 instead of 8 (if the ü is written using U+0075 and U+00FC); and 2) JavaScript "characters" are actually UTF-16 code units, which may be only half of a code point ("😊".length is 2, for instance, because although it's a single Unicode code point (U+1F60A), it requires two code units to be expressed in UTF-16). (One can argue that JavaScript strings are UCS-2 because they tolerate invalid surrogate pairs [pairs of code units that, taken together, describe a code point], but the spec says "...each element in the String is treated as a UTF-16 code unit value...")
You can solve this problem with comparing those two umlauted u's by using normalization, via JavaScript's (relatively new) normalize method:
const word = "Würzburg";
// Iteration moves through the string by code points, not code units
for (const ch of word) {
console.log(`${ch} = ${ch.codePointAt(0)}`);
}
const char = String.fromCharCode(252);
const normalizedWord = word.normalize();
const normalizedChar = char.normalize();
// Using iteration to grab the second "character" (code point) from the string
const [, secondCharOfWord] = normalizedWord;
console.log(normalizedChar === secondCharOfWord); // true
.as-console-wrapper {
max-height: 100% !important;
}
In that example we use the default normalization ("NFC," Normalization Form C), which prefers specific code points to combining marks, so the normalized version of the word uses u-with-umlaut code point U+00FC. There are other normalization forms available by passing an argument to normalize (such as Normalization Form D, which prefers combining marks to specific character code points), but the default is usually the one you want.

decoding array of utf8 strings inside a stream

I faced a weird problem today after trying to decode a utf8 formatted string. It's being fetched through stream as an array of strings but formatted in utf8 somehow (I'm using fast-csv). However as you can see in the console if I log it directly it shows the correct version but when it's inside an object literal it's back to utf8 encoded version.
var stream = fs
.createReadStream(__dirname + '/my.csv')
.pipe(csv({ ignoreEmpty: true }))
.on('data', data => {
console.log(data[0])
// prints farren#rogers.com
console.log({ firstName: data[0] })
// prints { firstName: '\u0000f\u0000a\u0000r\u0000r\u0000e\u0000n\u0000#\u0000r\u0000o\u0000g\u0000e\u0000r\u0000s\u0000.\u0000c\u0000o\u0000m\u0000' }
})
Any solution or explanations are appreciated.
Edit: even after decoding using utf8.js and then pass it in the object literal, I still encounter the same problem.
JavaScript uses UTF-16 for Strings. It also has a numeric escape notation for a UTF-16 code unit. So, when you see this output in your debugger
\u0000f\u0000a\u0000r\u0000r\u0000e\u0000n
It is saying that the String's code units are \u0000 f \u0000 a etc. The \uHHHH escape means the UTF-16 code unit HHHH in hexadecimal. \u0000 is the single (unpaired) UTF-16 code unit need for the U+0000 (NUL) Unicode codepoint. So, something is being interpreted as NUL f NUL a, etc.
UTF-8 code units are 8 bits each. NUL in UTF-8 is 0x00. f is 0x66.
UTF-16 code units are 16 bits each. NULL is 0x0000. f is 0x0066. When 16-bit values are stored as bytes, endianness applies. In little endian, 0x0066 is written as 0x66 0x00. In big endian, 0x00 0x66.
So, if bytes of UTF-16 code units (such as the ones in the example data) are interpreted as UTF-8 (or perhaps other encodings), f can be read as NUL f or f NUL.
The fundamental rule of character encodings is to read with the same encoding that text was written with. No doing so can lead to data loss and corruption that can go on undetected. Not knowing what the encoding is to begin with is data loss itself and a failed communication.
You can learn more about Unicode at Unicode.org. You can learn more about languages and technologies that use it from their respective specifications—they are all very upfront and clear about it. JavaScript, Java, C#, VBA/VB4/VB5/VB6, VB.NET, F#, HTML, XML, T-SQL,…. (Okay, VB4 documentation might not be quite as clear but the point is that this is very common and not new [VBPJ Sept. 1996], though we all are still struggling to assimilate it.)

Reassembling negative Python marshal int's into Javascript numbers

I'm writing a client-side Python bytecode interpreter in Javascript (specifically Typescript) for a class project. Parsing the bytecode was going fine until I tried out a negative number.
In Python, marshal.dumps(2) gives 'i\x02\x00\x00\x00' and marshal.dumps(-2) gives 'i\xfe\xff\xff\xff'. This makes sense as Python represents integers using two's complement with at least 32 bits of precision.
In my Typescript code, I use the equivalent of Node.js's Buffer class (via a library called BrowserFS, instead of ArrayBuffers and etc.) to read the data. When I see the character 'i' (i.e. buffer.readUInt8(offset) == 105, signalling that the next thing is an int), I then call readInt32LE on the next offset to read a little-endian signed long (4 bytes). This works fine for positive numbers but not for negative numbers: for 1 I get '1', but for '-1' I get something like '-272777233'.
I guess that Javascript represents numbers in 64-bit (floating point?). So, it seems like the following should work:
var longval = buffer.readInt32LE(offset); // reads a 4-byte long, gives -272777233
var low32Bits = longval & 0xffff0000; //take the little endian 'most significant' 32 bits
var newval = ~low32Bits + 1; //invert the bits and add 1 to negate the original value
//but now newval = 272826368 instead of -2
I've tried a lot of different things and I've been stuck on this for days. I can't figure out how to recover the original value of the Python integer from the binary marshal string using Javascript/Typescript. Also I think I deeply misunderstand how bits work. Any thoughts would be appreciated here.
Some more specific questions might be:
Why would buffer.readInt32LE work for positive ints but not negative?
Am I using the correct method to get the 'most significant' or 'lowest' 32 bits (i.e. does & 0xffff0000 work how I think it does?)
Separate but related: in an actual 'long' number (i.e. longer than '-2'), I think there is a sign bit and a magnitude, and I think this information is stored in the 'highest' 2 bits of the number (i.e. at number & 0x000000ff?) -- is this the correct way of thinking about this?
The sequence ef bf bd is the UTF-8 sequence for the "Unicode replacement character", which Unicode encoders use to represent invalid encodings.
It sounds like whatever method you're using to download the data is getting accidentally run through a UTF-8 decoder and corrupting the raw datastream. Be sure you're using blob instead of text, or whatever the equivalent is for the way you're downloading the bytecode.
This got messed up only for negative values because positive values are within the normal mapping space of UTF-8 and thus get translated 1:1 from the original byte stream.

Javascript encoding breaking & combining multibyte characters?

I'm planning to use a client-side AES encryption for my web-app.
Right now, I've been looking for ways to break multibyte characters into one byte-'non-characters' ,encrypt (to have the same encrypted text length),
de-crypt them back, convert those one-byte 'non-characters' back to multibyte characters.
I've seen the wiki for UTF-8 (the supposedly-default encoding for JS?) and UTF-16, but I can't figure out how to detect "fragmented" multibyte characters and how I can combine them back.
Thanks : )
JavaScript strings are UTF-16 stored in 16-bit "characters". For Unicode characters ("code points") that require more than 16 bits (some code points require 32 bits in UTF-16), each JavaScript "character" is actually only half of the code point.
So to "break" a JavaScript character into bytes, you just take the character code and split off the high byte and the low byte:
var code = str.charCodeAt(0); // The first character, obviously you'll have a loop
var lowbyte = code & 0xFF;
var highbyte = (code & 0xFF00) >> 8;
(Even though JavaScript's numbers are floating point, the bitwise operators work in terms of 32-bit integers, and of course in our case only 16 of those bits are relevant.)
You'll never have an odd number of bytes, because again this is UTF-16.
You could simply convert to UTF8... For example by using this trick
function encode_utf8(s) {
return unescape(encodeURIComponent(s));
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
Considering you are using crypto-js, you can use its methods to convert to utf8 and return to string. See here:
var words = CryptoJS.enc.Utf8.parse('𤭢');
var utf8 = CryptoJS.enc.Utf8.stringify(words);
The 𤭢 is probably a botched example of Utf8 character.
By looking at the other examples (see the Latin1 example), I'll say that with parse you convert a string to Utf8 (technically you convert it to Utf8 and put in a special array used by crypto-js of type WordArray) and the result can be passed to the Aes encoding algorithm and with stringify you convert a WordArray (for example obtained by decoding algorithm) to an Utf8.
JsFiddle example: http://jsfiddle.net/UpJRm/

Categories

Resources