TextEncoder / TextDecoder not round tripping

TextEncoder / TextDecoder not round tripping - javascript

I'm definitely missing something about the TextEncoder and TextDecoder behavior. It seems to me like the following code should round-trip, but it doesn't seem to:
new TextDecoder().decode(new TextEncoder().encode(String.fromCharCode(55296))).charCodeAt(0);
Since I'm just encoding and decoding the string, the char code seems like it should be the same, but this returns 65533 instead of 55296. What am I missing?

Based on some spelunking, the TextEncoder.encode() method appears to take an argument of type USVString, where USV stands for Unicode Scalar Value. According to this page, a USV cannot be a high-surrogate or low-surrogate code point.
Also, according to MDN:
A USVString is a sequence of Unicode scalar values. This definition
differs from that of DOMString or the JavaScript String type in that
it always represents a valid sequence suitable for text processing,
while the latter can contain surrogate code points.
So, my guess is your String argument to encode() is getting converted to a USVString (either implicitly or within encode()). Based on this page, it looks like to convert from String to USVString, it first converts it to a DOMString, and then follows this procedure, which includes replacing all surrogates with U+FFFD, which is the code point you see, 65533, the "Replacement Character".
The reason String.fromCharCode(55296).charCodeAt(0) works I believe is because it doesn't need to do this String -> USVString conversion.
As to why TextEncoder.encode() was designed this way, I don't understand the unicode details well enough to attempt to explain, but I suspect it's to simplify implementation since the only output encoding it supports seems to be UTF-8, in an Uint8Array. I'm guessing requiring a USVString argument without surrogates (instead of a native UTF-16 String possibly with surrogates) simplifies the encoding to UTF-8, or maybe makes some encoding/decoding use cases simpler?

For those (like me) who aren't sure what "unicode surrogates" are:
The problem
The character code 55296 is not a valid character by itself. So this part of the code is already a problem:
String.fromCharCode(55296)
Since there is no valid character at that charCode, the .fromCharCode function returns the error character "�" instead, which happens to have the code 65533.
Codes like 55296 are only valid as the first element of a pair of codes. Pairs of codes are used to represent the characters that didn't fit in Unicode's Basic Multilingual Plane. (There are a lot of characters outside the Basic Multilingual Plane, so they need two 16-bit numbers to encode them.)
For example, here is a valid use of the code 55296:
console.log(String.fromCharCode(55296, 57091)
It returns the character "𐌃", from the ancient Etruscan alphabet.
The solution
This code will round-trip correctly:
const code = new TextEncoder().encode(String.fromCharCode(55296, 57091));
console.log(new TextDecoder().decode(code).charCodeAt(0)); // Returns 55296
But beware: .charCodeAt only returns the first part of the pair. A safer option might be to use String.codePointAt to convert the character into a single 32-bit code:
const code = new TextEncoder().encode(String.fromCharCode(55296, 57091));
console.log(new TextDecoder().decode(code).codePointAt(0)); // Returns 66307

Related

Text encoding that produces legible encodings suitable as Javascript identifiers?

I'm working on a tool that reads arbitrary data files and creates a table out of its data which I then store in a database. I'd like to preserve the column headers. The column headers are already ASCII text (or maybe latin1), but they have characters that aren't valid variable names (e.g., spaces, %), so I need to encode them somehow. I'm looking for an encoding for the column titles that has these properties:
Legible: it would be nice if the encoded text looked as similar as possible to the unencoded text (i.e., for debugging).
Legal identifier: I'd like the encoded text to be a valid JavaScript identifier (ECMA-262 Section 7.6).
Invertible: I'd like to be able to get the exact original text back from the encoded text.
I can think of approaches that work for 2 of the 3 cases, but I don't know how to get all 3. E.g., url encoding doesn't produce legal identifier names, I think I could transform base64 to be legal, but it isn't legible, what I've got currently just does some substitutions so it's not invertible.
Efficiency isn't a concern, so if necessary, I could store the encoded and unencoded texts together. The best option I can think of is to use url encoding and then swap percents for $. I thought there would be better options than this though, but I can't find anything. Is there anything better?

This pair of methods relying on Guava's PercentEscaper seems to meet my requirements. Guava doesn't provide an unescaper, but given my simple needs here, I can just use a simple URLDecoder.
private static PercentEscaper escaper = new PercentEscaper('',false)
static String getIdentifier(String str) {
//minimal safe characters, but leaves letters alone, so it's somewhat legible
String escaped = escaper.escape(str);
//javascript identifiers can't start with a digit, and the escaper doesn't know the first
//character has different rules. so prepend a "%3" to encode the digit
if(Character.isDigit(escaped.charAt(0))){
escaped = "%3"+escaped
}
//a percent isn't a valid in a javascript identifier, so we'll use _ as our special character
escaped = escaped.replace('%','_');
return escaped;
}
static String invertIdentifier(String str){
String unescaped = str.replace('_','%');
unescaped = URLDecoder.decode(unescaped, "UTF-8");
return unescaped;
}

How to feed strange characters to javaScript? [duplicate]

I need to insert an Omega (Ω) onto my html page. I am using its HTML escaped code to do that, so I can write Ω and get Ω. That's all fine and well when I put it into a HTML element; however, when I try to put it into my JS, e.g. var Omega = Ω, it parses that code as JS and the whole thing doesn't work. Anyone know how to go about this?

I'm guessing that you actually want Omega to be a string containing an uppercase omega? In that case, you can write:
var Omega = '\u03A9';
(Because Ω is the Unicode character with codepoint U+03A9; that is, 03A9 is 937, except written as four hexadecimal digits.)
Edited to add (in 2022): There now exists an alternative form that better supports codepoints above U+FFFF:
let Omega = '\u{03A9}';
let desertIslandEmoji = '\u{1F3DD}';
Judging from https://caniuse.com/mdn-javascript_builtins_string_unicode_code_point_escapes, most or all browsers added support for it in 2015, so it should be reasonably safe to use.

Although #ruakh gave a good answer, I will add some alternatives for completeness:
You could in fact use even var Omega = 'Ω' in JavaScript, but only if your JavaScript code is:
inside an event attribute, as in onclick="var Omega = '&#937';
alert(Omega)" or
in a script element inside an XHTML (or XHTML + XML) document
served with an XML content type.
In these cases, the code will be first (before getting passed to the JavaScript interpreter) be parsed by an HTML parser so that character references like Ω are recognized. The restrictions make this an impractical approach in most cases.
You can also enter the Ω character as such, as in var Omega = 'Ω', but then the character encoding must allow that, the encoding must be properly declared, and you need software that let you enter such characters. This is a clean solution and quite feasible if you use UTF-8 encoding for everything and are prepared to deal with the issues created by it. Source code will be readable, and reading it, you immediately see the character itself, instead of code notations. On the other hand, it may cause surprises if other people start working with your code.
Using the \u notation, as in var Omega = '\u03A9', works independently of character encoding, and it is in practice almost universal. It can however be as such used only up to U+FFFF, i.e. up to \uffff, but most characters that most people ever heard of fall into that area. (If you need “higher” characters, you need to use either surrogate pairs or one of the two approaches above.)
You can also construct a character using the String.fromCharCode() method, passing as a parameter the Unicode number, in decimal as in var Omega = String.fromCharCode(937) or in hexadecimal as in var Omega = String.fromCharCode(0x3A9). This works up to U+FFFF. This approach can be used even when you have the Unicode number in a variable.

One option is to put the character literally in your script, e.g.:
const omega = 'Ω';
This requires that you let the browser know the correct source encoding, see Unicode in JavaScript
However, if you can't or don't want to do this (e.g. because the character is too exotic and can't be expected to be available in the code editor font), the safest option may be to use new-style string escape or String.fromCodePoint:
const omega = '\u{3a9}';
// or:
const omega = String.fromCodePoint(0x3a9);
This is not restricted to UTF-16 but works for all unicode code points. In comparison, the other approaches mentioned here have the following downsides:
HTML escapes (const omega = '&#937';): only work when rendered unescaped in an HTML element
old style string escapes (const omega = '\u03A9';): restricted to UTF-16
String.fromCharCode: restricted to UTF-16

The answer is correct, but you don't need to declare a variable.
A string can contain your character:
"This string contains omega, that looks like this: \u03A9"
Unfortunately still those codes in ASCII are needed for displaying UTF-8, but I am still waiting (since too many years...) the day when UTF-8 will be same as ASCII was, and ASCII will be just a remembrance of the past.

I found this question when trying to implement a font-awesome style icon system in html. I have an API that provides me with a hex string and I need to convert it to unicode to match with the font-family.
Say I have the string const code = 'f004'; from my API. I can't do simple string concatenation (const unicode = '\u' + code;) since the system needs to recognize that it's unicode and this will in fact cause a syntax error if you try.
#coldfix mentioned using String.fromCodePoint but it takes a number as an argument, not a string.
To finally cross the finish line, just add parseInt and pass 16 (since hex is base 16) to it's second parameter. You'll finally get a unicode string from a simple hex string.
This is what I did:
const code = 'f004';
const toUnicode = code => String.fromCodePoint(parseInt(code, 16));
toUnicode(code);
// => '\uf004'

Try using Function(), like this:
var code = "2710"
var char = Function("return '\\u"+code+"';")()
It works well, just do not add any 's or "s or spaces.
In the example, char is "✐".

Insert Unicode character into JavaScript

I need to insert an Omega (Ω) onto my html page. I am using its HTML escaped code to do that, so I can write Ω and get Ω. That's all fine and well when I put it into a HTML element; however, when I try to put it into my JS, e.g. var Omega = Ω, it parses that code as JS and the whole thing doesn't work. Anyone know how to go about this?

I'm guessing that you actually want Omega to be a string containing an uppercase omega? In that case, you can write:
var Omega = '\u03A9';
(Because Ω is the Unicode character with codepoint U+03A9; that is, 03A9 is 937, except written as four hexadecimal digits.)
Edited to add (in 2022): There now exists an alternative form that better supports codepoints above U+FFFF:
let Omega = '\u{03A9}';
let desertIslandEmoji = '\u{1F3DD}';
Judging from https://caniuse.com/mdn-javascript_builtins_string_unicode_code_point_escapes, most or all browsers added support for it in 2015, so it should be reasonably safe to use.

Although #ruakh gave a good answer, I will add some alternatives for completeness:
You could in fact use even var Omega = 'Ω' in JavaScript, but only if your JavaScript code is:
inside an event attribute, as in onclick="var Omega = '&#937';
alert(Omega)" or
in a script element inside an XHTML (or XHTML + XML) document
served with an XML content type.
In these cases, the code will be first (before getting passed to the JavaScript interpreter) be parsed by an HTML parser so that character references like Ω are recognized. The restrictions make this an impractical approach in most cases.
You can also enter the Ω character as such, as in var Omega = 'Ω', but then the character encoding must allow that, the encoding must be properly declared, and you need software that let you enter such characters. This is a clean solution and quite feasible if you use UTF-8 encoding for everything and are prepared to deal with the issues created by it. Source code will be readable, and reading it, you immediately see the character itself, instead of code notations. On the other hand, it may cause surprises if other people start working with your code.
Using the \u notation, as in var Omega = '\u03A9', works independently of character encoding, and it is in practice almost universal. It can however be as such used only up to U+FFFF, i.e. up to \uffff, but most characters that most people ever heard of fall into that area. (If you need “higher” characters, you need to use either surrogate pairs or one of the two approaches above.)
You can also construct a character using the String.fromCharCode() method, passing as a parameter the Unicode number, in decimal as in var Omega = String.fromCharCode(937) or in hexadecimal as in var Omega = String.fromCharCode(0x3A9). This works up to U+FFFF. This approach can be used even when you have the Unicode number in a variable.

One option is to put the character literally in your script, e.g.:
const omega = 'Ω';
This requires that you let the browser know the correct source encoding, see Unicode in JavaScript
However, if you can't or don't want to do this (e.g. because the character is too exotic and can't be expected to be available in the code editor font), the safest option may be to use new-style string escape or String.fromCodePoint:
const omega = '\u{3a9}';
// or:
const omega = String.fromCodePoint(0x3a9);
This is not restricted to UTF-16 but works for all unicode code points. In comparison, the other approaches mentioned here have the following downsides:
HTML escapes (const omega = '&#937';): only work when rendered unescaped in an HTML element
old style string escapes (const omega = '\u03A9';): restricted to UTF-16
String.fromCharCode: restricted to UTF-16

The answer is correct, but you don't need to declare a variable.
A string can contain your character:
"This string contains omega, that looks like this: \u03A9"
Unfortunately still those codes in ASCII are needed for displaying UTF-8, but I am still waiting (since too many years...) the day when UTF-8 will be same as ASCII was, and ASCII will be just a remembrance of the past.

I found this question when trying to implement a font-awesome style icon system in html. I have an API that provides me with a hex string and I need to convert it to unicode to match with the font-family.
Say I have the string const code = 'f004'; from my API. I can't do simple string concatenation (const unicode = '\u' + code;) since the system needs to recognize that it's unicode and this will in fact cause a syntax error if you try.
#coldfix mentioned using String.fromCodePoint but it takes a number as an argument, not a string.
To finally cross the finish line, just add parseInt and pass 16 (since hex is base 16) to it's second parameter. You'll finally get a unicode string from a simple hex string.
This is what I did:
const code = 'f004';
const toUnicode = code => String.fromCodePoint(parseInt(code, 16));
toUnicode(code);
// => '\uf004'

Try using Function(), like this:
var code = "2710"
var char = Function("return '\\u"+code+"';")()
It works well, just do not add any 's or "s or spaces.
In the example, char is "✐".

what kind of encoding is this?

I've got some data from dbpedia using jena and since jena's output is based on xml so there are some circumstances that xml characters need to be treated differently like following :
Guns n &#039; Roses
I just want to know what kind of econding is this?
I want decode/encode my input based on above encode(r) with the help of javascript and send it back to a servlet.
(edited post if you remove the space between & and amp you will get the correct character since in stackoverflow I couldn't find a way to do that I decided to put like that!)

Seems to be XML entity encoding, and a numeric character reference (decimal).
A numeric character reference refers to a character by its Universal
Character Set/Unicode code point, and uses the format
You can get some info here: List of XML and HTML character entity references on Wikipedia.
Your character is number 39, being the apostrophe: ', which can also be referenced with a character entity reference: &apos;.
To decode this using Javascript, you could use for example php.js, which has an html_entity_decode() function (note that it depends on get_html_translation_table()).
UPDATE: in reply to your edit: Basically that is the same, the only difference is that it was encoded twice (possibly by mistake). & is the ampersand: &.

This is an SGML/HTML/XML numeric character entity reference.
In this case for an apostrophe '.

Why does Closure Compiler insist on adding more bytes?

If I give Closure Compiler something like this:
window.array = '0123456789'.split('');
It "compiles" it to this:
window.array="0,1,2,3,4,5,6,7,8,9".split(",");
Now as you can tell, that's bigger. Is there any reason why Closure Compiler is doing this?

I think this is what's going on, but I am by no means certain...
The code that causes the insertion of commas is tryMinimizeStringArrayLiteral in PeepholeSubstituteAlternateSyntax.java.
That method contains a list of characters that are likely to have a low Huffman encoding, and are therefore preferable to split on than other characters. You can see the result of this if you try something like this:
"a b c d e f g".split(" "); //Uncompiled, split on spaces
"a,b,c,d,e,f,g".split(","); //Compiled, split on commas (same size)
The compiler will replace the character you try to split on with one it thinks is favourable. It does so by iterating over the characters of the string and finding the most favourable splitting character that does not occur within the string:
// These delimiters are chars that appears a lot in the program therefore
// probably have a small Huffman encoding.
NEXT_DELIMITER: for (char delimiter : new char[]{',', ' ', ';', '{', '}'}) {
for (String cur : strings) {
if (cur.indexOf(delimiter) != -1) {
continue NEXT_DELIMITER;
}
}
String template = Joiner.on(delimiter).join(strings);
//...
}
In the above snippet you can see the array of characters the compiler claims to be optimal to split on. The comma is first (which is why in my space example above, the spaces have been replaced by commas).
I believe the insertion of commas in the case where the string to split on is the empty string may simply be an oversight. There does not appear to be any special treatment of this case, so it's treated like any other split call and each character is joined with the first appropriate character from the array shown in the above snippet.
Another example of how the compiler deals with the split method:
"a,;b;c;d;e;f;g".split(";"); //Uncompiled, split on semi-colons
"a, b c d e f g".split(" "); //Compiled, split on spaces
This time, since the original string already contains a comma (and we don't want to split on the comma character), the comma can't be chosen from the array of low-Huffman-encoded characters, so the next best choice is selected (the space).
Update
Following some further research into this, it is definitely not a bug. This behaviour is actually by design, and in my opinion it's a very clever little optimisation, when you bear in mind that the Closure compiler tends to favour the speed of the compiled code over size.
Above I mentioned Huffman encoding a couple of times. The Huffman coding algorithm, explained very simply, assigns a weight to each character appearing the the text to be encoded. The weight is based on the frequency with which each character appears. These frequencies are used to build a binary tree, with the most common character at the root. That means the most common characters are quicker to decode, since they are closer to the root of the tree.
And since the Huffman algorithm is a large part of the DEFLATE algorithm used by gzip. So if your web server is configured to use gzip, your users will be benefiting from this clever optimisation.

This issue was fixed on Apr 20, 2012 see revision:
https://code.google.com/p/closure-compiler/source/detail?r=1267364f742588a835d78808d0eef8c9f8ba8161

Ironically, split in the compiled code has nothing to do with split in the source. Consider:
Source : a = ["0","1","2","3","4","5"]
Compiled: a="0,1,2,3,4,5".split(",")
Here, split is just a way to represent long arrays (long enough for sum of all quotes + commas to be longer than split(","") ). So, what's going on in your example? First, the compiler sees a string function applied to a constant and evaluates it right away:
'0123456789'.split('') => ["0","1","2","3","4","5","6","7","8","9"]
At some later point, when generating output, the compiler considers this array to be "long" and writes it in the above "split" form:
["0","1","2","3","4","5","6","7","8","9"] => "0,1,2,3,4,5,6,7,8,9".split(",")
Note that all information about split('') in the source is already lost at this point.
If the source string were shorter, it would be generated in the array array form, without extra splitting:
Source : a = '0123'.split('')
Compiled: a=["0","1","2","3"]

Develop Reference

JavaScript is the programming language of the Web.