Failed to declare and display UTF-8 character properly in JSON - javascript

I have a JSON object where one attribute contains a static special character - https://www.compart.com/en/unicode/U+1F514
I have tried to store the string both as encoded UTF-8 "\xF0\x9F\x94\x94"
or tried to print it using its HEX value - String.fromCharCode(0x1F514) or decimal value String.fromCharCode(128276)
But it all results in an empty charater/empty square character in Google Chrome.
How can I please store this character properly, statically in a simple JSON {header1:"____"} and then echo it?
Also not able to display it in IntelliJ - so if you have a comment regarding this side issue would be very thankful.

For historial reasons, JavaScript doesn't have full Unicode support because language creators assumed that UTF-16 would never need more than 2-bytes to encode a single character. JSON inherits that and \u entities only accept 4 hexadecimal characters.
You need to use a workaround that basically consists on splitting the actual 4-byte UTF-16 character in two 2-byte characters, as in:
var raw = "🔔";
var doesNotWork = "\u1F514";
var works = "\uD83D\uDD14";
console.log(raw, doesNotWork, works);
... or get rid of entities and just dump the actual binary character:
var data = ["🔔"];
var json = JSON.stringify(data);
console.log(json, JSON.parse(json));

I think that the problem is that the font doesn't have support for such symbol, hence the square character being drawn. If there is not an specific reason as why you are using this character, you could draw it with an icon, or using a character in an icon font.

Related

Matching ouput for HttpServerUtility.UrlTokenEncode in NodeJS Javascript

I am looking at an example in dotnet which looks like the following: https://dotnetfiddle.net/t0y8yD.
The output for the HttpServerUtility.UrlTokenEncode method is:
Pn55YBwEH2S2BEM5qlNrq-LMNE8BDdHYwbWKFEHiPZo1
When I try to complete the same in NodeJS with encodeURI, encodeURIComponent or any other attempt I get the following:
Pn55YBwEH2S2BEM5qlNrq+LMNE8BDdHYwbWKFEHiPZo=
As you can see from the above the '-' should be a '+' and the last character part is different. The hash is created the same and outputs the same buffer.
var hmac = crypto.createHmac("sha256", buf);
hmac.update("9644873");
var hash = hmac.digest("base64");
How can I get the two to match? One other important note is that this is one use case and I am unsure if there are other chars that do the same.
I am unsure if the dotnet variant is incorrect or the NodeJS version is. However, the comparison will be done on the dotnet side, so I need node to match that.
The difference of the two results is caused by the use of Base64URL encoding in the C# code vs. Base64 encoding in node.js.
Base64URL and Base64 are almost identical, but Base64 encoding uses the characters +, / and =, which have a special meaning in URLs and thus have to be avoided. In Base64URL encoding + is replaced with -, / with _ and = (the padding character on the end) is either replaced with %20 or simply omitted.
In your code you're calculating a HMAC-SHA256 hash, so you get a 256 bit result, which can be encoded in 32 bytes. In Base64/Base64URL every character represents 6 bits, therefore you would need 256/6 = 42,66 => 43 Base64 characters. With 43 characters you would have 2 'lonesome' bits on the end, therefore a padding char (=) is added.
The question now is why HttpServerUtility.UrlTokenEncode adds a 1 as a replacement for the padding char on the end. I didn't find anything in the documentation. But you you should keep in mind that it's insignificant anyway.
To to get the same in node.js, you can use the package base64url, or just use simple replace statements on the base64 encoded hash.
With base64url package:
const base64url = require('base64url');
var hmacB64 = "Pn55YBwEH2S2BEM5qlNrq+LMNE8BDdHYwbWKFEHiPZo="
var hmacB64url = base64url.fromBase64(hmacb64)
console.log(hmacB64url)
The result is:
Pn55YBwEH2S2BEM5qlNrq-LMNE8BDdHYwbWKFEHiPZo
as you can see, this library just omits the padding char.
With replace, also replacing the padding = with 1:
var hmacB64 = "Pn55YBwEH2S2BEM5qlNrq+LMNE8BDdHYwbWKFEHiPZo="
console.log(hmacb64.replace(/\//g,'_').replace(/\+/g,'-').replace(/\=+$/m,'1'))
The result is:
Pn55YBwEH2S2BEM5qlNrq-LMNE8BDdHYwbWKFEHiPZo1
I tried the C# code with different data and always got '1' on the end, so to replace = with 1 seems to be ok, though it doesn't seem to be conform to the RFC.
The other alternative, if this is an option for you, is to change the C# code. Use normal base64 encoding plus string replace to get base64url output instead of using HttpServerUtility.UrlTokenEncode
A possible solution for that is described here
I'm new here so I can't comment (need 50 reputation), but I would like to add to #jqs answer that if the string ends with two "=", the replace needs to be done with "2". So my replace looks like:
hmacb64.replace(///g,'_').replace(/+/g,'-').replace(/\=\=$/m,'2').replace(/\=$/m,'1')

Text encoding that produces legible encodings suitable as Javascript identifiers?

I'm working on a tool that reads arbitrary data files and creates a table out of its data which I then store in a database. I'd like to preserve the column headers. The column headers are already ASCII text (or maybe latin1), but they have characters that aren't valid variable names (e.g., spaces, %), so I need to encode them somehow. I'm looking for an encoding for the column titles that has these properties:
Legible: it would be nice if the encoded text looked as similar as possible to the unencoded text (i.e., for debugging).
Legal identifier: I'd like the encoded text to be a valid JavaScript identifier (ECMA-262 Section 7.6).
Invertible: I'd like to be able to get the exact original text back from the encoded text.
I can think of approaches that work for 2 of the 3 cases, but I don't know how to get all 3. E.g., url encoding doesn't produce legal identifier names, I think I could transform base64 to be legal, but it isn't legible, what I've got currently just does some substitutions so it's not invertible.
Efficiency isn't a concern, so if necessary, I could store the encoded and unencoded texts together. The best option I can think of is to use url encoding and then swap percents for $. I thought there would be better options than this though, but I can't find anything. Is there anything better?
This pair of methods relying on Guava's PercentEscaper seems to meet my requirements. Guava doesn't provide an unescaper, but given my simple needs here, I can just use a simple URLDecoder.
private static PercentEscaper escaper = new PercentEscaper('',false)
static String getIdentifier(String str) {
//minimal safe characters, but leaves letters alone, so it's somewhat legible
String escaped = escaper.escape(str);
//javascript identifiers can't start with a digit, and the escaper doesn't know the first
//character has different rules. so prepend a "%3" to encode the digit
if(Character.isDigit(escaped.charAt(0))){
escaped = "%3"+escaped
}
//a percent isn't a valid in a javascript identifier, so we'll use _ as our special character
escaped = escaped.replace('%','_');
return escaped;
}
static String invertIdentifier(String str){
String unescaped = str.replace('_','%');
unescaped = URLDecoder.decode(unescaped, "UTF-8");
return unescaped;
}

How to feed strange characters to javaScript? [duplicate]

I need to insert an Omega (Ω) onto my html page. I am using its HTML escaped code to do that, so I can write Ω and get Ω. That's all fine and well when I put it into a HTML element; however, when I try to put it into my JS, e.g. var Omega = Ω, it parses that code as JS and the whole thing doesn't work. Anyone know how to go about this?
I'm guessing that you actually want Omega to be a string containing an uppercase omega? In that case, you can write:
var Omega = '\u03A9';
(Because Ω is the Unicode character with codepoint U+03A9; that is, 03A9 is 937, except written as four hexadecimal digits.)
Edited to add (in 2022): There now exists an alternative form that better supports codepoints above U+FFFF:
let Omega = '\u{03A9}';
let desertIslandEmoji = '\u{1F3DD}';
Judging from https://caniuse.com/mdn-javascript_builtins_string_unicode_code_point_escapes, most or all browsers added support for it in 2015, so it should be reasonably safe to use.
Although #ruakh gave a good answer, I will add some alternatives for completeness:
You could in fact use even var Omega = 'Ω' in JavaScript, but only if your JavaScript code is:
inside an event attribute, as in onclick="var Omega = '&#937';
alert(Omega)" or
in a script element inside an XHTML (or XHTML + XML) document
served with an XML content type.
In these cases, the code will be first (before getting passed to the JavaScript interpreter) be parsed by an HTML parser so that character references like Ω are recognized. The restrictions make this an impractical approach in most cases.
You can also enter the Ω character as such, as in var Omega = 'Ω', but then the character encoding must allow that, the encoding must be properly declared, and you need software that let you enter such characters. This is a clean solution and quite feasible if you use UTF-8 encoding for everything and are prepared to deal with the issues created by it. Source code will be readable, and reading it, you immediately see the character itself, instead of code notations. On the other hand, it may cause surprises if other people start working with your code.
Using the \u notation, as in var Omega = '\u03A9', works independently of character encoding, and it is in practice almost universal. It can however be as such used only up to U+FFFF, i.e. up to \uffff, but most characters that most people ever heard of fall into that area. (If you need “higher” characters, you need to use either surrogate pairs or one of the two approaches above.)
You can also construct a character using the String.fromCharCode() method, passing as a parameter the Unicode number, in decimal as in var Omega = String.fromCharCode(937) or in hexadecimal as in var Omega = String.fromCharCode(0x3A9). This works up to U+FFFF. This approach can be used even when you have the Unicode number in a variable.
One option is to put the character literally in your script, e.g.:
const omega = 'Ω';
This requires that you let the browser know the correct source encoding, see Unicode in JavaScript
However, if you can't or don't want to do this (e.g. because the character is too exotic and can't be expected to be available in the code editor font), the safest option may be to use new-style string escape or String.fromCodePoint:
const omega = '\u{3a9}';
// or:
const omega = String.fromCodePoint(0x3a9);
This is not restricted to UTF-16 but works for all unicode code points. In comparison, the other approaches mentioned here have the following downsides:
HTML escapes (const omega = '&#937';): only work when rendered unescaped in an HTML element
old style string escapes (const omega = '\u03A9';): restricted to UTF-16
String.fromCharCode: restricted to UTF-16
The answer is correct, but you don't need to declare a variable.
A string can contain your character:
"This string contains omega, that looks like this: \u03A9"
Unfortunately still those codes in ASCII are needed for displaying UTF-8, but I am still waiting (since too many years...) the day when UTF-8 will be same as ASCII was, and ASCII will be just a remembrance of the past.
I found this question when trying to implement a font-awesome style icon system in html. I have an API that provides me with a hex string and I need to convert it to unicode to match with the font-family.
Say I have the string const code = 'f004'; from my API. I can't do simple string concatenation (const unicode = '\u' + code;) since the system needs to recognize that it's unicode and this will in fact cause a syntax error if you try.
#coldfix mentioned using String.fromCodePoint but it takes a number as an argument, not a string.
To finally cross the finish line, just add parseInt and pass 16 (since hex is base 16) to it's second parameter. You'll finally get a unicode string from a simple hex string.
This is what I did:
const code = 'f004';
const toUnicode = code => String.fromCodePoint(parseInt(code, 16));
toUnicode(code);
// => '\uf004'
Try using Function(), like this:
var code = "2710"
var char = Function("return '\\u"+code+"';")()
It works well, just do not add any 's or "s or spaces.
In the example, char is "✐".

What is the best way to serialize a JavaScript object into something that can be used as a fragment identifier (url#hash)?

My page state can be described by a JavaScript object that can be serialized into JSON. But I don't think a JSON string is suitable for use in a fragment ID due to, for example, the spaces and double-quotes.
Would encoding the JSON string into a base64 string be sensible, or is there a better way? My goal is to allow the user to bookmark the page and then upon returning to that bookmark, have a piece of JavaScript read window.location.hash and change state accordingly.
I think you are on a good way. Let's write down the requirements:
The encoded string must be usable as hash, i.e. only letters and numbers.
The original value must be possible to restore, i.e. hashing (md5, sha1) is not an option.
It shouldn't be too long, to remain usable.
There should be an implementation in JavaScript, so it can be generated in the browser.
Base64 would be a great solution for that. Only problem: base64 also contains characters like - and +, so you win nothing compared to simply attaching a JSON string (which also would have to be URL encoded).
BUT: Luckily, theres a variant of base64 called base64url which is exactly what you need. It is specifically designed for the type of problem you're describing.
However, I was not able to find a JS implementation; maybe you have to write one youself – or do a bit more research than my half-assed 15 seconds scanning the first 5 Google results.
EDIT: On a second thought, I think you don't need to write an own implementation. Use a normal implementation, and simply replace the “forbidden” characters with something you find appropriate for your URLs.
Base64 is an excellent way to store binary data in text. It uses just 33% more characters/bytes than the original data and mostly uses 0-9, a-z, and A-Z. It also has three other characters that would need encoded to be stored in the URL, which are /, =, and +. If you simply used URL encoding, it would take up 300% (3x) the size.
If you're only storing the characters in the fragment of the URL, base64-encoded text it doesn't need to be re-encoded and will not change. But if you want to send the data as part of the actual URL to visit, then it matters.
As referenced by lxg, there there is a base64url variant for that. This is a modified version of base64 to replace unsafe characters to store in the URL. Here is how to encode it:
function tobase64url(s) {
return btoa(x).replace(/\+/g,'-').replace(/\//g,'_').replace(/=/g,'');
}
console.log(tobase64url('\x00\xff\xff\xf1\xf1\xf1\xff\xff\xfe'));
// Returns "AP__8fHx___-" instead of "AP//8fHx///+"
And to decode a base64 string from the URL:
function frombase64url(s) {
return atob(x.replace(/-/g,'+').replace(/_/g, '/'));
}
Use encodeURIComponent and decodeURIComponent to serialize data for the fragment (aka hash) part of the URL.
This is safe because the character set output by encodeURIComponent is a subset of the character set allowed in the fragment. Specifically, encodeURIComponent escapes all characters except:
A - Z
a - z
0 - 9
- . _ ~ ! ' ( ) *
So the output includes the above characters, plus escaped characters, which are % followed by hexadecimal digits.
The set of allowed characters in the fragment is:
A - Z
a - z
0 - 9
? / : # - . _ ~ ! $ & ' ( ) * + , ; =
percent-encoded characters (a % followed by hexadecimal digits)
This set of allowed characters includes all the characters output by encodeURIComponent, plus a few other characters.

Insert Unicode character into JavaScript

I need to insert an Omega (Ω) onto my html page. I am using its HTML escaped code to do that, so I can write Ω and get Ω. That's all fine and well when I put it into a HTML element; however, when I try to put it into my JS, e.g. var Omega = Ω, it parses that code as JS and the whole thing doesn't work. Anyone know how to go about this?
I'm guessing that you actually want Omega to be a string containing an uppercase omega? In that case, you can write:
var Omega = '\u03A9';
(Because Ω is the Unicode character with codepoint U+03A9; that is, 03A9 is 937, except written as four hexadecimal digits.)
Edited to add (in 2022): There now exists an alternative form that better supports codepoints above U+FFFF:
let Omega = '\u{03A9}';
let desertIslandEmoji = '\u{1F3DD}';
Judging from https://caniuse.com/mdn-javascript_builtins_string_unicode_code_point_escapes, most or all browsers added support for it in 2015, so it should be reasonably safe to use.
Although #ruakh gave a good answer, I will add some alternatives for completeness:
You could in fact use even var Omega = 'Ω' in JavaScript, but only if your JavaScript code is:
inside an event attribute, as in onclick="var Omega = '&#937';
alert(Omega)" or
in a script element inside an XHTML (or XHTML + XML) document
served with an XML content type.
In these cases, the code will be first (before getting passed to the JavaScript interpreter) be parsed by an HTML parser so that character references like Ω are recognized. The restrictions make this an impractical approach in most cases.
You can also enter the Ω character as such, as in var Omega = 'Ω', but then the character encoding must allow that, the encoding must be properly declared, and you need software that let you enter such characters. This is a clean solution and quite feasible if you use UTF-8 encoding for everything and are prepared to deal with the issues created by it. Source code will be readable, and reading it, you immediately see the character itself, instead of code notations. On the other hand, it may cause surprises if other people start working with your code.
Using the \u notation, as in var Omega = '\u03A9', works independently of character encoding, and it is in practice almost universal. It can however be as such used only up to U+FFFF, i.e. up to \uffff, but most characters that most people ever heard of fall into that area. (If you need “higher” characters, you need to use either surrogate pairs or one of the two approaches above.)
You can also construct a character using the String.fromCharCode() method, passing as a parameter the Unicode number, in decimal as in var Omega = String.fromCharCode(937) or in hexadecimal as in var Omega = String.fromCharCode(0x3A9). This works up to U+FFFF. This approach can be used even when you have the Unicode number in a variable.
One option is to put the character literally in your script, e.g.:
const omega = 'Ω';
This requires that you let the browser know the correct source encoding, see Unicode in JavaScript
However, if you can't or don't want to do this (e.g. because the character is too exotic and can't be expected to be available in the code editor font), the safest option may be to use new-style string escape or String.fromCodePoint:
const omega = '\u{3a9}';
// or:
const omega = String.fromCodePoint(0x3a9);
This is not restricted to UTF-16 but works for all unicode code points. In comparison, the other approaches mentioned here have the following downsides:
HTML escapes (const omega = '&#937';): only work when rendered unescaped in an HTML element
old style string escapes (const omega = '\u03A9';): restricted to UTF-16
String.fromCharCode: restricted to UTF-16
The answer is correct, but you don't need to declare a variable.
A string can contain your character:
"This string contains omega, that looks like this: \u03A9"
Unfortunately still those codes in ASCII are needed for displaying UTF-8, but I am still waiting (since too many years...) the day when UTF-8 will be same as ASCII was, and ASCII will be just a remembrance of the past.
I found this question when trying to implement a font-awesome style icon system in html. I have an API that provides me with a hex string and I need to convert it to unicode to match with the font-family.
Say I have the string const code = 'f004'; from my API. I can't do simple string concatenation (const unicode = '\u' + code;) since the system needs to recognize that it's unicode and this will in fact cause a syntax error if you try.
#coldfix mentioned using String.fromCodePoint but it takes a number as an argument, not a string.
To finally cross the finish line, just add parseInt and pass 16 (since hex is base 16) to it's second parameter. You'll finally get a unicode string from a simple hex string.
This is what I did:
const code = 'f004';
const toUnicode = code => String.fromCodePoint(parseInt(code, 16));
toUnicode(code);
// => '\uf004'
Try using Function(), like this:
var code = "2710"
var char = Function("return '\\u"+code+"';")()
It works well, just do not add any 's or "s or spaces.
In the example, char is "✐".

Categories

Resources