Convert Windows-1252 hex value to Unicode in JavaScript - javascript

Let's say I have a string containing Windows-1252 hex value for a character, I would like to make that appropriate Unicode character:
const asciiHex = '85' //represents hellip
parseInt(asciiHex, 16) //I get 133 as expected
I can't do String.fromCharCode now as this takes Unicode codes, rather than ASCII (in unicode hellip is 8230 (decimal)). Is anyone aware of any simple conversion?
btw I am doing this in node 6

You don't mention the input encoding: in which character encoding is \x85 mapped to the horizontal ellipsis? Turns out that's Windows-1252, which Node.js doesn't "speak" out of the box.
A module that can encode/decode it is windows-1252.
To convert your hex code to a UTF-8 encoded string:
const windows1252 = require('windows-1252');
let asciiHex = '85';
let utf8text = windows1252.decode( Buffer.from(asciiHex, 'hex').toString('binary') );
console.log( utf8text ); // outputs: …

Related

Why won't window.btoa work on – ” characters in Javascript?

So I'm converting a string to BASE64 as shown in the code below...
var str = "Hello World";
var enc = window.btoa(str);
This yields SGVsbG8gV29ybGQ=. However if I add these characters – ” such as the code shown below, the conversion doesn't happen. What is the reason behind this? Thank you so much.
var str = "Hello – World”";
var enc = window.btoa(str);
btoa is an exotic function in that it requires a "Binary String", which is an 8-bit clean string format. It doesn't work with unicode values above charcode 255, such as used by your em dash and "fancy" quote symbol.
You'll either have to turn your string into a new string that conforms to single byte packing (and then manually reconstitute the result of the associated atob), or you can uri encode the data first, making it safe:
> var str = `Hello – World`;
> window.btoa(encodeURIComponent(str));
"SGVsbG8lMjAlRTIlODAlOTMlMjBXb3JsZA=="
And then remember to decode it again when unpacking:
> var base64= "SGVsbG8lMjAlRTIlODAlOTMlMjBXb3JsZA==";
> decodeURIComponent(window.atob(base64));
"Hello – World"
The Problem is the character ” lies outside of Latin1 range.
For this you can use unescape (now deprecated)
var str = "Hello – World”";
var enc = btoa(unescape(encodeURIComponent(str)));
alert(enc);
And to decode:
var encStr = "SGVsbG8g4oCTIFdvcmxk4oCd";
var dec = decodeURIComponent(escape(window.atob(encStr)))
alert(dec);
This ultimately owes to a deficiency in the JavaScript type system.
JavaScript strings are strings of 16-bit code units, which are customarily interpreted as UTF-16. The Base64 encoding is a method of transforming an 8-bit byte stream into a string of digits, by taking each three bytes and mapping them into four digits, each covering 6 bits: 3 × 8 = 4 × 6. As we see, this is crucially dependent on the bit width of each symbol.
At the time the btoa function was defined, JavaScript had no type for 8-bit byte streams, so the API was defined to take the ordinary 16-bit string type as input, with the restriction that each code unit was supposed to be confined to the range [U+0000, U+00FF]; when encoded into ISO-8859-1, such a string would reproduce the intended byte stream exactly.
The character – is U+2013, while ” is U+201D; neither of those characters fits into the above-mentioned range, so the function rejects it.
If you want to convert Unicode text into Base64, you need to pick a character encoding and convert it into a byte string first, and encode that. Asking for a Base64 encoding of a Unicode string itself is meaningless.
The most bullet proof way is to work on binary data directly.
For this, you can encode your string to an ArrayBuffer object representing the UTF-8 version of your string.
Then a FileReader instance will be able to give you the base64 quite easily.
var str = "Hello – World”";
var buf = new TextEncoder().encode( str );
var reader = new FileReader();
reader.onload = evt => { console.log( reader.result.split(',')[1] ); };
reader.readAsDataURL( new Blob([buf]) );
And since the Blob() constructor does automagically encode DOMString instances to UTF-8, we could even get rid of the TextEncoder object:
var str = "Hello – World”";
var reader = new FileReader();
reader.onload = evt => { console.log( reader.result.split(',')[1] ); };
reader.readAsDataURL( new Blob([str]) );

How to convert unicode emoji into hex codepoint (with multiple groups)

I'm building an application that converts emoji shortnames (like :flag_cf:) and converts them through a series of operations into a hex codepoint (which are the keys in a map to return Twitter emoji/twemoji).
I have a utility (emojione.shortnameToUnicode()) that converts the shortnames into native unicode emoji, but I'm having trouble with converting the native unicode emoji into hex codepoints.
I've been using:
const unicode = emojione.shortnameToUnicode(str);
const decCodepoint = unicode.codePointAt(0);
const hexCodepoint = decCodepoint.toString(16);
This works fine when the resulting hex codepoint is a single figure. However, emoji like flags seem to have two, like :flag_cn: is 1f1f9-1f1f7. However, my process above would only return the first hex codepoint (namely 1f1f9).

Converting unicode to currency symbol in javascript

I am working with currency symbol in appcelerator for building apps in Android and iOS. I want make many parameters dynamic, so passing this value(u20b9) as api to app. Can't pass value(\u20b9) like this because of some reasons, so passing without slash.
When I use below code it works proper:-
var unicode = '\u20b9';
alert(unicode);
Output:- ₹
When I use below code:-
var unicode = '\\'+'u20b9';
alert(unicode);
Output:- \u20b9
Because of this, instead of ₹ it prints \u20b9 everywhere, which I don't want.
Thanks in advance.
The following works for me:
console.log(String.fromCharCode(0x20aa)); // ₪ - Israeli Shekel
console.log(String.fromCharCode(0x24)); // $ - US Dollar
console.log(String.fromCharCode(0x20b9)); // ₹ - ???
alert(String.fromCharCode(0x20aa) + "\n" + String.fromCharCode(0x24) + "\n" + String.fromCharCode(0x20b9));
As far I understand, you need to pass string values of unicode characters via api. Obviously you can't use string-code without slash because that will make it invalid unicode and if you pass the slash that'll convert the value to unicode. So what you can do here is to pass the string without slash & 'u' character and then parse the remaining characters as hexadecimal format.
See following code snippet:
// this won't work as you have included 'u' which is not a hexadecimal character
var unicode = 'u20b9';
String.fromCharCode(parseInt(unicode, 16));
// It WORKS! as the string now has only hexadecimal characters
var unicode = '20b9';
String.fromCharCode( parseInt(unicode, 16) ); // prints rupee character by using 16 as parsing format which is hexadecimal
I hope it solves your query !

How can I convert a string into a unicode character?

In Javascript '\uXXXX' returns in a unicode character. But how can I get a unicode character when the XXXX part is a variable?
For example:
var input = '2122';
console.log('\\u' + input); // returns a string: "\u2122"
console.log(new String('\\u' + input)); // returns a string: "\u2122"
The only way I can think of to make it work, is to use eval; yet I hope there's a better solution:
var input = '2122';
var char = '\\u' + input;
console.log(eval("'" + char + "'")); // returns a character: "™"
Use String.fromCharCode() like this: String.fromCharCode(parseInt(input,16)). When you put a Unicode value in a string using \u, it is interpreted as a hexdecimal value, so you need to specify the base (16) when using parseInt.
String.fromCharCode("0x" + input)
or
String.fromCharCode(parseInt(input, 16)) as they are 16bit numbers (UTF-16)
JavaScript uses UCS-2 internally.
Thus, String.fromCharCode(codePoint) won’t work for supplementary Unicode characters. If codePoint is 119558 (0x1D306, for the '𝌆' character), for example.
If you want to create a string based on a non-BMP Unicode code point, you could use Punycode.js’s utility functions to convert between UCS-2 strings and UTF-16 code points:
// `String.fromCharCode` replacement that doesn’t make you enter the surrogate halves separately
punycode.ucs2.encode([0x1d306]); // '𝌆'
punycode.ucs2.encode([119558]); // '𝌆'
punycode.ucs2.encode([97, 98, 99]); // 'abc'
Since ES5 you can use
String.fromCodePoint(number)
to get unicode values bigger than 0xFFFF.
So, in every new browser, you can write it in this way:
var input = '2122';
console.log(String.fromCodePoint(input));
or if it is a hex number:
var input = '2122';
console.log(String.fromCodePoint(parseInt(input, 16)));
More info:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCodePoint
Edit (2021):
fromCodePoint is not just used for bigger numbers, but also to combine Unicode emojis.
For example, to draw a waving hand, you have to write:
String.fromCodePoint(0x1F44B);
But if you want a waving hand with a skin tone, you have to combine it:
String.fromCodePoint(0x1F44B, 0x1F3FC);
In future (or from now), you will even be able to combine 2 emoji to create a new one, for example a heart and a fire, to create a burning heart:
String.fromCodePoint(0x2764, 0xFE0F, 0x200D, 0x1F525);
32-bit number:
<script>
document.write(String.fromCodePoint(0x1F44B));
</script>
<br>
32-bit number + skin:
<script>
document.write(String.fromCodePoint(0x1F44B, 0x1F3FE));
</script>
<br>
32-bit number + another emoji:
<script>
document.write(String.fromCodePoint(0x2764, 0xFE0F, 0x200D, 0x1F525));
</script>
var hex = '2122';
var char = unescape('%u' + hex);
console.log(char);
will returns " ™ "

How to convert mixed ascii and unicode to a string in javascript?

I have a mixed source of unicode and ascii characters, for example:
var source = "\u5c07\u63a2\u8a0e HTML5 \u53ca\u5176\u4ed6";
How do I convert it to a string by leveraging and extending the below uniCodeToString function written by myself in Javascript? This function can convert pure unicode to string.
function uniCodeToString(source){
//for example, source = "\u5c07\u63a2\u8a0e"
var escapedSource = escape(source);
var codeArray = escapedSource.split("%u");
var str = "";
for(var i=1; i<codeArray.length; i++){
str += String.fromCharCode("0x"+codeArray[i]);
}
return str;
}
Use encodeURIComponent, escape was never meant for unicode.
var source = "\u5c07\u63a2\u8a0e HTML5 \u53ca\u5176\u4ed6";
var enc=encodeURIComponent(source)
//returned value: (String)
%E5%B0%87%E6%8E%A2%E8%A8%8E%20HTML5%20%E5%8F%8A%E5%85%B6%E4%BB%96
decodeURIComponent(enc)
//returned value: (String)
將探討 HTML5 及其他
I think you are misunderstanding the purpose of Unicode escape sequences.
var source = "\u5c07\u63a2\u8a0e HTML5 \u53ca\u5176\u4ed6";
JavaScript strings are always Unicode (each code unit is a 16 bit UTF-16 encoded value.) The purpose of the escapes is to allow you to describe values that are unsupported by the encoding used to save the source file (e.g. the HTML page or .JS file is encoded as ISO-8859-1) or to overcome things like keyboard limitations. This is no different to using \n to indicate a linefeed code point.
The above string ("將探討 HTML5 及其他") is made up of the values 5c07 63a2 8a0e 0020 0048 0054 004d 004c 0035 0020 53ca 5176 4ed6 whether you write the sequence as a literal or in escape sequences.
See the String Literals section of ECMA-262 for more details.

Categories

Resources