How to convert unicode emoji into hex codepoint (with multiple groups) - javascript

I'm building an application that converts emoji shortnames (like :flag_cf:) and converts them through a series of operations into a hex codepoint (which are the keys in a map to return Twitter emoji/twemoji).
I have a utility (emojione.shortnameToUnicode()) that converts the shortnames into native unicode emoji, but I'm having trouble with converting the native unicode emoji into hex codepoints.
I've been using:
const unicode = emojione.shortnameToUnicode(str);
const decCodepoint = unicode.codePointAt(0);
const hexCodepoint = decCodepoint.toString(16);
This works fine when the resulting hex codepoint is a single figure. However, emoji like flags seem to have two, like :flag_cn: is 1f1f9-1f1f7. However, my process above would only return the first hex codepoint (namely 1f1f9).

Related

cannot get utf8 icon with charAt js

let a=["๐Ÿ˜"] ;
let b="๐Ÿ˜"
console.log(a[0],b,b.charAt(0))
"๐Ÿ˜"
"๐Ÿ˜"
"๏ฟฝ"
this charAt prints questions mark ... can someone enlighten me how to get utf8 icon with charAt in a string
Emojis are represented using multiple bytes:
let b="๐Ÿ˜"
console.log(b.length)
With charAt you only get one part of the emoji.
With codePoint you can get the:
a non-negative integer that is the Unicode code point value at the given position.
You however need to know where the emojii starts at, because the index itself still refers to the bytes:
let b="๐Ÿ˜๐Ÿ‘†๐Ÿ‘"
console.dir(String.fromCodePoint(b.codePointAt(0)));
console.dir(String.fromCodePoint(b.codePointAt(2)));
You could split your string using the spread operator ... and then access the visible char in question using the index. This works because the iterator uses the code points and not individual bytes.
let b="๐Ÿ˜test๐Ÿ‘†test๐Ÿ‘"
function splitStringByCodePoint(str) {
return [...str]
}
console.log(splitStringByCodePoint(b)[5])
But that won't work with emojis like ๐Ÿ‘†๐Ÿฝ because those consist of ๐Ÿ‘† + byte(s) representing the variation (color).
let b="๐Ÿ‘†๐Ÿฝ"
console.log(b.length)
function splitStringByCodePoint(str) {
return [...str]
}
console.log(splitStringByCodePoint(b))
console.log(String.fromCodePoint(b.codePointAt(0)));
console.log(String.fromCodePoint(b.codePointAt(2)));
If you want to support all emojis you currently need to write your own parser or look for a library that does that.

How to get the correct element from a unicode string?

I want to get specific letters from an unicode string using index. However, it doesn't work as expected.
Example:
var handwriting = `๐–†๐–‡๐–ˆ๐–‰๐–Š๐–‹๐–Œ๐–๐–Ž๐–๐–๐–‘๐–’๐–“๐–”๐–•๐––๐–—๐–˜๐–™๐–š๐–›๐–œ๐–๐–ž๐–Ÿ๐•ฌ๐•ญ๐•ฎ๐•ฏ๐•ฐ๐•ฑ๐•ฒ๐•ณ๐•ด๐•ต๐•ถ๐•ท๐•ธ๐•น๐•บ๐•ป๐•ผ๐•ฝ๐•พ๐•ฟ๐–€๐–๐–‚๐–ƒ๐–„๐–…1234567890`
var normal = `abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890`
console.log(normal[3]) // gives 'd' but
console.log(handwriting[3]) // gives '๏ฟฝ' instead of '๐–‰'
also length doesn't work as expected normal.length gives correct value as 62 but handwriting.length gives 114.
Indexing doesn't work as expected. How can I access the elements of unicode array?
I tried this on python it works perfectly but in Javascript it is not working.
I need exact characters from the unicode string like an expected output of 'd' '๐–‰' for index 3
In Javascript, a string is a sequence of 16-bit code points. Since these characters are encoded above the Basic Multilingual Plane, it means that they are represented by a pair of code points, also known as a surrogate pair.
Reference
Unicode number of ๐–† is U+1D586. And 0x1D586 is greater than 0xFFFF (2^16). So, ๐–† is represented by a pair of code points, also known as a surrogate pair
console.log("๐–†".length)
console.log("๐–†" === "\uD835\uDD86")
One way is to create an array of characters using the spread syntax or Array.from() and then get the index you need
var handwriting = `๐–†๐–‡๐–ˆ๐–‰๐–Š๐–‹๐–Œ๐–๐–Ž๐–๐–๐–‘๐–’๐–“๐–”๐–•๐––๐–—๐–˜๐–™๐–š๐–›๐–œ๐–๐–ž๐–Ÿ๐•ฌ๐•ญ๐•ฎ๐•ฏ๐•ฐ๐•ฑ๐•ฒ๐•ณ๐•ด๐•ต๐•ถ๐•ท๐•ธ๐•น๐•บ๐•ป๐•ผ๐•ฝ๐•พ๐•ฟ๐–€๐–๐–‚๐–ƒ๐–„๐–…1234567890`
console.log([...handwriting][3])
console.log(Array.from(handwriting)[3])
A unicode character looks like '\u00E9' so if your string is longer this is normal.
To have the real length of a unicode string, you have to convert it to an array :
let charArray = [...handwriting]
console.log(charArray.length) //=62
Each item of your array is a char of your string.
charArray[3] will return you the unicode char corresponding to '๐–‰'

Convert Windows-1252 hex value to Unicode in JavaScript

Let's say I have a string containing Windows-1252 hex value for a character, I would like to make that appropriate Unicode character:
const asciiHex = '85' //represents hellip
parseInt(asciiHex, 16) //I get 133 as expected
I can't do String.fromCharCode now as this takes Unicode codes, rather than ASCII (in unicode hellip is 8230 (decimal)). Is anyone aware of any simple conversion?
btw I am doing this in node 6
You don't mention the input encoding: in which character encoding is \x85 mapped to the horizontal ellipsis? Turns out that's Windows-1252, which Node.js doesn't "speak" out of the box.
A module that can encode/decode it is windows-1252.
To convert your hex code to a UTF-8 encoded string:
const windows1252 = require('windows-1252');
let asciiHex = '85';
let utf8text = windows1252.decode( Buffer.from(asciiHex, 'hex').toString('binary') );
console.log( utf8text ); // outputs: โ€ฆ

How to convert unicode \:0936\:093e\:092e to Devanagari script?

I have created a Twitter bot based on Google Apps Script and wolfram|alpha. The bot answers questions the way wolfram|alpha does.
I need to translate a string from English to Devanagari.
I get result as \:0936\:093e\:092e which should be converted to "เคถเคพเคฎ"
Here is a link for more info - https://codepoints.net/U+0936?lang=en
I want to know how I can achieve this using Google Apps Script (JavaScript)?
The values in \:0936\:093e\:092e are UTF-16 character codes, but are not expressed in a way that will render the characters you need. If they were, you could use the answer from Expressing UTF-16 unicode characters in JavaScript directly.
Demo
This script extracts the hexadecimal numbers from the given string, then uses the getUnicodeCharacter() function from the linked question to convert each number, or codepoint, into its Unicode character.
function utf16demo() {
var str = "\:0936\:093e\:092e";
var charCodes = str.replace(/\:/,'').split('\:').map(function(st){return parseInt(st,16);});
var newStr = '';
for (var ch=0; ch < charCodes.length; ch++) {
newStr += getUnicodeCharacter(charCodes[ch])
}
Logger.log(newStr);
}
Log
[15-11-03 23:04:16:096 EST] เคถเคพเคฎ

How to convert mixed ascii and unicode to a string in javascript?

I have a mixed source of unicode and ascii characters, for example:
var source = "\u5c07\u63a2\u8a0e HTML5 \u53ca\u5176\u4ed6";
How do I convert it to a string by leveraging and extending the below uniCodeToString function written by myself in Javascript? This function can convert pure unicode to string.
function uniCodeToString(source){
//for example, source = "\u5c07\u63a2\u8a0e"
var escapedSource = escape(source);
var codeArray = escapedSource.split("%u");
var str = "";
for(var i=1; i<codeArray.length; i++){
str += String.fromCharCode("0x"+codeArray[i]);
}
return str;
}
Use encodeURIComponent, escape was never meant for unicode.
var source = "\u5c07\u63a2\u8a0e HTML5 \u53ca\u5176\u4ed6";
var enc=encodeURIComponent(source)
//returned value: (String)
%E5%B0%87%E6%8E%A2%E8%A8%8E%20HTML5%20%E5%8F%8A%E5%85%B6%E4%BB%96
decodeURIComponent(enc)
//returned value: (String)
ๅฐ‡ๆŽข่จŽ HTML5 ๅŠๅ…ถไป–
I think you are misunderstanding the purpose of Unicode escape sequences.
var source = "\u5c07\u63a2\u8a0e HTML5 \u53ca\u5176\u4ed6";
JavaScript strings are always Unicode (each code unit is a 16 bit UTF-16 encoded value.) The purpose of the escapes is to allow you to describe values that are unsupported by the encoding used to save the source file (e.g. the HTML page or .JS file is encoded as ISO-8859-1) or to overcome things like keyboard limitations. This is no different to using \n to indicate a linefeed code point.
The above string ("ๅฐ‡ๆŽข่จŽ HTML5 ๅŠๅ…ถไป–") is made up of the values 5c07 63a2 8a0e 0020 0048 0054 004d 004c 0035 0020 53ca 5176 4ed6 whether you write the sequence as a literal or in escape sequences.
See the String Literals section of ECMA-262 for more details.

Categories

Resources