How to get unicode name from a string character in JavaScript [duplicate] - javascript

I have the following:
function showUnicode()
{
var text = prompt( 'Enter the wanted text', 'Unicode' ),
unicode = 0,
ntext,
temp,
i = 0
;
// got the text now transform it in unicode
for(i; i < text.length; i++)
{
unicode += text.charCodeAt(i)
}
// now do an alert
alert( 'Here is the unicode:\n' + unicode + '\nof:\n' + text )
}
Thanks for the idea to initialize unicode but now unicode variable gets the Unicode of the last character, why does it?

JavaScript uses UCS-2 internally.
This means that supplementary Unicode symbols are exposed as two separate code units (the surrogate halves). For example, '๐Œ†'.length == 2, even though itโ€™s only one Unicode character.
Because of this, if you want to get the Unicode code point for every character in a string, youโ€™ll need to convert the UCS-2 string into an array of UTF-16 code points (where each surrogate pair forms a single code point). You could use Punycode.jsโ€™s utility functions for this:
punycode.ucs2.decode('abc'); // [97, 98, 99]
punycode.ucs2.decode('๐Œ†'); // [119558]

You should initialize the unicode variable to something, or you're adding the char codes to undefined.

NaN = Not a Number
You need to initialize "unicode" as a numeric type:
var unicode = 0

Related

How to get the correct element from a unicode string?

I want to get specific letters from an unicode string using index. However, it doesn't work as expected.
Example:
var handwriting = `๐–†๐–‡๐–ˆ๐–‰๐–Š๐–‹๐–Œ๐–๐–Ž๐–๐–๐–‘๐–’๐–“๐–”๐–•๐––๐–—๐–˜๐–™๐–š๐–›๐–œ๐–๐–ž๐–Ÿ๐•ฌ๐•ญ๐•ฎ๐•ฏ๐•ฐ๐•ฑ๐•ฒ๐•ณ๐•ด๐•ต๐•ถ๐•ท๐•ธ๐•น๐•บ๐•ป๐•ผ๐•ฝ๐•พ๐•ฟ๐–€๐–๐–‚๐–ƒ๐–„๐–…1234567890`
var normal = `abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890`
console.log(normal[3]) // gives 'd' but
console.log(handwriting[3]) // gives '๏ฟฝ' instead of '๐–‰'
also length doesn't work as expected normal.length gives correct value as 62 but handwriting.length gives 114.
Indexing doesn't work as expected. How can I access the elements of unicode array?
I tried this on python it works perfectly but in Javascript it is not working.
I need exact characters from the unicode string like an expected output of 'd' '๐–‰' for index 3
In Javascript, a string is a sequence of 16-bit code points. Since these characters are encoded above the Basic Multilingual Plane, it means that they are represented by a pair of code points, also known as a surrogate pair.
Reference
Unicode number of ๐–† is U+1D586. And 0x1D586 is greater than 0xFFFF (2^16). So, ๐–† is represented by a pair of code points, also known as a surrogate pair
console.log("๐–†".length)
console.log("๐–†" === "\uD835\uDD86")
One way is to create an array of characters using the spread syntax or Array.from() and then get the index you need
var handwriting = `๐–†๐–‡๐–ˆ๐–‰๐–Š๐–‹๐–Œ๐–๐–Ž๐–๐–๐–‘๐–’๐–“๐–”๐–•๐––๐–—๐–˜๐–™๐–š๐–›๐–œ๐–๐–ž๐–Ÿ๐•ฌ๐•ญ๐•ฎ๐•ฏ๐•ฐ๐•ฑ๐•ฒ๐•ณ๐•ด๐•ต๐•ถ๐•ท๐•ธ๐•น๐•บ๐•ป๐•ผ๐•ฝ๐•พ๐•ฟ๐–€๐–๐–‚๐–ƒ๐–„๐–…1234567890`
console.log([...handwriting][3])
console.log(Array.from(handwriting)[3])
A unicode character looks like '\u00E9' so if your string is longer this is normal.
To have the real length of a unicode string, you have to convert it to an array :
let charArray = [...handwriting]
console.log(charArray.length) //=62
Each item of your array is a char of your string.
charArray[3] will return you the unicode char corresponding to '๐–‰'

Reveal all non-printing ANSI characters and metacharacters in a Javascript string

I'm receiving piped stdout output from a multitude of fairly random shell processes, all as input (stdin) on a single node.js process. For debugging and for parsing, I need to be handle different special character codes that are being piped into the process. It would really help me to see invisible characters (for debugging mostly) and to deal with them accordingly once I've identified the patterns in which they are used.
Given a javascript string with ANSI special characters \u001b* and/or metacharacters such as \n, \t, \r etc., how can one reveal these special characters so they aren't actually rendered, but rather exposed as their code value instead.
For example, let's say I have the following string printed in green (can't show the green colour on SO):
This is a string.
We are now using the green color.
I would like to be able to do a console.log (for example) on this string and have it replace the non-printing characters, metacharacters/newlines, color codes etc with their ANSI codes:
"\u001b[32m\tThis is a string.\nWe are now using the green color.\n"
I can do something like the following, but it is too specific, hard-coded, and inefficient:
line = line.replace(/[\f]/g, '\\n');
line = line.replace(/\u0008/g, '\\b');
line = line.replace(/\u001b|\u001B/g, '\\u001b');
line = line.replace(/\r\n|\r|\n/g, '\\n');
...
Try this:
var map = { // Special characters
'\\': '\\',
'\n': 'n',
'\r': 'r',
'\t': 't'
};
str = str.replace(/[\\\n\r\t]/g, function(i) {
return '\\'+map[i];
});
str = str.replace(/[^ -~]/g, function(i){
return '\\u'+("000" + i.charCodeAt(0).toString(16)).slice(-4);
});
Here's a version that loops through the string, tests to see if it's a normal printable character and, if not, looks it up in a special table for your own representation of that character and if not found in the table, displays whatever default representation you want:
var tagKeys = {
'\n': 'New Line \n',
'\u0009': 'Tab',
'\u2029': 'Line Separator'
/* and so on */
};
function tagSpecialChars(str) {
var output = "", ch, replacement;
for (var i = 0; i < str.length; i++) {
ch = str.charAt(i);
if (ch < ' ' || ch > '~') {
replacement = tagKeys[ch];
if (replacement) {
ch = replacement;
} else {
// default value
// could also use charCodeAt() to get the numeric value
ch = '*****';
}
}
output += ch;
}
return output;
}
Demo: http://jsfiddle.net/jfriend00/bCYa4/
This is obviously not some fancy regex solution, but you said performance was important and you rarely find the best performing operation using a regex and certainly not if you're going to use a whole bunch of them. Plus every regex replace has to loop through the whole string anyway.
This workman-like solution just loops through the input string once and lets you customize the display conversion for any non-printable character you want and also determine what you want to display when it's a non-printable character that you don't have a special display representation for.

How can I convert a string into a unicode character?

In Javascript '\uXXXX' returns in a unicode character. But how can I get a unicode character when the XXXX part is a variable?
For example:
var input = '2122';
console.log('\\u' + input); // returns a string: "\u2122"
console.log(new String('\\u' + input)); // returns a string: "\u2122"
The only way I can think of to make it work, is to use eval; yet I hope there's a better solution:
var input = '2122';
var char = '\\u' + input;
console.log(eval("'" + char + "'")); // returns a character: "โ„ข"
Use String.fromCharCode() like this: String.fromCharCode(parseInt(input,16)). When you put a Unicode value in a string using \u, it is interpreted as a hexdecimal value, so you need to specify the base (16) when using parseInt.
String.fromCharCode("0x" + input)
or
String.fromCharCode(parseInt(input, 16)) as they are 16bit numbers (UTF-16)
JavaScript uses UCS-2 internally.
Thus, String.fromCharCode(codePoint) wonโ€™t work for supplementary Unicode characters. If codePoint is 119558 (0x1D306, for the '๐Œ†' character), for example.
If you want to create a string based on a non-BMP Unicode code point, you could use Punycode.jsโ€™s utility functions to convert between UCS-2 strings and UTF-16 code points:
// `String.fromCharCode` replacement that doesnโ€™t make you enter the surrogate halves separately
punycode.ucs2.encode([0x1d306]); // '๐Œ†'
punycode.ucs2.encode([119558]); // '๐Œ†'
punycode.ucs2.encode([97, 98, 99]); // 'abc'
Since ES5 you can use
String.fromCodePoint(number)
to get unicode values bigger than 0xFFFF.
So, in every new browser, you can write it in this way:
var input = '2122';
console.log(String.fromCodePoint(input));
or if it is a hex number:
var input = '2122';
console.log(String.fromCodePoint(parseInt(input, 16)));
More info:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCodePoint
Edit (2021):
fromCodePoint is not just used for bigger numbers, but also to combine Unicode emojis.
For example, to draw a waving hand, you have to write:
String.fromCodePoint(0x1F44B);
But if you want a waving hand with a skin tone, you have to combine it:
String.fromCodePoint(0x1F44B, 0x1F3FC);
In future (or from now), you will even be able to combine 2 emoji to create a new one, for example a heart and a fire, to create a burning heart:
String.fromCodePoint(0x2764, 0xFE0F, 0x200D, 0x1F525);
32-bit number:
<script>
document.write(String.fromCodePoint(0x1F44B));
</script>
<br>
32-bit number + skin:
<script>
document.write(String.fromCodePoint(0x1F44B, 0x1F3FE));
</script>
<br>
32-bit number + another emoji:
<script>
document.write(String.fromCodePoint(0x2764, 0xFE0F, 0x200D, 0x1F525));
</script>
var hex = '2122';
var char = unescape('%u' + hex);
console.log(char);
will returns " โ„ข "

How is a non-breaking space represented in a JavaScript string?

This apparently is not working:
X = $td.text();
if (X == ' ') {
X = '';
}
Is there something about a non-breaking space or the ampersand that JavaScript doesn't like?
is a HTML entity. When doing .text(), all HTML entities are decoded to their character values.
Instead of comparing using the entity, compare using the actual raw character:
var x = td.text();
if (x == '\xa0') { // Non-breakable space is char 0xa0 (160 dec)
x = '';
}
Or you can also create the character from the character code manually it in its Javascript escaped form:
var x = td.text();
if (x == String.fromCharCode(160)) { // Non-breakable space is char 160
x = '';
}
More information about String.fromCharCode is available here:
fromCharCode - MDC Doc Center
More information about character codes for different charsets are available here:
Windows-1252 Charset
UTF-8 Charset
Remember that .text() strips out markup, thus I don't believe you're going to find in a non-markup result.
Made in to an answer....
var p = $('<p>').html(' ');
if (p.text() == String.fromCharCode(160) && p.text() == '\xA0')
alert('Character 160');
Shows an alert, as the ASCII equivalent of the markup is returned instead.
That entity is converted to the char it represents when the browser renders the page. JS (jQuery) reads the rendered page, thus it will not encounter such a text sequence. The only way it could encounter such a thing is if you're double encoding entities.
The jQuery docs for text() says
Due to variations in the HTML parsers
in different browsers, the text
returned may vary in newlines and
other white space.
I'd use $td.html() instead.

What is an easy way to call Asc() and Chr() in JavaScript for Unicode values?

I am not that familiar with Javascript, and am looking for the function that returns the UNICODE value of a character, and given the UNICODE value, returns the string equivalent. I'm sure there is something simple, but I don't see it.
Example:
ASC("A") = 65
CHR(65) = "A"
ASC("เจ”") = 2580
CHR(2580) = "เจ”"
Have a look at:
String.fromCharCode(64)
and
String.charCodeAt(0)
The first must be called on the String class (literally String.fromCharCode...) and will return "#" (for 64). The second should be run on a String instance (e.g., "###".charCodeAt...) and returns the Unicode code of the first character (the '0' is a position within the string, you can get the codes for other characters in the string by changing that to another number).
The script snippet:
document.write("Unicode for character เจ” is: " + "เจ”".charCodeAt(0) + "<br />");
document.write("Character 2580 is " + String.fromCharCode(2580) + "<br />");
gives:
Unicode for character เจ” is: 2580
Character 2580 is เจ”
Because JavaScript uses UCS-2 internally, String.fromCharCode(codePoint) wonโ€™t work for supplementary Unicode characters. If codePoint is 119558 (0x1D306, for the '๐Œ†' character), for example.
If you want to create a string based on a non-BMP Unicode code point, you could use Punycode.jsโ€™s utility functions to convert between UCS-2 strings and UTF-16 code points:
// `String.fromCharCode` replacement that doesnโ€™t make you enter the surrogate halves separately
punycode.ucs2.encode([0x1d306]); // '๐Œ†'
punycode.ucs2.encode([119558]); // '๐Œ†'
punycode.ucs2.encode([97, 98, 99]); // 'abc'
if you want to get the Unicode code point for every character in a string, youโ€™ll need to convert the UCS-2 string into an array of UTF-16 code points (where each surrogate pair forms a single code point). You could use Punycode.jsโ€™s utility functions for this:
punycode.ucs2.decode('abc'); // [97, 98, 99]
punycode.ucs2.decode('๐Œ†'); // [119558]
Example for generating alphabet array here :
const arr = [];
for(var i = 0; i< 20; i++) {
arr.push( String.fromCharCode('A'.charCodeAt(0) + i) )
}

Categories

Resources