Javascript: Convert Unicode Character to hex string [duplicate]

Javascript: Convert Unicode Character to hex string [duplicate] - javascript

I'm using a barcode scanner to read a barcode on my website (the website is made in OpenUI5).
The scanner works like a keyboard that types the characters it reads. At the end and the beginning of the typing it uses a special character. These characters are different for every type of scanner.
Some possible characters are:
█
▄
–
—
In my code I use if (oModelScanner.oData.scanning && oEvent.key == "\u2584") to check if the input from the scanner is ▄.
Is there any way to get the code from that character in the \uHHHH style? (with the HHHH being the hexadecimal code for the character)
I tried the charCodeAt but this returns the decimal code.
With the codePointAt examples they make the code I need into a decimal code so I need a reverse of this.

Javascript strings have a method codePointAt which gives you the integer representing the Unicode point value. You need to use a base 16 (hexadecimal) representation of that number if you wish to format the integer into a four hexadecimal digits sequence (as in the response of Nikolay Spasov).
var hex = "▄".codePointAt(0).toString(16);
var result = "\\u" + "0000".substring(0, 4 - hex.length) + hex;
However it would probably be easier for you to check directly if you key code point integer match the expected code point
oEvent.key.codePointAt(0) === '▄'.codePointAt(0);
Note that "symbol equality" can actually be trickier: some symbols are defined by surrogate pairs (you can see it as the combination of two halves defined as four hexadecimal digits sequence).
For this reason I would recommend to use a specialized library.
you'll find more details in the very relevant article by Mathias Bynens

var hex = "▄".charCodeAt(0).toString(16);
var result = "\\u" + "0000".substring(0, 4 - hex.length) + hex;

If you want to print the multiple code points of a character, e.g., an emoji, you can do this:
const facepalm = "🤦🏼‍♂️";
const codePoints = Array.from(facepalm)
.map((v) => v.codePointAt(0).toString(16))
.map((hex) => "\\u{" + hex + "}");
console.log(codePoints);
["\u{1f926}", "\u{1f3fc}", "\u{200d}", "\u{2642}", "\u{fe0f}"]
If you are wondering about the components and the length of 🤦🏼‍♂️, check out this article.

Related

When is an umlaut not an umlaut (u-umlaut maps to charCode 117)

> let uValFromStr = "Würzburg".charCodeAt(1)
undefined
> uValFromStr
117
> String.fromCharCode(252)
'ü'
> "Würzburg".charCodeAt(1) === String.fromCharCode(252)
false
>
We have a case where an umlaut in a string is failing a simple string comparison test because its value is actually mapped to charCode 117. The u-umlaut should be mapped to charCode 252. Note the first two lines where we extract the character's charCode. So when this occurs, a user enters a text string matching the first three characters and the match fails as code is evaluating 117===252.
Any ideas as to how this can occur? We have numerous use cases with umlauts in our data which work correctly so it is not an endemic issue but rather one that is particular to this input only (so far).

The ü in that specific "Würzburg" string is written using the Unicode code point for u (U+0075) followed by an umlaut combining mark (U+0308) which modifies it, but the ü you're comparing it to is written with the single Unicode code point for u-with-umlaut (U+00FC). Nearly all of JavaScript's string handling is quite naive, which is why they aren't equal. This naive (but fast!) nature has two parts: 1) It doesn't know about combining marks, which is why "Würzburg".length is 9 instead of 8 (if the ü is written using U+0075 and U+00FC); and 2) JavaScript "characters" are actually UTF-16 code units, which may be only half of a code point ("😊".length is 2, for instance, because although it's a single Unicode code point (U+1F60A), it requires two code units to be expressed in UTF-16). (One can argue that JavaScript strings are UCS-2 because they tolerate invalid surrogate pairs [pairs of code units that, taken together, describe a code point], but the spec says "...each element in the String is treated as a UTF-16 code unit value...")
You can solve this problem with comparing those two umlauted u's by using normalization, via JavaScript's (relatively new) normalize method:
const word = "Würzburg";
// Iteration moves through the string by code points, not code units
for (const ch of word) {
console.log(`${ch} = ${ch.codePointAt(0)}`);
}
const char = String.fromCharCode(252);
const normalizedWord = word.normalize();
const normalizedChar = char.normalize();
// Using iteration to grab the second "character" (code point) from the string
const [, secondCharOfWord] = normalizedWord;
console.log(normalizedChar === secondCharOfWord); // true
.as-console-wrapper {
max-height: 100% !important;
}
In that example we use the default normalization ("NFC," Normalization Form C), which prefers specific code points to combining marks, so the normalized version of the word uses u-with-umlaut code point U+00FC. There are other normalization forms available by passing an argument to normalize (such as Normalization Form D, which prefers combining marks to specific character code points), but the default is usually the one you want.

How to generate a Shift_JIS(SJIS) percent encoded string in JavaScript

I'm new to both JavaScript and Google Apps Script and having a problem to convert texts written in a cell to the Shift-JIS (SJIS) encoded letters.
For example, the Japanese string "あいう" should be encoded as "%82%A0%82%A2%82%A4" not as "%E3%81%82%E3%81%84%E3%81%86" which is UTF-8 encoded.
I tried EncodingJS and the built-in urlencode() function but it both returns the UTF-8 encoded one.
Would any one tell me how to get the SJIS-encoded letters properly in GAS? Thank you.

You want to do the URL encode from あいう to %82%A0%82%A2%82%A4 as Shift-JIS of the character set.
%E3%81%82%E3%81%84%E3%81%86 is the result converted as UTF-8.
You want to achieve this using Google Apps Script.
If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.
Points of this answer:
In order to use Shift-JIS of the character set at Google Apps Script, it is required to use it as the binary data. Because, when the value of Shift-JIS is retrieved as the string by Google Apps Script, the character set is automatically changed to UTF-8. Please be careful this.
Sample script 1:
In order to convert from あいう to %82%A0%82%A2%82%A4, how about the following script? In this case, this script can be used for HIRAGANA characters.
function muFunction() {
var str = "あいう";
var bytes = Utilities.newBlob("").setDataFromString(str, "Shift_JIS").getBytes();
var res = bytes.map(function(byte) {return "%" + ("0" + (byte & 0xFF).toString(16)).slice(-2)}).join("").toUpperCase();
Logger.log(res)
}
Result:
You can see the following result at the log.
%82%A0%82%A2%82%A4
Sample script 2:
If you want to convert the values including the KANJI characters, how about the following script? In this case, 本日は晴天なり is converted to %96%7B%93%FA%82%CD%90%B0%93V%82%C8%82%E8.
function muFunction() {
var str = "本日は晴天なり";
var conv = Utilities.newBlob("0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz*-.#_").getBytes().map(function(e) {return ("0" + (e & 0xFF).toString(16)).slice(-2)});
var bytes = Utilities.newBlob("").setDataFromString(str, "Shift_JIS").getBytes();
var res = bytes.map(function(byte) {
var n = ("0" + (byte & 0xFF).toString(16)).slice(-2);
return conv.indexOf(n) != -1 ? String.fromCharCode(parseInt(n[0], 16).toString(2).length == 4 ? parseInt(n, 16) - 256 : parseInt(n, 16)) : ("%" + n).toUpperCase();
}).join("");
Logger.log(res)
}
Result:
You can see the following result at the log.
%96%7B%93%FA%82%CD%90%B0%93V%82%C8%82%E8
When 本日は晴天なり is converted with the sample script 1, it becomes like %96%7B%93%FA%82%CD%90%B0%93%56%82%C8%82%E8. This can also decoded. But it seems that the result value converted with the sample script 2 is generally used.
Flow:
The flow of this script is as follows.
Create new blob as the empty data.
Put the text value of あいう to the blob. At that time, the text value is put as Shift-JIS of the the character set.
In this case, even when blob.getDataAsString("Shift_JIS") is used, the result becomes UTF-8. So the blob is required to be used as the binary data without converting to the string data. This is the important point in this answer.
Convert the blob to the byte array.
Convert the bytes array of the signed hexadecimal to the unsigned hexadecimal.
At Google Apps Script, the byte array is uses as he signed hexadecimal. So it is required to convert to the unsigned hexadecimal.
When the value is the KANJI character, when the characters of 2 bytes can be converted to the string value as the ascii code, the string value is required to be used. The script of "Sample script 2" can be used for this situation.
At above sample, 天 becomes %93V.
Add % to the top character of each byte.
References:
newBlob(data)
setDataFromString(string, charset)
getBytes()
map()
If I misunderstood your question and this was not the direction you want, I apologize.

Let libraries do the hard work! EncodingJS, which you mentioned, can produce URL-encoded Shift-JIS strings from ordinary String objects.
Loading the library in Apps Script is a bit tricky, but nonetheless possible as demonstrated in this answer:
/**
* Specific to Apps Script. See:
* https://stackoverflow.com/a/33315754/13301046
*
* You can instead use <script>, import or require()
* depending on the environment the code runs in.
*/
eval(UrlFetchApp.fetch('https://cdnjs.cloudflare.com/ajax/libs/encoding-japanese/2.0.0/encoding.js').getContentText());
URL encoding is achieved is as follows:
function muFunction() {
const utfString = '本日は晴天なり';
const sjisArray = Encoding.convert(utfString, {
to: 'SJIS',
from: 'UNICODE'
})
const sjisUrlEncoded = Encoding.urlEncode(sjisArray)
Logger.log(sjisUrlEncoded)
}
This emits an URL-encoded Shift-JIS string to the log:
'%96%7B%93%FA%82%CD%90%B0%93V%82%C8%82%E8'

What is a surrogate pair?

I came across this code in a javascript open source project.
validator.isLength = function (str, min, max)
// match surrogate pairs in string or declare an empty array if none found in string
var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
// subtract the surrogate pairs string length from main string length
var len = str.length - surrogatePairs.length;
// now compare string length with min and max ... also make sure max is defined(in other words, max param is optional for function)
return len >= min && (typeof max === 'undefined' || len <= max);
};
As far as I understand, the above code is checking the length of the string but not taking the surrogate pairs into account. So:
Is my understanding of the code correct?
What are surrogate pairs?
I have thus far only figured out that this is related to encoding.

Yes. Your understanding is correct. The function returns the length of the string in Unicode Code Points.
JavaScript is using UTF-16 to encode its strings. This means two bytes (16-bit) are used to represent one Unicode Code Point.
Now there are characters (like the Emojis) in Unicode that have a that high code point so that they cannot be stored in 2 bytes (16bit) so they need to get encoded into two UTF-16 characters (4 bytes). These are called surrogate pairs.
Try this
var len = "😀".length // There is an emoji in the string (if you don’t see it)
vs
var str = "😀"
var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
var len = str.length - surrogatePairs.length;
In the first example len will be 2 because the Emoji consists of two 2 UTF-16 characters. In the second example len will be 1.
You might want to read
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky

For your second question:
1. What is a "surrogate pair" in Java?
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme.
In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF.
Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, some additional complexity is used to store values above this range (0x10000 to 0x10FFFF). This is done using pairs of code units known as surrogates.
The surrogate code units are in two ranges known as "low surrogates" and "high surrogates", depending on whether they are allowed at the start or end of the two code unit sequence.
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396
Hope this helps.

Did you try to just google it?
The best description is http://unicodebook.readthedocs.io/unicode_encodings.html#surrogates
In UTF-16 some characters are stored in 8 bits and others in 16 bits.
Surrogate pair is a character representation that take 16 bits.
Some character codes is reserved to be the first one in such pairs.

RegEx to filter out all but one decimal point [duplicate]

i need a regular expression for decimal/float numbers like 12 12.2 1236.32 123.333 and +12.00 or -12.00 or ...123.123... for using in javascript and jQuery.
Thank you.

Optionally match a + or - at the beginning, followed by one or more decimal digits, optional followed by a decimal point and one or more decimal digits util the end of the string:
/^[+-]?\d+(\.\d+)?$/
RegexPal

The right expression should be as followed:
[+-]?([0-9]*[.])?[0-9]+
this apply for:
+1
+1.
+.1
+0.1
1
1.
.1
0.1
Here is Python example:
import re
#print if found
print(bool(re.search(r'[+-]?([0-9]*[.])?[0-9]+', '1.0')))
#print result
print(re.search(r'[+-]?([0-9]*[.])?[0-9]+', '1.0').group(0))
Output:
True
1.0
If you are using mac, you can test on command line:
python -c "import re; print(bool(re.search(r'[+-]?([0-9]*[.])?[0-9]+', '1.0')))"
python -c "import re; print(re.search(r'[+-]?([0-9]*[.])?[0-9]+', '1.0').group(0))"

You can check for text validation and also only one decimal point validation using isNaN
var val = $('#textbox').val();
var floatValues = /[+-]?([0-9]*[.])?[0-9]+/;
if (val.match(floatValues) && !isNaN(val)) {
// your function
}

This is an old post but it was the top search result for "regular expression for floating point" or something like that and doesn't quite answer _my_ question. Since I worked it out I will share my result so the next person who comes across this thread doesn't have to work it out for themselves.
All of the answers thus far accept a leading 0 on numbers with two (or more) digits on the left of the decimal point (e.g. 0123 instead of just 123) This isn't really valid and in some contexts is used to indicate the number is in octal (base-8) rather than the regular decimal (base-10) format.
Also these expressions accept a decimal with no leading zero (.14 instead of 0.14) or without a trailing fractional part (3. instead of 3.0). That is valid in some programing contexts (including JavaScript) but I want to disallow them (because for my purposes those are more likely to be an error than intentional).
Ignoring "scientific notation" like 1.234E7, here is an expression that meets my criteria:
/^((-)?(0|([1-9][0-9]*))(\.[0-9]+)?)$/
or if you really want to accept a leading +, then:
/^((\+|-)?(0|([1-9][0-9]*))(\.[0-9]+)?)$/
I believe that regular expression will perform a strict test for the typical integer or decimal-style floating point number.
When matched:
$1 contains the full number that matched
$2 contains the (possibly empty) leading sign (+/-)
$3 contains the value to the left of the decimal point
$5 contains the value to the right of the decimal point, including the leading .
By "strict" I mean that the number must be the only thing in the string you are testing.
If you want to extract just the float value out of a string that contains other content use this expression:
/((\b|\+|-)(0|([1-9][0-9]*))(\.[0-9]+)?)\b/
Which will find -3.14 in "negative pi is approximately -3.14." or in "(-3.14)" etc.
The numbered groups have the same meaning as above (except that $2 is now an empty string ("") when there is no leading sign, rather than null).
But be aware that it will also try to extract whatever numbers it can find. E.g., it will extract 127.0 from 127.0.0.1.
If you want something more sophisticated than that then I think you might want to look at lexical analysis instead of regular expressions. I'm guessing one could create a look-ahead-based expression that would recognize that "Pi is 3.14." contains a floating point number but Home is 127.0.0.1. does not, but it would be complex at best. If your pattern depends on the characters that come after it in non-trivial ways you're starting to venture outside of regular expressions' sweet-spot.

Paulpro and lbsweek answers led me to this:
re=/^[+-]?(?:\d*\.)?\d+$/;
>> /^[+-]?(?:\d*\.)?\d+$/
re.exec("1")
>> Array [ "1" ]
re.exec("1.5")
>> Array [ "1.5" ]
re.exec("-1")
>> Array [ "-1" ]
re.exec("-1.5")
>> Array [ "-1.5" ]
re.exec(".5")
>> Array [ ".5" ]
re.exec("")
>> null
re.exec("qsdq")
>> null

For anyone new:
I made a RegExp for the E scientific notation (without spaces).
const floatR = /^([+-]?(?:[0-9]+(?:\.[0-9]+)?|\.[0-9]+)(?:[eE][+-]?[0-9]+)?)$/;
let str = "-2.3E23";
let m = floatR.exec(str);
parseFloat(m[1]); //=> -2.3e+23
If you prefer to use Unicode numbers, you could replace all [0-9] by \d in the RegExp.
And possibly add the Unicode flag u at the end of the RegExp.
For a better understanding of the pattern see https://regexper.com/.
And for making RegExp, I can suggest https://regex101.com/.
EDIT: found another site for viewing RegExp in color: https://jex.im/regulex/.
EDIT 2: although op asks for RegExp specifically you can check a string in JS directly:
const isNum = (num)=>!Number.isNaN(Number(num));
isNum("123.12345678E+3");//=> true
isNum("80F");//=> false
converting the string to a number (or NaN) with Number()
then checking if it is NOT NaN with !Number.isNaN()

If you want it to work with e, use this expression:
[+-]?[0-9]+([.][0-9]+)?([eE][+-]?[0-9]+)?
Here is a JavaScript example:
var re = /^[+-]?[0-9]+([.][0-9]+)?([eE][+-]?[0-9]+)?$/;
console.log(re.test('1'));
console.log(re.test('1.5'));
console.log(re.test('-1'));
console.log(re.test('-1.5'));
console.log(re.test('1E-100'));
console.log(re.test('1E+100'));
console.log(re.test('.5'));
console.log(re.test('foo'));

Here is my js method , handling 0s at the head of string
1- ^0[0-9]+\.?[0-9]*$ : will find numbers starting with 0 and followed by numbers bigger than zero before the decimal seperator , mainly ".". I put this to distinguish strings containing numbers , for example, "0.111" from "01.111".
2- ([1-9]{1}[0-9]\.?[0-9]) : if there is string starting with 0 then the part which is bigger than 0 will be taken into account. parentheses are used here because I wanted to capture only parts conforming to regex.
3- ([0-9]\.?[0-9]): to capture only the decimal part of the string.
In Javascript , st.match(regex), will return array in which first element contains conformed part. I used this method in the input element's onChange event , by this if the user enters something that violates the regex than violating part is not shown in element's value at all but if there is a part that conforms to regex , then it stays in the element's value.
const floatRegexCheck = (st) => {
const regx1 = new RegExp("^0[0-9]+\\.?[0-9]*$"); // for finding numbers starting with 0
let regx2 = new RegExp("([1-9]{1}[0-9]*\\.?[0-9]*)"); //if regx1 matches then this will remove 0s at the head.
if (!st.match(regx1)) {
regx2 = new RegExp("([0-9]*\\.?[0-9]*)"); //if number does not contain 0 at the head of string then standard decimal formatting takes place
}
st = st.match(regx2);
if (st?.length > 0) {
st = st[0];
}
return st;
}

Here is a more rigorous answer
^[+-]?0(?![0-9]).[0-9]*(?![.])$|^[+-]?[1-9]{1}[0-9]*.[0-9]*$|^[+-]?.[0-9]+$
The following values will match (+- sign are also work)
.11234
0.1143424
11.21
1.
The following values will not match
00.1
1.0.00
12.2350.0.0.0.0.
.
....
How it works
The (?! regex) means NOT operation
let's break down the regex by | operator which is same as logical OR operator
^[+-]?0(?![0-9]).[0-9]*(?![.])$
This regex is to check the value starts from 0
First Check + and - sign with 0 or 1 time ^[+-]
Then check if it has leading zero 0
If it has,then the value next to it must not be zero because we don't want to see 00.123 (?![0-9])
Then check the dot exactly one time and check the fraction part with unlimited times of digits .[0-9]*
Last, if it has a dot follow by fraction part, we discard it.(?![.])$
Now see the second part
^[+-]?[1-9]{1}[0-9]*.[0-9]*$
^[+-]? same as above
If it starts from non zero, match the first digit exactly one time and unlimited time follow by it [1-9]{1}[0-9]* e.g. 12.3 , 1.2, 105.6
Match the dot one time and unlimited digit follow it .[0-9]*$
Now see the third part
^[+-]?.{1}[0-9]+$
This will check the value starts from . e.g. .12, .34565
^[+-]? same as above
Match dot one time and one or more digits follow by it .[0-9]+$

Javascript encoding breaking & combining multibyte characters?

I'm planning to use a client-side AES encryption for my web-app.
Right now, I've been looking for ways to break multibyte characters into one byte-'non-characters' ,encrypt (to have the same encrypted text length),
de-crypt them back, convert those one-byte 'non-characters' back to multibyte characters.
I've seen the wiki for UTF-8 (the supposedly-default encoding for JS?) and UTF-16, but I can't figure out how to detect "fragmented" multibyte characters and how I can combine them back.
Thanks : )

JavaScript strings are UTF-16 stored in 16-bit "characters". For Unicode characters ("code points") that require more than 16 bits (some code points require 32 bits in UTF-16), each JavaScript "character" is actually only half of the code point.
So to "break" a JavaScript character into bytes, you just take the character code and split off the high byte and the low byte:
var code = str.charCodeAt(0); // The first character, obviously you'll have a loop
var lowbyte = code & 0xFF;
var highbyte = (code & 0xFF00) >> 8;
(Even though JavaScript's numbers are floating point, the bitwise operators work in terms of 32-bit integers, and of course in our case only 16 of those bits are relevant.)
You'll never have an odd number of bytes, because again this is UTF-16.

You could simply convert to UTF8... For example by using this trick
function encode_utf8(s) {
return unescape(encodeURIComponent(s));
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
Considering you are using crypto-js, you can use its methods to convert to utf8 and return to string. See here:
var words = CryptoJS.enc.Utf8.parse('𤭢');
var utf8 = CryptoJS.enc.Utf8.stringify(words);
The 𤭢 is probably a botched example of Utf8 character.
By looking at the other examples (see the Latin1 example), I'll say that with parse you convert a string to Utf8 (technically you convert it to Utf8 and put in a special array used by crypto-js of type WordArray) and the result can be passed to the Aes encoding algorithm and with stringify you convert a WordArray (for example obtained by decoding algorithm) to an Utf8.
JsFiddle example: http://jsfiddle.net/UpJRm/

Develop Reference

JavaScript is the programming language of the Web.

Javascript: Convert Unicode Character to hex string [duplicate] - javascript

var hex = "▄".charCodeAt(0).toString(16); var result = "\\u" + "0000".substring(0, 4 - hex.length) + hex;

Related

When is an umlaut not an umlaut (u-umlaut maps to charCode 117)

How to generate a Shift_JIS(SJIS) percent encoded string in JavaScript

What is a surrogate pair?

RegEx to filter out all but one decimal point [duplicate]

Javascript encoding breaking & combining multibyte characters?

Categories

Resources