What is a surrogate pair? - javascript

I came across this code in a javascript open source project.
validator.isLength = function (str, min, max)
// match surrogate pairs in string or declare an empty array if none found in string
var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
// subtract the surrogate pairs string length from main string length
var len = str.length - surrogatePairs.length;
// now compare string length with min and max ... also make sure max is defined(in other words, max param is optional for function)
return len >= min && (typeof max === 'undefined' || len <= max);
};
As far as I understand, the above code is checking the length of the string but not taking the surrogate pairs into account. So:
Is my understanding of the code correct?
What are surrogate pairs?
I have thus far only figured out that this is related to encoding.

Yes. Your understanding is correct. The function returns the length of the string in Unicode Code Points.
JavaScript is using UTF-16 to encode its strings. This means two bytes (16-bit) are used to represent one Unicode Code Point.
Now there are characters (like the Emojis) in Unicode that have a that high code point so that they cannot be stored in 2 bytes (16bit) so they need to get encoded into two UTF-16 characters (4 bytes). These are called surrogate pairs.
Try this
var len = "๐Ÿ˜€".length // There is an emoji in the string (if you donโ€™t see it)
vs
var str = "๐Ÿ˜€"
var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
var len = str.length - surrogatePairs.length;
In the first example len will be 2 because the Emoji consists of two 2 UTF-16 characters. In the second example len will be 1.
You might want to read
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky

For your second question:
1. What is a "surrogate pair" in Java?
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme.
In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF.
Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, some additional complexity is used to store values above this range (0x10000 to 0x10FFFF). This is done using pairs of code units known as surrogates.
The surrogate code units are in two ranges known as "low surrogates" and "high surrogates", depending on whether they are allowed at the start or end of the two code unit sequence.
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396
Hope this helps.

Did you try to just google it?
The best description is http://unicodebook.readthedocs.io/unicode_encodings.html#surrogates
In UTF-16 some characters are stored in 8 bits and others in 16 bits.
Surrogate pair is a character representation that take 16 bits.
Some character codes is reserved to be the first one in such pairs.

Related

Javascript: Convert Unicode Character to hex string [duplicate]

I'm using a barcode scanner to read a barcode on my website (the website is made in OpenUI5).
The scanner works like a keyboard that types the characters it reads. At the end and the beginning of the typing it uses a special character. These characters are different for every type of scanner.
Some possible characters are:
โ–ˆ
โ–„
โ€“
โ€”
In my code I use if (oModelScanner.oData.scanning && oEvent.key == "\u2584") to check if the input from the scanner is โ–„.
Is there any way to get the code from that character in the \uHHHH style? (with the HHHH being the hexadecimal code for the character)
I tried the charCodeAt but this returns the decimal code.
With the codePointAt examples they make the code I need into a decimal code so I need a reverse of this.
Javascript strings have a method codePointAt which gives you the integer representing the Unicode point value. You need to use a base 16 (hexadecimal) representation of that number if you wish to format the integer into a four hexadecimal digits sequence (as in the response of Nikolay Spasov).
var hex = "โ–„".codePointAt(0).toString(16);
var result = "\\u" + "0000".substring(0, 4 - hex.length) + hex;
However it would probably be easier for you to check directly if you key code point integer match the expected code point
oEvent.key.codePointAt(0) === 'โ–„'.codePointAt(0);
Note that "symbol equality" can actually be trickier: some symbols are defined by surrogate pairs (you can see it as the combination of two halves defined as four hexadecimal digits sequence).
For this reason I would recommend to use a specialized library.
you'll find more details in the very relevant article by Mathias Bynens
var hex = "โ–„".charCodeAt(0).toString(16);
var result = "\\u" + "0000".substring(0, 4 - hex.length) + hex;
If you want to print the multiple code points of a character, e.g., an emoji, you can do this:
const facepalm = "๐Ÿคฆ๐Ÿผโ€โ™‚๏ธ";
const codePoints = Array.from(facepalm)
.map((v) => v.codePointAt(0).toString(16))
.map((hex) => "\\u{" + hex + "}");
console.log(codePoints);
["\u{1f926}", "\u{1f3fc}", "\u{200d}", "\u{2642}", "\u{fe0f}"]
If you are wondering about the components and the length of ๐Ÿคฆ๐Ÿผโ€โ™‚๏ธ, check out this article.

How do JSON parsers encode unicode characters not in the basic multilingual plane?

I'm writing a JSON parser in Xojo. It's working apart from the fact that I can't figure out how to encode and decode unicode strings that are not in the basic multilingual plane (BMP). In other words, my parser dies if encounters something greater than \uFFFF.
The specs say:
To escape a code point that is not in the Basic Multilingual Plane,
the character may be represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair corresponding to the code point. So
for example, a string containing only the G clef character (U+1D11E)
may be represented as "\uD834\uDD1E". However, whether a processor of
JSON texts interprets such a surrogate pair as a single code point or
as an explicit surrogate pair is a semantic decision that is
determined by the specific processor.
What I don't understand is what is the algorithm to go from U+1D11E to \uD834\uDD1E. I can't find any explanation of how to "encode the UTF-16 surrogate pair corresponding to the code point".
For example, say I want to encode the smiley face character (U+1F600). What would this be as a UTF-16 surrogate pair and what is the working to derive it?
Could somebody please at least point me in the correct direction?
Taken from the Wikipedia article linked by Remy Lebeau in the comments above (link):
To encode U+10437 (๐ท) to UTF-16:
Subtract 0x10000 from the code point, leaving 0x0437. For the high
surrogate, shift right by 10 (divide by 0x400), then add 0xD800,
resulting in 0x0001 + 0xD800 = 0xD801. For the low surrogate, take the
low 10 bits (remainder of dividing by 0x400), then add 0xDC00,
resulting in 0x0037 + 0xDC00 = 0xDC37. To decode U+10437 (๐ท) from
UTF-16:
Take the high surrogate (0xD801) and subtract 0xD800, then multiply by
0x400, resulting in 0x0001 ร— 0x400 = 0x0400. Take the low surrogate
(0xDC37) and subtract 0xDC00, resulting in 0x37. Add these two results
together (0x0437), and finally add 0x10000 to get the final decoded
UTF-32 code point, 0x10437.

String comparison - Javascript

I'm trying to get my head around string comparisons in Javascript
function f(str){
return str[0] < str[str.length -1]
}
f("a+"); // false
In ASCII: 'a' == 97, '+' == 43
Am I correct in thinking my test: f(str) is based on ASCII values above?
You don't need a function or a complicated test pulling a string apart for this. Just do 'a' < '+' and learn from what happens. Or, more simply, check the char's charcode using 'a'.charCodeAt(0).
You are almost right. It is based on unicode code units (not code points, this is the 16-bit encoded version), not ascii on values.
From the ECMAScript 2015 specification:
If both px and py are Strings, then
If py is a prefix of px, return false. (A String value p is a prefix of String value q if q can be the result of concatenating p and some other String r. Note that any String is a prefix of itself, because r may be the empty String.)
If px is a prefix of py, return true.
Let k be the smallest nonnegative integer such that the code unit at index k within px is different from the code unit at index k within py. (There must be such a k, for neither String is a prefix of the other.)
Let m be the integer that is the code unit value at index k within px.
Let n be the integer that is the code unit value at index k within py.
If m < n, return true. Otherwise, return false.
Note2
The comparison of Strings uses a simple lexicographic ordering on
sequences of code unit values. There is no attempt to use the more
complex, semantically oriented definitions of character or string
equality and collating order defined in the Unicode specification.
Therefore String values that are canonically equal according to the
Unicode standard could test as unequal. In effect this algorithm
assumes that both Strings are already in normalized form. Also, note
that for strings containing supplementary characters, lexicographic
ordering on sequences of UTF-16 code unit values differs from that on
sequences of code point values.
Basically it means that string comparison is based on a lexicographical order of "code units", which is the numeric value of unicode characters.
JavaScript engines are allowed to use either UCS-2 or UTF-16 (which is the same for most practical purposes).
So, technically, your function is based on UTF-16 values and you were comparing 0x0061 and 0x002B.

Javascript encoding breaking & combining multibyte characters?

I'm planning to use a client-side AES encryption for my web-app.
Right now, I've been looking for ways to break multibyte characters into one byte-'non-characters' ,encrypt (to have the same encrypted text length),
de-crypt them back, convert those one-byte 'non-characters' back to multibyte characters.
I've seen the wiki for UTF-8 (the supposedly-default encoding for JS?) and UTF-16, but I can't figure out how to detect "fragmented" multibyte characters and how I can combine them back.
Thanks : )
JavaScript strings are UTF-16 stored in 16-bit "characters". For Unicode characters ("code points") that require more than 16 bits (some code points require 32 bits in UTF-16), each JavaScript "character" is actually only half of the code point.
So to "break" a JavaScript character into bytes, you just take the character code and split off the high byte and the low byte:
var code = str.charCodeAt(0); // The first character, obviously you'll have a loop
var lowbyte = code & 0xFF;
var highbyte = (code & 0xFF00) >> 8;
(Even though JavaScript's numbers are floating point, the bitwise operators work in terms of 32-bit integers, and of course in our case only 16 of those bits are relevant.)
You'll never have an odd number of bytes, because again this is UTF-16.
You could simply convert to UTF8... For example by using this trick
function encode_utf8(s) {
return unescape(encodeURIComponent(s));
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
Considering you are using crypto-js, you can use its methods to convert to utf8 and return to string. See here:
var words = CryptoJS.enc.Utf8.parse('๐คญข');
var utf8 = CryptoJS.enc.Utf8.stringify(words);
The ๐คญข is probably a botched example of Utf8 character.
By looking at the other examples (see the Latin1 example), I'll say that with parse you convert a string to Utf8 (technically you convert it to Utf8 and put in a special array used by crypto-js of type WordArray) and the result can be passed to the Aes encoding algorithm and with stringify you convert a WordArray (for example obtained by decoding algorithm) to an Utf8.
JsFiddle example: http://jsfiddle.net/UpJRm/

How many bytes in a JavaScript string?

I have a javascript string which is about 500K when being sent from the server in UTF-8. How can I tell its size in JavaScript?
I know that JavaScript uses UCS-2, so does that mean 2 bytes per character. However, does it depend on the JavaScript implementation? Or on the page encoding or maybe content-type?
You can use the Blob to get the string size in bytes.
Examples:
console.info(
new Blob(['๐Ÿ˜‚']).size, // 4
new Blob(['๐Ÿ‘']).size, // 4
new Blob(['๐Ÿ˜‚๐Ÿ‘']).size, // 8
new Blob(['๐Ÿ‘๐Ÿ˜‚']).size, // 8
new Blob(['I\'m a string']).size, // 12
// from Premasagar correction of Lauri's answer for
// strings containing lone characters in the surrogate pair range:
// https://stackoverflow.com/a/39488643/6225838
new Blob([String.fromCharCode(55555)]).size, // 3
new Blob([String.fromCharCode(55555, 57000)]).size // 4 (not 6)
);
This function will return the byte size of any UTF-8 string you pass to it.
function byteCount(s) {
return encodeURI(s).split(/%..|./).length - 1;
}
Source
JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, itโ€™s just an implementation detail that wonโ€™t affect the languageโ€™s characteristics.
The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.
Source
If you're using node.js, there is a simpler solution using buffers :
function getBinarySize(string) {
return Buffer.byteLength(string, 'utf8');
}
There is a npm lib for that : https://www.npmjs.org/package/utf8-binary-cutter (from yours faithfully)
String values are not implementation dependent, according the ECMA-262 3rd Edition Specification, each character represents a single 16-bit unit of UTF-16 text:
4.3.16 String Value
A string value is a member of the type String and is a
finite ordered sequence of zero or
more 16-bit unsigned integer values.
NOTE Although each value usually
represents a single 16-bit unit of
UTF-16 text, the language does not
place any restrictions or requirements
on the values except that they be
16-bit unsigned integers.
These are 3 ways I use:
TextEncoder
new TextEncoder().encode("myString").length
Blob
new Blob(["myString"]).size
Buffer
Buffer.byteLength("myString", 'utf8')
Try this combination with using unescape js function:
const byteAmount = unescape(encodeURIComponent(yourString)).length
Full encode proccess example:
const s = "1 a ั„ โ„– # ยฎ"; // length is 11
const s2 = encodeURIComponent(s); // length is 41
const s3 = unescape(s2); // length is 15 [1-1,a-1,ั„-2,โ„–-3,#-1,ยฎ-2]
const s4 = escape(s3); // length is 39
const s5 = decodeURIComponent(s4); // length is 11
Note that if you're targeting node.js you can use Buffer.from(string).length:
var str = "\u2620"; // => "โ˜ "
str.length; // => 1 (character)
Buffer.from(str).length // => 3 (bytes)
The size of a JavaScript string is
Pre-ES6: 2 bytes per character
ES6 and later: 2 bytes per character,
or 5 or more bytes per character
Pre-ES6
Always 2 bytes per character. UTF-16 is not allowed because the spec says "values must be 16-bit unsigned integers". Since UTF-16 strings can use 3 or 4 byte characters, it would violate 2 byte requirement. Crucially, while UTF-16 cannot be fully supported, the standard does require that the two byte characters used are valid UTF-16 characters. In other words, Pre-ES6 JavaScript strings support a subset of UTF-16 characters.
ES6 and later
2 bytes per character, or 5 or more bytes per character. The additional sizes come into play because ES6 (ECMAScript 6) adds support for Unicode code point escapes. Using a unicode escape looks like this: \u{1D306}
Practical notes
This doesn't relate to the internal implemention of a particular engine. For
example, some engines use data structures and libraries with full
UTF-16 support, but what they provide externally doesn't have to be
full UTF-16 support. Also an engine may provide external UTF-16
support as well but is not mandated to do so.
For ES6, practically speaking characters will never be more than 5
bytes long (2 bytes for the escape point + 3 bytes for the Unicode
code point) because the latest version of Unicode only has 136,755
possible characters, which fits easily into 3 bytes. However this is
technically not limited by the standard so in principal a single
character could use say, 4 bytes for the code point and 6 bytes
total.
Most of the code examples here for calculating byte size don't seem to take into account ES6 Unicode code point escapes, so the results could be incorrect in some cases.
UTF-8 encodes characters using 1 to 4 bytes per code point. As CMS pointed out in the accepted answer, JavaScript will store each character internally using 16 bits (2 bytes).
If you parse each character in the string via a loop and count the number of bytes used per code point, and then multiply the total count by 2, you should have JavaScript's memory usage in bytes for that UTF-8 encoded string. Perhaps something like this:
getStringMemorySize = function( _string ) {
"use strict";
var codePoint
, accum = 0
;
for( var stringIndex = 0, endOfString = _string.length; stringIndex < endOfString; stringIndex++ ) {
codePoint = _string.charCodeAt( stringIndex );
if( codePoint < 0x100 ) {
accum += 1;
continue;
}
if( codePoint < 0x10000 ) {
accum += 2;
continue;
}
if( codePoint < 0x1000000 ) {
accum += 3;
} else {
accum += 4;
}
}
return accum * 2;
}
Examples:
getStringMemorySize( 'I' ); // 2
getStringMemorySize( 'โค' ); // 4
getStringMemorySize( '๐ €ฐ' ); // 8
getStringMemorySize( 'Iโค๐ €ฐ' ); // 14
The answer from Lauri Oherd works well for most strings seen in the wild, but will fail if the string contains lone characters in the surrogate pair range, 0xD800 to 0xDFFF. E.g.
byteCount(String.fromCharCode(55555))
// URIError: URI malformed
This longer function should handle all strings:
function bytes (str) {
var bytes=0, len=str.length, codePoint, next, i;
for (i=0; i < len; i++) {
codePoint = str.charCodeAt(i);
// Lone surrogates cannot be passed to encodeURI
if (codePoint >= 0xD800 && codePoint < 0xE000) {
if (codePoint < 0xDC00 && i + 1 < len) {
next = str.charCodeAt(i + 1);
if (next >= 0xDC00 && next < 0xE000) {
bytes += 4;
i++;
continue;
}
}
}
bytes += (codePoint < 0x80 ? 1 : (codePoint < 0x800 ? 2 : 3));
}
return bytes;
}
E.g.
bytes(String.fromCharCode(55555))
// 3
It will correctly calculate the size for strings containing surrogate pairs:
bytes(String.fromCharCode(55555, 57000))
// 4 (not 6)
The results can be compared with Node's built-in function Buffer.byteLength:
Buffer.byteLength(String.fromCharCode(55555), 'utf8')
// 3
Buffer.byteLength(String.fromCharCode(55555, 57000), 'utf8')
// 4 (not 6)
A single element in a JavaScript String is considered to be a single UTF-16 code unit. That is to say, Strings characters are stored in 16-bit (1 code unit), and 16-bit is equal to 2 bytes (8-bit = 1 byte).
The charCodeAt() method can be used to return an integer between 0 and 65535 representing the UTF-16 code unit at the given index.
The codePointAt() can be used to return the entire code point value for Unicode characters, e.g. UTF-32.
When a UTF-16 character can't be represented in a single 16-bit code unit, it will have a surrogate pair and therefore use two code units( 2 x 16-bit = 4 bytes)
See Unicode encodings for different encodings and their code ranges.
The Blob interface's size property returns the size of the Blob or File in bytes.
const getStringSize = (s) => new Blob([s]).size;
I'm working with an embedded version of the V8 Engine.
I've tested a single string. Pushing each step 1000 characters. UTF-8.
First test with single byte (8bit, ANSI) Character "A" (hex: 41).
Second test with two byte character (16bit) "ฮฉ" (hex: CE A9) and the
third test with three byte character (24bit) "โ˜บ" (hex: E2 98 BA).
In all three cases the device prints out of memory at
888 000 characters and using ca. 26 348 kb in RAM.
Result: The characters are not dynamically stored. And not with only 16bit. - Ok, perhaps only for my case (Embedded 128 MB RAM Device, V8 Engine C++/QT) - The character encoding has nothing to do with the size in ram of the javascript engine. E.g. encodingURI, etc. is only useful for highlevel data transmission and storage.
Embedded or not, fact is that the characters are not only stored in 16bit.
Unfortunally I've no 100% answer, what Javascript do at low level area.
Btw. I've tested the same (first test above) with an array of character "A".
Pushed 1000 items every step. (Exactly the same test. Just replaced string to array) And the system bringt out of memory (wanted) after 10 416 KB using and array length of 1 337 000.
So, the javascript engine is not simple restricted. It's a kind more complex.
You can try this:
var b = str.match(/[^\x00-\xff]/g);
return (str.length + (!b ? 0: b.length));
It worked for me.

Categories

Resources