How many bytes in a JavaScript string? - javascript

I have a javascript string which is about 500K when being sent from the server in UTF-8. How can I tell its size in JavaScript?
I know that JavaScript uses UCS-2, so does that mean 2 bytes per character. However, does it depend on the JavaScript implementation? Or on the page encoding or maybe content-type?

You can use the Blob to get the string size in bytes.
Examples:
console.info(
new Blob(['😂']).size, // 4
new Blob(['👍']).size, // 4
new Blob(['😂👍']).size, // 8
new Blob(['👍😂']).size, // 8
new Blob(['I\'m a string']).size, // 12
// from Premasagar correction of Lauri's answer for
// strings containing lone characters in the surrogate pair range:
// https://stackoverflow.com/a/39488643/6225838
new Blob([String.fromCharCode(55555)]).size, // 3
new Blob([String.fromCharCode(55555, 57000)]).size // 4 (not 6)
);

This function will return the byte size of any UTF-8 string you pass to it.
function byteCount(s) {
return encodeURI(s).split(/%..|./).length - 1;
}
Source
JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, it’s just an implementation detail that won’t affect the language’s characteristics.
The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.
Source

If you're using node.js, there is a simpler solution using buffers :
function getBinarySize(string) {
return Buffer.byteLength(string, 'utf8');
}
There is a npm lib for that : https://www.npmjs.org/package/utf8-binary-cutter (from yours faithfully)

String values are not implementation dependent, according the ECMA-262 3rd Edition Specification, each character represents a single 16-bit unit of UTF-16 text:
4.3.16 String Value
A string value is a member of the type String and is a
finite ordered sequence of zero or
more 16-bit unsigned integer values.
NOTE Although each value usually
represents a single 16-bit unit of
UTF-16 text, the language does not
place any restrictions or requirements
on the values except that they be
16-bit unsigned integers.

These are 3 ways I use:
TextEncoder
new TextEncoder().encode("myString").length
Blob
new Blob(["myString"]).size
Buffer
Buffer.byteLength("myString", 'utf8')

Try this combination with using unescape js function:
const byteAmount = unescape(encodeURIComponent(yourString)).length
Full encode proccess example:
const s = "1 a ф № # ®"; // length is 11
const s2 = encodeURIComponent(s); // length is 41
const s3 = unescape(s2); // length is 15 [1-1,a-1,ф-2,№-3,#-1,®-2]
const s4 = escape(s3); // length is 39
const s5 = decodeURIComponent(s4); // length is 11

Note that if you're targeting node.js you can use Buffer.from(string).length:
var str = "\u2620"; // => "☠"
str.length; // => 1 (character)
Buffer.from(str).length // => 3 (bytes)

The size of a JavaScript string is
Pre-ES6: 2 bytes per character
ES6 and later: 2 bytes per character,
or 5 or more bytes per character
Pre-ES6
Always 2 bytes per character. UTF-16 is not allowed because the spec says "values must be 16-bit unsigned integers". Since UTF-16 strings can use 3 or 4 byte characters, it would violate 2 byte requirement. Crucially, while UTF-16 cannot be fully supported, the standard does require that the two byte characters used are valid UTF-16 characters. In other words, Pre-ES6 JavaScript strings support a subset of UTF-16 characters.
ES6 and later
2 bytes per character, or 5 or more bytes per character. The additional sizes come into play because ES6 (ECMAScript 6) adds support for Unicode code point escapes. Using a unicode escape looks like this: \u{1D306}
Practical notes
This doesn't relate to the internal implemention of a particular engine. For
example, some engines use data structures and libraries with full
UTF-16 support, but what they provide externally doesn't have to be
full UTF-16 support. Also an engine may provide external UTF-16
support as well but is not mandated to do so.
For ES6, practically speaking characters will never be more than 5
bytes long (2 bytes for the escape point + 3 bytes for the Unicode
code point) because the latest version of Unicode only has 136,755
possible characters, which fits easily into 3 bytes. However this is
technically not limited by the standard so in principal a single
character could use say, 4 bytes for the code point and 6 bytes
total.
Most of the code examples here for calculating byte size don't seem to take into account ES6 Unicode code point escapes, so the results could be incorrect in some cases.

UTF-8 encodes characters using 1 to 4 bytes per code point. As CMS pointed out in the accepted answer, JavaScript will store each character internally using 16 bits (2 bytes).
If you parse each character in the string via a loop and count the number of bytes used per code point, and then multiply the total count by 2, you should have JavaScript's memory usage in bytes for that UTF-8 encoded string. Perhaps something like this:
getStringMemorySize = function( _string ) {
"use strict";
var codePoint
, accum = 0
;
for( var stringIndex = 0, endOfString = _string.length; stringIndex < endOfString; stringIndex++ ) {
codePoint = _string.charCodeAt( stringIndex );
if( codePoint < 0x100 ) {
accum += 1;
continue;
}
if( codePoint < 0x10000 ) {
accum += 2;
continue;
}
if( codePoint < 0x1000000 ) {
accum += 3;
} else {
accum += 4;
}
}
return accum * 2;
}
Examples:
getStringMemorySize( 'I' ); // 2
getStringMemorySize( '❤' ); // 4
getStringMemorySize( '𠀰' ); // 8
getStringMemorySize( 'I❤𠀰' ); // 14

The answer from Lauri Oherd works well for most strings seen in the wild, but will fail if the string contains lone characters in the surrogate pair range, 0xD800 to 0xDFFF. E.g.
byteCount(String.fromCharCode(55555))
// URIError: URI malformed
This longer function should handle all strings:
function bytes (str) {
var bytes=0, len=str.length, codePoint, next, i;
for (i=0; i < len; i++) {
codePoint = str.charCodeAt(i);
// Lone surrogates cannot be passed to encodeURI
if (codePoint >= 0xD800 && codePoint < 0xE000) {
if (codePoint < 0xDC00 && i + 1 < len) {
next = str.charCodeAt(i + 1);
if (next >= 0xDC00 && next < 0xE000) {
bytes += 4;
i++;
continue;
}
}
}
bytes += (codePoint < 0x80 ? 1 : (codePoint < 0x800 ? 2 : 3));
}
return bytes;
}
E.g.
bytes(String.fromCharCode(55555))
// 3
It will correctly calculate the size for strings containing surrogate pairs:
bytes(String.fromCharCode(55555, 57000))
// 4 (not 6)
The results can be compared with Node's built-in function Buffer.byteLength:
Buffer.byteLength(String.fromCharCode(55555), 'utf8')
// 3
Buffer.byteLength(String.fromCharCode(55555, 57000), 'utf8')
// 4 (not 6)

A single element in a JavaScript String is considered to be a single UTF-16 code unit. That is to say, Strings characters are stored in 16-bit (1 code unit), and 16-bit is equal to 2 bytes (8-bit = 1 byte).
The charCodeAt() method can be used to return an integer between 0 and 65535 representing the UTF-16 code unit at the given index.
The codePointAt() can be used to return the entire code point value for Unicode characters, e.g. UTF-32.
When a UTF-16 character can't be represented in a single 16-bit code unit, it will have a surrogate pair and therefore use two code units( 2 x 16-bit = 4 bytes)
See Unicode encodings for different encodings and their code ranges.

The Blob interface's size property returns the size of the Blob or File in bytes.
const getStringSize = (s) => new Blob([s]).size;

I'm working with an embedded version of the V8 Engine.
I've tested a single string. Pushing each step 1000 characters. UTF-8.
First test with single byte (8bit, ANSI) Character "A" (hex: 41).
Second test with two byte character (16bit) "Ω" (hex: CE A9) and the
third test with three byte character (24bit) "☺" (hex: E2 98 BA).
In all three cases the device prints out of memory at
888 000 characters and using ca. 26 348 kb in RAM.
Result: The characters are not dynamically stored. And not with only 16bit. - Ok, perhaps only for my case (Embedded 128 MB RAM Device, V8 Engine C++/QT) - The character encoding has nothing to do with the size in ram of the javascript engine. E.g. encodingURI, etc. is only useful for highlevel data transmission and storage.
Embedded or not, fact is that the characters are not only stored in 16bit.
Unfortunally I've no 100% answer, what Javascript do at low level area.
Btw. I've tested the same (first test above) with an array of character "A".
Pushed 1000 items every step. (Exactly the same test. Just replaced string to array) And the system bringt out of memory (wanted) after 10 416 KB using and array length of 1 337 000.
So, the javascript engine is not simple restricted. It's a kind more complex.

You can try this:
var b = str.match(/[^\x00-\xff]/g);
return (str.length + (!b ? 0: b.length));
It worked for me.

Related

Decode a prepended VarInt from a byte stream of unknown size in javascript/Nodejs

I am writing a small utility library for me to request server status of a given minecraft host in js on node. I am using the Server List Ping Protocol as outlined here (https://wiki.vg/Server_List_Ping) and got it mostly working as expected, albeit having big trouble working with unsupported Data Types (VarInt) and had to scour the internet to find a way of converting js nums into VarInts in order to craft the necessary packet buffers:
function toVarIntBuffer(integer) {
let buffer = Buffer.alloc(0);
while (true) {
let tmp = integer & 0b01111111;
integer >>>= 7;
if (integer != 0) {
tmp |= 0b10000000;
}
buffer = Buffer.concat([buffer, Buffer.from([tmp])]);
if (integer <= 0) break;
}
return buffer;
}
Right now I am able to request a server status by sending the handshake packet and then the query packet and do receive a JSON response with the length of the response prepended as a VarInt.
However the issue is here, where I simply don't know how to safely identify the VarInt from the beginning of the JSON response (as it can be anywhere up to 5 byte) and decode it back to a readable num so I can get the proper length of the response byte stream.
[...] as with all strings this is prefixed by its length as a VarInt
(from the protocol documentation)
My current super hacky workaround is to concatenate the chunks as String until the concatenated string contains the same count of '{'s and '}'s (meaning a full json object) and slice the json response at the first '{' before parsing it.
However I am very unhappy with this hacky, inefficient, unelegant and possibly unreliable way of solving the issue and would rather decode the VarInt in front of the JSON response in order to get a proper length to compare against.
I don't know this protocol, but VarInt in protobuf are coded with the MSB bit:
Each byte in a varint, except the last byte, has the most significant
bit (msb) set – this indicates that there are further bytes to come.
The lower 7 bits of each byte are used to store the two's complement
representation of the number in groups of 7 bits, least significant
group first.
Note: Too long for a comment, so posting as an answer.
Update: I browsed a bit through the URL you gave, and it is indeed the ProtoBuf VarInt. It is also described there with pseudo-code:
https://wiki.vg/Protocol#VarInt_and_VarLong
VarInt and VarLong Variable-length format such that smaller numbers
use fewer bytes. These are very similar to Protocol Buffer Varints:
the 7 least significant bits are used to encode the value and the most
significant bit indicates whether there's another byte after it for
the next part of the number. The least significant group is written
first, followed by each of the more significant groups; thus, VarInts
are effectively little endian (however, groups are 7 bits, not 8).
VarInts are never longer than 5 bytes, and VarLongs are never longer
than 10 bytes.
Pseudocode to read and write VarInts and VarLongs:
Thanks to the reference material that #thst pointed me to, I was able to slap together a working way of reading VarInts in javascript.
function readVarInt(buffer) {
let value = 0;
let length = 0;
let currentByte;
while (true) {
currentByte = buffer[length];
value |= (currentByte & 0x7F) << (length * 7);
length += 1;
if (length > 5) {
throw new Error('VarInt exceeds allowed bounds.');
}
if ((currentByte & 0x80) != 0x80) break;
}
return value;
}
buffer must be a byte stream starting with the VarInt, ideally using the std Buffer class.

What is this type of string called?

In python, we can do something like print("some random string".encode().decode('utf-16')) which will output: 潳敭爠湡潤瑳楲杮.
I feel like that is utf-16, but I'm not really sure, because I can't reproduce it in any other language. My goal is to create a function that will do exactly this, but in Javascript. The problem is that I can't find what of what type if this type of string...
Does someone know how this is called or/and how I could reproduce this in JS ?
A string is a sequence of runes. Unicode is a standard for assigning numeric values to those runes. UTF-8 or UTF-16 are standards for encoding a sequence of runes, as represented by their unicode numeric values, as a sequence of bytes.
What you did there is use encode with the default encoding, which is UTF-8, to get a sequence of bytes which you then tried to decode back to runes as if the bytes had come from a UTF-16 encoding. Basically (because your input string fits in a 1-byte encoding for UTF-8) you're taking pairs of characters from the input, jamming their bytes together and hoping that the resulting value is a legal UTF-16 encoding of something (which in general you cannot count on being true). You'll also run into issues if the utf-8 encoding is not an even number of bytes, of course.
If you really need to do this thing in javascript, you could do something like this:
const str = "some random string";
var buf = new ArrayBuffer(str.length);
// Reinterpret the sequence of bytes as a sequence of byte pairs.
var bufView = new Uint16Array(buf);
for (var i=0, strLen=str.length; i < strLen-1; i+=2) {
var c1 = str.charCodeAt(i);
var c2 = str.charCodeAt(i+1);
if (c1 > 127 || c2 > 127) {
// This will be a problem. How you handle it is up to you.
}
bufView[i/2] = c1 << 8 | c2;
}
console.log(String.fromCharCode.apply(String, bufView));

Is there an equivalent to C's *(unsigned int*)(char) = 123 in Javascript?

I'm dealing with some C source code I'm trying to convert over to Javascript, I've hit a snag at this line
char ddata[512];
*(unsigned int*)(ddata+0)=123;
*(unsigned int*)(ddata+4)=add-8;
memset(ddata+8,0,add-8);
I'm not sure exactly what is happening here, I understand they're casting the char to an unsigned int, but what is the ddata+0 and stuff doing here? Thanks.
You can't say.
That's because the behaviour on casting a char* to an unsigned* is undefined unless the pointer started off as an unsigned*, which, in your case it didn't.
ddata + 0 is equivalent to ddata.
ddata + 4 is equivalent to &ddata[4], i.e. the address of the 5th element of the array.
For what it's worth, it looks like the C programmer is attempting to serialise a couple of unsigned literals into a byte array. But the code is a mess; aside from what I've already said they appear to be assuming that an unsigned occupies 4 bytes, which is not necessarily the case.
The code fragment is storing a record id (123) as 4 byte integer in the first 4 bytes of a char buffer ddata. It then stores a length (add-8) in the following 4 bytes and finally initializes the following add-8 bytes to 0.
Translating this to javascript can be done in different ways, but probably not by constructing a string with the same contents. The reason is strings a not byte buffers in javascript, they contain unicode code points, so writing the string to storage might perform some unwanted conversions.
The best solution depends on your actual target platform, where byte arrays may be available to more closely match the intended semantics of your C code.
Note that the above code is not portable and has undefined behavior for various reasons, notably because ddata might not be properly aligned to be used as the address to store an unsigned int via the cast *(unsigned int*)ddata = 123;, because it assumes int to be 4 bytes and because it relies on unspecified byte ordering.
On the Redhat linux box it probably works as expected, and the same C code would probably perform correctly on MacOS, that uses the same Intel architecture with little endian ordering. How best to translate this to Javascript requires more context and specifications.
In the mean time, the code would best be rewritten this way:
unsigned char ddata[512];
if (add <= 512) {
ddata[0] = 123;
ddata[1] = 0;
ddata[2] = 0;
ddata[3] = 0;
ddata[4] = ((add-8) >> 0) & 255;
ddata[5] = ((add-8) >> 8) & 255;
ddata[6] = ((add-8) >> 16) & 255;
ddata[7] = ((add-8) >> 24) & 255;
memset(ddata + 8, 0, add - 8);
}

Counting the byte size of a file encoded in ISO 8859-7 in JavaScript

Background
I am writing an esoteric language called Jolf. It is used on the lovely site codegolf SE. If you don't already know, a lot of challenges are scored in bytes. People have made lots of languages that utilize either their own encoding or a pre-existing encoding.
On the interpreter for my language, I have a byte counter. As you might expect, it counts the number of bytes in the code. Until now, I've been using a UTF-8 en/decoder (utf8.js). I am now using the ISO 8859-7 encoding, which has Greek characters. Nor does the text upload actually work. I need to count the actually bytes contained within an uploaded file. Also, is there a way to read the contents of said encoded file?
Question
Given a file encoded in ISO 8859-7 obtained from an <input> element on the page, is there any way to obtain the number of bytes contained in that file? And, given "plaintext" (i.e. text put directly into a <textarea>), how might I count the bytes in that as if it was encoded in ISO 8859-7?
What I've tried
The input element is called isogreek. The file resides in the <input> element. The content is ΦX族, a Greek character, a latin character (each of which should be a byte) and a Chinese character, which should be more than one byte (?).
isogreek.files[0].size; // is 3; should be more.
var reader = new FileReader();
reader.readAsBinaryString(isogreek.files[0]); // corrupts the string to `ÖX?`
reader.readAsText(isogreek.files[0]); // �X?
reader.readAsText(isogreek.files[0],"ISO 8859-7"); // �X?
Extended from this comment.
As #pvg mentioned in the comments, the string resulting from readAsBinaryString would be correct, but is corrupted for two reasons:
A. The result is encoded in ISO-8859-1. You can use a function to fix this:
function convertFrom1to7(text) {
// charset is the set of chars in the ISO-8859-7 encoding from 0xA0 and up, encoded with this format:
// - If the character is in the same position as in ISO-8859-1/Unicode, use a "!".
// - If the character is a Greek char with 720 subtracted from its char code, use a ".".
// - Otherwise, use \uXXXX format.
var charset = "!\u2018\u2019!\u20AC\u20AF!!!!.!!!!\u2015!!!!...!...!.!....................!............................................!";
var newtext = "", newchar = "";
for (var i = 0; i < text.length; i++) {
var char = text[i];
newchar = char;
if (char.charCodeAt(0) >= 160) {
newchar = charset[char.charCodeAt(0) - 160];
if (newchar === "!") newchar = char;
if (newchar === ".") newchar = String.fromCharCode(char.charCodeAt(0) + 720);
}
newtext += newchar;
}
return newtext;
}
B. The Chinese character isn't a part of the ISO-8859-7 charset (because the charset supports up to 256 unique chars, as the table shows). If you want to include arbitrary Unicode characters in a program, you will probably need to do one of these two things:
Count the bytes of that program in i.e. UTF-8 or UTF-16. This can be done pretty easily with the library you linked. However, if you want this to be done automatically, you'll need a function that checks if the content of the textarea is a valid ISO-8859-7 file, like this:
function isValidISO_8859_7(text) {
var charset = /[\u0000-\u00A0\u2018\u2019\u00A3\u20AC\u20AF\u00A6-\u00A9\u037A\u00AB-\u00AD\u2015\u00B0-\u00B3\u0384-\u0386\u00B7\u0388-\u038A\u00BB\u038C\u00BD\u038E-\u03CE]/;
var valid = true;
for (var i = 0; i < text.length; i++) {
valid = valid && charset.test(text[i]);
}
return valid;
}
Create your own, custom variant of ISO-8859-7 that uses a specific byte (or more than one) to signify that the next 2 or 3 bytes belong to a single Unicode char. This can be pretty much as simple or complex as you like, from one char signifying a 2-byte char and one signifying a 3-byter to everything between 80 and 9F setting up for the next few. Here's a basic example that uses 80 as the 2-byter and 81 as the 3-byter (assumes the text is encoded in ISO-8859-1):
function reUnicode(text) {
var newtext = "";
for (var i = 0; i < text.length; i++) {
if (text.charCodeAt(i) === 0x80) {
newtext += String.fromCharCode((text.charCodeAt(++i) << 8) + text.charCodeAt(++i));
} else if (text.charCodeAt(i) === 0x81) {
var charcode = (text.charCodeAt(++i) << 16) + (text.charCodeAt(++i) << 8) + text.charCodeAt(++i) - 65536;
newtext += String.fromCharCode(0xD800 + (charcode >> 10), 0xDC00 + (charcode & 1023)); // Convert into a UTF-16 surrogate pair
} else {
newtext += convertFrom1to7(text[i]);
}
}
return newtext;
}
I can go into either method in more detail if you desire.
The three characters you gave as an example are decoded in 6 bytes a6 ce e6 58 8f 97 (0x58 = X). Also: JavaScript works with utf16 which results in some funny things like ("abc".length === "ΦX族".length) being true.
You most probably need to go to the full length and check every single character for its length by its code-value. You may also need to check two characters in some cases (utf-32 to utf-16). A BOM needs to be placed and checked, too, if necessary (always necessary if you work with files of unknown sources).
EDIT: added on request:
The encodings of the characters in JavaScript is always in utf-16, a two byte representation of the character. That was all well and nice until they suddenly (ha!) found out that two bytes are not really sufficient for all of the alphabets of the world, so the expanded the Unicode range to four bytes: utf-32.
Well, the Unicode consortium did so but the ECMA committee did not.
It cannot be said that hell broke loose but it is quite close in some circumstances, and one of those is your case because you want to mix one-byte encodings with multiple-byte encodings, different ones even.
One byte fits well in two bytes but three or more bytes do not fit well in two bytes, so the so called surrogates were invented. These surrogates are also the reason why it is not so simple to reverse a string in JavaScript.
As I said: a large can of worms.

What is a surrogate pair?

I came across this code in a javascript open source project.
validator.isLength = function (str, min, max)
// match surrogate pairs in string or declare an empty array if none found in string
var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
// subtract the surrogate pairs string length from main string length
var len = str.length - surrogatePairs.length;
// now compare string length with min and max ... also make sure max is defined(in other words, max param is optional for function)
return len >= min && (typeof max === 'undefined' || len <= max);
};
As far as I understand, the above code is checking the length of the string but not taking the surrogate pairs into account. So:
Is my understanding of the code correct?
What are surrogate pairs?
I have thus far only figured out that this is related to encoding.
Yes. Your understanding is correct. The function returns the length of the string in Unicode Code Points.
JavaScript is using UTF-16 to encode its strings. This means two bytes (16-bit) are used to represent one Unicode Code Point.
Now there are characters (like the Emojis) in Unicode that have a that high code point so that they cannot be stored in 2 bytes (16bit) so they need to get encoded into two UTF-16 characters (4 bytes). These are called surrogate pairs.
Try this
var len = "😀".length // There is an emoji in the string (if you don’t see it)
vs
var str = "😀"
var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
var len = str.length - surrogatePairs.length;
In the first example len will be 2 because the Emoji consists of two 2 UTF-16 characters. In the second example len will be 1.
You might want to read
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky
For your second question:
1. What is a "surrogate pair" in Java?
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme.
In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF.
Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, some additional complexity is used to store values above this range (0x10000 to 0x10FFFF). This is done using pairs of code units known as surrogates.
The surrogate code units are in two ranges known as "low surrogates" and "high surrogates", depending on whether they are allowed at the start or end of the two code unit sequence.
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396
Hope this helps.
Did you try to just google it?
The best description is http://unicodebook.readthedocs.io/unicode_encodings.html#surrogates
In UTF-16 some characters are stored in 8 bits and others in 16 bits.
Surrogate pair is a character representation that take 16 bits.
Some character codes is reserved to be the first one in such pairs.

Categories

Resources