I am trying to ascertain if there is a standard arithmetical formula which, given the length of an unencoded string, will reveal the length of that string when it has been base-64 encoded.
Here is a list of strings and their base-64 encodings:
A : QQ==
AB : QUI=
ABC : QUJD
ABCD : QUJDRA==
ABCDE : QUJDREU=
ABCDEF : QUJDREVG
ABCDEFG : QUJDREVGRw==
ABCDEFGH : QUJDREVGR0g=
ABCDEFGHI : QUJDREVGR0hJ
ABCDEFGHIJ : QUJDREVGR0hJSg==
ABCDEFGHIJK : QUJDREVGR0hJSks=
ABCDEFGHIJKL : QUJDREVGR0hJSktM
Here are the string lengths of the original strings and the lengths of their base-64 encoded strings (not including the = signs sometimes appended to the end of the encoding):
1 : 2
2 : 3
3 : 4
4 : 6
5 : 7
6 : 8
7 : 10
8 : 11
9 : 12
10 : 14
11 : 15
12 : 16
What single formula, when applied to the numbers on the left, results in the numbers on the right?
Function https://stackoverflow.com/a/57945696/230983 does exactly what Rounin needs. But if you want to support Unicode characters you cannot rely on the length method, so you need something else to count the number of bytes. A simple way to solve this is to use blobs:
/**
* Guess the number of Base64 characters required by specified string
*
* #param {String} str
* #returns {Number}
*/
function detectB64CharsLength(str) {
const blob = new Blob([str]);
return Math.ceil(blob.size * (4 / 3))
}
/**
* A dirty hack for encoding Unicode characters to Base64
*
* #link https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/Base64_encoding_and_decoding#The_Unicode_Problem
* #param {String} data
* #returns {String}
*/
function utoa(data) {
return btoa(unescape(encodeURIComponent(data)));
}
// Run some tests and make sure everything is ok
['a', 'ab', 'ββ', '😀'].map(v => {
console.log(v, detectB64CharsLength(v), utoa(v));
});
Your question is muddled, because of the part where you say "not including the = signs sometimes appended to the end of the encoding".
I'm not saying the length of the non-= portion of a base64 encoding result is uninteresting -- perhaps you have valid reasons for wanting to know that.
But if you are trying to calculate, say, the storage needed for a base64 encoding result, you need to include storage for the = signs; a base64 result cannot be decoded without them. Observe:
echo -n 'ABCDE' | base64
QUJDREU=
$ echo -n 'QUJDREU=' | base64 --decode | od -c
0000000 A B C D E
$ echo -n 'QUJDREU' | base64 --decode | od -c
0000000 A B C
NOTE #1 : It is possible to not store the =-signs, because it is possible to calculate when they are missing from a given base64 result; they don't strictly speaking need to be stored, but they do need to be supplied for the decoding operation. But then you'd need a custom decoding operation that first looks to see if the padding is missing. I wager that storing at worst 2 extra bytes is far less expensive than the hassle / complexity / unexpectedness of a custom base64 decoding function.
NOTE #2 : As per follow-up comments, some libraries have base64 functions that support missing padding. Treatment of padding is implementation-specific. In some contexts, padding is mandatory (per the relevant specs). Each of the following is a reasonable treatment of padding for any specific library:
implicit padding : assume padding characters for inputs whose length is one or two bytes short of a multiple of 4 bytes (note: 3 bytes short is still invalid, since base64 encoding can only be 0, 1, or 2 bytes short)
best-effort decoding : decode the longest portion of the input that is divisible by 4 bytes
assume truncation : reject as invalid an input whose length is not divisible by 4 bytes, on the assumption that this indicates an incomplete transmission
Again, which of these is most correct will depend upon the context in which the code in question is operating, and different library authors will make different determinations on this.
The answer from #Victor is the best answer; it is the most germane to the context of the question (Javascript), and considers the crucial bytes-vs-characters issue as well.
As I was finishing typing out the question above, I realised (I think) what the formula is.
Divide the original string length by 3.
Round up that new number
Add the rounded up new number to the original string length
Like this:
getLengthOfStringAfterBase64Encoding = (string) => {
const stringLength = string.length;
const base64EncodedStringLength = stringLength + Math.ceil(stringLength / 3);
return base64EncodedStringLength;
}
Now I have to convert the hexadecimal encoded in a String to a byte hexadecimal.
var str = "5e"
var b = // Should be 0x5e then.
if str = "6b", then b = 0x6b and so on.
Is there any function in javascript, like in java
Byte.parseByte(str, 16)
Thanks in advance
The function you want is parseInt
parseInt("6b", 16) // returns 107
The first argument to parseInt is a string representation of the number and the second argument is the base. Use 10 for decimal and 16 for hexadecimal.
From your comment, if you expect "an output of 0x6b" from the string "6b" then just prepend "0x" to your string, and further manipulate as you need. There is no Javascript type that will output a hexadecimal in a readable format that you'll see prefixed with '0x' other than a string.
I solved it by using just
new Buffer("32476832", 'hex')
this solved my problem and gave me the desired buffer
<Buffer 32 47 68 32>
In JavaScript, I need to convert two bytes into a 16 bit integer, so that I can convert a stream of audio data into an array of signed PCM values.
Most answers online for converting bytes to 16 bit integers use the following, but it does not work correctly for negative numbers.
var result = (((byteA & 0xFF) << 8) | (byteB & 0xFF));
You need to consider that the negatives are represented in 2's compliment, and that JavaScript uses 32 bit integers to perform bitwise operations. Because of this, if it's a negative value, you need to fill in the first 16 bits of the number with 1's. So, here is a solution:
var sign = byteA & (1 << 7);
var x = (((byteA & 0xFF) << 8) | (byteB & 0xFF));
if (sign) {
result = 0xFFFF0000 | x; // fill in most significant bits with 1's
}
How can I perform bitwise right shift on binary?
>> in JS applies to integers. So, 6 >> 1 (bitwise shift right, step 1) shows 3 as result (110 >> 1 = 011 – correct). It is nice, but... Is it possible to work with shift right operator with binary?
I need this: 110 >> 1 working correctly. This will show 55 as result and it is correct. 110 is 01101110. 01101110 >> 1 = 00110111. 00110111 is 55. Correct. But I want 011 as result! How I can do this in js?
This looks like string manipulation to me. How about:
function shr(x, s) {
return
new String('0', Math.min(s, x.length)) +
x.substr(0, Math.max(0, x.length - s));
}
>>> shr('110', 1)
'011'
Alternatively, you can use the usual bitwise operators and simply convert from a string representation beforehand (once), and convert back to a string representation afterwards (once).
Here's one way to do it:
(parseInt("110",2) >> 1).toString(2) // gives "11"
parseInt can take a radix as a parameter, so if you pass it 2 it will treat the string as binary. So you convert to a number, shift it and then use toString (which conveniently also will take a radix as a parameter) to convert back to a binary string.
I have a javascript string which is about 500K when being sent from the server in UTF-8. How can I tell its size in JavaScript?
I know that JavaScript uses UCS-2, so does that mean 2 bytes per character. However, does it depend on the JavaScript implementation? Or on the page encoding or maybe content-type?
You can use the Blob to get the string size in bytes.
Examples:
console.info(
new Blob(['😂']).size, // 4
new Blob(['👍']).size, // 4
new Blob(['😂👍']).size, // 8
new Blob(['👍😂']).size, // 8
new Blob(['I\'m a string']).size, // 12
// from Premasagar correction of Lauri's answer for
// strings containing lone characters in the surrogate pair range:
// https://stackoverflow.com/a/39488643/6225838
new Blob([String.fromCharCode(55555)]).size, // 3
new Blob([String.fromCharCode(55555, 57000)]).size // 4 (not 6)
);
This function will return the byte size of any UTF-8 string you pass to it.
function byteCount(s) {
return encodeURI(s).split(/%..|./).length - 1;
}
Source
JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, it’s just an implementation detail that won’t affect the language’s characteristics.
The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.
Source
If you're using node.js, there is a simpler solution using buffers :
function getBinarySize(string) {
return Buffer.byteLength(string, 'utf8');
}
There is a npm lib for that : https://www.npmjs.org/package/utf8-binary-cutter (from yours faithfully)
String values are not implementation dependent, according the ECMA-262 3rd Edition Specification, each character represents a single 16-bit unit of UTF-16 text:
4.3.16 String Value
A string value is a member of the type String and is a
finite ordered sequence of zero or
more 16-bit unsigned integer values.
NOTE Although each value usually
represents a single 16-bit unit of
UTF-16 text, the language does not
place any restrictions or requirements
on the values except that they be
16-bit unsigned integers.
These are 3 ways I use:
TextEncoder
new TextEncoder().encode("myString").length
Blob
new Blob(["myString"]).size
Buffer
Buffer.byteLength("myString", 'utf8')
Try this combination with using unescape js function:
const byteAmount = unescape(encodeURIComponent(yourString)).length
Full encode proccess example:
const s = "1 a ф № # ®"; // length is 11
const s2 = encodeURIComponent(s); // length is 41
const s3 = unescape(s2); // length is 15 [1-1,a-1,ф-2,№-3,#-1,®-2]
const s4 = escape(s3); // length is 39
const s5 = decodeURIComponent(s4); // length is 11
Note that if you're targeting node.js you can use Buffer.from(string).length:
var str = "\u2620"; // => "☠"
str.length; // => 1 (character)
Buffer.from(str).length // => 3 (bytes)
The size of a JavaScript string is
Pre-ES6: 2 bytes per character
ES6 and later: 2 bytes per character,
or 5 or more bytes per character
Pre-ES6
Always 2 bytes per character. UTF-16 is not allowed because the spec says "values must be 16-bit unsigned integers". Since UTF-16 strings can use 3 or 4 byte characters, it would violate 2 byte requirement. Crucially, while UTF-16 cannot be fully supported, the standard does require that the two byte characters used are valid UTF-16 characters. In other words, Pre-ES6 JavaScript strings support a subset of UTF-16 characters.
ES6 and later
2 bytes per character, or 5 or more bytes per character. The additional sizes come into play because ES6 (ECMAScript 6) adds support for Unicode code point escapes. Using a unicode escape looks like this: \u{1D306}
Practical notes
This doesn't relate to the internal implemention of a particular engine. For
example, some engines use data structures and libraries with full
UTF-16 support, but what they provide externally doesn't have to be
full UTF-16 support. Also an engine may provide external UTF-16
support as well but is not mandated to do so.
For ES6, practically speaking characters will never be more than 5
bytes long (2 bytes for the escape point + 3 bytes for the Unicode
code point) because the latest version of Unicode only has 136,755
possible characters, which fits easily into 3 bytes. However this is
technically not limited by the standard so in principal a single
character could use say, 4 bytes for the code point and 6 bytes
total.
Most of the code examples here for calculating byte size don't seem to take into account ES6 Unicode code point escapes, so the results could be incorrect in some cases.
UTF-8 encodes characters using 1 to 4 bytes per code point. As CMS pointed out in the accepted answer, JavaScript will store each character internally using 16 bits (2 bytes).
If you parse each character in the string via a loop and count the number of bytes used per code point, and then multiply the total count by 2, you should have JavaScript's memory usage in bytes for that UTF-8 encoded string. Perhaps something like this:
getStringMemorySize = function( _string ) {
"use strict";
var codePoint
, accum = 0
;
for( var stringIndex = 0, endOfString = _string.length; stringIndex < endOfString; stringIndex++ ) {
codePoint = _string.charCodeAt( stringIndex );
if( codePoint < 0x100 ) {
accum += 1;
continue;
}
if( codePoint < 0x10000 ) {
accum += 2;
continue;
}
if( codePoint < 0x1000000 ) {
accum += 3;
} else {
accum += 4;
}
}
return accum * 2;
}
Examples:
getStringMemorySize( 'I' ); // 2
getStringMemorySize( '❤' ); // 4
getStringMemorySize( '𠀰' ); // 8
getStringMemorySize( 'I❤𠀰' ); // 14
The answer from Lauri Oherd works well for most strings seen in the wild, but will fail if the string contains lone characters in the surrogate pair range, 0xD800 to 0xDFFF. E.g.
byteCount(String.fromCharCode(55555))
// URIError: URI malformed
This longer function should handle all strings:
function bytes (str) {
var bytes=0, len=str.length, codePoint, next, i;
for (i=0; i < len; i++) {
codePoint = str.charCodeAt(i);
// Lone surrogates cannot be passed to encodeURI
if (codePoint >= 0xD800 && codePoint < 0xE000) {
if (codePoint < 0xDC00 && i + 1 < len) {
next = str.charCodeAt(i + 1);
if (next >= 0xDC00 && next < 0xE000) {
bytes += 4;
i++;
continue;
}
}
}
bytes += (codePoint < 0x80 ? 1 : (codePoint < 0x800 ? 2 : 3));
}
return bytes;
}
E.g.
bytes(String.fromCharCode(55555))
// 3
It will correctly calculate the size for strings containing surrogate pairs:
bytes(String.fromCharCode(55555, 57000))
// 4 (not 6)
The results can be compared with Node's built-in function Buffer.byteLength:
Buffer.byteLength(String.fromCharCode(55555), 'utf8')
// 3
Buffer.byteLength(String.fromCharCode(55555, 57000), 'utf8')
// 4 (not 6)
A single element in a JavaScript String is considered to be a single UTF-16 code unit. That is to say, Strings characters are stored in 16-bit (1 code unit), and 16-bit is equal to 2 bytes (8-bit = 1 byte).
The charCodeAt() method can be used to return an integer between 0 and 65535 representing the UTF-16 code unit at the given index.
The codePointAt() can be used to return the entire code point value for Unicode characters, e.g. UTF-32.
When a UTF-16 character can't be represented in a single 16-bit code unit, it will have a surrogate pair and therefore use two code units( 2 x 16-bit = 4 bytes)
See Unicode encodings for different encodings and their code ranges.
The Blob interface's size property returns the size of the Blob or File in bytes.
const getStringSize = (s) => new Blob([s]).size;
I'm working with an embedded version of the V8 Engine.
I've tested a single string. Pushing each step 1000 characters. UTF-8.
First test with single byte (8bit, ANSI) Character "A" (hex: 41).
Second test with two byte character (16bit) "Ω" (hex: CE A9) and the
third test with three byte character (24bit) "☺" (hex: E2 98 BA).
In all three cases the device prints out of memory at
888 000 characters and using ca. 26 348 kb in RAM.
Result: The characters are not dynamically stored. And not with only 16bit. - Ok, perhaps only for my case (Embedded 128 MB RAM Device, V8 Engine C++/QT) - The character encoding has nothing to do with the size in ram of the javascript engine. E.g. encodingURI, etc. is only useful for highlevel data transmission and storage.
Embedded or not, fact is that the characters are not only stored in 16bit.
Unfortunally I've no 100% answer, what Javascript do at low level area.
Btw. I've tested the same (first test above) with an array of character "A".
Pushed 1000 items every step. (Exactly the same test. Just replaced string to array) And the system bringt out of memory (wanted) after 10 416 KB using and array length of 1 337 000.
So, the javascript engine is not simple restricted. It's a kind more complex.
You can try this:
var b = str.match(/[^\x00-\xff]/g);
return (str.length + (!b ? 0: b.length));
It worked for me.