Russian characters converting to binary incorrectly (JavaScript)

Russian characters converting to binary incorrectly (JavaScript) - javascript

I'm writing a program in JavaScript that needs to convert text to 8-bit binary, which I accomplish with a loop that uses "exampleVariable.charCodeAt(i).toString(2)", then appends "0"s to the front until the length of the binary representation of each character is 8 bits. However, when Russian characters are passed into the function, each character is converted to an 11-bit binary representation, when it should actually be 16 bits. For example, "д" converts to 10000110100, when, in actuality, it should convert to "1101000010110100". Any ideas on how to fix this?

It looks like you are trying to get the binary representation of the UTF-8 representation of the character. JavaScript uses UTF-16 internally, so you will have to do some work to do the translation. There are various libraries out there, we'd need to know more about the environment to recommend the right tools. If you wanted to code it up yourself, it would be roughly:
function codepointToUTF_8(code) {
if (code < 0x07f) {
return [code];
} else if (code < 0x800) {
var byte1 = 0xc0 | (code >> 6 );
var byte2 = 0x80 | (code & 0x3f);
return [ byte1, byte2 ];
} else if (code < 0x10000) {
var byte1 = 0xe0 | ( code >> 12 );
var byte2 = 0x80 | ((code >> 6 ) & 0x3f);
var byte3 = 0x80 | ( code & 0x3f);
return [ byte1, byte2, byte3 ];
} else {
var byte1 = 0xf0 | ( code >> 18 );
var byte2 = 0x80 | ((code >> 12) & 0x3f);
var byte3 = 0x80 | ((code >> 6) & 0x3f);
var byte4 = 0x80 | ( code & 0x3f);
return [ byte1, byte2, byte3, byte4 ];
}
}
function strToUTF_8 (str) {
var result = [];
for (var i = 0; i < str.length; i++) {
// NOTE: this will not handle anything beyond the BMP
result.push(codepointToUTF_8(str.charCodeAt(i)));
}
console.log('result = ', result);
return [].concat.apply([], result);
}
function byteToBinary (b) {
var str = b.toString(2);
while (str.length < 8) {
str = '0' + str;
}
return str;
}
function toBinaryUTF_8 (str) {
return strToUTF_8(str).map(byteToBinary).join(' ');
}
console.log("абвгд => '" + toBinaryUTF_8("абвгд") + "'");
When I execute this I get:
абвгд => '11010000 10110000 11010000 10110001 11010000 10110010 11010000 10110011 11010000 10110100'
I haven't tested this thoroughly, but it should handle the Russian characters OK. It produces an array of character codes, which if you translate as you were trying before with 8 binary bits per character, you should be fine.

Related

How to convert a string to base64 encoding using byte array in JavaScript?

I have the below .NET code to convert a string to Base64 encoding by first converting it to byte array. I tried different answers on Stack Overflow to convert the string in byte array and then use btoa() function for base64 encoding in JavaScript. But, I'm not getting the exact encoded value as shared below.
For string value,
BBFDC43D-4890-4558-BB89-50D802014A97
I need Base64 encoding as,
PcT9u5BIWEW7iVDYAgFKlw==
.NET code:
String str = "BBFDC43D-4890-4558-BB89-50D802014A97"
Guid guid = new Guid(str);
Console.WriteLine(guid); // bbfdc43d-4890-4558-bb89-50d802014a97
Byte[] bytes = guid.ToByteArray();
Console.WriteLine(bytes); // System.Byte[]
String s = Convert.ToBase64String(bytes, Base64FormattingOptions.InsertLineBreaks);
Console.WriteLine(s); // PcT9u5BIWEW7iVDYAgFKlw==
Currently, I tried with the below code, which is not producing the desired result:
function strToUtf8Bytes(str) {
const utf8 = [];
for (let ii = 0; ii < str.length; ii++) {
let charCode = str.charCodeAt(ii);
if (charCode < 0x80) utf8.push(charCode);
else if (charCode < 0x800) {
utf8.push(0xc0 | (charCode >> 6), 0x80 | (charCode & 0x3f));
} else if (charCode < 0xd800 || charCode >= 0xe000) {
utf8.push(0xe0 | (charCode >> 12), 0x80 | ((charCode >> 6) & 0x3f), 0x80 | (charCode & 0x3f));
} else {
ii++;
// Surrogate pair:
// UTF-16 encodes 0x10000-0x10FFFF by subtracting 0x10000 and
// splitting the 20 bits of 0x0-0xFFFFF into two halves
charCode = 0x10000 + (((charCode & 0x3ff) << 10) | (str.charCodeAt(ii) & 0x3ff));
utf8.push(
0xf0 | (charCode >> 18),
0x80 | ((charCode >> 12) & 0x3f),
0x80 | ((charCode >> 6) & 0x3f),
0x80 | (charCode & 0x3f),
);
}
}
return utf8;
}
const str = "BBFDC43D-4890-4558-BB89-50D802014A97";
const strByteArr = strToUtf8Bytes(str);
const strBase64 = btoa(strByteArr);
// NjYsNjYsNzAsNjgsNjcsNTIsNTEsNjgsNDUsNTIsNTYsNTcsNDgsNDUsNTIsNTMsNTMsNTYsNDUsNjYsNjYsNTYsNTcsNDUsNTMsNDgsNjgsNTYsNDgsNTAsNDgsNDksNTIsNjUsNTcsNTU=

Your problem is caused by the following:
btoa() is using ASCII encoding
guid.ToByteArray(); does not use ASCII encoding
If you modify your C# code like this:
String str = "BBFDC43D-4890-4558-BB89-50D802014A97";
//Guid guid = new Guid(str);
//Console.WriteLine(guid);
// bbfdc43d-4890-4558-bb89-50d802014a97
//Byte[] bytes = guid.ToByteArray();
byte[] bytes = System.Text.Encoding.ASCII.GetBytes(str);
//Console.WriteLine(bytes); // System.Byte[]
String s = Convert.ToBase64String(bytes, Base64FormattingOptions.InsertLineBreaks);
Console.WriteLine(s);
You will get the following output:
QkJGREM0M0QtNDg5MC00NTU4LUJCODktNTBEODAyMDE0QTk3
Which will be the same string as the one returned from the btoa() function:
var rawString = "BBFDC43D-4890-4558-BB89-50D802014A97";
var b64encoded = btoa(rawString);
console.log(b64encoded);
Output:
QkJGREM0M0QtNDg5MC00NTU4LUJCODktNTBEODAyMDE0QTk3
UPDATE - Since you can't modify the C# code
You should adapt your Javascript code by combining Piotr's answer and this SO answer
function guidToBytes(guid) {
var bytes = [];
guid.split('-').map((number, index) => {
var bytesInChar = index < 3 ? number.match(/.{1,2}/g).reverse() : number.match(/.{1,2}/g);
bytesInChar.map((byte) => { bytes.push(parseInt(byte, 16)); })
});
return bytes;
}
function arrayBufferToBase64(buffer) {
var binary = '';
var bytes = new Uint8Array(buffer);
var len = bytes.byteLength;
for (var i = 0; i < len; i++) {
binary += String.fromCharCode(bytes[i]);
}
return btoa(binary);
}
var str = "BBFDC43D-4890-4558-BB89-50D802014A97";
var guidBytes = guidToBytes(str);
var b64encoded = arrayBufferToBase64(guidBytes);
console.log(b64encoded);
Output:
PcT9u5BIWEW7iVDYAgFKlw==

The problem with your code is representation of Guid. In C# code you are converting "BBFDC43D-4890-4558-BB89-50D802014A97" into UUID which is a 128-bit number. In JavaScript code, you are doing something else. You iterate through the string and calculate a byte array of a string. They are simply not equal.
Now you have to options
Implement proper guid conversion in JS (this may help: https://gist.github.com/daboxu/4f1dd0a254326ac2361f8e78f89e97ae)
In C# calculate byte array in the same way as in JS

Your string is a hexadecimal value, which you use to create a GUID. Then you convert the GUID into a byte array with:
Byte[] bytes = guid.ToByteArray();
The GUID is a 16-byte value which can be represented as a hexadecimal value. When you convert this GUID into a byte array, you will get the 16 bytes of the value, not the byte representation of the hexadecimal value.
In the provided JavaScript function you are doing something else: You are converting the string directly to a byte array.
In C# you do the equivalent with an Encoding:
String str = "BBFDC43D-4890-4558-BB89-50D802014A97";
Byte[] bytes = Encoding.UTF8.GetBytes(str);

Is there a way to redefine the standard Ascii character set that Javascript charCodeAt and fromCharCode calls from within a function?

For encoding, Javascript pulls from the standard Anscii table for mapping characters. I found the following function below that brilliantly and correctly encodes to Anscii85/Base85. But I want to encode to the Z85 variation because it contains the set of symbols that I require. My understanding is that the Anscii85/Base85 encoding should work exactly the same, except that Z85 maps the values in a different order from the Anscii standard, and uses a different combination of symbols from the standard Ansii85 mapping as well. So the character set is the only difference:
Ansci85 uses the 85 characters, 32 through 126 (reference):
"!\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstu
Z85 uses a custom set of 85 characters (reference):
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.-:+=^!/*?&<>()[]{}#%$#
My question is, is there any way to redefine the character set that charCodeAt and fromCharCode refer to in this function so that it would then encode in Z85?
// By Steve Hanov. Released to the public domain.
function encodeAscii85(input) {
// Remove Adobe standard prefix
// var output = "<~";
var chr1, chr2, chr3, chr4, chr, enc1, enc2, enc3, enc4, enc5;
var i = 0;
while (i < input.length) {
// Access past the end of the string is intentional.
chr1 = input.charCodeAt(i++);
chr2 = input.charCodeAt(i++);
chr3 = input.charCodeAt(i++);
chr4 = input.charCodeAt(i++);
chr = ((chr1 << 24) | (chr2 << 16) | (chr3 << 8) | chr4) >>> 0;
enc1 = (chr / (85 * 85 * 85 * 85) | 0) % 85 + 33;
enc2 = (chr / (85 * 85 * 85) | 0) % 85 + 33;
enc3 = (chr / (85 * 85) | 0 ) % 85 + 33;
enc4 = (chr / 85 | 0) % 85 + 33;
enc5 = chr % 85 + 33;
output += String.fromCharCode(enc1) +
String.fromCharCode(enc2);
if (!isNaN(chr2)) {
output += String.fromCharCode(enc3);
if (!isNaN(chr3)) {
output += String.fromCharCode(enc4);
if (!isNaN(chr4)) {
output += String.fromCharCode(enc5);
}
}
}
}
// Remove Adobe standard suffix
// output += "~>";
return output;
}
Extra notes:
Alternately, I thought I could use something like the following function, but the problem is that it doesn't properly encode Anscii85 in the first place. If it was correct, Hello world! should encode to 87cURD]j7BEbo80, but this function encodes it to RZ!iCB=*gD0D5_+ (reference).
I don't understand the algorithm enough to know what is wrong with the mapping here. Ideally, if it was encoding correctly, I should be able to update this function to use the Z85 character set:
// Adapted from: Ascii85 JavaScript implementation, 2012.10.16 Jim Herrero
// Original: https://jsfiddle.net/nderscore/bbKS4/
var Ascii85 = {
// Ascii85 mapping
_alphabet: "!\"#$%&'()*+,-./0123456789:;<=>?#"+
"ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`"+
"abcdefghijklmnopqrstu"+
"y"+ // short form 4 spaces (optional)
"z", // short form 4 nulls (optional)
// functions
encode: function(input) {
var alphabet = Ascii85._alphabet,
useShort = alphabet.length > 85,
output = "", buffer, val, i, j, l;
for (i = 0, l = input.length; i < l;) {
buffer = [0,0,0,0];
for (j = 0; j < 4; j++)
if(input[i])
buffer[j] = input.charCodeAt(i++);
for (val = buffer[3], j = 2; j >= 0; j--)
val = val*256+buffer[j];
if (useShort && !val)
output += alphabet[86];
else if (useShort && val == 0x20202020)
output += alphabet[85];
else {
for (j = 0; j < 5; j++) {
output += alphabet[val%85];
val = Math.floor(val/85);
}
}
}
return output;
}
};

Character codes are character codes. You can't change the behavior of String.fromCharCode() or String.charCodeAt().
However, you can store your custom character set in an array and use array indexing and Array.indexOf() to look up entries.
Updating this function to work with Z85 will be tricky, though, because String.fromCharCode() and String.charCodeAt() are used in two different contexts -- they're sometimes used to access the unencoded string (which doesn't need to change), and sometimes for the encoded string (which does). You will need to take care to not confuse the two.

Javascript: unicode character to BYTE based hex escape sequence (NOT surrogates)

In javascript I am trying to make unicode into byte based hex escape sequences that are compatible with C:
ie. 😄
becomes: \xF0\x9F\x98\x84 (correct)
NOT javascript surrogates, not \uD83D\uDE04 (wrong)
I cannot figure out the math relationship between the four bytes C wants vs the two surrogates javascript uses. I suspect the algorithm is far more complex than my feeble attempts.
Thanks for any tips.

encodeURIComponent does this work:
var input = "\uD83D\uDE04";
var result = encodeURIComponent(input).replace(/%/g, "\\x"); // \xF0\x9F\x98\x84
Upd: Actually, C strings can contain digits and letters without escaping, but if you really need to escape them:
function escape(s, escapeEverything) {
if (escapeEverything) {
s = s.replace(/[\x10-\x7f]/g, function (s) {
return "-x" + s.charCodeAt(0).toString(16).toUpperCase();
});
}
s = encodeURIComponent(s).replace(/%/g, "\\x");
if (escapeEverything) {
s = s.replace(/\-/g, "\\");
}
return s;
}

Found a solution here: http://jonisalonen.com/2012/from-utf-16-to-utf-8-in-javascript/
I would have never figured out THAT math, wow.
somewhat minified
function UTF8seq(s) {
var i,c,u=[];
for (i=0; i < s.length; i++) {
c = s.charCodeAt(i);
if (c < 0x80) { u.push(c); }
else if (c < 0x800) { u.push(0xc0 | (c >> 6), 0x80 | (c & 0x3f)); }
else if (c < 0xd800 || c >= 0xe000) { u.push(0xe0 | (c >> 12), 0x80 | ((c>>6) & 0x3f), 0x80 | (c & 0x3f)); }
else { i++; c = 0x10000 + (((c & 0x3ff)<<10) | (s.charCodeAt(i) & 0x3ff));
u.push(0xf0 | (c >>18), 0x80 | ((c>>12) & 0x3f), 0x80 | ((c>>6) & 0x3f), 0x80 | (c & 0x3f)); }
}
for (i=0; i < u.length; i++) { u[i]=u[i].toString(16); }
return '\\x'+u.join('\\x');
}

Your C code expects an UTF-8 string (the symbol is represented as 4 bytes). The JS representation you see is UTF-16 however (the symbol is represented as 2 uint16s, a surrogate pair).
You will first need to get the (Unicode) code point for your symbol (from the UTF-16 JS string), then build the UTF-8 representation for it from that.
Since ES6 you can use the codePointAt method for the first part, which I would recommend using as a shim even if not supported. I guess you don't want to decode surrogate pairs yourself :-)
For the rest, I don't think there's a library method, but you can write it yourself according to the spec:
function hex(x) {
x = x.toString(16);
return (x.length > 2 ? "\\u0000" : "\\x00").slice(0,-x.length)+x.toUpperCase();
}
var c = "😄";
console.log(c.length, hex(c.charCodeAt(0))+hex(c.charCodeAt(1))); // 2, "\uD83D\uDE04"
var cp = c.codePointAt(0);
var bytes = new Uint8Array(4);
bytes[3] = 0x80 | cp & 0x3F;
bytes[2] = 0x80 | (cp >>>= 6) & 0x3F;
bytes[1] = 0x80 | (cp >>>= 6) & 0x3F;
bytes[0] = 0xF0 | (cp >>>= 6) & 0x3F;
console.log(Array.prototype.map.call(bytes, hex).join("")) // "\xf0\x9f\x98\x84"
(tested in Chrome)

javascript convert to utf8 the result of readAsBinaryString

I have a file in the following format:
utf-8 encoded text block
separator
binary data block
I use JavaScript's FileReader to read the file as a binary string using
FileReader.readAsBinaryString like so:
var reader = new FileReader();
reader.onload = function(evt){
// Here I use the separator position to divide the file content into
// header and binary
...
console.log(header);
};
FileReader.onerror = function (evt) {
onFailure(evt.target.error.code);
}
reader.readAsBinaryString(blobFile);
The header is not parsed as UTF-8. I know that FileReader.readAsText takes the encoding of the file into account while FileReader.readAsBinaryString reads the file byte by byte.
How do I convert the header to utf8? reading the file twice, once as binary string to read the binary data and again as text to get the first block as utf8 encoded don't appeal to me.

I found the answer on http://snipplr.com/view/31206/:
I have tested it on French characters and it converts then to utf8 without any issues.
function readUTF8String(bytes) {
var ix = 0;
if (bytes.slice(0, 3) == "\xEF\xBB\xBF") {
ix = 3;
}
var string = "";
for (; ix < bytes.length; ix++) {
var byte1 = bytes[ix].charCodeAt(0);
if (byte1 < 0x80) {
string += String.fromCharCode(byte1);
} else if (byte1 >= 0xC2 && byte1 < 0xE0) {
var byte2 = bytes[++ix].charCodeAt(0);
string += String.fromCharCode(((byte1 & 0x1F) << 6) + (byte2 & 0x3F));
} else if (byte1 >= 0xE0 && byte1 < 0xF0) {
var byte2 = bytes[++ix].charCodeAt(0);
var byte3 = bytes[++ix].charCodeAt(0);
string += String.fromCharCode(((byte1 & 0xFF) << 12) + ((byte2 & 0x3F) << 6) + (byte3 & 0x3F));
} else if (byte1 >= 0xF0 && byte1 < 0xF5) {
var byte2 = bytes[++ix].charCodeAt(0);
var byte3 = bytes[++ix].charCodeAt(0);
var byte4 = bytes[++ix].charCodeAt(0);
var codepoint = ((byte1 & 0x07) << 18) + ((byte2 & 0x3F) << 12) + ((byte3 & 0x3F) << 6) + (byte4 & 0x3F);
codepoint -= 0x10000;
string += String.fromCharCode(
(codepoint >> 10) + 0xD800, (codepoint & 0x3FF) + 0xDC00
);
}
}
return string;
}

the result is a atring so you can iterate and convert every byte to its ascii representation using String.fromCharCode something like.
var cursor=0
var header="";
while(cursor!=blob.length && blob[cursor]!=/*separator code*/){
header+=String.fromCharCode(blob[cursor]);
cursor+=1;
}
or
var pos=blob.indexOf(/*separator*/);
var header=String.fromCharCode.apply(this,blob.substr(0,pos).split(' '))

Failed to execute 'btoa' on 'Window': The string to be encoded contains characters outside of the Latin1 range.

The error in the title is thrown only in Google Chrome, according to my tests. I'm base64 encoding a big XML file so that it can be downloaded:
this.loader.src = "data:application/x-forcedownload;base64,"+
btoa("<?xml version=\"1.0\" encoding=\"utf-8\"?>"
+"<"+this.gamesave.tagName+">"
+this.xml.firstChild.innerHTML
+"</"+this.gamesave.tagName+">");
this.loader is hidden iframe.
This error is actually quite a change because normally, Google Chrome would crash upon btoa call. Mozilla Firefox has no problems here, so the issue is browser related.
I'm not aware of any strange characters in file. Actually I do believe there are no non-ascii characters.
Q:
How do I find the problematic characters and replace them so that Chrome stops complaining?
I have tried to use Downloadify to initiate the download, but it does not work. It's unreliable and throws no errors to allow debug.

If you have UTF8, use this (actually works with SVG source), like:
btoa(unescape(encodeURIComponent(str)))
example:
var imgsrc = 'data:image/svg+xml;base64,' + btoa(unescape(encodeURIComponent(markup)));
var img = new Image(1, 1); // width, height values are optional params
img.src = imgsrc;
If you need to decode that base64, use this:
var str2 = decodeURIComponent(escape(window.atob(b64)));
console.log(str2);
Example:
var str = "äöüÄÖÜçéèñ";
var b64 = window.btoa(unescape(encodeURIComponent(str)))
console.log(b64);
var str2 = decodeURIComponent(escape(window.atob(b64)));
console.log(str2);
Note: if you need to get this to work in mobile-safari, you might need to strip all the white-space from the base64 data...
function b64_to_utf8( str ) {
str = str.replace(/\s/g, '');
return decodeURIComponent(escape(window.atob( str )));
}
2017 Update
This problem has been bugging me again.
The simple truth is, atob doesn't really handle UTF8-strings - it's ASCII only.
Also, I wouldn't use bloatware like js-base64.
But webtoolkit does have a small, nice and very maintainable implementation:
/**
*
* Base64 encode / decode
* http://www.webtoolkit.info
*
**/
var Base64 = {
// private property
_keyStr: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/="
// public method for encoding
, encode: function (input)
{
var output = "";
var chr1, chr2, chr3, enc1, enc2, enc3, enc4;
var i = 0;
input = Base64._utf8_encode(input);
while (i < input.length)
{
chr1 = input.charCodeAt(i++);
chr2 = input.charCodeAt(i++);
chr3 = input.charCodeAt(i++);
enc1 = chr1 >> 2;
enc2 = ((chr1 & 3) << 4) | (chr2 >> 4);
enc3 = ((chr2 & 15) << 2) | (chr3 >> 6);
enc4 = chr3 & 63;
if (isNaN(chr2))
{
enc3 = enc4 = 64;
}
else if (isNaN(chr3))
{
enc4 = 64;
}
output = output +
this._keyStr.charAt(enc1) + this._keyStr.charAt(enc2) +
this._keyStr.charAt(enc3) + this._keyStr.charAt(enc4);
} // Whend
return output;
} // End Function encode
// public method for decoding
,decode: function (input)
{
var output = "";
var chr1, chr2, chr3;
var enc1, enc2, enc3, enc4;
var i = 0;
input = input.replace(/[^A-Za-z0-9\+\/\=]/g, "");
while (i < input.length)
{
enc1 = this._keyStr.indexOf(input.charAt(i++));
enc2 = this._keyStr.indexOf(input.charAt(i++));
enc3 = this._keyStr.indexOf(input.charAt(i++));
enc4 = this._keyStr.indexOf(input.charAt(i++));
chr1 = (enc1 << 2) | (enc2 >> 4);
chr2 = ((enc2 & 15) << 4) | (enc3 >> 2);
chr3 = ((enc3 & 3) << 6) | enc4;
output = output + String.fromCharCode(chr1);
if (enc3 != 64)
{
output = output + String.fromCharCode(chr2);
}
if (enc4 != 64)
{
output = output + String.fromCharCode(chr3);
}
} // Whend
output = Base64._utf8_decode(output);
return output;
} // End Function decode
// private method for UTF-8 encoding
,_utf8_encode: function (string)
{
var utftext = "";
string = string.replace(/\r\n/g, "\n");
for (var n = 0; n < string.length; n++)
{
var c = string.charCodeAt(n);
if (c < 128)
{
utftext += String.fromCharCode(c);
}
else if ((c > 127) && (c < 2048))
{
utftext += String.fromCharCode((c >> 6) | 192);
utftext += String.fromCharCode((c & 63) | 128);
}
else
{
utftext += String.fromCharCode((c >> 12) | 224);
utftext += String.fromCharCode(((c >> 6) & 63) | 128);
utftext += String.fromCharCode((c & 63) | 128);
}
} // Next n
return utftext;
} // End Function _utf8_encode
// private method for UTF-8 decoding
,_utf8_decode: function (utftext)
{
var string = "";
var i = 0;
var c, c1, c2, c3;
c = c1 = c2 = 0;
while (i < utftext.length)
{
c = utftext.charCodeAt(i);
if (c < 128)
{
string += String.fromCharCode(c);
i++;
}
else if ((c > 191) && (c < 224))
{
c2 = utftext.charCodeAt(i + 1);
string += String.fromCharCode(((c & 31) << 6) | (c2 & 63));
i += 2;
}
else
{
c2 = utftext.charCodeAt(i + 1);
c3 = utftext.charCodeAt(i + 2);
string += String.fromCharCode(((c & 15) << 12) | ((c2 & 63) << 6) | (c3 & 63));
i += 3;
}
} // Whend
return string;
} // End Function _utf8_decode
}
https://www.fileformat.info/info/unicode/utf8.htm
For any character equal to or below 127 (hex 0x7F), the UTF-8
representation is one byte. It is just the lowest 7 bits of the full
unicode value. This is also the same as the ASCII value.
For characters equal to or below 2047 (hex 0x07FF), the UTF-8
representation is spread across two bytes. The first byte will have
the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The
second byte will have the top bit set and the second bit clear (i.e.
0x80 to 0xBF).
For all characters equal to or greater than 2048 but less than 65535
(0xFFFF), the UTF-8 representation is spread across three bytes.

Use a library instead
We don't have to reinvent the wheel. Just use a library to save the time and headache.
js-base64
https://github.com/dankogai/js-base64 is good and I confirm it supports unicode very well.
Base64.encode('dankogai'); // ZGFua29nYWk=
Base64.encode('小飼弾'); // 5bCP6aO85by+
Base64.encodeURI('小飼弾'); // 5bCP6aO85by-
Base64.decode('ZGFua29nYWk='); // dankogai
Base64.decode('5bCP6aO85by+'); // 小飼弾
// note .decodeURI() is unnecessary since it accepts both flavors
Base64.decode('5bCP6aO85by-'); // 小飼弾

Using btoa with unescape and encodeURIComponent didn't work for me. Replacing all the special characters with XML/HTML entities and then converting to the base64 representation was the only way to solve this issue for me. Some code:
base64 = btoa(str.replace(/[\u00A0-\u2666]/g, function(c) {
return '&#' + c.charCodeAt(0) + ';';
}));

I just thought I should share how I actually solved the problem and why I think this is the right solution (provided you don't optimize for old browser).
Converting data to dataURL (data: ...)
var blob = new Blob(
// I'm using page innerHTML as data
// note that you can use the array
// to concatenate many long strings EFFICIENTLY
[document.body.innerHTML],
// Mime type is important for data url
{type : 'text/html'}
);
// This FileReader works asynchronously, so it doesn't lag
// the web application
var a = new FileReader();
a.onload = function(e) {
// Capture result here
console.log(e.target.result);
};
a.readAsDataURL(blob);
Allowing user to save data
Apart from obvious solution - opening new window with your dataURL as URL you can do two other things.
1. Use fileSaver.js
File saver can create actual fileSave dialog with predefined filename. It can also fallback to normal dataURL approach.
2. Use (experimental) URL.createObjectURL
This is great for reusing base64 encoded data. It creates a short URL for your dataURL:
console.log(URL.createObjectURL(blob));
//Prints: blob:http://stackoverflow.com/7c18953f-f5f8-41d2-abf5-e9cbced9bc42
Don't forget to use the URL including the leading blob prefix. I used document.body again:
You can use this short URL as AJAX target, <script> source or <a> href location. You're responsible for destroying the URL though:
URL.revokeObjectURL('blob:http://stackoverflow.com/7c18953f-f5f8-41d2-abf5-e9cbced9bc42')

As an complement to Stefan Steiger answer: (as it doesn't look nice as a comment)
Extending String prototype:
String.prototype.b64encode = function() {
return btoa(unescape(encodeURIComponent(this)));
};
String.prototype.b64decode = function() {
return decodeURIComponent(escape(atob(this)));
};
Usage:
var str = "äöüÄÖÜçéèñ";
var encoded = str.b64encode();
console.log( encoded.b64decode() );
NOTE:
As stated in the comments, using unescape is not recommended as it may be removed in the future:
Warning: Although unescape() is not strictly deprecated (as in "removed from the Web standards"), it is defined in Annex B of the ECMA-262 standard, whose introduction states:
… All of the language features and behaviours specified in this annex have one or more undesirable characteristics and in the absence of legacy usage would be removed from this specification.
Note: Do not use unescape to decode URIs, use decodeURI or decodeURIComponent instead.

btoa() only support characters from String.fromCodePoint(0) up to String.fromCodePoint(255). For Base64 characters with a code point 256 or higher you need to encode/decode these before and after.
And in this point it becomes tricky...
Every possible sign are arranged in a Unicode-Table. The Unicode-Table is divided in different planes (languages, math symbols, and so on...). Every sign in a plane has a unique code point number. Theoretically, the number can become arbitrarily large.
A computer stores the data in bytes (8 bit, hexadecimal 0x00 - 0xff, binary 00000000 - 11111111, decimal 0 - 255). This range normally use to save basic characters (Latin1 range).
For characters with higher codepoint then 255 exist different encodings. JavaScript use 16 bits per sign (UTF-16), the string called DOMString. Unicode can handle code points up to 0x10fffff. That means, that a method must be exist to store several bits over several cells away.
String.fromCodePoint(0x10000).length == 2
UTF-16 use surrogate pairs to store 20bits in two 16bit cells. The first higher surrogate begins with 110110xxxxxxxxxx, the lower second one with 110111xxxxxxxxxx. Unicode reserved own planes for this: https://unicode-table.com/de/#high-surrogates
To store characters in bytes (Latin1 range) standardized procedures use UTF-8.
Sorry to say that, but I think there is no other way to implement this function self.
function stringToUTF8(str)
{
let bytes = [];
for(let character of str)
{
let code = character.codePointAt(0);
if(code <= 127)
{
let byte1 = code;
bytes.push(byte1);
}
else if(code <= 2047)
{
let byte1 = 0xC0 | (code >> 6);
let byte2 = 0x80 | (code & 0x3F);
bytes.push(byte1, byte2);
}
else if(code <= 65535)
{
let byte1 = 0xE0 | (code >> 12);
let byte2 = 0x80 | ((code >> 6) & 0x3F);
let byte3 = 0x80 | (code & 0x3F);
bytes.push(byte1, byte2, byte3);
}
else if(code <= 2097151)
{
let byte1 = 0xF0 | (code >> 18);
let byte2 = 0x80 | ((code >> 12) & 0x3F);
let byte3 = 0x80 | ((code >> 6) & 0x3F);
let byte4 = 0x80 | (code & 0x3F);
bytes.push(byte1, byte2, byte3, byte4);
}
}
return bytes;
}
function utf8ToString(bytes, fallback)
{
let valid = undefined;
let codePoint = undefined;
let codeBlocks = [0, 0, 0, 0];
let result = "";
for(let offset = 0; offset < bytes.length; offset++)
{
let byte = bytes[offset];
if((byte & 0x80) == 0x00)
{
codeBlocks[0] = byte & 0x7F;
codePoint = codeBlocks[0];
}
else if((byte & 0xE0) == 0xC0)
{
codeBlocks[0] = byte & 0x1F;
byte = bytes[++offset];
if(offset >= bytes.length || (byte & 0xC0) != 0x80) { valid = false; break; }
codeBlocks[1] = byte & 0x3F;
codePoint = (codeBlocks[0] << 6) + codeBlocks[1];
}
else if((byte & 0xF0) == 0xE0)
{
codeBlocks[0] = byte & 0xF;
for(let blockIndex = 1; blockIndex <= 2; blockIndex++)
{
byte = bytes[++offset];
if(offset >= bytes.length || (byte & 0xC0) != 0x80) { valid = false; break; }
codeBlocks[blockIndex] = byte & 0x3F;
}
if(valid === false) { break; }
codePoint = (codeBlocks[0] << 12) + (codeBlocks[1] << 6) + codeBlocks[2];
}
else if((byte & 0xF8) == 0xF0)
{
codeBlocks[0] = byte & 0x7;
for(let blockIndex = 1; blockIndex <= 3; blockIndex++)
{
byte = bytes[++offset];
if(offset >= bytes.length || (byte & 0xC0) != 0x80) { valid = false; break; }
codeBlocks[blockIndex] = byte & 0x3F;
}
if(valid === false) { break; }
codePoint = (codeBlocks[0] << 18) + (codeBlocks[1] << 12) + (codeBlocks[2] << 6) + (codeBlocks[3]);
}
else
{
valid = false; break;
}
result += String.fromCodePoint(codePoint);
}
if(valid === false)
{
if(!fallback)
{
throw new TypeError("Malformed utf-8 encoding.");
}
result = "";
for(let offset = 0; offset != bytes.length; offset++)
{
result += String.fromCharCode(bytes[offset] & 0xFF);
}
}
return result;
}
function decodeBase64(text, binary)
{
if(/[^0-9a-zA-Z\+\/\=]/.test(text)) { throw new TypeError("The string to be decoded contains characters outside of the valid base64 range."); }
let codePointA = 'A'.codePointAt(0);
let codePointZ = 'Z'.codePointAt(0);
let codePointa = 'a'.codePointAt(0);
let codePointz = 'z'.codePointAt(0);
let codePointZero = '0'.codePointAt(0);
let codePointNine = '9'.codePointAt(0);
let codePointPlus = '+'.codePointAt(0);
let codePointSlash = '/'.codePointAt(0);
function getCodeFromKey(key)
{
let keyCode = key.codePointAt(0);
if(keyCode >= codePointA && keyCode <= codePointZ)
{
return keyCode - codePointA;
}
else if(keyCode >= codePointa && keyCode <= codePointz)
{
return keyCode + 26 - codePointa;
}
else if(keyCode >= codePointZero && keyCode <= codePointNine)
{
return keyCode + 52 - codePointZero;
}
else if(keyCode == codePointPlus)
{
return 62;
}
else if(keyCode == codePointSlash)
{
return 63;
}
return undefined;
}
let codes = Array.from(text).map(character => getCodeFromKey(character));
let bytesLength = Math.ceil(codes.length / 4) * 3;
if(codes[codes.length - 2] == undefined) { bytesLength = bytesLength - 2; } else if(codes[codes.length - 1] == undefined) { bytesLength--; }
let bytes = new Uint8Array(bytesLength);
for(let offset = 0, index = 0; offset < bytes.length;)
{
let code1 = codes[index++];
let code2 = codes[index++];
let code3 = codes[index++];
let code4 = codes[index++];
let byte1 = (code1 << 2) | (code2 >> 4);
let byte2 = ((code2 & 0xf) << 4) | (code3 >> 2);
let byte3 = ((code3 & 0x3) << 6) | code4;
bytes[offset++] = byte1;
bytes[offset++] = byte2;
bytes[offset++] = byte3;
}
if(binary) { return bytes; }
return utf8ToString(bytes, true);
}
function encodeBase64(bytes) {
if (bytes === undefined || bytes === null) {
return '';
}
if (bytes instanceof Array) {
bytes = bytes.filter(item => {
return Number.isFinite(item) && item >= 0 && item <= 255;
});
}
if (
!(
bytes instanceof Uint8Array ||
bytes instanceof Uint8ClampedArray ||
bytes instanceof Array
)
) {
if (typeof bytes === 'string') {
const str = bytes;
bytes = Array.from(unescape(encodeURIComponent(str))).map(ch =>
ch.codePointAt(0)
);
} else {
throw new TypeError('bytes must be of type Uint8Array or String.');
}
}
const keys = [
'A',
'B',
'C',
'D',
'E',
'F',
'G',
'H',
'I',
'J',
'K',
'L',
'M',
'N',
'O',
'P',
'Q',
'R',
'S',
'T',
'U',
'V',
'W',
'X',
'Y',
'Z',
'a',
'b',
'c',
'd',
'e',
'f',
'g',
'h',
'i',
'j',
'k',
'l',
'm',
'n',
'o',
'p',
'q',
'r',
's',
't',
'u',
'v',
'w',
'x',
'y',
'z',
'0',
'1',
'2',
'3',
'4',
'5',
'6',
'7',
'8',
'9',
'+',
'/'
];
const fillKey = '=';
let byte1;
let byte2;
let byte3;
let sign1 = ' ';
let sign2 = ' ';
let sign3 = ' ';
let sign4 = ' ';
let result = '';
for (let index = 0; index < bytes.length; ) {
let fillUpAt = 0;
// tslint:disable:no-increment-decrement
byte1 = bytes[index++];
byte2 = bytes[index++];
byte3 = bytes[index++];
if (byte2 === undefined) {
byte2 = 0;
fillUpAt = 2;
}
if (byte3 === undefined) {
byte3 = 0;
if (!fillUpAt) {
fillUpAt = 3;
}
}
// tslint:disable:no-bitwise
sign1 = keys[byte1 >> 2];
sign2 = keys[((byte1 & 0x3) << 4) + (byte2 >> 4)];
sign3 = keys[((byte2 & 0xf) << 2) + (byte3 >> 6)];
sign4 = keys[byte3 & 0x3f];
if (fillUpAt > 0) {
if (fillUpAt <= 2) {
sign3 = fillKey;
}
if (fillUpAt <= 3) {
sign4 = fillKey;
}
}
result += sign1 + sign2 + sign3 + sign4;
if (fillUpAt) {
break;
}
}
return result;
}
let base64 = encodeBase64("\u{1F604}"); // unicode code point escapes for smiley
let str = decodeBase64(base64);
console.log("base64", base64);
console.log("str", str);
document.body.innerText = str;
how to use it: decodeBase64(encodeBase64("\u{1F604}"))
demo: https://jsfiddle.net/qrLadeb8/

Another solution for browser without using unescape:
function ToBinary(str)
{
let result="";
str=encodeURIComponent(str);
for(let i=0;i<str.length;i++)
if(str[i]=="%")
{
result+=String.fromCharCode(parseInt(str.substring(i+1,i+3),16));
i+=2;
}
else
result+=str[i];
return result;
}
btoa(ToBinary("тест"));//0YLQtdGB0YI=

I just ran into this problem myself.
First, modify your code slightly:
var download = "<?xml version=\"1.0\" encoding=\"utf-8\"?>"
+"<"+this.gamesave.tagName+">"
+this.xml.firstChild.innerHTML
+"</"+this.gamesave.tagName+">";
this.loader.src = "data:application/x-forcedownload;base64,"+
btoa(download);
Then use your favorite web inspector, put a breakpoint on the line of code that assigns this.loader.src, then execute this code:
for (var i = 0; i < download.length; i++) {
if (download[i].charCodeAt(0) > 255) {
console.warn('found character ' + download[i].charCodeAt(0) + ' "' + download[i] + '" at position ' + i);
}
}
Depending on your application, replacing the characters that are out of range may or may not work, since you'll be modifying the data. See the note on MDN about unicode characters with the btoa method:
https://developer.mozilla.org/en-US/docs/Web/API/window.btoa

A solution that converts the string to utf-8, which is slightly shorter than the utf-16 or URLEncoded versions many of the other answers suggest. It's also more compatible with how other languages like python and PHP would decode the strings:
Encode
function btoa_utf8(value) {
return btoa(
String.fromCharCode(
...new TextEncoder('utf-8')
.encode(hash_value)
)
);
}
Decode
function atob_utf8(value) {
const value_latin1 = atob(value);
return new TextDecoder('utf-8').decode(
Uint8Array.from(
{ length: value_latin1.length },
(element, index) => value_latin1.charCodeAt(index)
)
)
}
You can replace the 'utf-8' string in either of these with a different character encoding if you prefer.
Note This depends on the TextEncoder class. This is supported in most browsers nowadays but if you need to target older browsers check if it's available.

Develop Reference

JavaScript is the programming language of the Web.

Russian characters converting to binary incorrectly (JavaScript) - javascript

Related

How to convert a string to base64 encoding using byte array in JavaScript?

Is there a way to redefine the standard Ascii character set that Javascript charCodeAt and fromCharCode calls from within a function?

Javascript: unicode character to BYTE based hex escape sequence (NOT surrogates)

javascript convert to utf8 the result of readAsBinaryString

Failed to execute 'btoa' on 'Window': The string to be encoded contains characters outside of the Latin1 range.

Categories

Resources