Javascript: unicode character to BYTE based hex escape sequence (NOT surrogates) - javascript

In javascript I am trying to make unicode into byte based hex escape sequences that are compatible with C:
ie. 😄
becomes: \xF0\x9F\x98\x84 (correct)
NOT javascript surrogates, not \uD83D\uDE04 (wrong)
I cannot figure out the math relationship between the four bytes C wants vs the two surrogates javascript uses. I suspect the algorithm is far more complex than my feeble attempts.
Thanks for any tips.

encodeURIComponent does this work:
var input = "\uD83D\uDE04";
var result = encodeURIComponent(input).replace(/%/g, "\\x"); // \xF0\x9F\x98\x84
Upd: Actually, C strings can contain digits and letters without escaping, but if you really need to escape them:
function escape(s, escapeEverything) {
if (escapeEverything) {
s = s.replace(/[\x10-\x7f]/g, function (s) {
return "-x" + s.charCodeAt(0).toString(16).toUpperCase();
});
}
s = encodeURIComponent(s).replace(/%/g, "\\x");
if (escapeEverything) {
s = s.replace(/\-/g, "\\");
}
return s;
}

Found a solution here: http://jonisalonen.com/2012/from-utf-16-to-utf-8-in-javascript/
I would have never figured out THAT math, wow.
somewhat minified
function UTF8seq(s) {
var i,c,u=[];
for (i=0; i < s.length; i++) {
c = s.charCodeAt(i);
if (c < 0x80) { u.push(c); }
else if (c < 0x800) { u.push(0xc0 | (c >> 6), 0x80 | (c & 0x3f)); }
else if (c < 0xd800 || c >= 0xe000) { u.push(0xe0 | (c >> 12), 0x80 | ((c>>6) & 0x3f), 0x80 | (c & 0x3f)); }
else { i++; c = 0x10000 + (((c & 0x3ff)<<10) | (s.charCodeAt(i) & 0x3ff));
u.push(0xf0 | (c >>18), 0x80 | ((c>>12) & 0x3f), 0x80 | ((c>>6) & 0x3f), 0x80 | (c & 0x3f)); }
}
for (i=0; i < u.length; i++) { u[i]=u[i].toString(16); }
return '\\x'+u.join('\\x');
}

Your C code expects an UTF-8 string (the symbol is represented as 4 bytes). The JS representation you see is UTF-16 however (the symbol is represented as 2 uint16s, a surrogate pair).
You will first need to get the (Unicode) code point for your symbol (from the UTF-16 JS string), then build the UTF-8 representation for it from that.
Since ES6 you can use the codePointAt method for the first part, which I would recommend using as a shim even if not supported. I guess you don't want to decode surrogate pairs yourself :-)
For the rest, I don't think there's a library method, but you can write it yourself according to the spec:
function hex(x) {
x = x.toString(16);
return (x.length > 2 ? "\\u0000" : "\\x00").slice(0,-x.length)+x.toUpperCase();
}
var c = "😄";
console.log(c.length, hex(c.charCodeAt(0))+hex(c.charCodeAt(1))); // 2, "\uD83D\uDE04"
var cp = c.codePointAt(0);
var bytes = new Uint8Array(4);
bytes[3] = 0x80 | cp & 0x3F;
bytes[2] = 0x80 | (cp >>>= 6) & 0x3F;
bytes[1] = 0x80 | (cp >>>= 6) & 0x3F;
bytes[0] = 0xF0 | (cp >>>= 6) & 0x3F;
console.log(Array.prototype.map.call(bytes, hex).join("")) // "\xf0\x9f\x98\x84"
(tested in Chrome)

Related

JavaScript equivalent of Java's String.getBytes(StandardCharsets.UTF_8)

I have the following Java code:
String str = "\u00A0";
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
System.out.println(Arrays.toString(bytes));
This outputs the following byte array:
[-62, -96]
I am trying to get the same result in Javascript. I have tried the solution posted here:
https://stackoverflow.com/a/51904484/12177456
function strToUtf8Bytes(str) {
const utf8 = [];
for (let ii = 0; ii < str.length; ii++) {
let charCode = str.charCodeAt(ii);
if (charCode < 0x80) utf8.push(charCode);
else if (charCode < 0x800) {
utf8.push(0xc0 | (charCode >> 6), 0x80 | (charCode & 0x3f));
} else if (charCode < 0xd800 || charCode >= 0xe000) {
utf8.push(0xe0 | (charCode >> 12), 0x80 | ((charCode >> 6) & 0x3f), 0x80 | (charCode & 0x3f));
} else {
ii++;
// Surrogate pair:
// UTF-16 encodes 0x10000-0x10FFFF by subtracting 0x10000 and
// splitting the 20 bits of 0x0-0xFFFFF into two halves
charCode = 0x10000 + (((charCode & 0x3ff) << 10) | (str.charCodeAt(ii) & 0x3ff));
utf8.push(
0xf0 | (charCode >> 18),
0x80 | ((charCode >> 12) & 0x3f),
0x80 | ((charCode >> 6) & 0x3f),
0x80 | (charCode & 0x3f),
);
}
}
return utf8;
}
console.log(strToUtf8Bytes("h\u00A0i"));
But this gives this (which is a https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Uint8Array):
[194, 160]
This is a problem for me as I'm using the graal js engine, and need to pass the array to a java function that expects a byte[], so any value in the array > 127 will cause an error, as described here:
https://github.com/oracle/graal/issues/2118
Note I also tried the TextEncoder class instead of the strToUtf8Bytes function as described here:
java string.getBytes("UTF-8") javascript equivalent
but it gives the same result as above.
Is there something else I can try here so that I can get JavaScript to generate the same array as Java?
The result is the same in terms of bytes, JS just defaults to unsigned bytes.
U in Uint8Array stands for “unsigned”; the signed variant is called Int8Array.
The conversion is easy: just pass the result to the Int8Array constructor:
console.log(new Int8Array(new TextEncoder().encode("\u00a0"))); // Int8Array [ -62, -96 ]

How to convert a string to base64 encoding using byte array in JavaScript?

I have the below .NET code to convert a string to Base64 encoding by first converting it to byte array. I tried different answers on Stack Overflow to convert the string in byte array and then use btoa() function for base64 encoding in JavaScript. But, I'm not getting the exact encoded value as shared below.
For string value,
BBFDC43D-4890-4558-BB89-50D802014A97
I need Base64 encoding as,
PcT9u5BIWEW7iVDYAgFKlw==
.NET code:
String str = "BBFDC43D-4890-4558-BB89-50D802014A97"
Guid guid = new Guid(str);
Console.WriteLine(guid); // bbfdc43d-4890-4558-bb89-50d802014a97
Byte[] bytes = guid.ToByteArray();
Console.WriteLine(bytes); // System.Byte[]
String s = Convert.ToBase64String(bytes, Base64FormattingOptions.InsertLineBreaks);
Console.WriteLine(s); // PcT9u5BIWEW7iVDYAgFKlw==
Currently, I tried with the below code, which is not producing the desired result:
function strToUtf8Bytes(str) {
const utf8 = [];
for (let ii = 0; ii < str.length; ii++) {
let charCode = str.charCodeAt(ii);
if (charCode < 0x80) utf8.push(charCode);
else if (charCode < 0x800) {
utf8.push(0xc0 | (charCode >> 6), 0x80 | (charCode & 0x3f));
} else if (charCode < 0xd800 || charCode >= 0xe000) {
utf8.push(0xe0 | (charCode >> 12), 0x80 | ((charCode >> 6) & 0x3f), 0x80 | (charCode & 0x3f));
} else {
ii++;
// Surrogate pair:
// UTF-16 encodes 0x10000-0x10FFFF by subtracting 0x10000 and
// splitting the 20 bits of 0x0-0xFFFFF into two halves
charCode = 0x10000 + (((charCode & 0x3ff) << 10) | (str.charCodeAt(ii) & 0x3ff));
utf8.push(
0xf0 | (charCode >> 18),
0x80 | ((charCode >> 12) & 0x3f),
0x80 | ((charCode >> 6) & 0x3f),
0x80 | (charCode & 0x3f),
);
}
}
return utf8;
}
const str = "BBFDC43D-4890-4558-BB89-50D802014A97";
const strByteArr = strToUtf8Bytes(str);
const strBase64 = btoa(strByteArr);
// NjYsNjYsNzAsNjgsNjcsNTIsNTEsNjgsNDUsNTIsNTYsNTcsNDgsNDUsNTIsNTMsNTMsNTYsNDUsNjYsNjYsNTYsNTcsNDUsNTMsNDgsNjgsNTYsNDgsNTAsNDgsNDksNTIsNjUsNTcsNTU=
Your problem is caused by the following:
btoa() is using ASCII encoding
guid.ToByteArray(); does not use ASCII encoding
If you modify your C# code like this:
String str = "BBFDC43D-4890-4558-BB89-50D802014A97";
//Guid guid = new Guid(str);
//Console.WriteLine(guid);
// bbfdc43d-4890-4558-bb89-50d802014a97
//Byte[] bytes = guid.ToByteArray();
byte[] bytes = System.Text.Encoding.ASCII.GetBytes(str);
//Console.WriteLine(bytes); // System.Byte[]
String s = Convert.ToBase64String(bytes, Base64FormattingOptions.InsertLineBreaks);
Console.WriteLine(s);
You will get the following output:
QkJGREM0M0QtNDg5MC00NTU4LUJCODktNTBEODAyMDE0QTk3
Which will be the same string as the one returned from the btoa() function:
var rawString = "BBFDC43D-4890-4558-BB89-50D802014A97";
var b64encoded = btoa(rawString);
console.log(b64encoded);
Output:
QkJGREM0M0QtNDg5MC00NTU4LUJCODktNTBEODAyMDE0QTk3
UPDATE - Since you can't modify the C# code
You should adapt your Javascript code by combining Piotr's answer and this SO answer
function guidToBytes(guid) {
var bytes = [];
guid.split('-').map((number, index) => {
var bytesInChar = index < 3 ? number.match(/.{1,2}/g).reverse() : number.match(/.{1,2}/g);
bytesInChar.map((byte) => { bytes.push(parseInt(byte, 16)); })
});
return bytes;
}
function arrayBufferToBase64(buffer) {
var binary = '';
var bytes = new Uint8Array(buffer);
var len = bytes.byteLength;
for (var i = 0; i < len; i++) {
binary += String.fromCharCode(bytes[i]);
}
return btoa(binary);
}
var str = "BBFDC43D-4890-4558-BB89-50D802014A97";
var guidBytes = guidToBytes(str);
var b64encoded = arrayBufferToBase64(guidBytes);
console.log(b64encoded);
Output:
PcT9u5BIWEW7iVDYAgFKlw==
The problem with your code is representation of Guid. In C# code you are converting "BBFDC43D-4890-4558-BB89-50D802014A97" into UUID which is a 128-bit number. In JavaScript code, you are doing something else. You iterate through the string and calculate a byte array of a string. They are simply not equal.
Now you have to options
Implement proper guid conversion in JS (this may help: https://gist.github.com/daboxu/4f1dd0a254326ac2361f8e78f89e97ae)
In C# calculate byte array in the same way as in JS
Your string is a hexadecimal value, which you use to create a GUID. Then you convert the GUID into a byte array with:
Byte[] bytes = guid.ToByteArray();
The GUID is a 16-byte value which can be represented as a hexadecimal value. When you convert this GUID into a byte array, you will get the 16 bytes of the value, not the byte representation of the hexadecimal value.
In the provided JavaScript function you are doing something else: You are converting the string directly to a byte array.
In C# you do the equivalent with an Encoding:
String str = "BBFDC43D-4890-4558-BB89-50D802014A97";
Byte[] bytes = Encoding.UTF8.GetBytes(str);

Is there a way to redefine the standard Ascii character set that Javascript charCodeAt and fromCharCode calls from within a function?

For encoding, Javascript pulls from the standard Anscii table for mapping characters. I found the following function below that brilliantly and correctly encodes to Anscii85/Base85. But I want to encode to the Z85 variation because it contains the set of symbols that I require. My understanding is that the Anscii85/Base85 encoding should work exactly the same, except that Z85 maps the values in a different order from the Anscii standard, and uses a different combination of symbols from the standard Ansii85 mapping as well. So the character set is the only difference:
Ansci85 uses the 85 characters, 32 through 126 (reference):
"!\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstu
Z85 uses a custom set of 85 characters (reference):
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.-:+=^!/*?&<>()[]{}#%$#
My question is, is there any way to redefine the character set that charCodeAt and fromCharCode refer to in this function so that it would then encode in Z85?
// By Steve Hanov. Released to the public domain.
function encodeAscii85(input) {
// Remove Adobe standard prefix
// var output = "<~";
var chr1, chr2, chr3, chr4, chr, enc1, enc2, enc3, enc4, enc5;
var i = 0;
while (i < input.length) {
// Access past the end of the string is intentional.
chr1 = input.charCodeAt(i++);
chr2 = input.charCodeAt(i++);
chr3 = input.charCodeAt(i++);
chr4 = input.charCodeAt(i++);
chr = ((chr1 << 24) | (chr2 << 16) | (chr3 << 8) | chr4) >>> 0;
enc1 = (chr / (85 * 85 * 85 * 85) | 0) % 85 + 33;
enc2 = (chr / (85 * 85 * 85) | 0) % 85 + 33;
enc3 = (chr / (85 * 85) | 0 ) % 85 + 33;
enc4 = (chr / 85 | 0) % 85 + 33;
enc5 = chr % 85 + 33;
output += String.fromCharCode(enc1) +
String.fromCharCode(enc2);
if (!isNaN(chr2)) {
output += String.fromCharCode(enc3);
if (!isNaN(chr3)) {
output += String.fromCharCode(enc4);
if (!isNaN(chr4)) {
output += String.fromCharCode(enc5);
}
}
}
}
// Remove Adobe standard suffix
// output += "~>";
return output;
}
Extra notes:
Alternately, I thought I could use something like the following function, but the problem is that it doesn't properly encode Anscii85 in the first place. If it was correct, Hello world! should encode to 87cURD]j7BEbo80, but this function encodes it to RZ!iCB=*gD0D5_+ (reference).
I don't understand the algorithm enough to know what is wrong with the mapping here. Ideally, if it was encoding correctly, I should be able to update this function to use the Z85 character set:
// Adapted from: Ascii85 JavaScript implementation, 2012.10.16 Jim Herrero
// Original: https://jsfiddle.net/nderscore/bbKS4/
var Ascii85 = {
// Ascii85 mapping
_alphabet: "!\"#$%&'()*+,-./0123456789:;<=>?#"+
"ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`"+
"abcdefghijklmnopqrstu"+
"y"+ // short form 4 spaces (optional)
"z", // short form 4 nulls (optional)
// functions
encode: function(input) {
var alphabet = Ascii85._alphabet,
useShort = alphabet.length > 85,
output = "", buffer, val, i, j, l;
for (i = 0, l = input.length; i < l;) {
buffer = [0,0,0,0];
for (j = 0; j < 4; j++)
if(input[i])
buffer[j] = input.charCodeAt(i++);
for (val = buffer[3], j = 2; j >= 0; j--)
val = val*256+buffer[j];
if (useShort && !val)
output += alphabet[86];
else if (useShort && val == 0x20202020)
output += alphabet[85];
else {
for (j = 0; j < 5; j++) {
output += alphabet[val%85];
val = Math.floor(val/85);
}
}
}
return output;
}
};
Character codes are character codes. You can't change the behavior of String.fromCharCode() or String.charCodeAt().
However, you can store your custom character set in an array and use array indexing and Array.indexOf() to look up entries.
Updating this function to work with Z85 will be tricky, though, because String.fromCharCode() and String.charCodeAt() are used in two different contexts -- they're sometimes used to access the unencoded string (which doesn't need to change), and sometimes for the encoded string (which does). You will need to take care to not confuse the two.

Russian characters converting to binary incorrectly (JavaScript)

I'm writing a program in JavaScript that needs to convert text to 8-bit binary, which I accomplish with a loop that uses "exampleVariable.charCodeAt(i).toString(2)", then appends "0"s to the front until the length of the binary representation of each character is 8 bits. However, when Russian characters are passed into the function, each character is converted to an 11-bit binary representation, when it should actually be 16 bits. For example, "д" converts to 10000110100, when, in actuality, it should convert to "1101000010110100". Any ideas on how to fix this?
It looks like you are trying to get the binary representation of the UTF-8 representation of the character. JavaScript uses UTF-16 internally, so you will have to do some work to do the translation. There are various libraries out there, we'd need to know more about the environment to recommend the right tools. If you wanted to code it up yourself, it would be roughly:
function codepointToUTF_8(code) {
if (code < 0x07f) {
return [code];
} else if (code < 0x800) {
var byte1 = 0xc0 | (code >> 6 );
var byte2 = 0x80 | (code & 0x3f);
return [ byte1, byte2 ];
} else if (code < 0x10000) {
var byte1 = 0xe0 | ( code >> 12 );
var byte2 = 0x80 | ((code >> 6 ) & 0x3f);
var byte3 = 0x80 | ( code & 0x3f);
return [ byte1, byte2, byte3 ];
} else {
var byte1 = 0xf0 | ( code >> 18 );
var byte2 = 0x80 | ((code >> 12) & 0x3f);
var byte3 = 0x80 | ((code >> 6) & 0x3f);
var byte4 = 0x80 | ( code & 0x3f);
return [ byte1, byte2, byte3, byte4 ];
}
}
function strToUTF_8 (str) {
var result = [];
for (var i = 0; i < str.length; i++) {
// NOTE: this will not handle anything beyond the BMP
result.push(codepointToUTF_8(str.charCodeAt(i)));
}
console.log('result = ', result);
return [].concat.apply([], result);
}
function byteToBinary (b) {
var str = b.toString(2);
while (str.length < 8) {
str = '0' + str;
}
return str;
}
function toBinaryUTF_8 (str) {
return strToUTF_8(str).map(byteToBinary).join(' ');
}
console.log("абвгд => '" + toBinaryUTF_8("абвгд") + "'");
When I execute this I get:
абвгд => '11010000 10110000 11010000 10110001 11010000 10110010 11010000 10110011 11010000 10110100'
I haven't tested this thoroughly, but it should handle the Russian characters OK. It produces an array of character codes, which if you translate as you were trying before with 8 binary bits per character, you should be fine.

RSA Encrypt String longer than key

Hei Guys,
i'm encryping Strings with JS and a Public Key. I use the JSBN code but the problem is in creating a BigInt.
RSA.js does:
// Copyright (c) 2005 Tom Wu
var m = pkcs1pad2(text,(this.n.bitLength()+7)>>3); // getting the Bigint
if(m == null) return null; // nullcheck
var c = this.doPublic(m); // encrypting the BigInt
The problem is in "pkcs1pad2". The function has a check, if the text is longer den the BitLength of the key. If so, return, else create a BigInt.
// Copyright (c) 2005 Tom Wu
if(n < s.length + 11) { // TODO: fix for utf-8
alert("Message too long for RSA");
return null;
}
var ba = new Array();
var i = s.length - 1;
while(i >= 0 && n > 0) {
var c = s.charCodeAt(i--);
if(c < 128) { // encode using utf-8
ba[--n] = c;
} else if((c > 127) && (c < 2048)) {
ba[--n] = (c & 63) | 128;
ba[--n] = (c >> 6) | 192;
} else {
ba[--n] = (c & 63) | 128;
ba[--n] = ((c >> 6) & 63) | 128;
ba[--n] = (c >> 12) | 224;
}
}
ba[--n] = 0;
var rng = new SecureRandom();
var x = new Array();
while(n > 2) { // random non-zero pad
x[0] = 0;
while(x[0] == 0) rng.nextBytes(x);
ba[--n] = x[0];
}
ba[--n] = 2;
ba[--n] = 0;
return new BigInteger(ba);
I can't figure out, what the author means with "// TODO: fix for utf-8". Can anyone explain this? &/ give a working answer?
This tries to implement PKCS#1 v1.5 padding as defined in PKCS#1. The input of PKCS#1 v1.5 padding for encryption must be 11 bytes smaller than the size of the modulus:
M : message to be encrypted, an octet string of length mLen,
where mLen <= k - 11
If the message is larger then RSA encryption cannot continue. Usually that is not a problem: you encrypt using a symmetric cipher with a random key and then encrypt that random key using RSA. This is called a hybrid cryptosystem.
The reason why the comment is there is that JavaScript - like many scripting languages - has some trouble distinguishing between bytes and text. If s is text then s.length is likely to return the amount of characters not bytes. This is deliberate for languages that implement weak typing, but it doesn't make it easier to create and use cryptographic algorithms in JavaScript.
If you use a multi-byte encoding such as UTF-8, then characters that encode to 2 bytes will increase the total plaintext size (in bytes). So the calculation may fail at that point. Either that, or you will loose the characters that cannot be encoded in single bytes - if I take a quick look at the code that is what will likely happen if you go beyond the ASCII (7 bit, values 0..127) range of characters.

Categories

Resources