I have the following Java code:
String str = "\u00A0";
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
System.out.println(Arrays.toString(bytes));
This outputs the following byte array:
[-62, -96]
I am trying to get the same result in Javascript. I have tried the solution posted here:
https://stackoverflow.com/a/51904484/12177456
function strToUtf8Bytes(str) {
const utf8 = [];
for (let ii = 0; ii < str.length; ii++) {
let charCode = str.charCodeAt(ii);
if (charCode < 0x80) utf8.push(charCode);
else if (charCode < 0x800) {
utf8.push(0xc0 | (charCode >> 6), 0x80 | (charCode & 0x3f));
} else if (charCode < 0xd800 || charCode >= 0xe000) {
utf8.push(0xe0 | (charCode >> 12), 0x80 | ((charCode >> 6) & 0x3f), 0x80 | (charCode & 0x3f));
} else {
ii++;
// Surrogate pair:
// UTF-16 encodes 0x10000-0x10FFFF by subtracting 0x10000 and
// splitting the 20 bits of 0x0-0xFFFFF into two halves
charCode = 0x10000 + (((charCode & 0x3ff) << 10) | (str.charCodeAt(ii) & 0x3ff));
utf8.push(
0xf0 | (charCode >> 18),
0x80 | ((charCode >> 12) & 0x3f),
0x80 | ((charCode >> 6) & 0x3f),
0x80 | (charCode & 0x3f),
);
}
}
return utf8;
}
console.log(strToUtf8Bytes("h\u00A0i"));
But this gives this (which is a https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Uint8Array):
[194, 160]
This is a problem for me as I'm using the graal js engine, and need to pass the array to a java function that expects a byte[], so any value in the array > 127 will cause an error, as described here:
https://github.com/oracle/graal/issues/2118
Note I also tried the TextEncoder class instead of the strToUtf8Bytes function as described here:
java string.getBytes("UTF-8") javascript equivalent
but it gives the same result as above.
Is there something else I can try here so that I can get JavaScript to generate the same array as Java?
The result is the same in terms of bytes, JS just defaults to unsigned bytes.
U in Uint8Array stands for “unsigned”; the signed variant is called Int8Array.
The conversion is easy: just pass the result to the Int8Array constructor:
console.log(new Int8Array(new TextEncoder().encode("\u00a0"))); // Int8Array [ -62, -96 ]
Related
In Python, I have the following code to store bytes in a variable like: -
K = b"\x00" * 32
I was trying to write a javascript equivalent of this code to get bytes string using the following code:
function toUTF8Array(str) {
var utf8 = [];
for (var i = 0; i < str.length; i++) {
var charcode = str.charCodeAt(i);
if (charcode < 0x80) utf8.push(charcode);
else if (charcode < 0x800) {
utf8.push(0xc0 | (charcode >> 6), 0x80 | (charcode & 0x3f));
} else if (charcode < 0xd800 || charcode >= 0xe000) {
utf8.push(
0xe0 | (charcode >> 12),
0x80 | ((charcode >> 6) & 0x3f),
0x80 | (charcode & 0x3f)
);
}
// surrogate pair
else {
i++;
// UTF-16 encodes 0x10000-0x10FFFF by
// subtracting 0x10000 and splitting the
// 20 bits of 0x0-0xFFFFF into two halves
charcode =
0x10000 + (((charcode & 0x3ff) << 10) | (str.charCodeAt(i) & 0x3ff));
utf8.push(
0xf0 | (charcode >> 18),
0x80 | ((charcode >> 12) & 0x3f),
0x80 | ((charcode >> 6) & 0x3f),
0x80 | (charcode & 0x3f)
);
}
}
return utf8;
}
But it is generating byte array as follows:
[
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0 ]
I don't want the output as plain array of numbers. I need the bytes output like Python's (out of the Python code as: K = b"\x00" * 32).
How to achieve that?
I have the below .NET code to convert a string to Base64 encoding by first converting it to byte array. I tried different answers on Stack Overflow to convert the string in byte array and then use btoa() function for base64 encoding in JavaScript. But, I'm not getting the exact encoded value as shared below.
For string value,
BBFDC43D-4890-4558-BB89-50D802014A97
I need Base64 encoding as,
PcT9u5BIWEW7iVDYAgFKlw==
.NET code:
String str = "BBFDC43D-4890-4558-BB89-50D802014A97"
Guid guid = new Guid(str);
Console.WriteLine(guid); // bbfdc43d-4890-4558-bb89-50d802014a97
Byte[] bytes = guid.ToByteArray();
Console.WriteLine(bytes); // System.Byte[]
String s = Convert.ToBase64String(bytes, Base64FormattingOptions.InsertLineBreaks);
Console.WriteLine(s); // PcT9u5BIWEW7iVDYAgFKlw==
Currently, I tried with the below code, which is not producing the desired result:
function strToUtf8Bytes(str) {
const utf8 = [];
for (let ii = 0; ii < str.length; ii++) {
let charCode = str.charCodeAt(ii);
if (charCode < 0x80) utf8.push(charCode);
else if (charCode < 0x800) {
utf8.push(0xc0 | (charCode >> 6), 0x80 | (charCode & 0x3f));
} else if (charCode < 0xd800 || charCode >= 0xe000) {
utf8.push(0xe0 | (charCode >> 12), 0x80 | ((charCode >> 6) & 0x3f), 0x80 | (charCode & 0x3f));
} else {
ii++;
// Surrogate pair:
// UTF-16 encodes 0x10000-0x10FFFF by subtracting 0x10000 and
// splitting the 20 bits of 0x0-0xFFFFF into two halves
charCode = 0x10000 + (((charCode & 0x3ff) << 10) | (str.charCodeAt(ii) & 0x3ff));
utf8.push(
0xf0 | (charCode >> 18),
0x80 | ((charCode >> 12) & 0x3f),
0x80 | ((charCode >> 6) & 0x3f),
0x80 | (charCode & 0x3f),
);
}
}
return utf8;
}
const str = "BBFDC43D-4890-4558-BB89-50D802014A97";
const strByteArr = strToUtf8Bytes(str);
const strBase64 = btoa(strByteArr);
// NjYsNjYsNzAsNjgsNjcsNTIsNTEsNjgsNDUsNTIsNTYsNTcsNDgsNDUsNTIsNTMsNTMsNTYsNDUsNjYsNjYsNTYsNTcsNDUsNTMsNDgsNjgsNTYsNDgsNTAsNDgsNDksNTIsNjUsNTcsNTU=
Your problem is caused by the following:
btoa() is using ASCII encoding
guid.ToByteArray(); does not use ASCII encoding
If you modify your C# code like this:
String str = "BBFDC43D-4890-4558-BB89-50D802014A97";
//Guid guid = new Guid(str);
//Console.WriteLine(guid);
// bbfdc43d-4890-4558-bb89-50d802014a97
//Byte[] bytes = guid.ToByteArray();
byte[] bytes = System.Text.Encoding.ASCII.GetBytes(str);
//Console.WriteLine(bytes); // System.Byte[]
String s = Convert.ToBase64String(bytes, Base64FormattingOptions.InsertLineBreaks);
Console.WriteLine(s);
You will get the following output:
QkJGREM0M0QtNDg5MC00NTU4LUJCODktNTBEODAyMDE0QTk3
Which will be the same string as the one returned from the btoa() function:
var rawString = "BBFDC43D-4890-4558-BB89-50D802014A97";
var b64encoded = btoa(rawString);
console.log(b64encoded);
Output:
QkJGREM0M0QtNDg5MC00NTU4LUJCODktNTBEODAyMDE0QTk3
UPDATE - Since you can't modify the C# code
You should adapt your Javascript code by combining Piotr's answer and this SO answer
function guidToBytes(guid) {
var bytes = [];
guid.split('-').map((number, index) => {
var bytesInChar = index < 3 ? number.match(/.{1,2}/g).reverse() : number.match(/.{1,2}/g);
bytesInChar.map((byte) => { bytes.push(parseInt(byte, 16)); })
});
return bytes;
}
function arrayBufferToBase64(buffer) {
var binary = '';
var bytes = new Uint8Array(buffer);
var len = bytes.byteLength;
for (var i = 0; i < len; i++) {
binary += String.fromCharCode(bytes[i]);
}
return btoa(binary);
}
var str = "BBFDC43D-4890-4558-BB89-50D802014A97";
var guidBytes = guidToBytes(str);
var b64encoded = arrayBufferToBase64(guidBytes);
console.log(b64encoded);
Output:
PcT9u5BIWEW7iVDYAgFKlw==
The problem with your code is representation of Guid. In C# code you are converting "BBFDC43D-4890-4558-BB89-50D802014A97" into UUID which is a 128-bit number. In JavaScript code, you are doing something else. You iterate through the string and calculate a byte array of a string. They are simply not equal.
Now you have to options
Implement proper guid conversion in JS (this may help: https://gist.github.com/daboxu/4f1dd0a254326ac2361f8e78f89e97ae)
In C# calculate byte array in the same way as in JS
Your string is a hexadecimal value, which you use to create a GUID. Then you convert the GUID into a byte array with:
Byte[] bytes = guid.ToByteArray();
The GUID is a 16-byte value which can be represented as a hexadecimal value. When you convert this GUID into a byte array, you will get the 16 bytes of the value, not the byte representation of the hexadecimal value.
In the provided JavaScript function you are doing something else: You are converting the string directly to a byte array.
In C# you do the equivalent with an Encoding:
String str = "BBFDC43D-4890-4558-BB89-50D802014A97";
Byte[] bytes = Encoding.UTF8.GetBytes(str);
I have a piece of returned code from a (serial) device. I have 4 bytes of information that I need to get a usable, human-readable value from.
I tried a lot of code examples from the Internet, but I can't get a grip on it.
Here is an example of an outcome, but no formula for how to do this in JavaScript:
34 32 33 39 37 30 41 34 Bus voltage-float: A4703942=46.36
// ( so Voltage is 46.36 )
How do I get this from the Hex A4703942 in JavaScript?
I know it has something to do with a float, little or big endian... yes?
Check the below URL
http://babbage.cs.qc.cuny.edu/IEEE-754.old/32bit.html
In the above URL, you enter the string 423970A4, which is the reverse(because of that end-ian stuff) of the your bytes string and click on compute, You will get the 46.36.
The JavaScript found in this URL page's source would help you for this conversion.
To answer my own question more or less :
//extract usable data from the returned Hex
function hex2float(num) {
var sign = (num & 0x80000000) ? -1 : 1;
var exponent = ((num >> 23) & 0xff) - 127;
var mantissa = 1 + ((num & 0x7fffff) / 0x7fffff);
return sign * mantissa * Math.pow(2, exponent);
}
//make it a nice 4 digits number
function roundToTwo(num) {
return +(Math.round(num + "e+2") + "e-2");
}
print or whatever (roundToTwo(hex2float("0x"+<yourhexInput>)));[/CODE]
( sometimes there is the need to flip the bytes/Hex to become the right input )
for example :
function swap32(val) {
return ((val & 0xFF) << 24)
| ((val & 0xFF00) << 8)
| ((val >> 8) & 0xFF00)
| ((val >> 24) & 0xFF);
}
I'm writing a program in JavaScript that needs to convert text to 8-bit binary, which I accomplish with a loop that uses "exampleVariable.charCodeAt(i).toString(2)", then appends "0"s to the front until the length of the binary representation of each character is 8 bits. However, when Russian characters are passed into the function, each character is converted to an 11-bit binary representation, when it should actually be 16 bits. For example, "д" converts to 10000110100, when, in actuality, it should convert to "1101000010110100". Any ideas on how to fix this?
It looks like you are trying to get the binary representation of the UTF-8 representation of the character. JavaScript uses UTF-16 internally, so you will have to do some work to do the translation. There are various libraries out there, we'd need to know more about the environment to recommend the right tools. If you wanted to code it up yourself, it would be roughly:
function codepointToUTF_8(code) {
if (code < 0x07f) {
return [code];
} else if (code < 0x800) {
var byte1 = 0xc0 | (code >> 6 );
var byte2 = 0x80 | (code & 0x3f);
return [ byte1, byte2 ];
} else if (code < 0x10000) {
var byte1 = 0xe0 | ( code >> 12 );
var byte2 = 0x80 | ((code >> 6 ) & 0x3f);
var byte3 = 0x80 | ( code & 0x3f);
return [ byte1, byte2, byte3 ];
} else {
var byte1 = 0xf0 | ( code >> 18 );
var byte2 = 0x80 | ((code >> 12) & 0x3f);
var byte3 = 0x80 | ((code >> 6) & 0x3f);
var byte4 = 0x80 | ( code & 0x3f);
return [ byte1, byte2, byte3, byte4 ];
}
}
function strToUTF_8 (str) {
var result = [];
for (var i = 0; i < str.length; i++) {
// NOTE: this will not handle anything beyond the BMP
result.push(codepointToUTF_8(str.charCodeAt(i)));
}
console.log('result = ', result);
return [].concat.apply([], result);
}
function byteToBinary (b) {
var str = b.toString(2);
while (str.length < 8) {
str = '0' + str;
}
return str;
}
function toBinaryUTF_8 (str) {
return strToUTF_8(str).map(byteToBinary).join(' ');
}
console.log("абвгд => '" + toBinaryUTF_8("абвгд") + "'");
When I execute this I get:
абвгд => '11010000 10110000 11010000 10110001 11010000 10110010 11010000 10110011 11010000 10110100'
I haven't tested this thoroughly, but it should handle the Russian characters OK. It produces an array of character codes, which if you translate as you were trying before with 8 binary bits per character, you should be fine.
In javascript I am trying to make unicode into byte based hex escape sequences that are compatible with C:
ie. 😄
becomes: \xF0\x9F\x98\x84 (correct)
NOT javascript surrogates, not \uD83D\uDE04 (wrong)
I cannot figure out the math relationship between the four bytes C wants vs the two surrogates javascript uses. I suspect the algorithm is far more complex than my feeble attempts.
Thanks for any tips.
encodeURIComponent does this work:
var input = "\uD83D\uDE04";
var result = encodeURIComponent(input).replace(/%/g, "\\x"); // \xF0\x9F\x98\x84
Upd: Actually, C strings can contain digits and letters without escaping, but if you really need to escape them:
function escape(s, escapeEverything) {
if (escapeEverything) {
s = s.replace(/[\x10-\x7f]/g, function (s) {
return "-x" + s.charCodeAt(0).toString(16).toUpperCase();
});
}
s = encodeURIComponent(s).replace(/%/g, "\\x");
if (escapeEverything) {
s = s.replace(/\-/g, "\\");
}
return s;
}
Found a solution here: http://jonisalonen.com/2012/from-utf-16-to-utf-8-in-javascript/
I would have never figured out THAT math, wow.
somewhat minified
function UTF8seq(s) {
var i,c,u=[];
for (i=0; i < s.length; i++) {
c = s.charCodeAt(i);
if (c < 0x80) { u.push(c); }
else if (c < 0x800) { u.push(0xc0 | (c >> 6), 0x80 | (c & 0x3f)); }
else if (c < 0xd800 || c >= 0xe000) { u.push(0xe0 | (c >> 12), 0x80 | ((c>>6) & 0x3f), 0x80 | (c & 0x3f)); }
else { i++; c = 0x10000 + (((c & 0x3ff)<<10) | (s.charCodeAt(i) & 0x3ff));
u.push(0xf0 | (c >>18), 0x80 | ((c>>12) & 0x3f), 0x80 | ((c>>6) & 0x3f), 0x80 | (c & 0x3f)); }
}
for (i=0; i < u.length; i++) { u[i]=u[i].toString(16); }
return '\\x'+u.join('\\x');
}
Your C code expects an UTF-8 string (the symbol is represented as 4 bytes). The JS representation you see is UTF-16 however (the symbol is represented as 2 uint16s, a surrogate pair).
You will first need to get the (Unicode) code point for your symbol (from the UTF-16 JS string), then build the UTF-8 representation for it from that.
Since ES6 you can use the codePointAt method for the first part, which I would recommend using as a shim even if not supported. I guess you don't want to decode surrogate pairs yourself :-)
For the rest, I don't think there's a library method, but you can write it yourself according to the spec:
function hex(x) {
x = x.toString(16);
return (x.length > 2 ? "\\u0000" : "\\x00").slice(0,-x.length)+x.toUpperCase();
}
var c = "😄";
console.log(c.length, hex(c.charCodeAt(0))+hex(c.charCodeAt(1))); // 2, "\uD83D\uDE04"
var cp = c.codePointAt(0);
var bytes = new Uint8Array(4);
bytes[3] = 0x80 | cp & 0x3F;
bytes[2] = 0x80 | (cp >>>= 6) & 0x3F;
bytes[1] = 0x80 | (cp >>>= 6) & 0x3F;
bytes[0] = 0xF0 | (cp >>>= 6) & 0x3F;
console.log(Array.prototype.map.call(bytes, hex).join("")) // "\xf0\x9f\x98\x84"
(tested in Chrome)