Converting UUID to bytes [duplicate] - javascript

How can I convert a string in bytearray using JavaScript. Output should be equivalent of the below C# code.
UnicodeEncoding encoding = new UnicodeEncoding();
byte[] bytes = encoding.GetBytes(AnyString);
As UnicodeEncoding is by default of UTF-16 with Little-Endianness.
Edit: I have a requirement to match the bytearray generated client side with the one generated at server side using the above C# code.

Update 2018 - The easiest way in 2018 should be TextEncoder
let utf8Encode = new TextEncoder();
utf8Encode.encode("abc");
// Uint8Array [ 97, 98, 99 ]
Caveats - The returned element is a Uint8Array, and not all browsers support it.

If you are looking for a solution that works in node.js, you can use this:
var myBuffer = [];
var str = 'Stack Overflow';
var buffer = new Buffer(str, 'utf16le');
for (var i = 0; i < buffer.length; i++) {
myBuffer.push(buffer[i]);
}
console.log(myBuffer);

In C# running this
UnicodeEncoding encoding = new UnicodeEncoding();
byte[] bytes = encoding.GetBytes("Hello");
Will create an array with
72,0,101,0,108,0,108,0,111,0
For a character which the code is greater than 255 it will look like this
If you want a very similar behavior in JavaScript you can do this (v2 is a bit more robust solution, while the original version will only work for 0x00 ~ 0xff)
var str = "Hello竜";
var bytes = []; // char codes
var bytesv2 = []; // char codes
for (var i = 0; i < str.length; ++i) {
var code = str.charCodeAt(i);
bytes = bytes.concat([code]);
bytesv2 = bytesv2.concat([code & 0xff, code / 256 >>> 0]);
}
// 72, 101, 108, 108, 111, 31452
console.log('bytes', bytes.join(', '));
// 72, 0, 101, 0, 108, 0, 108, 0, 111, 0, 220, 122
console.log('bytesv2', bytesv2.join(', '));

I suppose C# and Java produce equal byte arrays. If you have non-ASCII characters, it's not enough to add an additional 0. My example contains a few special characters:
var str = "Hell ö € Ω 𝄞";
var bytes = [];
var charCode;
for (var i = 0; i < str.length; ++i)
{
charCode = str.charCodeAt(i);
bytes.push((charCode & 0xFF00) >> 8);
bytes.push(charCode & 0xFF);
}
alert(bytes.join(' '));
// 0 72 0 101 0 108 0 108 0 32 0 246 0 32 32 172 0 32 3 169 0 32 216 52 221 30
I don't know if C# places BOM (Byte Order Marks), but if using UTF-16, Java String.getBytes adds following bytes: 254 255.
String s = "Hell ö € Ω ";
// now add a character outside the BMP (Basic Multilingual Plane)
// we take the violin-symbol (U+1D11E) MUSICAL SYMBOL G CLEF
s += new String(Character.toChars(0x1D11E));
// surrogate codepoints are: d834, dd1e, so one could also write "\ud834\udd1e"
byte[] bytes = s.getBytes("UTF-16");
for (byte aByte : bytes) {
System.out.print((0xFF & aByte) + " ");
}
// 254 255 0 72 0 101 0 108 0 108 0 32 0 246 0 32 32 172 0 32 3 169 0 32 216 52 221 30
Edit:
Added a special character (U+1D11E) MUSICAL SYMBOL G CLEF (outside BPM, so taking not only 2 bytes in UTF-16, but 4.
Current JavaScript versions use "UCS-2" internally, so this symbol takes the space of 2 normal characters.
I'm not sure but when using charCodeAt it seems we get exactly the surrogate codepoints also used in UTF-16, so non-BPM characters are handled correctly.
This problem is absolutely non-trivial. It might depend on the used JavaScript versions and engines. So if you want reliable solutions, you should have a look at:
https://github.com/koichik/node-codepoint/
http://mathiasbynens.be/notes/javascript-escapes
Mozilla Developer Network: charCodeAt
BigEndian vs. LittleEndian

UTF-16 Byte Array
JavaScript encodes strings as UTF-16, just like C#'s UnicodeEncoding, so creating a byte array is relatively straightforward.
JavaScript's charCodeAt() returns a 16-bit code unit (aka a 2-byte integer between 0 and 65535). You can split it into distinct bytes using the following:
function strToUtf16Bytes(str) {
const bytes = [];
for (ii = 0; ii < str.length; ii++) {
const code = str.charCodeAt(ii); // x00-xFFFF
bytes.push(code & 255, code >> 8); // low, high
}
return bytes;
}
For example:
strToUtf16Bytes('🌵');
// [ 60, 216, 53, 223 ]
This works between C# and JavaScript because they both support UTF-16. However, if you want to get a UTF-8 byte array from JS, you must transcode the bytes.
UTF-8 Byte Array
The solution feels somewhat non-trivial, but I used the code below in production with great success (original source).
Also, for the interested reader, I published my unicode helpers that help me work with string lengths reported by other languages such as PHP.
/**
* Convert a string to a unicode byte array
* #param {string} str
* #return {Array} of bytes
*/
export function strToUtf8Bytes(str) {
const utf8 = [];
for (let ii = 0; ii < str.length; ii++) {
let charCode = str.charCodeAt(ii);
if (charCode < 0x80) utf8.push(charCode);
else if (charCode < 0x800) {
utf8.push(0xc0 | (charCode >> 6), 0x80 | (charCode & 0x3f));
} else if (charCode < 0xd800 || charCode >= 0xe000) {
utf8.push(0xe0 | (charCode >> 12), 0x80 | ((charCode >> 6) & 0x3f), 0x80 | (charCode & 0x3f));
} else {
ii++;
// Surrogate pair:
// UTF-16 encodes 0x10000-0x10FFFF by subtracting 0x10000 and
// splitting the 20 bits of 0x0-0xFFFFF into two halves
charCode = 0x10000 + (((charCode & 0x3ff) << 10) | (str.charCodeAt(ii) & 0x3ff));
utf8.push(
0xf0 | (charCode >> 18),
0x80 | ((charCode >> 12) & 0x3f),
0x80 | ((charCode >> 6) & 0x3f),
0x80 | (charCode & 0x3f),
);
}
}
return utf8;
}

Inspired by #hgoebl's answer. His code is for UTF-16 and I needed something for US-ASCII. So here's a more complete answer covering US-ASCII, UTF-16, and UTF-32.
/**#returns {Array} bytes of US-ASCII*/
function stringToAsciiByteArray(str)
{
var bytes = [];
for (var i = 0; i < str.length; ++i)
{
var charCode = str.charCodeAt(i);
if (charCode > 0xFF) // char > 1 byte since charCodeAt returns the UTF-16 value
{
throw new Error('Character ' + String.fromCharCode(charCode) + ' can\'t be represented by a US-ASCII byte.');
}
bytes.push(charCode);
}
return bytes;
}
/**#returns {Array} bytes of UTF-16 Big Endian without BOM*/
function stringToUtf16ByteArray(str)
{
var bytes = [];
//currently the function returns without BOM. Uncomment the next line to change that.
//bytes.push(254, 255); //Big Endian Byte Order Marks
for (var i = 0; i < str.length; ++i)
{
var charCode = str.charCodeAt(i);
//char > 2 bytes is impossible since charCodeAt can only return 2 bytes
bytes.push((charCode & 0xFF00) >>> 8); //high byte (might be 0)
bytes.push(charCode & 0xFF); //low byte
}
return bytes;
}
/**#returns {Array} bytes of UTF-32 Big Endian without BOM*/
function stringToUtf32ByteArray(str)
{
var bytes = [];
//currently the function returns without BOM. Uncomment the next line to change that.
//bytes.push(0, 0, 254, 255); //Big Endian Byte Order Marks
for (var i = 0; i < str.length; i+=2)
{
var charPoint = str.codePointAt(i);
//char > 4 bytes is impossible since codePointAt can only return 4 bytes
bytes.push((charPoint & 0xFF000000) >>> 24);
bytes.push((charPoint & 0xFF0000) >>> 16);
bytes.push((charPoint & 0xFF00) >>> 8);
bytes.push(charPoint & 0xFF);
}
return bytes;
}
UTF-8 is variable length and isn't included because I would have to write the encoding myself. UTF-8 and UTF-16 are variable length. UTF-8, UTF-16, and UTF-32 have a minimum number of bits as their name indicates. If a UTF-32 character has a code point of 65 then that means there are 3 leading 0s. But the same code for UTF-16 has only 1 leading 0. US-ASCII on the other hand is fixed width 8-bits which means it can be directly translated to bytes.
String.prototype.charCodeAt returns a maximum number of 2 bytes and matches UTF-16 exactly. However for UTF-32 String.prototype.codePointAt is needed which is part of the ECMAScript 6 (Harmony) proposal. Because charCodeAt returns 2 bytes which is more possible characters than US-ASCII can represent, the function stringToAsciiByteArray will throw in such cases instead of splitting the character in half and taking either or both bytes.
Note that this answer is non-trivial because character encoding is non-trivial. What kind of byte array you want depends on what character encoding you want those bytes to represent.
javascript has the option of internally using either UTF-16 or UCS-2 but since it has methods that act like it is UTF-16 I don't see why any browser would use UCS-2.
Also see: https://mathiasbynens.be/notes/javascript-encoding
Yes I know the question is 4 years old but I needed this answer for myself.

Since I cannot comment on the answer, I'd build on Jin Izzraeel's answer
var myBuffer = [];
var str = 'Stack Overflow';
var buffer = new Buffer(str, 'utf16le');
for (var i = 0; i < buffer.length; i++) {
myBuffer.push(buffer[i]);
}
console.log(myBuffer);
by saying that you could use this if you want to use a Node.js buffer in your browser.
https://github.com/feross/buffer
Therefore, Tom Stickel's objection is not valid, and the answer is indeed a valid answer.

String.prototype.encodeHex = function () {
return this.split('').map(e => e.charCodeAt())
};
String.prototype.decodeHex = function () {
return this.map(e => String.fromCharCode(e)).join('')
};

The best solution I've come up with at on the spot (though most likely crude) would be:
String.prototype.getBytes = function() {
var bytes = [];
for (var i = 0; i < this.length; i++) {
var charCode = this.charCodeAt(i);
var cLen = Math.ceil(Math.log(charCode)/Math.log(256));
for (var j = 0; j < cLen; j++) {
bytes.push((charCode << (j*8)) & 0xFF);
}
}
return bytes;
}
Though I notice this question has been here for over a year.

I know the question is almost 4 years old, but this is what worked smoothly with me:
String.prototype.encodeHex = function () {
var bytes = [];
for (var i = 0; i < this.length; ++i) {
bytes.push(this.charCodeAt(i));
}
return bytes;
};
Array.prototype.decodeHex = function () {
var str = [];
var hex = this.toString().split(',');
for (var i = 0; i < hex.length; i++) {
str.push(String.fromCharCode(hex[i]));
}
return str.toString().replace(/,/g, "");
};
var str = "Hello World!";
var bytes = str.encodeHex();
alert('The Hexa Code is: '+bytes+' The original string is: '+bytes.decodeHex());
or, if you want to work with strings only, and no Array, you can use:
String.prototype.encodeHex = function () {
var bytes = [];
for (var i = 0; i < this.length; ++i) {
bytes.push(this.charCodeAt(i));
}
return bytes.toString();
};
String.prototype.decodeHex = function () {
var str = [];
var hex = this.split(',');
for (var i = 0; i < hex.length; i++) {
str.push(String.fromCharCode(hex[i]));
}
return str.toString().replace(/,/g, "");
};
var str = "Hello World!";
var bytes = str.encodeHex();
alert('The Hexa Code is: '+bytes+' The original string is: '+bytes.decodeHex());

Here is the same function that #BrunoLM posted converted to a String prototype function:
String.prototype.getBytes = function () {
var bytes = [];
for (var i = 0; i < this.length; ++i) {
bytes.push(this.charCodeAt(i));
}
return bytes;
};
If you define the function as such, then you can call the .getBytes() method on any string:
var str = "Hello World!";
var bytes = str.getBytes();

You don't need underscore, just use built-in map:
var string = 'Hello World!';
document.write(string.split('').map(function(c) { return c.charCodeAt(); }));

Related

Equivalent of Java's getBytes in JavaScript for different encodings

I have a function in Java that I need to convert to JavaScript and that contains this line:
byte[] bytes = ttText.getBytes(Charset.forName("Cp1250"));
ttText is String. I need to do the same. I need to get the bytes of a string encoded in Cp1250 (windows-1250), modify those bytes and then convert it back to string. Is there a way how to do it in JavaScript?
I discovered for example TextEncoder and TextDecoder but the support for different encodings than UTF-8 was dropped some time ago.
var cp1250 = '€ ‚ „…†‡ ‰Š‹ŚŤŽŹ ‘’“”•–— ™š›śťžź ˇ˘Ł¤Ą¦§¨©Ş«¬­®Ż°±˛ł´µ¶·¸ąş»Ľ˝ľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ˙';
function encodeCP1250(text) {
var buf = [];
for (var i = 0; i < text.length; i++) {
var code = cp1250.indexOf(text[i]);
if (code >= 0) {
code += 128;
} else {
code = text.charCodeAt(i);
}
buf.push(code > 255 ? 32 : code);
}
return buf;
}
function decodeCP1250(buf) {
var text = '';
for (var i = 0; i < buf.length; i++) {
var code = buf[i];
text += code > 127 ? cp1250[code - 128] : String.fromCharCode(code);
}
return text;
}
var buf = encodeCP1250('AÁÂĂÄ'); // [65, 193, 194, 195, 196]
var text = decodeCP1250(buf); // 'AÁÂĂÄ'
Upd: Chrome and Firefox have TextDecoder as experimental feature, but TextEncoder works only with UTF-8.
Try this.
https://mths.be/windows-1250
This looks promising. It provides support for both encoding and decoding.
All you need to do is add the library and use the methods.
var encodedData = windows1250.encode(text);

Longitudinal redundancy check in Javascript

I'm working with a system that integrates a Point of Sell (POS) device, I use chrome serial to scan ports and be able to read credit card data.
The problem I'm facing is that I need to concat the LRC from a string in this format:
STX = '\002' (2 HEX) (Start of text)
LLL = Length of data (doesn't include STX or ETX but command).
Command C50 {C = A message from PC to POS, 50 the actual code that "prints" a message on POS}
ETX = '\003' (3 HEX) (End of text)
LRC = Longitudinal Redundancy Check
A message example would be as follows:
'\002014C50HELLO WORLD\003'
Here we can see 002 as STX, 014 is the length from C50 to D, and 003 as ETX.
I found some algorithms in C# like this one or this one and even this one in Java, I even saw this question that was removed from SO on Google's cache, which actually asks the same as I but had no examples or answers.
I also made this Java algorithm:
private int calculateLRC(String str) {
int result = 0;
for (int i = 0; i < str.length(); i++) {
String char1 = str.substring(i, i + 1);
char[] char2 = char1.toCharArray();
int number = char2[0];
result = result ^ number;
}
return result;
}
and tried passing it to Javascript (where I have poor knowledge)
function calculateLRC2(str) {
var result = 0;
for (var i = 0; i < str.length; i++) {
var char1 = str.substring(i, i + 1);
//var char2[] = char1.join('');
var number = char1;
result = result ^ number;
}
return result.toString();
}
and after following the Wikipedia's pseudocode I tried doing this:
function calculateLRC(str) {
var buffer = convertStringToArrayBuffer(str);
var lrc;
for (var i = 0; i < str.length; i++) {
lrc = (lrc + buffer[i]) & 0xFF;
}
lrc = ((lrc ^ 0xFF) + 1) & 0xFF;
return lrc;
}
This is how I call the above method:
var finalMessage = '\002014C50HELLO WORLD\003'
var lrc = calculateLRC(finalMessage);
console.log('lrc: ' + lrc);
finalMessage = finalMessage.concat(lrc);
console.log('finalMessage: ' + finalMessage);
However after trying all these methods, I still can't send a message to POS correctly. I have 3 days now trying to fix this thing and can't do anything more unless I finish it.
Is there anyone that knows another way to calculate LRC or what am I doing wrong here? I need it to be with Javascritpt since POS comunicates with PC through NodeJS.
Oh btw the code from convertStringToArrayBuffer is on the chrome serial documentation which is this one:
var writeSerial=function(str) {
chrome.serial.send(connectionId, convertStringToArrayBuffer(str), onSend);
}
// Convert string to ArrayBuffer
var convertStringToArrayBuffer=function(str) {
var buf=new ArrayBuffer(str.length);
var bufView=new Uint8Array(buf);
for (var i=0; i<str.length; i++) {
bufView[i]=str.charCodeAt(i);
}
return buf;
}
Edit After testing I came with this algorithm which returns a 'z' (lower case) with the following input: \002007C50HOLA\003.
function calculateLRC (str) {
var bytes = [];
var lrc = 0;
for (var i = 0; i < str.length; i++) {
bytes.push(str.charCodeAt(i));
}
for (var i = 0; i < str.length; i++) {
lrc ^= bytes[i];
console.log('lrc: ' + lrc);
//console.log('lrcString: ' + String.fromCharCode(lrc));
}
console.log('bytes: ' + bytes);
return String.fromCharCode(lrc);
}
However with some longer inputs and specialy when trying to read card data, LRC becomes sometimes a Control Character which in my case that I use them on my String, might be a problem. Is there a way to force LRC to avoid those characters? Or maybe I'm doing it wrong and that's why I'm having those characters as output.
I solved LRC issue by calculating it with the following method, after reading #Jack A.'s answer and modifying it to this one:
function calculateLRC (str) {
var bytes = [];
var lrc = 0;
for (var i = 0; i < str.length; i++) {
bytes.push(str.charCodeAt(i));
}
for (var i = 0; i < str.length; i++) {
lrc ^= bytes[i];
}
return String.fromCharCode(lrc);
}
Explanation of what it does:
1st: it converts the string received to it's ASCII equivalent (charCodeAt()).
2nd: it calculates LRC by doing a XOR operation between last calculated LRC (0 on 1st iteration) and string's ASCII for each char.
3rd: it converts from ASCII to it's equivalent chat (fromCharCode()) and returns this char to main function (or whatever function called it).
Your pseudocode-based algorithm is using addition. For the XOR version, try this:
function calculateLRC(str) {
var buffer = convertStringToArrayBuffer(str);
var lrc = 0;
for (var i = 0; i < str.length; i++) {
lrc = (lrc ^ buffer[i]) & 0xFF;
}
return lrc;
}
I think your original attempt at the XOR version was failing because you needed to get the character code. The number variable still contained a string when you did result = result ^ number, so the results were probably not what you expected.
This is a SWAG since I don't have Node.JS installed at the moment so I can't verify it will work.
Another thing I would be concerned about is character encoding. JavaScript uses UTF-16 for text, so converting any non-ASCII characters to 8-bit bytes may give unexpected results.

how convert binary data in base64 from couchdb attachments

I try to get binary data from my couchdb server. But I can use them. The response contain a string that rapresent binary data, but if I try to code in base64 with the function btoa, the function give me this error:
Uncaught InvalidCharacterError: 'btoa' failed: The string to be encoded contains characters outside of the Latin1 range.
I know that I can get data directly coded in base64, but I don't want.
$.ajax({
url: "http://localhost:5984/testdb/7d9de7a8f2cab6c0b3409d4495000e3f/img",
headers: {
Authorization: 'Basic ' + btoa("name:password"),
},
success: function(data){
/*console.log(JSON.parse(jsonData));
console.log(imageData);*/
document.getElementById("immagine").src = "Data:image/jpg;base64," + btoa(data);
console.log(data);
}
});
any idea?
Start with the knowledge that each char in a Base64 String
var chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'.split('');
represents a Number, specifically from 0 to 63. Next, consider that this range of numbers is all the numbers you can write with 6 bits and that we normally think about binary data in bytes which are 8 bits long.
So now we can conclude, the transformation we which to achieve is 8 bit integers to 6 bit integers, which looks a bit like this
xxxxxx xxyyyy yyyyzz zzzzzz
where each letter describes which byte the bit is in, and the spaces describe the breaks between the 6 bit integers.
Once we have the 6 bit number, we can simply transform to the char and finally add = signs if we need to indicate the number of bytes was not a multiple of 3 (and they're not simply 0)
So how do we do this?
var arr8 = 'foobar'; // assuming binary string
var i, // to iterate
s1, s2, s3, s4, // sixes
e1, e2, e3, // eights
b64 = ''; // result
// some code I prepared earlier
for (i = 0; i < arr8.length; i += 3) {
e1 = arr8[i ];
e1 = e1 ? e1.charCodeAt(0) & 255 : 0;
e2 = arr8[i + 1];
e2 = e2 ? e2.charCodeAt(0) & 255 : 0;
e3 = arr8[i + 2];
e3 = e3 ? e3.charCodeAt(0) & 255 : 0;
// wwwwwwxx xxxxyyyy yyzzzzzz
s1 = e1 >>> 2 ;
s2 = ((e1 & 3) << 4) + (e2 >>> 4);
s3 = ((e2 & 15) << 2) + (e3 >>> 6);
s4 = e3 & 63 ;
b64 += chars[s1] + chars[s2];
if (arr8[i + 2] !== undefined)
b64 += chars[s3] + chars[s4];
else if (arr8[i + 1] !== undefined)
b64 += chars[s3] + '=';
else
b64 += '==';
}
// and the result
b64; // "Zm9vYmFy"

Get the Value of a byte with Javascript

I'm not sure if the title makes any sense, but here is what I need.
This is my byte: ���������ˇ�����0F*E��ù� �
I have already managed to get the values of this byte with this php snippet:
<?php
$var = "���������ˇ�����0F*E��ù�";
for($i = 0; $i < strlen($var); $i++)
{
echo ord($var[$i])."<br/>";
}
?>
The result was:
0
0
0
0
0
0
0
0
2
2
0
255
0
0
0
0
0
2
48
70
1
42
69
0
0
1
157
0
But now I need to do the exact same thing without php, but in Java Script.
Any help would be appreciated
If you're wanting to get the numeric value of each character in a string in JavaScript, that can be done like so:
var someString = "blarg";
for(var i=0;i<someString.length;i++) {
var char = someString.charCodeAt(i);
}
String.charCodeAt(index) returns the Unicode code-point value of the specified character in the string. It does not behave like PHP or C where it returns the numeric value of a fixed 8-bit encoding (i.e. ASCII). Assuming your string is a human-readable string (as opposed to raw binary data) then using charCodeAt is perfectly fine. If you're working with raw binary data then don't use a JavaScript string.
If your strings contain characters that have Unicode code-points below 128 then charCodeAt behaves the same as ord in PHP or C's char type, however the example you've provided contains non-ASCII characters, so Unicode's (sometimes complicated) rules will come into play.
See the documentation on charCodeAt here: https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/String/charCodeAt
The PHP String is computed as 8-bit (bytes 0..255) while JavaScript uses 16-bit unicode characters (0..65535). Depending on your string you may either split it into (16bit) char codes or into the bytes. If you know your String contains only 8-bit chars you may ignore the "hiByte" (see below) to get the same results as in PHP.
function toByteVersionA(s) {
var len = s.length;
// char codes
var charCodes = new Array();
for(var i=0; i<len; i++) {
charCodes.push(s.charCodeAt(i).toString());
}
var charCodesString = charCodes.join(" ");
return charCodesString;
}
function toByteVersionB(s) {
var len = s.length;
var bytes = new Array();
for(var i=0; i<len; i++) {
var charCode = s.charCodeAt(i);
var loByte = charCode & 255;
var hiByte = charCode >> 8;
bytes.push(loByte.toString());
bytes.push(hiByte.toString());
}
var bytesString = bytes.join(" ");
return bytesString;
}
function toByteVersionC(s) {
var len = s.length;
var bytes = new Array();
for(var i=0; i<len; i++) {
var charCode = s.charCodeAt(i);
var loByte = charCode & 255;
bytes.push(loByte.toString());
}
var bytesString = bytes.join(" ");
return bytesString;
}
var myString = "abc"; // whatever your String is
var myBytes = toByteVersionA(myString); // whatever version you want

How to convert a String to Bytearray

How can I convert a string in bytearray using JavaScript. Output should be equivalent of the below C# code.
UnicodeEncoding encoding = new UnicodeEncoding();
byte[] bytes = encoding.GetBytes(AnyString);
As UnicodeEncoding is by default of UTF-16 with Little-Endianness.
Edit: I have a requirement to match the bytearray generated client side with the one generated at server side using the above C# code.
Update 2018 - The easiest way in 2018 should be TextEncoder
let utf8Encode = new TextEncoder();
utf8Encode.encode("abc");
// Uint8Array [ 97, 98, 99 ]
Caveats - The returned element is a Uint8Array, and not all browsers support it.
If you are looking for a solution that works in node.js, you can use this:
var myBuffer = [];
var str = 'Stack Overflow';
var buffer = new Buffer(str, 'utf16le');
for (var i = 0; i < buffer.length; i++) {
myBuffer.push(buffer[i]);
}
console.log(myBuffer);
In C# running this
UnicodeEncoding encoding = new UnicodeEncoding();
byte[] bytes = encoding.GetBytes("Hello");
Will create an array with
72,0,101,0,108,0,108,0,111,0
For a character which the code is greater than 255 it will look like this
If you want a very similar behavior in JavaScript you can do this (v2 is a bit more robust solution, while the original version will only work for 0x00 ~ 0xff)
var str = "Hello竜";
var bytes = []; // char codes
var bytesv2 = []; // char codes
for (var i = 0; i < str.length; ++i) {
var code = str.charCodeAt(i);
bytes = bytes.concat([code]);
bytesv2 = bytesv2.concat([code & 0xff, code / 256 >>> 0]);
}
// 72, 101, 108, 108, 111, 31452
console.log('bytes', bytes.join(', '));
// 72, 0, 101, 0, 108, 0, 108, 0, 111, 0, 220, 122
console.log('bytesv2', bytesv2.join(', '));
I suppose C# and Java produce equal byte arrays. If you have non-ASCII characters, it's not enough to add an additional 0. My example contains a few special characters:
var str = "Hell ö € Ω 𝄞";
var bytes = [];
var charCode;
for (var i = 0; i < str.length; ++i)
{
charCode = str.charCodeAt(i);
bytes.push((charCode & 0xFF00) >> 8);
bytes.push(charCode & 0xFF);
}
alert(bytes.join(' '));
// 0 72 0 101 0 108 0 108 0 32 0 246 0 32 32 172 0 32 3 169 0 32 216 52 221 30
I don't know if C# places BOM (Byte Order Marks), but if using UTF-16, Java String.getBytes adds following bytes: 254 255.
String s = "Hell ö € Ω ";
// now add a character outside the BMP (Basic Multilingual Plane)
// we take the violin-symbol (U+1D11E) MUSICAL SYMBOL G CLEF
s += new String(Character.toChars(0x1D11E));
// surrogate codepoints are: d834, dd1e, so one could also write "\ud834\udd1e"
byte[] bytes = s.getBytes("UTF-16");
for (byte aByte : bytes) {
System.out.print((0xFF & aByte) + " ");
}
// 254 255 0 72 0 101 0 108 0 108 0 32 0 246 0 32 32 172 0 32 3 169 0 32 216 52 221 30
Edit:
Added a special character (U+1D11E) MUSICAL SYMBOL G CLEF (outside BPM, so taking not only 2 bytes in UTF-16, but 4.
Current JavaScript versions use "UCS-2" internally, so this symbol takes the space of 2 normal characters.
I'm not sure but when using charCodeAt it seems we get exactly the surrogate codepoints also used in UTF-16, so non-BPM characters are handled correctly.
This problem is absolutely non-trivial. It might depend on the used JavaScript versions and engines. So if you want reliable solutions, you should have a look at:
https://github.com/koichik/node-codepoint/
http://mathiasbynens.be/notes/javascript-escapes
Mozilla Developer Network: charCodeAt
BigEndian vs. LittleEndian
UTF-16 Byte Array
JavaScript encodes strings as UTF-16, just like C#'s UnicodeEncoding, so creating a byte array is relatively straightforward.
JavaScript's charCodeAt() returns a 16-bit code unit (aka a 2-byte integer between 0 and 65535). You can split it into distinct bytes using the following:
function strToUtf16Bytes(str) {
const bytes = [];
for (ii = 0; ii < str.length; ii++) {
const code = str.charCodeAt(ii); // x00-xFFFF
bytes.push(code & 255, code >> 8); // low, high
}
return bytes;
}
For example:
strToUtf16Bytes('🌵');
// [ 60, 216, 53, 223 ]
This works between C# and JavaScript because they both support UTF-16. However, if you want to get a UTF-8 byte array from JS, you must transcode the bytes.
UTF-8 Byte Array
The solution feels somewhat non-trivial, but I used the code below in production with great success (original source).
Also, for the interested reader, I published my unicode helpers that help me work with string lengths reported by other languages such as PHP.
/**
* Convert a string to a unicode byte array
* #param {string} str
* #return {Array} of bytes
*/
export function strToUtf8Bytes(str) {
const utf8 = [];
for (let ii = 0; ii < str.length; ii++) {
let charCode = str.charCodeAt(ii);
if (charCode < 0x80) utf8.push(charCode);
else if (charCode < 0x800) {
utf8.push(0xc0 | (charCode >> 6), 0x80 | (charCode & 0x3f));
} else if (charCode < 0xd800 || charCode >= 0xe000) {
utf8.push(0xe0 | (charCode >> 12), 0x80 | ((charCode >> 6) & 0x3f), 0x80 | (charCode & 0x3f));
} else {
ii++;
// Surrogate pair:
// UTF-16 encodes 0x10000-0x10FFFF by subtracting 0x10000 and
// splitting the 20 bits of 0x0-0xFFFFF into two halves
charCode = 0x10000 + (((charCode & 0x3ff) << 10) | (str.charCodeAt(ii) & 0x3ff));
utf8.push(
0xf0 | (charCode >> 18),
0x80 | ((charCode >> 12) & 0x3f),
0x80 | ((charCode >> 6) & 0x3f),
0x80 | (charCode & 0x3f),
);
}
}
return utf8;
}
Inspired by #hgoebl's answer. His code is for UTF-16 and I needed something for US-ASCII. So here's a more complete answer covering US-ASCII, UTF-16, and UTF-32.
/**#returns {Array} bytes of US-ASCII*/
function stringToAsciiByteArray(str)
{
var bytes = [];
for (var i = 0; i < str.length; ++i)
{
var charCode = str.charCodeAt(i);
if (charCode > 0xFF) // char > 1 byte since charCodeAt returns the UTF-16 value
{
throw new Error('Character ' + String.fromCharCode(charCode) + ' can\'t be represented by a US-ASCII byte.');
}
bytes.push(charCode);
}
return bytes;
}
/**#returns {Array} bytes of UTF-16 Big Endian without BOM*/
function stringToUtf16ByteArray(str)
{
var bytes = [];
//currently the function returns without BOM. Uncomment the next line to change that.
//bytes.push(254, 255); //Big Endian Byte Order Marks
for (var i = 0; i < str.length; ++i)
{
var charCode = str.charCodeAt(i);
//char > 2 bytes is impossible since charCodeAt can only return 2 bytes
bytes.push((charCode & 0xFF00) >>> 8); //high byte (might be 0)
bytes.push(charCode & 0xFF); //low byte
}
return bytes;
}
/**#returns {Array} bytes of UTF-32 Big Endian without BOM*/
function stringToUtf32ByteArray(str)
{
var bytes = [];
//currently the function returns without BOM. Uncomment the next line to change that.
//bytes.push(0, 0, 254, 255); //Big Endian Byte Order Marks
for (var i = 0; i < str.length; i+=2)
{
var charPoint = str.codePointAt(i);
//char > 4 bytes is impossible since codePointAt can only return 4 bytes
bytes.push((charPoint & 0xFF000000) >>> 24);
bytes.push((charPoint & 0xFF0000) >>> 16);
bytes.push((charPoint & 0xFF00) >>> 8);
bytes.push(charPoint & 0xFF);
}
return bytes;
}
UTF-8 is variable length and isn't included because I would have to write the encoding myself. UTF-8 and UTF-16 are variable length. UTF-8, UTF-16, and UTF-32 have a minimum number of bits as their name indicates. If a UTF-32 character has a code point of 65 then that means there are 3 leading 0s. But the same code for UTF-16 has only 1 leading 0. US-ASCII on the other hand is fixed width 8-bits which means it can be directly translated to bytes.
String.prototype.charCodeAt returns a maximum number of 2 bytes and matches UTF-16 exactly. However for UTF-32 String.prototype.codePointAt is needed which is part of the ECMAScript 6 (Harmony) proposal. Because charCodeAt returns 2 bytes which is more possible characters than US-ASCII can represent, the function stringToAsciiByteArray will throw in such cases instead of splitting the character in half and taking either or both bytes.
Note that this answer is non-trivial because character encoding is non-trivial. What kind of byte array you want depends on what character encoding you want those bytes to represent.
javascript has the option of internally using either UTF-16 or UCS-2 but since it has methods that act like it is UTF-16 I don't see why any browser would use UCS-2.
Also see: https://mathiasbynens.be/notes/javascript-encoding
Yes I know the question is 4 years old but I needed this answer for myself.
Since I cannot comment on the answer, I'd build on Jin Izzraeel's answer
var myBuffer = [];
var str = 'Stack Overflow';
var buffer = new Buffer(str, 'utf16le');
for (var i = 0; i < buffer.length; i++) {
myBuffer.push(buffer[i]);
}
console.log(myBuffer);
by saying that you could use this if you want to use a Node.js buffer in your browser.
https://github.com/feross/buffer
Therefore, Tom Stickel's objection is not valid, and the answer is indeed a valid answer.
String.prototype.encodeHex = function () {
return this.split('').map(e => e.charCodeAt())
};
String.prototype.decodeHex = function () {
return this.map(e => String.fromCharCode(e)).join('')
};
The best solution I've come up with at on the spot (though most likely crude) would be:
String.prototype.getBytes = function() {
var bytes = [];
for (var i = 0; i < this.length; i++) {
var charCode = this.charCodeAt(i);
var cLen = Math.ceil(Math.log(charCode)/Math.log(256));
for (var j = 0; j < cLen; j++) {
bytes.push((charCode << (j*8)) & 0xFF);
}
}
return bytes;
}
Though I notice this question has been here for over a year.
I know the question is almost 4 years old, but this is what worked smoothly with me:
String.prototype.encodeHex = function () {
var bytes = [];
for (var i = 0; i < this.length; ++i) {
bytes.push(this.charCodeAt(i));
}
return bytes;
};
Array.prototype.decodeHex = function () {
var str = [];
var hex = this.toString().split(',');
for (var i = 0; i < hex.length; i++) {
str.push(String.fromCharCode(hex[i]));
}
return str.toString().replace(/,/g, "");
};
var str = "Hello World!";
var bytes = str.encodeHex();
alert('The Hexa Code is: '+bytes+' The original string is: '+bytes.decodeHex());
or, if you want to work with strings only, and no Array, you can use:
String.prototype.encodeHex = function () {
var bytes = [];
for (var i = 0; i < this.length; ++i) {
bytes.push(this.charCodeAt(i));
}
return bytes.toString();
};
String.prototype.decodeHex = function () {
var str = [];
var hex = this.split(',');
for (var i = 0; i < hex.length; i++) {
str.push(String.fromCharCode(hex[i]));
}
return str.toString().replace(/,/g, "");
};
var str = "Hello World!";
var bytes = str.encodeHex();
alert('The Hexa Code is: '+bytes+' The original string is: '+bytes.decodeHex());
Here is the same function that #BrunoLM posted converted to a String prototype function:
String.prototype.getBytes = function () {
var bytes = [];
for (var i = 0; i < this.length; ++i) {
bytes.push(this.charCodeAt(i));
}
return bytes;
};
If you define the function as such, then you can call the .getBytes() method on any string:
var str = "Hello World!";
var bytes = str.getBytes();
You don't need underscore, just use built-in map:
var string = 'Hello World!';
document.write(string.split('').map(function(c) { return c.charCodeAt(); }));

Categories

Resources