Related
How can I convert a string in bytearray using JavaScript. Output should be equivalent of the below C# code.
UnicodeEncoding encoding = new UnicodeEncoding();
byte[] bytes = encoding.GetBytes(AnyString);
As UnicodeEncoding is by default of UTF-16 with Little-Endianness.
Edit: I have a requirement to match the bytearray generated client side with the one generated at server side using the above C# code.
Update 2018 - The easiest way in 2018 should be TextEncoder
let utf8Encode = new TextEncoder();
utf8Encode.encode("abc");
// Uint8Array [ 97, 98, 99 ]
Caveats - The returned element is a Uint8Array, and not all browsers support it.
If you are looking for a solution that works in node.js, you can use this:
var myBuffer = [];
var str = 'Stack Overflow';
var buffer = new Buffer(str, 'utf16le');
for (var i = 0; i < buffer.length; i++) {
myBuffer.push(buffer[i]);
}
console.log(myBuffer);
In C# running this
UnicodeEncoding encoding = new UnicodeEncoding();
byte[] bytes = encoding.GetBytes("Hello");
Will create an array with
72,0,101,0,108,0,108,0,111,0
For a character which the code is greater than 255 it will look like this
If you want a very similar behavior in JavaScript you can do this (v2 is a bit more robust solution, while the original version will only work for 0x00 ~ 0xff)
var str = "Hello竜";
var bytes = []; // char codes
var bytesv2 = []; // char codes
for (var i = 0; i < str.length; ++i) {
var code = str.charCodeAt(i);
bytes = bytes.concat([code]);
bytesv2 = bytesv2.concat([code & 0xff, code / 256 >>> 0]);
}
// 72, 101, 108, 108, 111, 31452
console.log('bytes', bytes.join(', '));
// 72, 0, 101, 0, 108, 0, 108, 0, 111, 0, 220, 122
console.log('bytesv2', bytesv2.join(', '));
I suppose C# and Java produce equal byte arrays. If you have non-ASCII characters, it's not enough to add an additional 0. My example contains a few special characters:
var str = "Hell ö € Ω 𝄞";
var bytes = [];
var charCode;
for (var i = 0; i < str.length; ++i)
{
charCode = str.charCodeAt(i);
bytes.push((charCode & 0xFF00) >> 8);
bytes.push(charCode & 0xFF);
}
alert(bytes.join(' '));
// 0 72 0 101 0 108 0 108 0 32 0 246 0 32 32 172 0 32 3 169 0 32 216 52 221 30
I don't know if C# places BOM (Byte Order Marks), but if using UTF-16, Java String.getBytes adds following bytes: 254 255.
String s = "Hell ö € Ω ";
// now add a character outside the BMP (Basic Multilingual Plane)
// we take the violin-symbol (U+1D11E) MUSICAL SYMBOL G CLEF
s += new String(Character.toChars(0x1D11E));
// surrogate codepoints are: d834, dd1e, so one could also write "\ud834\udd1e"
byte[] bytes = s.getBytes("UTF-16");
for (byte aByte : bytes) {
System.out.print((0xFF & aByte) + " ");
}
// 254 255 0 72 0 101 0 108 0 108 0 32 0 246 0 32 32 172 0 32 3 169 0 32 216 52 221 30
Edit:
Added a special character (U+1D11E) MUSICAL SYMBOL G CLEF (outside BPM, so taking not only 2 bytes in UTF-16, but 4.
Current JavaScript versions use "UCS-2" internally, so this symbol takes the space of 2 normal characters.
I'm not sure but when using charCodeAt it seems we get exactly the surrogate codepoints also used in UTF-16, so non-BPM characters are handled correctly.
This problem is absolutely non-trivial. It might depend on the used JavaScript versions and engines. So if you want reliable solutions, you should have a look at:
https://github.com/koichik/node-codepoint/
http://mathiasbynens.be/notes/javascript-escapes
Mozilla Developer Network: charCodeAt
BigEndian vs. LittleEndian
UTF-16 Byte Array
JavaScript encodes strings as UTF-16, just like C#'s UnicodeEncoding, so creating a byte array is relatively straightforward.
JavaScript's charCodeAt() returns a 16-bit code unit (aka a 2-byte integer between 0 and 65535). You can split it into distinct bytes using the following:
function strToUtf16Bytes(str) {
const bytes = [];
for (ii = 0; ii < str.length; ii++) {
const code = str.charCodeAt(ii); // x00-xFFFF
bytes.push(code & 255, code >> 8); // low, high
}
return bytes;
}
For example:
strToUtf16Bytes('🌵');
// [ 60, 216, 53, 223 ]
This works between C# and JavaScript because they both support UTF-16. However, if you want to get a UTF-8 byte array from JS, you must transcode the bytes.
UTF-8 Byte Array
The solution feels somewhat non-trivial, but I used the code below in production with great success (original source).
Also, for the interested reader, I published my unicode helpers that help me work with string lengths reported by other languages such as PHP.
/**
* Convert a string to a unicode byte array
* #param {string} str
* #return {Array} of bytes
*/
export function strToUtf8Bytes(str) {
const utf8 = [];
for (let ii = 0; ii < str.length; ii++) {
let charCode = str.charCodeAt(ii);
if (charCode < 0x80) utf8.push(charCode);
else if (charCode < 0x800) {
utf8.push(0xc0 | (charCode >> 6), 0x80 | (charCode & 0x3f));
} else if (charCode < 0xd800 || charCode >= 0xe000) {
utf8.push(0xe0 | (charCode >> 12), 0x80 | ((charCode >> 6) & 0x3f), 0x80 | (charCode & 0x3f));
} else {
ii++;
// Surrogate pair:
// UTF-16 encodes 0x10000-0x10FFFF by subtracting 0x10000 and
// splitting the 20 bits of 0x0-0xFFFFF into two halves
charCode = 0x10000 + (((charCode & 0x3ff) << 10) | (str.charCodeAt(ii) & 0x3ff));
utf8.push(
0xf0 | (charCode >> 18),
0x80 | ((charCode >> 12) & 0x3f),
0x80 | ((charCode >> 6) & 0x3f),
0x80 | (charCode & 0x3f),
);
}
}
return utf8;
}
Inspired by #hgoebl's answer. His code is for UTF-16 and I needed something for US-ASCII. So here's a more complete answer covering US-ASCII, UTF-16, and UTF-32.
/**#returns {Array} bytes of US-ASCII*/
function stringToAsciiByteArray(str)
{
var bytes = [];
for (var i = 0; i < str.length; ++i)
{
var charCode = str.charCodeAt(i);
if (charCode > 0xFF) // char > 1 byte since charCodeAt returns the UTF-16 value
{
throw new Error('Character ' + String.fromCharCode(charCode) + ' can\'t be represented by a US-ASCII byte.');
}
bytes.push(charCode);
}
return bytes;
}
/**#returns {Array} bytes of UTF-16 Big Endian without BOM*/
function stringToUtf16ByteArray(str)
{
var bytes = [];
//currently the function returns without BOM. Uncomment the next line to change that.
//bytes.push(254, 255); //Big Endian Byte Order Marks
for (var i = 0; i < str.length; ++i)
{
var charCode = str.charCodeAt(i);
//char > 2 bytes is impossible since charCodeAt can only return 2 bytes
bytes.push((charCode & 0xFF00) >>> 8); //high byte (might be 0)
bytes.push(charCode & 0xFF); //low byte
}
return bytes;
}
/**#returns {Array} bytes of UTF-32 Big Endian without BOM*/
function stringToUtf32ByteArray(str)
{
var bytes = [];
//currently the function returns without BOM. Uncomment the next line to change that.
//bytes.push(0, 0, 254, 255); //Big Endian Byte Order Marks
for (var i = 0; i < str.length; i+=2)
{
var charPoint = str.codePointAt(i);
//char > 4 bytes is impossible since codePointAt can only return 4 bytes
bytes.push((charPoint & 0xFF000000) >>> 24);
bytes.push((charPoint & 0xFF0000) >>> 16);
bytes.push((charPoint & 0xFF00) >>> 8);
bytes.push(charPoint & 0xFF);
}
return bytes;
}
UTF-8 is variable length and isn't included because I would have to write the encoding myself. UTF-8 and UTF-16 are variable length. UTF-8, UTF-16, and UTF-32 have a minimum number of bits as their name indicates. If a UTF-32 character has a code point of 65 then that means there are 3 leading 0s. But the same code for UTF-16 has only 1 leading 0. US-ASCII on the other hand is fixed width 8-bits which means it can be directly translated to bytes.
String.prototype.charCodeAt returns a maximum number of 2 bytes and matches UTF-16 exactly. However for UTF-32 String.prototype.codePointAt is needed which is part of the ECMAScript 6 (Harmony) proposal. Because charCodeAt returns 2 bytes which is more possible characters than US-ASCII can represent, the function stringToAsciiByteArray will throw in such cases instead of splitting the character in half and taking either or both bytes.
Note that this answer is non-trivial because character encoding is non-trivial. What kind of byte array you want depends on what character encoding you want those bytes to represent.
javascript has the option of internally using either UTF-16 or UCS-2 but since it has methods that act like it is UTF-16 I don't see why any browser would use UCS-2.
Also see: https://mathiasbynens.be/notes/javascript-encoding
Yes I know the question is 4 years old but I needed this answer for myself.
Since I cannot comment on the answer, I'd build on Jin Izzraeel's answer
var myBuffer = [];
var str = 'Stack Overflow';
var buffer = new Buffer(str, 'utf16le');
for (var i = 0; i < buffer.length; i++) {
myBuffer.push(buffer[i]);
}
console.log(myBuffer);
by saying that you could use this if you want to use a Node.js buffer in your browser.
https://github.com/feross/buffer
Therefore, Tom Stickel's objection is not valid, and the answer is indeed a valid answer.
String.prototype.encodeHex = function () {
return this.split('').map(e => e.charCodeAt())
};
String.prototype.decodeHex = function () {
return this.map(e => String.fromCharCode(e)).join('')
};
The best solution I've come up with at on the spot (though most likely crude) would be:
String.prototype.getBytes = function() {
var bytes = [];
for (var i = 0; i < this.length; i++) {
var charCode = this.charCodeAt(i);
var cLen = Math.ceil(Math.log(charCode)/Math.log(256));
for (var j = 0; j < cLen; j++) {
bytes.push((charCode << (j*8)) & 0xFF);
}
}
return bytes;
}
Though I notice this question has been here for over a year.
I know the question is almost 4 years old, but this is what worked smoothly with me:
String.prototype.encodeHex = function () {
var bytes = [];
for (var i = 0; i < this.length; ++i) {
bytes.push(this.charCodeAt(i));
}
return bytes;
};
Array.prototype.decodeHex = function () {
var str = [];
var hex = this.toString().split(',');
for (var i = 0; i < hex.length; i++) {
str.push(String.fromCharCode(hex[i]));
}
return str.toString().replace(/,/g, "");
};
var str = "Hello World!";
var bytes = str.encodeHex();
alert('The Hexa Code is: '+bytes+' The original string is: '+bytes.decodeHex());
or, if you want to work with strings only, and no Array, you can use:
String.prototype.encodeHex = function () {
var bytes = [];
for (var i = 0; i < this.length; ++i) {
bytes.push(this.charCodeAt(i));
}
return bytes.toString();
};
String.prototype.decodeHex = function () {
var str = [];
var hex = this.split(',');
for (var i = 0; i < hex.length; i++) {
str.push(String.fromCharCode(hex[i]));
}
return str.toString().replace(/,/g, "");
};
var str = "Hello World!";
var bytes = str.encodeHex();
alert('The Hexa Code is: '+bytes+' The original string is: '+bytes.decodeHex());
Here is the same function that #BrunoLM posted converted to a String prototype function:
String.prototype.getBytes = function () {
var bytes = [];
for (var i = 0; i < this.length; ++i) {
bytes.push(this.charCodeAt(i));
}
return bytes;
};
If you define the function as such, then you can call the .getBytes() method on any string:
var str = "Hello World!";
var bytes = str.getBytes();
You don't need underscore, just use built-in map:
var string = 'Hello World!';
document.write(string.split('').map(function(c) { return c.charCodeAt(); }));
During my research I've found these information but it seems like they are not really matching to my problem.
http://www.cplusplus.com/forum/beginner/31776/
Base 10 to base n conversions
https://cboard.cprogramming.com/cplusplus-programming/83808-base10-base-n-converter.html
So I'd like to implement a custom Base64 to BaseN encoding and decoding using C++.
I should be able to convert a (Base64)-string like "IloveC0mpil3rs" to a custom Base (e.g Base4) string like e.g "10230102010301" and back again.
Additional I should be able to use a custom charset (alphabet) for the base values like the default one probably is "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".
So I should be able to use a shuffled one like e.g this (kind of encoding :) ): "J87opBEyWwDQdNAYujzshP3LOx1T0XK2e+ZrvFnticbCS64a9/Il5GmgVkqUfRMH".
I thought about translating the convertBase-function below from javascript into C++ but I'm obviously a beginner and got big problems, so I got stuck right there because my code is not working as expected and I can not find the error:
string encoded = convertBase("Test", 64, 4); // gets 313032130131000
cout << encoded << endl;
string decoded = convertBase(encoded, 4, 64); // error
cout << decoded << endl;
C++ code: (not working)
std::string convertBase(string value, int from_base, int to_base) {
string range = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ+/";
string from_range = range.substr(0, from_base),
to_range = range.substr(0, to_base);
int dec_value = 0;
int index = 0;
string reversed(value.rbegin(), value.rend());
for(std::string::iterator it = reversed.begin(); it != reversed.end(); ++it) {
index++;
char digit = *it;
if (!range.find(digit)) return "error";
dec_value += from_range.find(digit) * pow(from_base, index);
}
string new_value = "";
while (dec_value > 0) {
new_value = to_range[dec_value % to_base] + new_value;
dec_value = (dec_value - (dec_value % to_base)) / to_base;
}
return new_value;
}
javascript code: (working)
function convertBase(value, from_base, to_base) {
var range = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ+/'.split('');
var from_range = range.slice(0, from_base);
var to_range = range.slice(0, to_base);
var dec_value = value.split('').reverse().reduce(function (carry, digit, index) {
if (from_range.indexOf(digit) === -1) throw new Error('Invalid digit `'+digit+'` for base '+from_base+'.');
return carry += from_range.indexOf(digit) * (Math.pow(from_base, index));
}, 0);
var new_value = '';
while (dec_value > 0) {
new_value = to_range[dec_value % to_base] + new_value;
dec_value = (dec_value - (dec_value % to_base)) / to_base;
}
return new_value || '0';
}
let encoded = convertBase("Test", 64, 4)
console.log(encoded);
let decoded = convertBase(encoded, 4, 64)
console.log(decoded);
Any help how to fix my code would be very appreciated!
Is it possible to create a GUID using 16 characters of hex? The reason I ask is Cloudflare is using 16 characters to identify each request to their system (they call them "Ray IDs"). They look much nicer compared to other GUID formats (I know this is silly preference).
The key space would contain these characters:
0-9
a-f
---
16 total possible characters
Example: adttlo9dOd8haoww
Also, any hint to a basic algorithm of generating these things would be awesome.
Lastly, I'm open to leaving the "hex" format and using:
0-9
a-z
A-Z
---
62 total possible characters
Example: dhmpLTuPFWEwM8UL
When you need to create your own unique id's use date time now. If your application is distributed use datetime.now + node identifier this is the most simple solution:
function d2h(d) {
return d.toString(16);
}
function h2d (h) {
return parseInt(h, 16);
}
function stringToHex (tmp) {
var str = '',
i = 0,
tmp_len = tmp.length,
c;
for (; i < tmp_len; i += 1) {
c = tmp.charCodeAt(i);
str += d2h(c) + ' ';
}
return str;
}
function hexToString (tmp) {
var arr = tmp.split(' '),
str = '',
i = 0,
arr_len = arr.length,
c;
for (; i < arr_len; i += 1) {
c = String.fromCharCode( h2d( arr[i] ) );
str += c;
}
return str;
}
//if you can get utc time is even bether
// Tue, 30 Jun 2015 23:01:04 GMT
var time = Date();
var server_point = "S1";
//you can encript this genrated id with blow fish or something, remember encripting existen bytes the length of the result wil grow
//remove spaces
var reg = new RegExp("[ ]+","g");
time = time.replace(reg, "");
var hexaResult = stringToHex(time + server_point);
alert(hexaResult.replace(reg, ""));
Or you can use crypto random generator:
https://developer.mozilla.org/en-US/docs/Web/API/RandomSource/getRandomValues
Guids are typically 32 hex characters with dashes at different intervals.
var crypto = require("crypto");
function create_guid() {
var hexstring = crypto.randomBytes(16).toString("hex"); // 16 bytes generates a 32 character hex string
var guidstring = hexstring.substring(0,8) + "-" + hexstring.substring(8,12) + "-" + hexstring.substring(12,16) + "-" + hexstring.substring(16,20) + "-" + hexstring.substring(20);
return guidstring;
}
You can simply modify the above function to return hexstring instead of guidstring if you don't want the dashes. And if want just 16 characters instead of 32:
function create_guid_simple() {
var hexstring = crypto.randomBytes(8).toString("hex"); // 8 bytes is a 16 character string
return hexstring;
}
I am trying without much success to convert a very large hex number to decimal.
My problem is that using deciaml = parseInt(hex, 16)
gives me errors in the number when I try to convert a hex number above 14 digits.
I have no problem with this in Java, but Javascript does not seem to be accurate above 14 digits of hex.
I have tried "BigNumber" but tis gives me the same erroneous result.
I have trawled the web to the best of my ability and found web sites that will do the conversion but cannot figure out how to do the conversion longhand.
I have tried getting each character in turn and multiplying it by its factor i.e. 123456789abcdef
15 * Math.pow(16, 0) + 14 * Math.pow(16, 1).... etc but I think (being a noob) that my subroutines may not hev been all they should be because I got a completely (and I mean really different!) answer.
If it helps you guys I can post what I have written so far for you to look at but I am hoping someone has simple answer for me.
<script>
function Hex2decimal(hex){
var stringLength = hex.length;
var characterPosition = stringLength;
var character;
var hexChars = new Array();
hexChars[0] = "0";
hexChars[1] = "1";
hexChars[2] = "2";
hexChars[3] = "3";
hexChars[4] = "4";
hexChars[5] = "5";
hexChars[6] = "6";
hexChars[7] = "7";
hexChars[8] = "8";
hexChars[9] = "9";
hexChars[10] = "a";
hexChars[11] = "b";
hexChars[12] = "c";
hexChars[13] = "d";
hexChars[14] = "e";
hexChars[15] = "f";
var index = 0;
var hexChar;
var result;
// document.writeln(hex);
while (characterPosition >= 0)
{
// document.writeln(characterPosition);
character = hex.charAt(characterPosition);
while (index < hexChars.length)
{
// document.writeln(index);
document.writeln("String Character = " + character);
hexChar = hexChars[index];
document.writeln("Hex Character = " + hexChar);
if (hexChar == character)
{
result = hexChar;
document.writeln(result);
}
index++
}
// document.write(character);
characterPosition--;
}
return result;
}
</script>
Thank you.
Paul
The New 'n' Easy Way
var hex = "7FDDDDDDDDDDDDDDDDDDDDDD";
if (hex.length % 2) { hex = '0' + hex; }
var bn = BigInt('0x' + hex);
var d = bn.toString(10);
BigInts are now available in most browsers (except IE).
Earlier in this answer:
BigInts are now available in both node.js and Chrome. Firefox shouldn't be far behind.
If you need to deal with negative numbers, that requires a bit of work:
How to handle Signed JS BigInts
Essentially:
function hexToBn(hex) {
if (hex.length % 2) {
hex = '0' + hex;
}
var highbyte = parseInt(hex.slice(0, 2), 16)
var bn = BigInt('0x' + hex);
if (0x80 & highbyte) {
// You'd think `bn = ~bn;` would work... but it doesn't
// manually perform two's compliment (flip bits, add one)
// (because JS binary operators are incorrect for negatives)
bn = BigInt('0b' + bn.toString(2).split('').map(function (i) {
return '0' === i ? 1 : 0
}).join('')) + BigInt(1);
bn = -bn;
}
return bn;
}
Ok, let's try this:
function h2d(s) {
function add(x, y) {
var c = 0, r = [];
var x = x.split('').map(Number);
var y = y.split('').map(Number);
while(x.length || y.length) {
var s = (x.pop() || 0) + (y.pop() || 0) + c;
r.unshift(s < 10 ? s : s - 10);
c = s < 10 ? 0 : 1;
}
if(c) r.unshift(c);
return r.join('');
}
var dec = '0';
s.split('').forEach(function(chr) {
var n = parseInt(chr, 16);
for(var t = 8; t; t >>= 1) {
dec = add(dec, dec);
if(n & t) dec = add(dec, '1');
}
});
return dec;
}
Test:
t = 'dfae267ab6e87c62b10b476e0d70b06f8378802d21f34e7'
console.log(h2d(t))
prints
342789023478234789127089427304981273408912349586345899239
which is correct (feel free to verify).
Notice that "0x" + "ff" will be considered as 255, so convert your hex value to a string and add "0x" ahead.
function Hex2decimal(hex)
{
return ("0x" + hex) / 1;
}
If you are using the '0x' notation for your Hex String, don't forget to add s = s.slice(2) to remove the '0x' prefix.
Keep in mind that JavaScript only has a single numeric type (double), and does not provide any separate integer types. So it may not be possible for it to store exact representations of your numbers.
In order to get exact results you need to use a library for arbitrary-precision integers, such as BigInt.js. For example, the code:
var x = str2bigInt("5061756c205768697465",16,1,1);
var s = bigInt2str(x, 10);
$('#output').text(s);
Correctly converts 0x5061756c205768697465 to the expected result of 379587113978081151906917.
Here is a jsfiddle if you would like to experiment with the code listed above.
The BigInt constructor can take a hex string as argument:
/** #param hex = "a83b01cd..." */
function Hex2decimal(hex) {
return BigInt("0x" + hex).toString(10);
}
Usage:
Hex2decimal("100");
Output:
256
A rip-off from the other answer, but without the meaningless 0 padding =P
How can I convert a string in bytearray using JavaScript. Output should be equivalent of the below C# code.
UnicodeEncoding encoding = new UnicodeEncoding();
byte[] bytes = encoding.GetBytes(AnyString);
As UnicodeEncoding is by default of UTF-16 with Little-Endianness.
Edit: I have a requirement to match the bytearray generated client side with the one generated at server side using the above C# code.
Update 2018 - The easiest way in 2018 should be TextEncoder
let utf8Encode = new TextEncoder();
utf8Encode.encode("abc");
// Uint8Array [ 97, 98, 99 ]
Caveats - The returned element is a Uint8Array, and not all browsers support it.
If you are looking for a solution that works in node.js, you can use this:
var myBuffer = [];
var str = 'Stack Overflow';
var buffer = new Buffer(str, 'utf16le');
for (var i = 0; i < buffer.length; i++) {
myBuffer.push(buffer[i]);
}
console.log(myBuffer);
In C# running this
UnicodeEncoding encoding = new UnicodeEncoding();
byte[] bytes = encoding.GetBytes("Hello");
Will create an array with
72,0,101,0,108,0,108,0,111,0
For a character which the code is greater than 255 it will look like this
If you want a very similar behavior in JavaScript you can do this (v2 is a bit more robust solution, while the original version will only work for 0x00 ~ 0xff)
var str = "Hello竜";
var bytes = []; // char codes
var bytesv2 = []; // char codes
for (var i = 0; i < str.length; ++i) {
var code = str.charCodeAt(i);
bytes = bytes.concat([code]);
bytesv2 = bytesv2.concat([code & 0xff, code / 256 >>> 0]);
}
// 72, 101, 108, 108, 111, 31452
console.log('bytes', bytes.join(', '));
// 72, 0, 101, 0, 108, 0, 108, 0, 111, 0, 220, 122
console.log('bytesv2', bytesv2.join(', '));
I suppose C# and Java produce equal byte arrays. If you have non-ASCII characters, it's not enough to add an additional 0. My example contains a few special characters:
var str = "Hell ö € Ω 𝄞";
var bytes = [];
var charCode;
for (var i = 0; i < str.length; ++i)
{
charCode = str.charCodeAt(i);
bytes.push((charCode & 0xFF00) >> 8);
bytes.push(charCode & 0xFF);
}
alert(bytes.join(' '));
// 0 72 0 101 0 108 0 108 0 32 0 246 0 32 32 172 0 32 3 169 0 32 216 52 221 30
I don't know if C# places BOM (Byte Order Marks), but if using UTF-16, Java String.getBytes adds following bytes: 254 255.
String s = "Hell ö € Ω ";
// now add a character outside the BMP (Basic Multilingual Plane)
// we take the violin-symbol (U+1D11E) MUSICAL SYMBOL G CLEF
s += new String(Character.toChars(0x1D11E));
// surrogate codepoints are: d834, dd1e, so one could also write "\ud834\udd1e"
byte[] bytes = s.getBytes("UTF-16");
for (byte aByte : bytes) {
System.out.print((0xFF & aByte) + " ");
}
// 254 255 0 72 0 101 0 108 0 108 0 32 0 246 0 32 32 172 0 32 3 169 0 32 216 52 221 30
Edit:
Added a special character (U+1D11E) MUSICAL SYMBOL G CLEF (outside BPM, so taking not only 2 bytes in UTF-16, but 4.
Current JavaScript versions use "UCS-2" internally, so this symbol takes the space of 2 normal characters.
I'm not sure but when using charCodeAt it seems we get exactly the surrogate codepoints also used in UTF-16, so non-BPM characters are handled correctly.
This problem is absolutely non-trivial. It might depend on the used JavaScript versions and engines. So if you want reliable solutions, you should have a look at:
https://github.com/koichik/node-codepoint/
http://mathiasbynens.be/notes/javascript-escapes
Mozilla Developer Network: charCodeAt
BigEndian vs. LittleEndian
UTF-16 Byte Array
JavaScript encodes strings as UTF-16, just like C#'s UnicodeEncoding, so creating a byte array is relatively straightforward.
JavaScript's charCodeAt() returns a 16-bit code unit (aka a 2-byte integer between 0 and 65535). You can split it into distinct bytes using the following:
function strToUtf16Bytes(str) {
const bytes = [];
for (ii = 0; ii < str.length; ii++) {
const code = str.charCodeAt(ii); // x00-xFFFF
bytes.push(code & 255, code >> 8); // low, high
}
return bytes;
}
For example:
strToUtf16Bytes('🌵');
// [ 60, 216, 53, 223 ]
This works between C# and JavaScript because they both support UTF-16. However, if you want to get a UTF-8 byte array from JS, you must transcode the bytes.
UTF-8 Byte Array
The solution feels somewhat non-trivial, but I used the code below in production with great success (original source).
Also, for the interested reader, I published my unicode helpers that help me work with string lengths reported by other languages such as PHP.
/**
* Convert a string to a unicode byte array
* #param {string} str
* #return {Array} of bytes
*/
export function strToUtf8Bytes(str) {
const utf8 = [];
for (let ii = 0; ii < str.length; ii++) {
let charCode = str.charCodeAt(ii);
if (charCode < 0x80) utf8.push(charCode);
else if (charCode < 0x800) {
utf8.push(0xc0 | (charCode >> 6), 0x80 | (charCode & 0x3f));
} else if (charCode < 0xd800 || charCode >= 0xe000) {
utf8.push(0xe0 | (charCode >> 12), 0x80 | ((charCode >> 6) & 0x3f), 0x80 | (charCode & 0x3f));
} else {
ii++;
// Surrogate pair:
// UTF-16 encodes 0x10000-0x10FFFF by subtracting 0x10000 and
// splitting the 20 bits of 0x0-0xFFFFF into two halves
charCode = 0x10000 + (((charCode & 0x3ff) << 10) | (str.charCodeAt(ii) & 0x3ff));
utf8.push(
0xf0 | (charCode >> 18),
0x80 | ((charCode >> 12) & 0x3f),
0x80 | ((charCode >> 6) & 0x3f),
0x80 | (charCode & 0x3f),
);
}
}
return utf8;
}
Inspired by #hgoebl's answer. His code is for UTF-16 and I needed something for US-ASCII. So here's a more complete answer covering US-ASCII, UTF-16, and UTF-32.
/**#returns {Array} bytes of US-ASCII*/
function stringToAsciiByteArray(str)
{
var bytes = [];
for (var i = 0; i < str.length; ++i)
{
var charCode = str.charCodeAt(i);
if (charCode > 0xFF) // char > 1 byte since charCodeAt returns the UTF-16 value
{
throw new Error('Character ' + String.fromCharCode(charCode) + ' can\'t be represented by a US-ASCII byte.');
}
bytes.push(charCode);
}
return bytes;
}
/**#returns {Array} bytes of UTF-16 Big Endian without BOM*/
function stringToUtf16ByteArray(str)
{
var bytes = [];
//currently the function returns without BOM. Uncomment the next line to change that.
//bytes.push(254, 255); //Big Endian Byte Order Marks
for (var i = 0; i < str.length; ++i)
{
var charCode = str.charCodeAt(i);
//char > 2 bytes is impossible since charCodeAt can only return 2 bytes
bytes.push((charCode & 0xFF00) >>> 8); //high byte (might be 0)
bytes.push(charCode & 0xFF); //low byte
}
return bytes;
}
/**#returns {Array} bytes of UTF-32 Big Endian without BOM*/
function stringToUtf32ByteArray(str)
{
var bytes = [];
//currently the function returns without BOM. Uncomment the next line to change that.
//bytes.push(0, 0, 254, 255); //Big Endian Byte Order Marks
for (var i = 0; i < str.length; i+=2)
{
var charPoint = str.codePointAt(i);
//char > 4 bytes is impossible since codePointAt can only return 4 bytes
bytes.push((charPoint & 0xFF000000) >>> 24);
bytes.push((charPoint & 0xFF0000) >>> 16);
bytes.push((charPoint & 0xFF00) >>> 8);
bytes.push(charPoint & 0xFF);
}
return bytes;
}
UTF-8 is variable length and isn't included because I would have to write the encoding myself. UTF-8 and UTF-16 are variable length. UTF-8, UTF-16, and UTF-32 have a minimum number of bits as their name indicates. If a UTF-32 character has a code point of 65 then that means there are 3 leading 0s. But the same code for UTF-16 has only 1 leading 0. US-ASCII on the other hand is fixed width 8-bits which means it can be directly translated to bytes.
String.prototype.charCodeAt returns a maximum number of 2 bytes and matches UTF-16 exactly. However for UTF-32 String.prototype.codePointAt is needed which is part of the ECMAScript 6 (Harmony) proposal. Because charCodeAt returns 2 bytes which is more possible characters than US-ASCII can represent, the function stringToAsciiByteArray will throw in such cases instead of splitting the character in half and taking either or both bytes.
Note that this answer is non-trivial because character encoding is non-trivial. What kind of byte array you want depends on what character encoding you want those bytes to represent.
javascript has the option of internally using either UTF-16 or UCS-2 but since it has methods that act like it is UTF-16 I don't see why any browser would use UCS-2.
Also see: https://mathiasbynens.be/notes/javascript-encoding
Yes I know the question is 4 years old but I needed this answer for myself.
Since I cannot comment on the answer, I'd build on Jin Izzraeel's answer
var myBuffer = [];
var str = 'Stack Overflow';
var buffer = new Buffer(str, 'utf16le');
for (var i = 0; i < buffer.length; i++) {
myBuffer.push(buffer[i]);
}
console.log(myBuffer);
by saying that you could use this if you want to use a Node.js buffer in your browser.
https://github.com/feross/buffer
Therefore, Tom Stickel's objection is not valid, and the answer is indeed a valid answer.
String.prototype.encodeHex = function () {
return this.split('').map(e => e.charCodeAt())
};
String.prototype.decodeHex = function () {
return this.map(e => String.fromCharCode(e)).join('')
};
The best solution I've come up with at on the spot (though most likely crude) would be:
String.prototype.getBytes = function() {
var bytes = [];
for (var i = 0; i < this.length; i++) {
var charCode = this.charCodeAt(i);
var cLen = Math.ceil(Math.log(charCode)/Math.log(256));
for (var j = 0; j < cLen; j++) {
bytes.push((charCode << (j*8)) & 0xFF);
}
}
return bytes;
}
Though I notice this question has been here for over a year.
I know the question is almost 4 years old, but this is what worked smoothly with me:
String.prototype.encodeHex = function () {
var bytes = [];
for (var i = 0; i < this.length; ++i) {
bytes.push(this.charCodeAt(i));
}
return bytes;
};
Array.prototype.decodeHex = function () {
var str = [];
var hex = this.toString().split(',');
for (var i = 0; i < hex.length; i++) {
str.push(String.fromCharCode(hex[i]));
}
return str.toString().replace(/,/g, "");
};
var str = "Hello World!";
var bytes = str.encodeHex();
alert('The Hexa Code is: '+bytes+' The original string is: '+bytes.decodeHex());
or, if you want to work with strings only, and no Array, you can use:
String.prototype.encodeHex = function () {
var bytes = [];
for (var i = 0; i < this.length; ++i) {
bytes.push(this.charCodeAt(i));
}
return bytes.toString();
};
String.prototype.decodeHex = function () {
var str = [];
var hex = this.split(',');
for (var i = 0; i < hex.length; i++) {
str.push(String.fromCharCode(hex[i]));
}
return str.toString().replace(/,/g, "");
};
var str = "Hello World!";
var bytes = str.encodeHex();
alert('The Hexa Code is: '+bytes+' The original string is: '+bytes.decodeHex());
Here is the same function that #BrunoLM posted converted to a String prototype function:
String.prototype.getBytes = function () {
var bytes = [];
for (var i = 0; i < this.length; ++i) {
bytes.push(this.charCodeAt(i));
}
return bytes;
};
If you define the function as such, then you can call the .getBytes() method on any string:
var str = "Hello World!";
var bytes = str.getBytes();
You don't need underscore, just use built-in map:
var string = 'Hello World!';
document.write(string.split('').map(function(c) { return c.charCodeAt(); }));