Javascript encoding breaking & combining multibyte characters?

Javascript encoding breaking & combining multibyte characters? - javascript

I'm planning to use a client-side AES encryption for my web-app.
Right now, I've been looking for ways to break multibyte characters into one byte-'non-characters' ,encrypt (to have the same encrypted text length),
de-crypt them back, convert those one-byte 'non-characters' back to multibyte characters.
I've seen the wiki for UTF-8 (the supposedly-default encoding for JS?) and UTF-16, but I can't figure out how to detect "fragmented" multibyte characters and how I can combine them back.
Thanks : )

JavaScript strings are UTF-16 stored in 16-bit "characters". For Unicode characters ("code points") that require more than 16 bits (some code points require 32 bits in UTF-16), each JavaScript "character" is actually only half of the code point.
So to "break" a JavaScript character into bytes, you just take the character code and split off the high byte and the low byte:
var code = str.charCodeAt(0); // The first character, obviously you'll have a loop
var lowbyte = code & 0xFF;
var highbyte = (code & 0xFF00) >> 8;
(Even though JavaScript's numbers are floating point, the bitwise operators work in terms of 32-bit integers, and of course in our case only 16 of those bits are relevant.)
You'll never have an odd number of bytes, because again this is UTF-16.

You could simply convert to UTF8... For example by using this trick
function encode_utf8(s) {
return unescape(encodeURIComponent(s));
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
Considering you are using crypto-js, you can use its methods to convert to utf8 and return to string. See here:
var words = CryptoJS.enc.Utf8.parse('𤭢');
var utf8 = CryptoJS.enc.Utf8.stringify(words);
The 𤭢 is probably a botched example of Utf8 character.
By looking at the other examples (see the Latin1 example), I'll say that with parse you convert a string to Utf8 (technically you convert it to Utf8 and put in a special array used by crypto-js of type WordArray) and the result can be passed to the Aes encoding algorithm and with stringify you convert a WordArray (for example obtained by decoding algorithm) to an Utf8.
JsFiddle example: http://jsfiddle.net/UpJRm/

Related

Javascript: Convert Unicode Character to hex string [duplicate]

I'm using a barcode scanner to read a barcode on my website (the website is made in OpenUI5).
The scanner works like a keyboard that types the characters it reads. At the end and the beginning of the typing it uses a special character. These characters are different for every type of scanner.
Some possible characters are:
█
▄
–
—
In my code I use if (oModelScanner.oData.scanning && oEvent.key == "\u2584") to check if the input from the scanner is ▄.
Is there any way to get the code from that character in the \uHHHH style? (with the HHHH being the hexadecimal code for the character)
I tried the charCodeAt but this returns the decimal code.
With the codePointAt examples they make the code I need into a decimal code so I need a reverse of this.

Javascript strings have a method codePointAt which gives you the integer representing the Unicode point value. You need to use a base 16 (hexadecimal) representation of that number if you wish to format the integer into a four hexadecimal digits sequence (as in the response of Nikolay Spasov).
var hex = "▄".codePointAt(0).toString(16);
var result = "\\u" + "0000".substring(0, 4 - hex.length) + hex;
However it would probably be easier for you to check directly if you key code point integer match the expected code point
oEvent.key.codePointAt(0) === '▄'.codePointAt(0);
Note that "symbol equality" can actually be trickier: some symbols are defined by surrogate pairs (you can see it as the combination of two halves defined as four hexadecimal digits sequence).
For this reason I would recommend to use a specialized library.
you'll find more details in the very relevant article by Mathias Bynens

var hex = "▄".charCodeAt(0).toString(16);
var result = "\\u" + "0000".substring(0, 4 - hex.length) + hex;

If you want to print the multiple code points of a character, e.g., an emoji, you can do this:
const facepalm = "🤦🏼‍♂️";
const codePoints = Array.from(facepalm)
.map((v) => v.codePointAt(0).toString(16))
.map((hex) => "\\u{" + hex + "}");
console.log(codePoints);
["\u{1f926}", "\u{1f3fc}", "\u{200d}", "\u{2642}", "\u{fe0f}"]
If you are wondering about the components and the length of 🤦🏼‍♂️, check out this article.

JS base n string to base64

I have a string (with numbers under 128) separated by a comma:
"127,25,34,52,46,2,34,4,6,1"
Because there are 10 digits and one comma, that makes 11 total characters. How can I convert this string from "base 11" to "base 64"? I would like to compress this string into base64. I tried window.btoa, but it produces a larger output because the browser doesn't know that the string only has 11 characters.
Thanks in advance.

Base64 encoding never produces shorter strings. It is not intended as a compression tool, but as a means to reduce the used character set to 64 readable characters, taking into account that the input may use a larger characterset (even if not all those characters are used).
Given the format of your string, why not take those numbers and use them as ASCII, and then apply Base64 encoding on that?
Demo:
let s = "127,25,34,52,46,2,34,4,6,1";
console.log(s);
let encoded = btoa(String.fromCharCode(...s.match(/\d+/g)));
console.log(encoded);
let decoded = Array.from(atob(encoded), c => c.charCodeAt()).join();
console.log(decoded);

decoding array of utf8 strings inside a stream

I faced a weird problem today after trying to decode a utf8 formatted string. It's being fetched through stream as an array of strings but formatted in utf8 somehow (I'm using fast-csv). However as you can see in the console if I log it directly it shows the correct version but when it's inside an object literal it's back to utf8 encoded version.
var stream = fs
.createReadStream(__dirname + '/my.csv')
.pipe(csv({ ignoreEmpty: true }))
.on('data', data => {
console.log(data[0])
// prints farren#rogers.com
console.log({ firstName: data[0] })
// prints { firstName: '\u0000f\u0000a\u0000r\u0000r\u0000e\u0000n\u0000#\u0000r\u0000o\u0000g\u0000e\u0000r\u0000s\u0000.\u0000c\u0000o\u0000m\u0000' }
})
Any solution or explanations are appreciated.
Edit: even after decoding using utf8.js and then pass it in the object literal, I still encounter the same problem.

JavaScript uses UTF-16 for Strings. It also has a numeric escape notation for a UTF-16 code unit. So, when you see this output in your debugger
\u0000f\u0000a\u0000r\u0000r\u0000e\u0000n
It is saying that the String's code units are \u0000 f \u0000 a etc. The \uHHHH escape means the UTF-16 code unit HHHH in hexadecimal. \u0000 is the single (unpaired) UTF-16 code unit need for the U+0000 (NUL) Unicode codepoint. So, something is being interpreted as NUL f NUL a, etc.
UTF-8 code units are 8 bits each. NUL in UTF-8 is 0x00. f is 0x66.
UTF-16 code units are 16 bits each. NULL is 0x0000. f is 0x0066. When 16-bit values are stored as bytes, endianness applies. In little endian, 0x0066 is written as 0x66 0x00. In big endian, 0x00 0x66.
So, if bytes of UTF-16 code units (such as the ones in the example data) are interpreted as UTF-8 (or perhaps other encodings), f can be read as NUL f or f NUL.
The fundamental rule of character encodings is to read with the same encoding that text was written with. No doing so can lead to data loss and corruption that can go on undetected. Not knowing what the encoding is to begin with is data loss itself and a failed communication.
You can learn more about Unicode at Unicode.org. You can learn more about languages and technologies that use it from their respective specifications—they are all very upfront and clear about it. JavaScript, Java, C#, VBA/VB4/VB5/VB6, VB.NET, F#, HTML, XML, T-SQL,…. (Okay, VB4 documentation might not be quite as clear but the point is that this is very common and not new [VBPJ Sept. 1996], though we all are still struggling to assimilate it.)

Unexpected behaviour of String.fromCodePoint / String#codePointAt (Firefox/ES6)

Since version 29 of Firefox, Mozilla provides the String.fromCodePoint and String#codePointAt methods and also published polyfills on the respective MDN pages.
So it happens that I am currently trying this out and it seems that I am missing something important, as splitting the string "ä☺𠜎" into codepoints and reassembling it from these returns an, at least for me, unexpected result.
I've built a test case: http://jsfiddle.net/dcodeIO/YhwP7/
var str = "ä☺𠜎";
...split it, reassemble it...
Am I missing something?

This is not a problem of .codePointAt, but more of the char encoding of the character 𠜎. 𠜎 has a javascript string length of 2.
Why?
Because Javascript Strings are encoded using 2-byte UTF-16. 𠜎 ( charcode: 132878 ) is greater than 2-byte UTF-16 ( 0-65535 ). This means it needs to be encoded using 4-byte UTF-16. Its UTF-16 representation is 0xD841 0xDF0E consuming two characters in the string.
When using .charAt() you will see the correct values:
var string = "𠜎";
console.log( string.charAt(0), string.charAt(1) ); // logs 55361 57102 (0xD841 0xDF0E)
Why doesn't it display 228, 9786, 55361, 57102?
Thats because .codePointAt() converts 4-byte UTF-16 characters to integers correctly ( 132878 ).
So why does it output 57,102 then?
Because you are iterating for str.length in your loop, which returns 4 (because "𠜎".length == "), so .codePointAt() will get executed on str[3] which is 57102.

Retrieving 16-bit integer from LWIP server response

In the server side I have the following loop, it takes a 16-bit integer (from 0 to 639) and separate it into two 8-bits chars to feed the buffer (1280 Bytes). This is then sent via TCP-IP to the client.
.c
unsigned int data2[1000];
char *p;
len = generate_http_header(buf, "js", 1280);
p = buf + len;
for (j=0; j<640; j++)
{
char_out[1]=(unsigned char)(data2[j]&0x00FF);
char_out[0]=(unsigned char)((data2[j]>>8)&(0x00FF));
*p=char_out[0];
p=p+1;
*p=char_out[1];
p=p+1;
}
....
tcp_write(pcb, buf, len, 1);
tcp_output(pcb);
In the client side I want to retrieve the 16-bit integer from the JSON object. I came up with this solution, but something is happenning and I can not get all the integers values (0 to 639).
.js
var bin=o.responseText;
for(i=0;i<1000;i=i+2)
{
a=bin[i].charCodeAt();
b=bin[i+1].charCodeAt();
// Get binary representation.
a=parseInt(a).toString(2);
a=parseInt(a);
//alert('a(bin) before:'+a);
b=parseInt(b).toString(2);
b=parseInt(b);
//padding zeros left.
a=pad(a,8);
b=pad(b,8)
//Concatenate and convert to string.
a=a.toString();
b=b.toString();
c=a+b;
//Convert to decimal
c=parseInt(c,2);
//alert('DECIMAL FINAL NUMBER:'+c)
fin=fin+c.toString();
}
alert('FINAL NUMBER'+fin);
I used Fire BUG to see the HTTP response from the server:
����������� �
���
������������������� �!�"�#�$�%�&�'�(�)�*�+�,�-�.�/ � 0�1�2�3�4�5�6�7�8�9�:�;�<�=�>�?�#�A�B�C�D�E�F�G�H�I�J�K�L�M�N�O�P�Q�R�S�T�U�V�W�X�Y�Z�[�\ �]� ^�_�`�a�b�c�d�e�f�g�h�i�j�k�l�m�n�o�p�q�r�s�t�u�v�w�x�y�z�{�|�}�~������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~���������������������������������������������������������������������������������������������������������������������������������
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUV�����������QR���� ��Ps������������$���������������P�������������
After run the .js code I get the right numbers as expected from 0 to 127 (0,1,2,...127), but from 128 to 256, I get all number equals to 255 instead of (128,129,130...256).After 256 every number is ok and in sequence (257,....639). I think the problem is related to the function charCodeAt() that returns the Unicode of the character.For some reason it's returning always 255 considering I have the same character, but this is impossible because the server is sending "129,130,131...255" Any idea what could be happening? Before using the actual solution I tried to retrieve the 16-bit integer directly from the JSON object but could not remove the dependency with a LUT. How can I have the 8-bit of each char in the o.responseText="abcdefgh..." without using a LUT to find the equivalent ASCII Code and then the binary representation? I think it's possible using a bitwise operator & but in this case still need to convert first to binary equivalent then to integer.How can I perform bitwise operations directly on strings in java script?

It looks like your data is displaying as utf8. utf8 is ascii compatible so all ascii characters a displayed fine (characters up to 127), the rest of characters are not valid in utf8 so the program that displays the data replaces these invalid characters with the invalid character replacement character �, try to change the client(receiving program) encoding to iso-8859-1.

Develop Reference

JavaScript is the programming language of the Web.