Create UTF8 String from UTF Codes in Javascript - javascript

I have the byte representation of UTF8, e.g.
195, 156 for "Ü" (capital U Umlaut)
I need to create a string for display in JavaScript out of these numbers - everything I tried failed.
No methode I found recognizes "195" as a UTF leading byte but gave mit "Ã".
So how do I get a string to display from my stream of UTF8 bytes?

You're working with decimal representations of the single byte components of the characters. For the example given, you have 195, 156. First, you have to converting to base 16's C3, 9C. From there you can use javascript's decodeURIComponent function.
console.log(decodeURIComponent(`%${(195).toString(16)}%${(156).toString(16)}`));
If you're doing this with a lot of characters, you probably want to find a library that implements string encoding / decoding. For example, node's Buffer objects do this internally.

Related

JS base n string to base64

I have a string (with numbers under 128) separated by a comma:
"127,25,34,52,46,2,34,4,6,1"
Because there are 10 digits and one comma, that makes 11 total characters. How can I convert this string from "base 11" to "base 64"? I would like to compress this string into base64. I tried window.btoa, but it produces a larger output because the browser doesn't know that the string only has 11 characters.
Thanks in advance.
Base64 encoding never produces shorter strings. It is not intended as a compression tool, but as a means to reduce the used character set to 64 readable characters, taking into account that the input may use a larger characterset (even if not all those characters are used).
Given the format of your string, why not take those numbers and use them as ASCII, and then apply Base64 encoding on that?
Demo:
let s = "127,25,34,52,46,2,34,4,6,1";
console.log(s);
let encoded = btoa(String.fromCharCode(...s.match(/\d+/g)));
console.log(encoded);
let decoded = Array.from(atob(encoded), c => c.charCodeAt()).join();
console.log(decoded);

decoding array of utf8 strings inside a stream

I faced a weird problem today after trying to decode a utf8 formatted string. It's being fetched through stream as an array of strings but formatted in utf8 somehow (I'm using fast-csv). However as you can see in the console if I log it directly it shows the correct version but when it's inside an object literal it's back to utf8 encoded version.
var stream = fs
.createReadStream(__dirname + '/my.csv')
.pipe(csv({ ignoreEmpty: true }))
.on('data', data => {
console.log(data[0])
// prints farren#rogers.com
console.log({ firstName: data[0] })
// prints { firstName: '\u0000f\u0000a\u0000r\u0000r\u0000e\u0000n\u0000#\u0000r\u0000o\u0000g\u0000e\u0000r\u0000s\u0000.\u0000c\u0000o\u0000m\u0000' }
})
Any solution or explanations are appreciated.
Edit: even after decoding using utf8.js and then pass it in the object literal, I still encounter the same problem.
JavaScript uses UTF-16 for Strings. It also has a numeric escape notation for a UTF-16 code unit. So, when you see this output in your debugger
\u0000f\u0000a\u0000r\u0000r\u0000e\u0000n
It is saying that the String's code units are \u0000 f \u0000 a etc. The \uHHHH escape means the UTF-16 code unit HHHH in hexadecimal. \u0000 is the single (unpaired) UTF-16 code unit need for the U+0000 (NUL) Unicode codepoint. So, something is being interpreted as NUL f NUL a, etc.
UTF-8 code units are 8 bits each. NUL in UTF-8 is 0x00. f is 0x66.
UTF-16 code units are 16 bits each. NULL is 0x0000. f is 0x0066. When 16-bit values are stored as bytes, endianness applies. In little endian, 0x0066 is written as 0x66 0x00. In big endian, 0x00 0x66.
So, if bytes of UTF-16 code units (such as the ones in the example data) are interpreted as UTF-8 (or perhaps other encodings), f can be read as NUL f or f NUL.
The fundamental rule of character encodings is to read with the same encoding that text was written with. No doing so can lead to data loss and corruption that can go on undetected. Not knowing what the encoding is to begin with is data loss itself and a failed communication.
You can learn more about Unicode at Unicode.org. You can learn more about languages and technologies that use it from their respective specifications—they are all very upfront and clear about it. JavaScript, Java, C#, VBA/VB4/VB5/VB6, VB.NET, F#, HTML, XML, T-SQL,…. (Okay, VB4 documentation might not be quite as clear but the point is that this is very common and not new [VBPJ Sept. 1996], though we all are still struggling to assimilate it.)

javascript atob("CDA=") doesn't provide the expected behaviour (at least on Chrome 32)

I'm trying to convert the base64 encoded string "CDA=" into a binary buffer, using JavaScript. I have tried calling the function atob, but the result is always an empty array.
I have tried atob with character strings, that I encoded with btoa, and atob provides the expected result. So it seems that it doesn't always fail, but probably only when the base64 string represent a binary data. From the internet, I see that binary data also should be managed... Does anyone have an explanation to this behaviour ?
atob() returns a string not an array.
Your Base64 string is 0x8 0x30 which is interpreted as <backspace><zero> when you look at it and see:
> window.atob("CDA=")
"0"
However both bytes are present:
> window.atob("CDA=").charCodeAt(0)
8
> window.atob("CDA=").charCodeAt(1)
48
If you want an array, see Creating a Blob from a base64 string in JavaScript.

Javascript encoding breaking & combining multibyte characters?

I'm planning to use a client-side AES encryption for my web-app.
Right now, I've been looking for ways to break multibyte characters into one byte-'non-characters' ,encrypt (to have the same encrypted text length),
de-crypt them back, convert those one-byte 'non-characters' back to multibyte characters.
I've seen the wiki for UTF-8 (the supposedly-default encoding for JS?) and UTF-16, but I can't figure out how to detect "fragmented" multibyte characters and how I can combine them back.
Thanks : )
JavaScript strings are UTF-16 stored in 16-bit "characters". For Unicode characters ("code points") that require more than 16 bits (some code points require 32 bits in UTF-16), each JavaScript "character" is actually only half of the code point.
So to "break" a JavaScript character into bytes, you just take the character code and split off the high byte and the low byte:
var code = str.charCodeAt(0); // The first character, obviously you'll have a loop
var lowbyte = code & 0xFF;
var highbyte = (code & 0xFF00) >> 8;
(Even though JavaScript's numbers are floating point, the bitwise operators work in terms of 32-bit integers, and of course in our case only 16 of those bits are relevant.)
You'll never have an odd number of bytes, because again this is UTF-16.
You could simply convert to UTF8... For example by using this trick
function encode_utf8(s) {
return unescape(encodeURIComponent(s));
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
Considering you are using crypto-js, you can use its methods to convert to utf8 and return to string. See here:
var words = CryptoJS.enc.Utf8.parse('𤭢');
var utf8 = CryptoJS.enc.Utf8.stringify(words);
The 𤭢 is probably a botched example of Utf8 character.
By looking at the other examples (see the Latin1 example), I'll say that with parse you convert a string to Utf8 (technically you convert it to Utf8 and put in a special array used by crypto-js of type WordArray) and the result can be passed to the Aes encoding algorithm and with stringify you convert a WordArray (for example obtained by decoding algorithm) to an Utf8.
JsFiddle example: http://jsfiddle.net/UpJRm/

Retrieving 16-bit integer from LWIP server response

In the server side I have the following loop, it takes a 16-bit integer (from 0 to 639) and separate it into two 8-bits chars to feed the buffer (1280 Bytes). This is then sent via TCP-IP to the client.
.c
unsigned int data2[1000];
char *p;
len = generate_http_header(buf, "js", 1280);
p = buf + len;
for (j=0; j<640; j++)
{
char_out[1]=(unsigned char)(data2[j]&0x00FF);
char_out[0]=(unsigned char)((data2[j]>>8)&(0x00FF));
*p=char_out[0];
p=p+1;
*p=char_out[1];
p=p+1;
}
....
tcp_write(pcb, buf, len, 1);
tcp_output(pcb);
In the client side I want to retrieve the 16-bit integer from the JSON object. I came up with this solution, but something is happenning and I can not get all the integers values (0 to 639).
.js
var bin=o.responseText;
for(i=0;i<1000;i=i+2)
{
a=bin[i].charCodeAt();
b=bin[i+1].charCodeAt();
// Get binary representation.
a=parseInt(a).toString(2);
a=parseInt(a);
//alert('a(bin) before:'+a);
b=parseInt(b).toString(2);
b=parseInt(b);
//padding zeros left.
a=pad(a,8);
b=pad(b,8)
//Concatenate and convert to string.
a=a.toString();
b=b.toString();
c=a+b;
//Convert to decimal
c=parseInt(c,2);
//alert('DECIMAL FINAL NUMBER:'+c)
fin=fin+c.toString();
}
alert('FINAL NUMBER'+fin);
I used Fire BUG to see the HTTP response from the server:
����������� �
���
������������������� �!�"�#�$�%�&�'�(�)�*�+�,�-�.�/ � 0�1�2�3�4�5�6�7�8�9�:�;�<�=�>�?�#�A�B�C�D�E�F�G�H�I�J�K�L�M�N�O�P�Q�R�S�T�U�V�W�X�Y�Z�[�\ �]� ^�_�`�a�b�c�d�e�f�g�h�i�j�k�l�m�n�o�p�q�r�s�t�u�v�w�x�y�z�{�|�}�~������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~���������������������������������������������������������������������������������������������������������������������������������
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUV�����������QR���� ��Ps������������$���������������P�������������
After run the .js code I get the right numbers as expected from 0 to 127 (0,1,2,...127), but from 128 to 256, I get all number equals to 255 instead of (128,129,130...256).After 256 every number is ok and in sequence (257,....639). I think the problem is related to the function charCodeAt() that returns the Unicode of the character.For some reason it's returning always 255 considering I have the same character, but this is impossible because the server is sending "129,130,131...255" Any idea what could be happening? Before using the actual solution I tried to retrieve the 16-bit integer directly from the JSON object but could not remove the dependency with a LUT. How can I have the 8-bit of each char in the o.responseText="abcdefgh..." without using a LUT to find the equivalent ASCII Code and then the binary representation? I think it's possible using a bitwise operator & but in this case still need to convert first to binary equivalent then to integer.How can I perform bitwise operations directly on strings in java script?
It looks like your data is displaying as utf8. utf8 is ascii compatible so all ascii characters a displayed fine (characters up to 127), the rest of characters are not valid in utf8 so the program that displays the data replaces these invalid characters with the invalid character replacement character �, try to change the client(receiving program) encoding to iso-8859-1.

Categories

Resources