decoding array of utf8 strings inside a stream

decoding array of utf8 strings inside a stream - javascript

I faced a weird problem today after trying to decode a utf8 formatted string. It's being fetched through stream as an array of strings but formatted in utf8 somehow (I'm using fast-csv). However as you can see in the console if I log it directly it shows the correct version but when it's inside an object literal it's back to utf8 encoded version.
var stream = fs
.createReadStream(__dirname + '/my.csv')
.pipe(csv({ ignoreEmpty: true }))
.on('data', data => {
console.log(data[0])
// prints farren#rogers.com
console.log({ firstName: data[0] })
// prints { firstName: '\u0000f\u0000a\u0000r\u0000r\u0000e\u0000n\u0000#\u0000r\u0000o\u0000g\u0000e\u0000r\u0000s\u0000.\u0000c\u0000o\u0000m\u0000' }
})
Any solution or explanations are appreciated.
Edit: even after decoding using utf8.js and then pass it in the object literal, I still encounter the same problem.

JavaScript uses UTF-16 for Strings. It also has a numeric escape notation for a UTF-16 code unit. So, when you see this output in your debugger
\u0000f\u0000a\u0000r\u0000r\u0000e\u0000n
It is saying that the String's code units are \u0000 f \u0000 a etc. The \uHHHH escape means the UTF-16 code unit HHHH in hexadecimal. \u0000 is the single (unpaired) UTF-16 code unit need for the U+0000 (NUL) Unicode codepoint. So, something is being interpreted as NUL f NUL a, etc.
UTF-8 code units are 8 bits each. NUL in UTF-8 is 0x00. f is 0x66.
UTF-16 code units are 16 bits each. NULL is 0x0000. f is 0x0066. When 16-bit values are stored as bytes, endianness applies. In little endian, 0x0066 is written as 0x66 0x00. In big endian, 0x00 0x66.
So, if bytes of UTF-16 code units (such as the ones in the example data) are interpreted as UTF-8 (or perhaps other encodings), f can be read as NUL f or f NUL.
The fundamental rule of character encodings is to read with the same encoding that text was written with. No doing so can lead to data loss and corruption that can go on undetected. Not knowing what the encoding is to begin with is data loss itself and a failed communication.
You can learn more about Unicode at Unicode.org. You can learn more about languages and technologies that use it from their respective specifications—they are all very upfront and clear about it. JavaScript, Java, C#, VBA/VB4/VB5/VB6, VB.NET, F#, HTML, XML, T-SQL,…. (Okay, VB4 documentation might not be quite as clear but the point is that this is very common and not new [VBPJ Sept. 1996], though we all are still struggling to assimilate it.)

Related

what character encoding can yield identical results for md5 with an array buffer?

So, if I do:
fileReader.onload = function (e) {
console.log(md5(e.target.result));
};
fileReader.readAsArrayBuffer(blob);
I get: df9206f11a5c4fc7841fca94522f19f2
But, if I do:
fileReader.onload = function (e) {
console.log(md5(e.target.result));
};
fileReader.readAsText(blob);
I get a completely different hash. I assume this is due to character encoding? So I am curious, what encoding can I use which will result in an identical hash?

Using readAsArrayBuffer() will read the source as a "pure" byte-range independent of what the data represents and its byte-order.
Using readAsText() without any encoding options will take two and two bytes from the source, assume and convert to a single UTF-16 (or UCS-2) character which will produce a completely different result, as you noticed.
If you know the source is in for example UTF-8 text format you can read it using the optional encoding options with readAsText(blob[, encoding]) (see supported encoding types).
Any common single-byte encoding page should suffer, in that case, as MD5 signatures as text are always within the ASCII range - the main issue then, is that it needs to be read as single byte and not double as with UTF-16/USC-2.
A different problem could be byte-order. For this case an alternative is to read it as ArrayBuffer and then use TextDecoder (see example answer) with correct byte-order (there is a BOM option available (ignoreBOM) for this approach), e.g. little-endian or big-endian (denoted as "le" and "be", f.ex. "utf-16be", in the previous linked encoder types).

Create UTF8 String from UTF Codes in Javascript

I have the byte representation of UTF8, e.g.
195, 156 for "Ü" (capital U Umlaut)
I need to create a string for display in JavaScript out of these numbers - everything I tried failed.
No methode I found recognizes "195" as a UTF leading byte but gave mit "Ã".
So how do I get a string to display from my stream of UTF8 bytes?

You're working with decimal representations of the single byte components of the characters. For the example given, you have 195, 156. First, you have to converting to base 16's C3, 9C. From there you can use javascript's decodeURIComponent function.
console.log(decodeURIComponent(`%${(195).toString(16)}%${(156).toString(16)}`));
If you're doing this with a lot of characters, you probably want to find a library that implements string encoding / decoding. For example, node's Buffer objects do this internally.

Reassembling negative Python marshal int's into Javascript numbers

I'm writing a client-side Python bytecode interpreter in Javascript (specifically Typescript) for a class project. Parsing the bytecode was going fine until I tried out a negative number.
In Python, marshal.dumps(2) gives 'i\x02\x00\x00\x00' and marshal.dumps(-2) gives 'i\xfe\xff\xff\xff'. This makes sense as Python represents integers using two's complement with at least 32 bits of precision.
In my Typescript code, I use the equivalent of Node.js's Buffer class (via a library called BrowserFS, instead of ArrayBuffers and etc.) to read the data. When I see the character 'i' (i.e. buffer.readUInt8(offset) == 105, signalling that the next thing is an int), I then call readInt32LE on the next offset to read a little-endian signed long (4 bytes). This works fine for positive numbers but not for negative numbers: for 1 I get '1', but for '-1' I get something like '-272777233'.
I guess that Javascript represents numbers in 64-bit (floating point?). So, it seems like the following should work:
var longval = buffer.readInt32LE(offset); // reads a 4-byte long, gives -272777233
var low32Bits = longval & 0xffff0000; //take the little endian 'most significant' 32 bits
var newval = ~low32Bits + 1; //invert the bits and add 1 to negate the original value
//but now newval = 272826368 instead of -2
I've tried a lot of different things and I've been stuck on this for days. I can't figure out how to recover the original value of the Python integer from the binary marshal string using Javascript/Typescript. Also I think I deeply misunderstand how bits work. Any thoughts would be appreciated here.
Some more specific questions might be:
Why would buffer.readInt32LE work for positive ints but not negative?
Am I using the correct method to get the 'most significant' or 'lowest' 32 bits (i.e. does & 0xffff0000 work how I think it does?)
Separate but related: in an actual 'long' number (i.e. longer than '-2'), I think there is a sign bit and a magnitude, and I think this information is stored in the 'highest' 2 bits of the number (i.e. at number & 0x000000ff?) -- is this the correct way of thinking about this?

The sequence ef bf bd is the UTF-8 sequence for the "Unicode replacement character", which Unicode encoders use to represent invalid encodings.
It sounds like whatever method you're using to download the data is getting accidentally run through a UTF-8 decoder and corrupting the raw datastream. Be sure you're using blob instead of text, or whatever the equivalent is for the way you're downloading the bytecode.
This got messed up only for negative values because positive values are within the normal mapping space of UTF-8 and thus get translated 1:1 from the original byte stream.

Javascript encoding breaking & combining multibyte characters?

I'm planning to use a client-side AES encryption for my web-app.
Right now, I've been looking for ways to break multibyte characters into one byte-'non-characters' ,encrypt (to have the same encrypted text length),
de-crypt them back, convert those one-byte 'non-characters' back to multibyte characters.
I've seen the wiki for UTF-8 (the supposedly-default encoding for JS?) and UTF-16, but I can't figure out how to detect "fragmented" multibyte characters and how I can combine them back.
Thanks : )

JavaScript strings are UTF-16 stored in 16-bit "characters". For Unicode characters ("code points") that require more than 16 bits (some code points require 32 bits in UTF-16), each JavaScript "character" is actually only half of the code point.
So to "break" a JavaScript character into bytes, you just take the character code and split off the high byte and the low byte:
var code = str.charCodeAt(0); // The first character, obviously you'll have a loop
var lowbyte = code & 0xFF;
var highbyte = (code & 0xFF00) >> 8;
(Even though JavaScript's numbers are floating point, the bitwise operators work in terms of 32-bit integers, and of course in our case only 16 of those bits are relevant.)
You'll never have an odd number of bytes, because again this is UTF-16.

You could simply convert to UTF8... For example by using this trick
function encode_utf8(s) {
return unescape(encodeURIComponent(s));
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
Considering you are using crypto-js, you can use its methods to convert to utf8 and return to string. See here:
var words = CryptoJS.enc.Utf8.parse('𤭢');
var utf8 = CryptoJS.enc.Utf8.stringify(words);
The 𤭢 is probably a botched example of Utf8 character.
By looking at the other examples (see the Latin1 example), I'll say that with parse you convert a string to Utf8 (technically you convert it to Utf8 and put in a special array used by crypto-js of type WordArray) and the result can be passed to the Aes encoding algorithm and with stringify you convert a WordArray (for example obtained by decoding algorithm) to an Utf8.
JsFiddle example: http://jsfiddle.net/UpJRm/

JavaScript-friendly binary-safe data format design (not JSON or XML)

First and foremost: JSON and XML are not an option in this specific case, please don't suggest them. If this makes it easier to accept that fact, imagine that I intend to reinvent the wheel for self-education.
Back to the point:
I need to design a binary-safe data format to encode some datagrams I send to a particular dumb server that I write (in C if that matters).
To simplify the question, let's say that I'm sending only numbers, strings and arrays.
Important fact: Server does not (and should not) know anything about Unicode and stuff. It treats all strings as binary blobs (and never looks inside them).
The format that I originally devised is as follows:
Datagram: <Number:size>\n<Value1>...<ValueN>
Value:
Number: N\n<Value>\n
String: S\n<Number:size-in-bytes>\n<bytes>\n
Array: A\n<Number:size>\n<Value0>...<ValueN>
Example:
[ 1, "foo", [] ]
Serializes as follows:
1 ; number of items in datagram
A ; -- array --
3 ; number of items in array
N ; -- number --
1 ; number value
S ; -- string --
3 ; string size in bytes
foo ; string bytes
A ; -- array --
0 ; number of items in array
The problem is that I can not reliably get a string size in bytes in JavaScript.
So, the question is: how to change the format, so a string can be both saved in JS and loaded in C neatly.
I do not want to add Unicode support to the server.
And I do not quite want to decode strings on server (say, from base64 or simply to unescape \xNN sequences) — this would require work with dynamic string buffers, which, given how dumb the server is, is not so desirable...
Any clues?

It seems that reading UTF-8 in plain C is not that scary after all. So I'm extending the protocol to handle UTF-8 strings natively. (But will appreciate an answer to this question as it stands.)

Develop Reference

JavaScript is the programming language of the Web.