Encode ArrayBuffer of arbitrary length to custom alphabet - javascript

I have had a few questions about how to convert integers into custom alphabet, and some stuff about encoding which I still don't fully understand yet, but I'm getting there. What I'm wondering about now though is how to convert an arbitrary-length ArrayBuffer (basically just a bunch of bits of arbitrary length), into a custom alphabet (without using major library helpers like JavaScript toString or parseInt or others).
So this value is bigger than the max integer by far, as you could have a whole paragraph or document as input.
From my understanding so far, I would do this:
var array = new Uint8Array(500000)
array[0] = 123
array[1] = 123
array[2] = 123
// ... fill it in with some stuff.
stringify(array.buffer, '123abc')
// encode to 6-character alphabet, such as:
// 1a2ba3caa13a...
Then I feel stuck... There is this helpful example on how to do it for integers. But I am having difficulty applying it to this new situation.
Also would be helpful to know how to convert it back into the ArrayBuffer from the string that used the custom alphabet, so it would go both ways.
The conversion of the array.buffer to some example output like 1a2ba3caa13a... would happen similar to the radix stringifying in the linked question (well, I don't know how it work work actually). But it would go through the bits somehow and basically encode them somehow using characters from the custom alphabet, like hex encoding, base64 encoding, etc.

Related

How can I convert this UTF-8 string to plain text in javascript and how can a normal user write it in a textarea [duplicate]

While reviewing JavaScript concepts, I found String.normalize(). This is not something that shows up in W3School's "JavaScript String Reference", and, hence, it is the reason I might have missed it before.
I found more information about it in HackerRank which states:
Returns a string containing the Unicode Normalization Form of the
calling string's value.
With the example:
var s = "HackerRank";
console.log(s.normalize());
console.log(s.normalize("NFKC"));
having as output:
HackerRank
HackerRank
Also, in GeeksForGeeks:
The string.normalize() is an inbuilt function in javascript which is
used to return a Unicode normalisation form of a given input string.
with the example:
<script>
// Taking a string as input.
var a = "GeeksForGeeks";
// calling normalize function.
b = a.normalize('NFC')
c = a.normalize('NFD')
d = a.normalize('NFKC')
e = a.normalize('NFKD')
// Printing normalised form.
document.write(b +"<br>");
document.write(c +"<br>");
document.write(d +"<br>");
document.write(e);
</script>
having as output:
GeeksForGeeks
GeeksForGeeks
GeeksForGeeks
GeeksForGeeks
Maybe the examples given are just really bad as they don't allow me to see any change.
I wonder... what's the point of this method?
It depends on what will do with strings: often you do not need it (if you are just getting input from user, and putting it to user). But to check/search/use as key/etc. such strings, you may want a unique way to identify the same string (semantically speaking).
The main problem is that you may have two strings which are semantically the same, but with two different representations: e.g. one with a accented character [one code point], and one with a character combined with accent [one code point for character, one for combining accent]. User may not be in control on how the input text will be sent, so you may have two different user names, or two different password. But also if you mangle data, you may get different results, depending on initial string. Users do not like it.
An other problem is about unique order of combining characters. You may have an accent, and a lower tail (e.g. cedilla): you may express this with several combinations: "pure char, tail, accent", "pure char, accent, tail", "char+tail, accent", "char+accent, cedilla".
And you may have degenerate cases (especially if you type from a keyboard): you may get code points which should be removed (you may have a infinite long string which could be equivalent of few bytes.
In any case, for sorting strings, you (or your library) requires a normalized form: if you already provide the right, the lib will not need to transform it again.
So: you want that the same (semantically speaking) string has the same sequence of unicode code points.
Note: If you are doing directly on UTF-8, you should also care about special cases of UTF-8: same codepoint could be written in different ways [using more bytes]. Also this could be a security problem.
The K is often used for "searches" and similar tasks: CO2 and CO₂ will be interpreted in the same manner, but this could change the meaning of the text, so it should often used only internally, for temporary tasks, but keeping the original text.
As stated in MDN documentation, String.prototype.normalize() return the Unicode Normalized Form of the string. This because in Unicode, some characters can have different representation code.
This is the example (taken from MDN):
const name1 = '\u0041\u006d\u00e9\u006c\u0069\u0065';
const name2 = '\u0041\u006d\u0065\u0301\u006c\u0069\u0065';
console.log(`${name1}, ${name2}`);
// expected output: "Amélie, Amélie"
console.log(name1 === name2);
// expected output: false
console.log(name1.length === name2.length);
// expected output: false
const name1NFC = name1.normalize('NFC');
const name2NFC = name2.normalize('NFC');
console.log(`${name1NFC}, ${name2NFC}`);
// expected output: "Amélie, Amélie"
console.log(name1NFC === name2NFC);
// expected output: true
console.log(name1NFC.length === name2NFC.length);
// expected output: true
As you can see, the string Amélie as two different Unicode representations. With normalization, we can reduce the two forms to the same string.
Very beautifully explained here --> https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize
Short answer : The point is, characters are represented through a coding scheme like ascii, utf-8 , etc.,(We use mostly UTF-8). And some characters have more than one representation. So 2 string may render similarly, but their unicode may vary! So string comparrision may fail here! So we use normaize to return a single type of representation
// source from MDN
let string1 = '\u00F1'; // ñ
let string2 = '\u006E\u0303'; // ñ
string1 = string1.normalize('NFC');
string2 = string2.normalize('NFC');
console.log(string1 === string2); // true
console.log(string1.length); // 1
console.log(string2.length); // 1
Normalization of strings isn't exclusive of JavaScript - see for instances in Python. The values valid for the arguments are defined by the Unicode (more on Unicode normalization).
When it comes to JavaScript, note that there's documentation with String.normalize() and String.prototype.normalize(). As #ChrisG mentions
String.prototype.normalize() is correct in a technical sense, because
normalize() is a dynamic method you call on instances, not the class
itself. The point of normalize() is to be able to compare Strings that
look the same but don't consist of the same characters, as shown in
the example code on MDN.
Then, when it comes to its usage, found a great example of the usage of String.normalize() that has
let s1 = 'sabiá';
let s2 = 'sabiá';
// one is in NFC, the other in NFD, so they're different
console.log(s1 == s2); // false
// with normalization, they become the same
console.log(s1.normalize('NFC') === s2.normalize('NFC')); // true
// transform string into array of codepoints
function codepoints(s) { return Array.from(s).map(c => c.codePointAt(0).toString(16)); }
// printing the codepoints you can see the difference
console.log(codepoints(s1)); // [ "73", "61", "62", "69", "e1" ]
console.log(codepoints(s2)); // [ "73", "61", "62", "69", "61", "301" ]
So while saibá e saibá in this example look the same to the human eye or even if we used console.log(), we can see that without normalization when comparing them we'd get different results. Then, by analyzing the codepoints, we see they're different.
There are some great answers here already, but I wanted to throw in a practical example.
I enjoy Bible translation as a hobby. I wasn't too thrilled at the flashcard option out there in the wild in my price range (free) so I made my own. The problem is, there is more than one way to do Hebrew and Greek in Unicode to get the exact same thing. For example:
בָּא
בָּא
These should look identical on your screen, and for all practical purposes they are identical. However, the first was typed with the qamats (the little t shaped thing under it) before the dagesh (the dot in the middle of the letter) and the second was typed with the dagesh before the qamats. Now, since you're just reading this, you don't care. And your web browser doesn't care. But when my flashcards compare the two, then they aren't the same. To the code behind the scenes, it's no different than saying "center" and "centre" are the same.
Similarly, in Greek:
ἀ
ἀ
These two should look nearly identical, but the top is one Unicode character and the second one is two Unicode characters. Which one is going to end up typed in my flashcards is going to depend on which keyboard I'm sitting at.
When I'm adding flashcards, believe it or not, I don't always type in vocab lists of 100 words. That's why God gave us spreadsheets. And sometimes the places I'm importing the lists from do it one way, and sometimes they do it the other way, and sometimes they mix it. But when I'm typing, I'm not trying to memorize the order that the dagesh or quamats appear or if the accents are typed as a separate character or not. Regardless if I remember to type the dagesh first or not, I want to get the right answer, because really it's the same answer in every practical sense either way.
So I normalize the order before saving the flashcards and I normalize the order before checking it, and the result is that it doesn't matter which way I type it, it comes out right!
If you want to check out the results:
https://sthelenskungfu.com/flashcards/
You need a Google or Facebook account to log in, so it can track progress and such. As far as I know (or care) only my daughter and I currently use it.
It's free, but eternally in beta.

decoding array of utf8 strings inside a stream

I faced a weird problem today after trying to decode a utf8 formatted string. It's being fetched through stream as an array of strings but formatted in utf8 somehow (I'm using fast-csv). However as you can see in the console if I log it directly it shows the correct version but when it's inside an object literal it's back to utf8 encoded version.
var stream = fs
.createReadStream(__dirname + '/my.csv')
.pipe(csv({ ignoreEmpty: true }))
.on('data', data => {
console.log(data[0])
// prints farren#rogers.com
console.log({ firstName: data[0] })
// prints { firstName: '\u0000f\u0000a\u0000r\u0000r\u0000e\u0000n\u0000#\u0000r\u0000o\u0000g\u0000e\u0000r\u0000s\u0000.\u0000c\u0000o\u0000m\u0000' }
})
Any solution or explanations are appreciated.
Edit: even after decoding using utf8.js and then pass it in the object literal, I still encounter the same problem.
JavaScript uses UTF-16 for Strings. It also has a numeric escape notation for a UTF-16 code unit. So, when you see this output in your debugger
\u0000f\u0000a\u0000r\u0000r\u0000e\u0000n
It is saying that the String's code units are \u0000 f \u0000 a etc. The \uHHHH escape means the UTF-16 code unit HHHH in hexadecimal. \u0000 is the single (unpaired) UTF-16 code unit need for the U+0000 (NUL) Unicode codepoint. So, something is being interpreted as NUL f NUL a, etc.
UTF-8 code units are 8 bits each. NUL in UTF-8 is 0x00. f is 0x66.
UTF-16 code units are 16 bits each. NULL is 0x0000. f is 0x0066. When 16-bit values are stored as bytes, endianness applies. In little endian, 0x0066 is written as 0x66 0x00. In big endian, 0x00 0x66.
So, if bytes of UTF-16 code units (such as the ones in the example data) are interpreted as UTF-8 (or perhaps other encodings), f can be read as NUL f or f NUL.
The fundamental rule of character encodings is to read with the same encoding that text was written with. No doing so can lead to data loss and corruption that can go on undetected. Not knowing what the encoding is to begin with is data loss itself and a failed communication.
You can learn more about Unicode at Unicode.org. You can learn more about languages and technologies that use it from their respective specifications—they are all very upfront and clear about it. JavaScript, Java, C#, VBA/VB4/VB5/VB6, VB.NET, F#, HTML, XML, T-SQL,…. (Okay, VB4 documentation might not be quite as clear but the point is that this is very common and not new [VBPJ Sept. 1996], though we all are still struggling to assimilate it.)

Reassembling negative Python marshal int's into Javascript numbers

I'm writing a client-side Python bytecode interpreter in Javascript (specifically Typescript) for a class project. Parsing the bytecode was going fine until I tried out a negative number.
In Python, marshal.dumps(2) gives 'i\x02\x00\x00\x00' and marshal.dumps(-2) gives 'i\xfe\xff\xff\xff'. This makes sense as Python represents integers using two's complement with at least 32 bits of precision.
In my Typescript code, I use the equivalent of Node.js's Buffer class (via a library called BrowserFS, instead of ArrayBuffers and etc.) to read the data. When I see the character 'i' (i.e. buffer.readUInt8(offset) == 105, signalling that the next thing is an int), I then call readInt32LE on the next offset to read a little-endian signed long (4 bytes). This works fine for positive numbers but not for negative numbers: for 1 I get '1', but for '-1' I get something like '-272777233'.
I guess that Javascript represents numbers in 64-bit (floating point?). So, it seems like the following should work:
var longval = buffer.readInt32LE(offset); // reads a 4-byte long, gives -272777233
var low32Bits = longval & 0xffff0000; //take the little endian 'most significant' 32 bits
var newval = ~low32Bits + 1; //invert the bits and add 1 to negate the original value
//but now newval = 272826368 instead of -2
I've tried a lot of different things and I've been stuck on this for days. I can't figure out how to recover the original value of the Python integer from the binary marshal string using Javascript/Typescript. Also I think I deeply misunderstand how bits work. Any thoughts would be appreciated here.
Some more specific questions might be:
Why would buffer.readInt32LE work for positive ints but not negative?
Am I using the correct method to get the 'most significant' or 'lowest' 32 bits (i.e. does & 0xffff0000 work how I think it does?)
Separate but related: in an actual 'long' number (i.e. longer than '-2'), I think there is a sign bit and a magnitude, and I think this information is stored in the 'highest' 2 bits of the number (i.e. at number & 0x000000ff?) -- is this the correct way of thinking about this?
The sequence ef bf bd is the UTF-8 sequence for the "Unicode replacement character", which Unicode encoders use to represent invalid encodings.
It sounds like whatever method you're using to download the data is getting accidentally run through a UTF-8 decoder and corrupting the raw datastream. Be sure you're using blob instead of text, or whatever the equivalent is for the way you're downloading the bytecode.
This got messed up only for negative values because positive values are within the normal mapping space of UTF-8 and thus get translated 1:1 from the original byte stream.

Encoding strings to small sizes for QRCode generation

I'm generating QR codes using strings that could very easily be longer in length then a QRCode could handle. I'm looking for suggestions on algorithms to encode these strings as small as possible, or a proof that the string cannot be shrunk any further.
Since I'm encoding a series of items, I can represent them using ID's and delineate them using pipes as in the following lookup table:
function encodeLookUp(character){
switch(character){
case '0': return '0000';
case '1': return '0001';
case '2': return '0010';
case '3': return '0011';
case '4': return '0100';
case '5': return '0101';
case '6': return '0110';
case '7': return '0111';
case '8': return '1000';
case '9': return '1001';
case '|': return '1010';
case ':': return '1011';
}
return false;
}
Using this table I am already doing a base 16 encoding, therefore each 32 ascii character from the original string becomes half a character in the new string (effectively halving the length).
Starting String: 01251548|4654654:4465464 // ID1 | ID2 : ID3 demonstrates both pipes.
Bit String: 000000010010010100010101010010001010010001100101010001100101010010110100010001100101010001100100
Result String: %H¤eFT´FTd // Half the length of the starting string.
Then this new ascii code, is translated according to QRCode specification.
EDIT: The most amount of characters currently encodable: 384
CLARIFICATION: Both ID numberic length, and the quantity of ID's or pipes is variable with a tendancy towards one. I am looking to be able to reduce this algorithm to contain on average the least amount of characters by the time its a 'result string'.
NOTE: The result string is only an ascii represenetaion of the binary string i've encoded with the data to conform with standard QRCode specifications and readers.
If you have relatively non-random data, a Huffman encoding might be a good solution.
Using the function, you're going to loose a lot of space (since 4 bits are way too much storage for 12 combinations).
I'd start by looking at the maximum length possible for your IDs and find a suitable storage block.
If you are storing these items serially in a fixed count (say, 4 ids). You would need id_length*id_count at most, and you won't need to use any separators.
Edit: Again according to the number of IDs you want to write and their expected maximum length, there may be different types of encodings to compress it done. RLE (run length encoding) came to my mind.
QR codes support a binary mode, and that's going to be the most efficient way for you to store your IDs. Either:
Pick a length (in bytes) that is sufficient to store all your IDs, and encode the QR-code as a series of fixed-length integers. 4 bytes (32 bits) is a standard choice that ought to cover the likely range, or
If you want to be able to encode a wide range of IDs, but expect most of the values to be small, use a variable-length encoding scheme. One example is to use the lowest 7 bits of each byte to store the integer, and the most significant bit to indicate if there are any further bytes.
Also note that QR codes can be a lot larger than 384 characters!
Edit: From your original question, though, it looks like you're encoding more than just a series of integers - you have at least two different types of delimiters. Where can they appear and in what circumstances? The encoding format is going to depend on those parameters.
QR codes already have special encoding modes that are optimized for digits, or just alphanumeric data. It would probably be easier to take advantage of these rather than invent a scheme.
If you're going to do something custom, I think you'll find it hard to beat something like gzip compression. Just gzip the bytes, encode the bytes in byte mode, and decompress on the other end.
As a start of an answer to my own question:
If I start with a string of numbers
I can parse that string for patterns and hold those patters in special symbols that are able to take up the other 4 spaces available in my Huffman tree.
EDIT: Example: staring string 12222345, ending string 12x345. Where x is a symbol that means 'repeat the last symbol 3 more times'

JavaScript-friendly binary-safe data format design (not JSON or XML)

First and foremost: JSON and XML are not an option in this specific case, please don't suggest them. If this makes it easier to accept that fact, imagine that I intend to reinvent the wheel for self-education.
Back to the point:
I need to design a binary-safe data format to encode some datagrams I send to a particular dumb server that I write (in C if that matters).
To simplify the question, let's say that I'm sending only numbers, strings and arrays.
Important fact: Server does not (and should not) know anything about Unicode and stuff. It treats all strings as binary blobs (and never looks inside them).
The format that I originally devised is as follows:
Datagram: <Number:size>\n<Value1>...<ValueN>
Value:
Number: N\n<Value>\n
String: S\n<Number:size-in-bytes>\n<bytes>\n
Array: A\n<Number:size>\n<Value0>...<ValueN>
Example:
[ 1, "foo", [] ]
Serializes as follows:
1 ; number of items in datagram
A ; -- array --
3 ; number of items in array
N ; -- number --
1 ; number value
S ; -- string --
3 ; string size in bytes
foo ; string bytes
A ; -- array --
0 ; number of items in array
The problem is that I can not reliably get a string size in bytes in JavaScript.
So, the question is: how to change the format, so a string can be both saved in JS and loaded in C neatly.
I do not want to add Unicode support to the server.
And I do not quite want to decode strings on server (say, from base64 or simply to unescape \xNN sequences) — this would require work with dynamic string buffers, which, given how dumb the server is, is not so desirable...
Any clues?
It seems that reading UTF-8 in plain C is not that scary after all. So I'm extending the protocol to handle UTF-8 strings natively. (But will appreciate an answer to this question as it stands.)

Categories

Resources