Reassembling negative Python marshal int's into Javascript numbers - javascript

I'm writing a client-side Python bytecode interpreter in Javascript (specifically Typescript) for a class project. Parsing the bytecode was going fine until I tried out a negative number.
In Python, marshal.dumps(2) gives 'i\x02\x00\x00\x00' and marshal.dumps(-2) gives 'i\xfe\xff\xff\xff'. This makes sense as Python represents integers using two's complement with at least 32 bits of precision.
In my Typescript code, I use the equivalent of Node.js's Buffer class (via a library called BrowserFS, instead of ArrayBuffers and etc.) to read the data. When I see the character 'i' (i.e. buffer.readUInt8(offset) == 105, signalling that the next thing is an int), I then call readInt32LE on the next offset to read a little-endian signed long (4 bytes). This works fine for positive numbers but not for negative numbers: for 1 I get '1', but for '-1' I get something like '-272777233'.
I guess that Javascript represents numbers in 64-bit (floating point?). So, it seems like the following should work:
var longval = buffer.readInt32LE(offset); // reads a 4-byte long, gives -272777233
var low32Bits = longval & 0xffff0000; //take the little endian 'most significant' 32 bits
var newval = ~low32Bits + 1; //invert the bits and add 1 to negate the original value
//but now newval = 272826368 instead of -2
I've tried a lot of different things and I've been stuck on this for days. I can't figure out how to recover the original value of the Python integer from the binary marshal string using Javascript/Typescript. Also I think I deeply misunderstand how bits work. Any thoughts would be appreciated here.
Some more specific questions might be:
Why would buffer.readInt32LE work for positive ints but not negative?
Am I using the correct method to get the 'most significant' or 'lowest' 32 bits (i.e. does & 0xffff0000 work how I think it does?)
Separate but related: in an actual 'long' number (i.e. longer than '-2'), I think there is a sign bit and a magnitude, and I think this information is stored in the 'highest' 2 bits of the number (i.e. at number & 0x000000ff?) -- is this the correct way of thinking about this?

The sequence ef bf bd is the UTF-8 sequence for the "Unicode replacement character", which Unicode encoders use to represent invalid encodings.
It sounds like whatever method you're using to download the data is getting accidentally run through a UTF-8 decoder and corrupting the raw datastream. Be sure you're using blob instead of text, or whatever the equivalent is for the way you're downloading the bytecode.
This got messed up only for negative values because positive values are within the normal mapping space of UTF-8 and thus get translated 1:1 from the original byte stream.

Related

Implementing extendible hash table in javascript: how to use binary number as index

I'm studying data structures and trying to implement extendible hashing from scratch in Javascript and I'm confused. Here is an example I'm using as reference hash table with binary labels
Example: to store "john":35 in a table of size: 8 indexes / depth 3 (last 3 digits of binary hash)
"john" gets converted to a hash, example: 13,
13 is converted to a binary: 1101
find which index of the table 1101 belongs to, by looking at the last 3 digits "101"
This is where I'm stuck. Am I suppose to convert 101 back to decimal form (which would be 5), to then access the index by doing array[5]? Is there a way to label the array indexes in binary format like array[101] (but then wouldn't it be better to use an object?)? This seems like a lot of unnecessary extra steps to avoid just using modulo (13%8), am I missing something? Is this implementation useful in not-javascript language?
First post - thanks in advance!
Internally, all data in the computer is stored in binary, so you can't "convert" from decimal to binary since everything is already binary (it's just shown to use as decimal). If you want to print out a number as binary for debugging purposes, you can do:
console.log((5).toString(2)); // will print "101"
The .toString(2) method converts the number to a string with the binary representation of the number.
You can also write numbers in binary by starting it with 0b:
let x = 0b1101; // == 13
If you want to get the last few binary digits of a number, use the modulo operator to 2 to the power of the number of digits you want:
(0b1101 % (2**3)).toString(2) // "101"
With the table selected, you probably want to use the rest of the number that you haven't used already as the index in the table. We can use the bitshift operator, >>, to do this:
(0b1101 >> 3).toString(2) // "1", right three bits cut off
With a longer number:
// Note that underscores don't mean anything, they are just used for spacing
(0b1101_1101 >> 3).toString(2) // "11011" you can see that the right three bits have been cut off
Keep in mind that you probably shouldn't be using .toString(2) to actually store anything in the table; it should only be used for debugging.

Buffer to integer. Having trouble understanding this line of code

I'm looking for help in understanding this line of code in the npm moudle hash-index.
The purpose of this module is to be a function which returns the sha-1 hash of an input mod by the second argument you pass.
The specific function in this module that I don't understand is this one that takes a Buffer as input and returns an integer:
var toNumber = function (buf) {
return buf.readUInt16BE(0) * 0xffffffff + buf.readUInt32BE(2)
}
I can't seem to figure out why those specific offsets of the buffer are chosen and what the purpose of multiplying by 0xffffffff is.
This module is really interesting to me and any help in understanding how it's converting buffers to integers would be greatly appreciated!
It prints the first UINT32 (Unsigned Integer 32 bits) in the buffer.
First, it reads the first two bytes (UINT16) of the buffer, using Big Endian, then, it multiplies it by 0xFFFFFFFF.
Then, it reads the second four bytes (UINT32) in the buffer, and adds it to the multiplied number - resulting in a number constructed from the first 6 bytes of the buffer.
Example: Consider [Buffer BB AA CC CC DD ... ]
0xbb * 0xffffffff = 0xbaffffff45
0xbaffffff45 + 0xaaccccdd = 0xbbaacccc22
And regarding the offsets, it chose that way:
First time, it reads from byte 0 to byte 1 (coverts to type - UINT16)
second time, it reads from byte 2 to byte 5 (converts to type - UINT32)
So to sum it up, it constructs a number from the first 6 bytes of the buffer using big endian notation, and returns it to the calling function.
Hope that's answers your question.
Wikipedia's Big Endian entry
EDIT
As someone pointed in the comments, I was totally wrong about 0xFFFFFFFF being a left-shift of 32, it's just a number multiplication - I'm assuming it's some kind of inner protocol to calculate a correct legal buffer header that complies with what they expect.
EDIT 2
After looking on the function in the original context, I've come to this conclusion:
This function is a part of a hashing flow, and it works in that manner:
Main flow receives a string input and a maximum number for the hash output, it then takes the string input, plugs it in the SHA-1 hashing function.
SHA-1 hashing returns a Buffer, it takes that Buffer, and applies the hash-indexing on it, as can be seen in the following code excerpt:
return toNumber(crypto.createHash('sha1').update(input).digest()) % max
Also, it uses a modulu to make sure the hash index returned doesn't exceed the maximum possible hash.
Multiplication by 2 is equivalent to a shift of bits to the left by 1, so the purpose of multiplying by 2^16 is the equivalent of shifting the bits left 16 times.
Here is a similar question already answered:
Bitwise Logic in C

Working with string (array?) of bits of an unspecified length

I'm a javascript code monkey, so this is virgin territory for me.
I have two "strings" that are just zeros and ones:
var first = "00110101011101010010101110100101010101010101010";
var second = "11001010100010101101010001011010101010101010101";
I want to perform a bitwise & (which I've never before worked with) to determine if there's any index where 1 appears in both strings.
These could potentially be VERY long strings (in the thousands of characters). I thought about adding them together as numbers, then converting to strings and checking for a 2, but javascript can't hold precision in large intervals and I get back numbers as strings like "1.1111111118215729e+95", which doesn't really do me much good.
Can I take two strings of unspecified length (they may not be the same length either) and somehow use a bitwise & to compare them?
I've already built the loop-through-each-character solution, but 1001^0110 would strike me as a major performance upgrade. Please do not give the javascript looping solution as an answer, this question is about using bitwise operators.
As you already noticed yourself, javascript has limited capabilities if it's about integer values. You'll have to chop your strings into "edible" portions and work your way through them. Since the parseInt() function accepts a base, you could convert 64 characters to an 8 byte int (or 32 to a 4 byte int) and use an and-operator to test for set bits (if (a & b != 0))
var first = "00110101011101010010101110100101010101010101010010001001010001010100011111",
second = "10110101011101010010101110100101010101010101010010001001010001010100011100",
firstInt = parseInt(first, 2),
secondInt = parseInt(second, 2),
xorResult = firstInt ^ secondInt, //524288
xorString = xorResult.toString(2); //"10000000000000000000"

Any way to reliably compress a short string?

I have a string exactly 53 characters long that contains a limited set of possible characters.
[A-Za-z0-9\.\-~_+]{53}
I need to reduce this to length 50 without loss of information and using the same set of characters.
I think it should be possible to compress most strings down to 50 length, but is it possible for all possible length 53 strings? We know that in the worst case 14 characters from the possible set will be unused. Can we use this information at all?
Thanks for reading.
If, as you stated, your output strings have to use the same set of characters as the input string, and if you don't know anything special about the requirements of the input string, then no, it's not possible to compress every possible 53-character string down to 50 characters. This is a simple application of the pigeonhole principle.
Your input strings can be represented as a 53-digit number in base 67, i.e., an integer from 0 to 6753 - 1 ≅ 6*1096.
You want to map those numbers to an integer from 0 to 6750 - 1 ≅ 2*1091.
So by the pigeonhole principle, you're guaranteed that 673 = 300,763 different inputs will map to each possible output -- which means that, when you go to decompress, you have no way to know which of those 300,763 originals you're supposed to map back to.
To make this work, you have to change your requirements. You could use a larger set of characters to encode the output (you could get it down to 50 characters if each one had 87 possible values, instead of the 67 in the input). Or you could identify redundancy in the input -- perhaps the first character can only be a '3' or a '5', the nineteenth and twentieth are a state abbreviation that can only have 62 different possible values, that sort of thing.
If you can't do either of those things, you'll have to use a compression algorithm, like Huffman coding, and accept the fact that some strings will be compressible (and get shorter) and others will not (and will get longer).
What you ask is not possible in the most general case, which can be proven very simply.
Say it was possible to encode an arbitrary 53 character string to 50 chars in the same set. Do that, then add three random characters to the encoded string. Then you have another arbitrary, 53 character string. How do you compress that?
So what you want can not be guaranteed to work for any possible data. However, it is possible that all your real data has low enough entropy that you can devise a scheme that will work.
In that case, you will probably want to do some variant of Huffman coding, which basically allocates variable-bit-length encodings for the characters in your set, using the shortest encodings for the most commonly used characters. You can analyze all your data to come up with a set of encodings. After Huffman coding, your string will be a (hopefully shorter) bitstream, which you encode to your character set at 6 bits per character. It may be short enough for all your real data.
A library-based encoding like Smaz (referenced in another answer) may work as well. Again, it is impossible to guarantee that it will work for all possible data.
One byte (character) can encode 256 values (0-255) but your set of valid characters uses only 67 values, which can be represented in 7 bits (alas, 6 bits gets you only 64) and none of your characters uses the high bit of the byte.
Given that, you can throw away the high bit and store only 7 bits, running the initial bits of the next character into the "spare" space of the first character. This would require only 47 bytes of space to store. (53 x 7 = 371 bits, 371 / 8 = 46.4 == 47)
This is not really considered compression, but rather a change in encoding.
For example "ABC" is 0x41 0x42 0x43
0x41 0x42 0x43 // hex values
0100 0001 0100 0010 0100 0011 // binary
100 0001 100 0010 100 0011 // drop high bit
// run it all together
100000110000101000011
// split as 8 bits (and pad to 8)
10000011 00001010 00011[000]
0x83 0x0A 0x18
As an example these 3 characters won't save any space, but your 53 characters will always come out as 47, guaranteed.
Note, however, that the output will not be in your original character set, if that is important to you.
The process becomes:
original-text --> encode --> store output-text (in database?)
retrieve --> decode --> original-text restored
If I remember correctly Huffman coding is going to be the most compact way to store the data. It has been too long since I used it to write the algorithm quickly, but the general idea is covered here, but if I remember correctly what you do is:
get the count for each character that is used
prioritize them based on how frequently they occurred
build a tree based off the prioritization
get the compressed bit representation of each character by traversing the tree (start at the root, left = 0 right = 1)
replace each character with the bits from the tree
Smaz is a simple compression library suitable for compressing very short strings.

JSON transfer of bigint: 12000000000002539 is converted to 12000000000002540?

I'm transferring raw data like [{id: 12000000000002539, Name: "Some Name"}] and I'm getting the object [{id: 12000000000002540, Name: "Some Name"}] after parsing, for now server side converting id into string seems to help.
But is there a better way to transfer bigint data correctly?
The value is actually not exceeding the maximum numeric value in JavaScript (which is "only" 1.7308 or so).
However, the value is exceeding the range of "integral precision". It is not that the wrong number is sent: rather, it is that the literal 12000000000002539 can only be represented as precisely as 12000000000002540, and thus there was never the correct numeric value in JavaScript. (The range of integrals is about +/- 253.)
This is an interesting phenomena of using a double relative-precision (binary64 in IEEE-754 speak) type to store all numeric values, including integers:
12000000000002539 === 12000000000002540 // true
The maximum significant number of decimal digits that be precisely stored as a numeric value is 15 (15.95, really). In the above, there are 17 significant digits, so some of the least-significant information is silently lost. In this case, as the JavaScript parser/engine reads in the literal value.
The only safe way to handle integral numbers of this magnitude in JavaScript is to use a string literal or to break it down in another fashion (e.g. a custom numeric type or a "bigint library"). However, I recommend just using a string, as it is human readable, relatively compact (only two extra characters in JSON), and doesn't require special serialization. Since the value is just an "id" in this case, I hope that math does not need to be performed upon it :)
Happy coding.

Categories

Resources