Any way to reliably compress a short string?

Any way to reliably compress a short string? - javascript

I have a string exactly 53 characters long that contains a limited set of possible characters.
[A-Za-z0-9\.\-~_+]{53}
I need to reduce this to length 50 without loss of information and using the same set of characters.
I think it should be possible to compress most strings down to 50 length, but is it possible for all possible length 53 strings? We know that in the worst case 14 characters from the possible set will be unused. Can we use this information at all?
Thanks for reading.

If, as you stated, your output strings have to use the same set of characters as the input string, and if you don't know anything special about the requirements of the input string, then no, it's not possible to compress every possible 53-character string down to 50 characters. This is a simple application of the pigeonhole principle.
Your input strings can be represented as a 53-digit number in base 67, i.e., an integer from 0 to 6753 - 1 ≅ 6*1096.
You want to map those numbers to an integer from 0 to 6750 - 1 ≅ 2*1091.
So by the pigeonhole principle, you're guaranteed that 673 = 300,763 different inputs will map to each possible output -- which means that, when you go to decompress, you have no way to know which of those 300,763 originals you're supposed to map back to.
To make this work, you have to change your requirements. You could use a larger set of characters to encode the output (you could get it down to 50 characters if each one had 87 possible values, instead of the 67 in the input). Or you could identify redundancy in the input -- perhaps the first character can only be a '3' or a '5', the nineteenth and twentieth are a state abbreviation that can only have 62 different possible values, that sort of thing.
If you can't do either of those things, you'll have to use a compression algorithm, like Huffman coding, and accept the fact that some strings will be compressible (and get shorter) and others will not (and will get longer).

What you ask is not possible in the most general case, which can be proven very simply.
Say it was possible to encode an arbitrary 53 character string to 50 chars in the same set. Do that, then add three random characters to the encoded string. Then you have another arbitrary, 53 character string. How do you compress that?
So what you want can not be guaranteed to work for any possible data. However, it is possible that all your real data has low enough entropy that you can devise a scheme that will work.
In that case, you will probably want to do some variant of Huffman coding, which basically allocates variable-bit-length encodings for the characters in your set, using the shortest encodings for the most commonly used characters. You can analyze all your data to come up with a set of encodings. After Huffman coding, your string will be a (hopefully shorter) bitstream, which you encode to your character set at 6 bits per character. It may be short enough for all your real data.
A library-based encoding like Smaz (referenced in another answer) may work as well. Again, it is impossible to guarantee that it will work for all possible data.

One byte (character) can encode 256 values (0-255) but your set of valid characters uses only 67 values, which can be represented in 7 bits (alas, 6 bits gets you only 64) and none of your characters uses the high bit of the byte.
Given that, you can throw away the high bit and store only 7 bits, running the initial bits of the next character into the "spare" space of the first character. This would require only 47 bytes of space to store. (53 x 7 = 371 bits, 371 / 8 = 46.4 == 47)
This is not really considered compression, but rather a change in encoding.
For example "ABC" is 0x41 0x42 0x43
0x41 0x42 0x43 // hex values
0100 0001 0100 0010 0100 0011 // binary
100 0001 100 0010 100 0011 // drop high bit
// run it all together
100000110000101000011
// split as 8 bits (and pad to 8)
10000011 00001010 00011[000]
0x83 0x0A 0x18
As an example these 3 characters won't save any space, but your 53 characters will always come out as 47, guaranteed.
Note, however, that the output will not be in your original character set, if that is important to you.
The process becomes:
original-text --> encode --> store output-text (in database?)
retrieve --> decode --> original-text restored

If I remember correctly Huffman coding is going to be the most compact way to store the data. It has been too long since I used it to write the algorithm quickly, but the general idea is covered here, but if I remember correctly what you do is:
get the count for each character that is used
prioritize them based on how frequently they occurred
build a tree based off the prioritization
get the compressed bit representation of each character by traversing the tree (start at the root, left = 0 right = 1)
replace each character with the bits from the tree

Smaz is a simple compression library suitable for compressing very short strings.

Related

Implementing extendible hash table in javascript: how to use binary number as index

I'm studying data structures and trying to implement extendible hashing from scratch in Javascript and I'm confused. Here is an example I'm using as reference hash table with binary labels
Example: to store "john":35 in a table of size: 8 indexes / depth 3 (last 3 digits of binary hash)
"john" gets converted to a hash, example: 13,
13 is converted to a binary: 1101
find which index of the table 1101 belongs to, by looking at the last 3 digits "101"
This is where I'm stuck. Am I suppose to convert 101 back to decimal form (which would be 5), to then access the index by doing array[5]? Is there a way to label the array indexes in binary format like array[101] (but then wouldn't it be better to use an object?)? This seems like a lot of unnecessary extra steps to avoid just using modulo (13%8), am I missing something? Is this implementation useful in not-javascript language?
First post - thanks in advance!

Internally, all data in the computer is stored in binary, so you can't "convert" from decimal to binary since everything is already binary (it's just shown to use as decimal). If you want to print out a number as binary for debugging purposes, you can do:
console.log((5).toString(2)); // will print "101"
The .toString(2) method converts the number to a string with the binary representation of the number.
You can also write numbers in binary by starting it with 0b:
let x = 0b1101; // == 13
If you want to get the last few binary digits of a number, use the modulo operator to 2 to the power of the number of digits you want:
(0b1101 % (2**3)).toString(2) // "101"
With the table selected, you probably want to use the rest of the number that you haven't used already as the index in the table. We can use the bitshift operator, >>, to do this:
(0b1101 >> 3).toString(2) // "1", right three bits cut off
With a longer number:
// Note that underscores don't mean anything, they are just used for spacing
(0b1101_1101 >> 3).toString(2) // "11011" you can see that the right three bits have been cut off
Keep in mind that you probably shouldn't be using .toString(2) to actually store anything in the table; it should only be used for debugging.

Reassembling negative Python marshal int's into Javascript numbers

I'm writing a client-side Python bytecode interpreter in Javascript (specifically Typescript) for a class project. Parsing the bytecode was going fine until I tried out a negative number.
In Python, marshal.dumps(2) gives 'i\x02\x00\x00\x00' and marshal.dumps(-2) gives 'i\xfe\xff\xff\xff'. This makes sense as Python represents integers using two's complement with at least 32 bits of precision.
In my Typescript code, I use the equivalent of Node.js's Buffer class (via a library called BrowserFS, instead of ArrayBuffers and etc.) to read the data. When I see the character 'i' (i.e. buffer.readUInt8(offset) == 105, signalling that the next thing is an int), I then call readInt32LE on the next offset to read a little-endian signed long (4 bytes). This works fine for positive numbers but not for negative numbers: for 1 I get '1', but for '-1' I get something like '-272777233'.
I guess that Javascript represents numbers in 64-bit (floating point?). So, it seems like the following should work:
var longval = buffer.readInt32LE(offset); // reads a 4-byte long, gives -272777233
var low32Bits = longval & 0xffff0000; //take the little endian 'most significant' 32 bits
var newval = ~low32Bits + 1; //invert the bits and add 1 to negate the original value
//but now newval = 272826368 instead of -2
I've tried a lot of different things and I've been stuck on this for days. I can't figure out how to recover the original value of the Python integer from the binary marshal string using Javascript/Typescript. Also I think I deeply misunderstand how bits work. Any thoughts would be appreciated here.
Some more specific questions might be:
Why would buffer.readInt32LE work for positive ints but not negative?
Am I using the correct method to get the 'most significant' or 'lowest' 32 bits (i.e. does & 0xffff0000 work how I think it does?)
Separate but related: in an actual 'long' number (i.e. longer than '-2'), I think there is a sign bit and a magnitude, and I think this information is stored in the 'highest' 2 bits of the number (i.e. at number & 0x000000ff?) -- is this the correct way of thinking about this?

The sequence ef bf bd is the UTF-8 sequence for the "Unicode replacement character", which Unicode encoders use to represent invalid encodings.
It sounds like whatever method you're using to download the data is getting accidentally run through a UTF-8 decoder and corrupting the raw datastream. Be sure you're using blob instead of text, or whatever the equivalent is for the way you're downloading the bytecode.
This got messed up only for negative values because positive values are within the normal mapping space of UTF-8 and thus get translated 1:1 from the original byte stream.

How many characters 2 bytes can hold? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
If I say 2 bytes can hold 510 characters of data will I be wrong? based on the fact that 1 byte can hold 0-255 max character

One byte is :
8 bits, each one can be either 0 or 1
something that can represent 256 distinct values
Two bytes are ..two bytes.
16 bits
something that can represent 65536 distinct values
There is no meaning in what a byte (or two bytes) is, if you don't know the encoding used, what each single one of the 256 (or 65536) values are supposed to be/mean.
If you're talking about Char, you can't either say it's one or two or fifty chars...
ASCII encoding holds 128 distinct characters (95 chars can be displayed while the remaining are control chars) ranging from code 0 to 127 (Byte value expressed in decimal literal)
Unicode encoding (v7) is a generic encoding. You have the UTF-8, the UTF-16 Little Endian or Big Endian, and the UTF-32 Little Endian or Big Endian.
UTF-8 requires either 1, 2, 3 or 4 bytes to represent one single character.
UTF-16 is a fixed-size character encoding : each char requires 2 bytes.
UTF-32 is also a fixed size character encoding that requires 4 bytes per character.
There are hundreds of different Encodings that can represent one character for each of the 256 unique values a single Byte can represent. Like ANSI.
So I tend to say, yes, you're wrong thinking two bytes can hold 510 characters of data, assuming you're using one of the above encoding or similar.
But again, a Byte is a Byte, not a Char !
Let's imagine a (new) custom encoding with specific parser and formatter where each bit [0 or 1] define the selection of one word/text/string stored in a dictionary, and following words/text/string selection depends on the previous selected word (previous bit value)
The purpose of such type of encoding is somewhat useless, but hey ! Because you used a dictionary, you can affirm one single byte can represent exactly 510 characters of data (or even more) because of the use of this specific encoding/decoding..!
Again, a byte is a byte, saying it holds one, two, zero or 510 characters doesn't mean anything if you don't define first what is the encoding used.
EDIT !
And while it's out of the scope of the question, compression is even more evil - and generally uses dictionary ;) - But compression are only effective from a certain amount of bytes....

A character is a graphical representation of a concept and may occupy an arbitrary number of bytes. For example, character "S" (capital letter 'S') occupies 1 byte whereas character 💋 (kissing lips) occupies 3 bytes.

I think your answer is wrong. byte is 1 character. a character in binary is a series of 8 on or offs or 0 or 1s. one of those is a bit and 8 bits make a byte so 1 byte is one character.so 2 bytes hold two characters.

It depends on the format of the string. 1 byte per character in ASCII and 2 bytes per character in Unicode. so 2 byte can hold only single Unicode character or 2 ASCII character.
The following code will explain my answer
MsgBox(System.Text.ASCIIEncoding.Unicode.GetByteCount("h")) '<--- displays 2
MsgBox(System.Text.ASCIIEncoding.ASCII.GetByteCount("h")) '<--- displays 1

Encoding strings to small sizes for QRCode generation

I'm generating QR codes using strings that could very easily be longer in length then a QRCode could handle. I'm looking for suggestions on algorithms to encode these strings as small as possible, or a proof that the string cannot be shrunk any further.
Since I'm encoding a series of items, I can represent them using ID's and delineate them using pipes as in the following lookup table:
function encodeLookUp(character){
switch(character){
case '0': return '0000';
case '1': return '0001';
case '2': return '0010';
case '3': return '0011';
case '4': return '0100';
case '5': return '0101';
case '6': return '0110';
case '7': return '0111';
case '8': return '1000';
case '9': return '1001';
case '|': return '1010';
case ':': return '1011';
}
return false;
}
Using this table I am already doing a base 16 encoding, therefore each 32 ascii character from the original string becomes half a character in the new string (effectively halving the length).
Starting String: 01251548|4654654:4465464 // ID1 | ID2 : ID3 demonstrates both pipes.
Bit String: 000000010010010100010101010010001010010001100101010001100101010010110100010001100101010001100100
Result String: %H¤eFT´FTd // Half the length of the starting string.
Then this new ascii code, is translated according to QRCode specification.
EDIT: The most amount of characters currently encodable: 384
CLARIFICATION: Both ID numberic length, and the quantity of ID's or pipes is variable with a tendancy towards one. I am looking to be able to reduce this algorithm to contain on average the least amount of characters by the time its a 'result string'.
NOTE: The result string is only an ascii represenetaion of the binary string i've encoded with the data to conform with standard QRCode specifications and readers.

If you have relatively non-random data, a Huffman encoding might be a good solution.

Using the function, you're going to loose a lot of space (since 4 bits are way too much storage for 12 combinations).
I'd start by looking at the maximum length possible for your IDs and find a suitable storage block.
If you are storing these items serially in a fixed count (say, 4 ids). You would need id_length*id_count at most, and you won't need to use any separators.
Edit: Again according to the number of IDs you want to write and their expected maximum length, there may be different types of encodings to compress it done. RLE (run length encoding) came to my mind.

QR codes support a binary mode, and that's going to be the most efficient way for you to store your IDs. Either:
Pick a length (in bytes) that is sufficient to store all your IDs, and encode the QR-code as a series of fixed-length integers. 4 bytes (32 bits) is a standard choice that ought to cover the likely range, or
If you want to be able to encode a wide range of IDs, but expect most of the values to be small, use a variable-length encoding scheme. One example is to use the lowest 7 bits of each byte to store the integer, and the most significant bit to indicate if there are any further bytes.
Also note that QR codes can be a lot larger than 384 characters!
Edit: From your original question, though, it looks like you're encoding more than just a series of integers - you have at least two different types of delimiters. Where can they appear and in what circumstances? The encoding format is going to depend on those parameters.

QR codes already have special encoding modes that are optimized for digits, or just alphanumeric data. It would probably be easier to take advantage of these rather than invent a scheme.
If you're going to do something custom, I think you'll find it hard to beat something like gzip compression. Just gzip the bytes, encode the bytes in byte mode, and decompress on the other end.

As a start of an answer to my own question:
If I start with a string of numbers
I can parse that string for patterns and hold those patters in special symbols that are able to take up the other 4 spaces available in my Huffman tree.
EDIT: Example: staring string 12222345, ending string 12x345. Where x is a symbol that means 'repeat the last symbol 3 more times'

Hash 32bit int to 16bit int?

What are some simple ways to hash a 32-bit integer (e.g. IP address, e.g. Unix time_t, etc.) down to a 16-bit integer?
E.g. hash_32b_to_16b(0x12345678) might return 0xABCD.
Let's start with this as a horrible but functional example solution:
function hash_32b_to_16b(val32b) {
return val32b % 0xffff;
}
Question is specifically about JavaScript, but feel free to add any language-neutral solutions, preferably without using library functions.
The context for this question is generating unique IDs (e.g. a 64-bit ID might be composed of several 16-bit hashes of various 32-bit values). Avoiding collisions is important.
Simple = good. Wacky+obfuscated = amusing.

The key to maximizing the preservation of entropy of some original 32-bit 'signal' is to ensure that each of the 32 input bits has an independent and equal ability to alter the value of the 16-bit output word.
Since the OP is requesting a bit-size which is exactly half of the original, the simplest way to satisfy this criteria is to xor the upper and lower halves, as others have mentioned. Using xor is optimal because—as is obvious by the definition of xor—independently flipping any one of the 32 input bits is guaranteed to change the value of the 16-bit output.
The problem becomes more interesting when you need further reduction beyond just half-the-size, say from a 32-bit input to, let's say, a 2-bit output. Remember, the goal is to preserve as much entropy from the source as possible, so solutions which involve naively masking off the two lowest bits with (i & 3) are generally heading in the wrong direction; doing that guarantees that there's no way for any bits except the unmasked bits to affect the result, and that generally means there's an arbitrary, possibly valuable part of the runtime signal which is being summarily discarded without principle.
Following from the earlier paragraph, you could of course iterate with xor three additional times to produce a 2-bit output with the desired property of being equally-influenced by each/any of the input bits. That solution is still optimally correct of course, but involves looping or multiple unrolled operations which, as it turns out, aren't necessary!
Fortunately, there is a nice technique of only two operations which gives the same optimal result for this situation. As with xor, it not only ensures that, for any given 32-bit value, twiddling any input bit will result in a change to the 2-bit output, but also that, given a uniform distribution of input values, the distribution of 2-bit output values will also be perfectly uniform. In the current example, the method divides the 4,294,967,296 possible input values into exactly 1,073,741,824 each of the four possible 2-bit hash results { 0, 1, 2, 3 }.
The method I mention here uses specific magic values that I discovered via exhaustive search, and which don't seem to be discussed very much elsewhere on the internet, at least for the particular use under discussion here (i.e., ensuring a uniform hash distribution that's maximally entropy-preserving). Curiously, according to this same exhaustive search, the magic values are in fact unique, meaning that for each of target bit-widths { 16, 8, 4, 2 }, the magic value I show below is the only value that, when used as I show here, satisfies the perfect hashing criteria outlined above.
Without further ado, the unique and mathematically optimal procedure for hashing 32-bits to n = { 16, 8, 4, 2 } is to multiply by the magic value corresponding to n (unsigned, discarding overflow), and then take the n highest bits of the result. To isolate those result bits as a hash value in the range [0 ... (2ⁿ - 1)], simply right-shift (unsigned!) the multiplication result by 32 - n bits.
The "magic" values, and C-like expression syntax are as follows:
Method
Maximum-entropy-preserving hash for reducing 32 bits to. . .
Target Bits Multiplier Right Shift Expression [1, 2]
----------- ------------ ----------- -----------------------
16 0x80008001 16 (i * 0x80008001) >> 16
8 0x80808081 24 (i * 0x80808081) >> 24
4 0x88888889 28 (i * 0x88888889) >> 28
2 0xAAAAAAAB 30 (i * 0xAAAAAAAB) >> 30
Maximum-entropy-preserving hash for reducing 64 bits to. . .
Target Bits Multiplier Right Shift Expression [1, 2]
----------- ------------------ ----------- -------------------------------
32 0x8000000080000001 32 (i * 0x8000000080000001) >> 32
16 0x8000800080008001 48 (i * 0x8000800080008001) >> 48
8 0x8080808080808081 56 (i * 0x8080808080808081) >> 56
4 0x8888888888888889 60 (i * 0x8888888888888889) >> 60
2 0xAAAAAAAAAAAAAAAB 62 (i * 0xAAAAAAAAAAAAAAAB) >> 62
Notes:
Use unsigned multiply and discard any overflow (64-bit multiply is not needed).
If isolating the result using right-shift (as shown), be sure to use an unsigned shift operation.
Further discussion
I find this all this quite cool. In practical terms, the key information-theoretical requirement is the guarantee that, for any m-bit input value and its corresponding n-bit hash value result, flipping any one of the m source bits always causes some change in the n-bit result value. Now although there are 2ⁿ possible result values in total, one of them is already "in-use" (by the result itself) since "switching" to that one from any other result would be no change at all. This leaves 2ⁿ - 1 result values that are eligible to be used by the entire set of m input values flipped by a single bit.
Let's consider an example; in fact, to show how this technique might seem to border on spooky or downright magical, we'll consider the more extreme case where m = 64 and n = 2. With 2 output bits there are four possible result values, { 0, 1, 2, 3 }. Assuming an arbitrary 64-bit input value 0x7521d9318fbdf523, we obtain its 2-bit hash value of 1:
(0x7521d9318fbdf523 * 0xAAAAAAAAAAAAAAAB) >> 62 // result --> '1'
So the result is 1 and the claim is that no value in the set of 64 values where a single-bit of 0x7521d9318fbdf523 is toggled may have that same result value. That is, none of those 64 other results can use value 1 and all must instead use either 0, 2, or 3. So in this example it seems like every one of the 2⁶⁴ input values—to the exclusion of 64 other input values—will selfishly hog one-quarter of the output space for itself. When you consider the sheer magnitude of these interacting constraints, can a simultaneously satisfying solution overall even exist?
Well sure enough, to show that (exactly?) one does, here are the hash result values, listed in order, for inputs that flipping a single bit of 0x7521d9318fbdf523 (one at a time), from MSB (position 63) down to LSB (0).
3 2 0 3 3 3 3 3 3 0 0 0 3 0 3 3 0 3 3 3 0 0 3 3 3 0 0 3 3 0 3 3 // continued…
0 0 3 0 0 3 0 3 0 0 0 3 0 3 3 3 0 3 0 3 3 3 3 3 3 0 0 0 3 0 0 3 // notice: no '1' values
As you can see, there are no 1 values, which entails that every bit in the source "as-is" must be contributing to influence the result (or, if you prefer, the de facto state of each-and-every bit in 0x7521d9318fbdf523 is essential to keeping the entire overall result from being "not-1"). Because no matter what single-bit change you make to the 64-bit input, the 2-bit result value will no longer be 1.
Keep in mind that the "missing-value" table shown above was dumped from the analysis of just the one randomly-chosen example value 0x7521d9318fbdf523; every other possible input value has a similar table of its own, each one eerily missing its owner's actual result value while yet somehow being globally consistent across its set-membership. This property essentially corresponds to maximally preserving the available entropy during the (inherently lossy) bit-width reduction task.
So we see that every one of the 2⁶⁴ possible source values independently imposes, on exactly 64 other source values, the constraint of excluding one of the possible result values. What defies my intuition about this is that there are untold quadrillions of these 64-member sets, each of whose members also belongs to 63 other, seemingly unrelated bit-twiddling sets. Yet somehow despite this most confounding puzzle of interwoven constraints, it is nevertheless trivial to exploit the one (I surmise) resolution which simultaneously satisfies them all exactly.
All this seems related to something you may have noticed in the tables above: namely, I don't see any obvious way to extend the technique to the case of compressing down to a 1-bit result. In this case, there are only two possible result values { 0, 1 }, so if any/every given (e.g.) 64-bit input value still summarily excludes its own result from being the result for all 64 of its single-bit-flip neighbors, then that now essentially imposes the other, only remaining value on those 64. The math breakdown we see in the table seems to be signalling that a simultaneous result under such conditions is a bridge too far.
In other words, the special 'information-preserving' characteristic of xor (that is, its luxuriously reliable guarantee that, as opposed to and, or, etc., it c̲a̲n̲ and w̲i̲l̲l̲ always change a bit) not surprisingly exacts a certain cost, namely, a fiercely non-negotiable demand for a certain amount of elbow room—at least 2 bits—to work with.

I think this is the best you're going to get. You could compress the code to a single line but the var's are there for now as documentation:
function hash_32b_to_16b(val32b) {
var rightBits = val32b & 0xffff; // Left-most 16 bits
var leftBits = val32b & 0xffff0000; // Right-most 16 bits
leftBits = leftBits >>> 16; // Shift the left-most 16 bits to a 16-bit value
return rightBits ^ leftBits; // XOR the left-most and right-most bits
}
Given the parameters of the problem, the best solution would have each 16-bit hash correspond to exactly 2^16 32-bit numbers. It would also IMO hash sequential 32-bit numbers differently. Unless I'm missing something, I believe this solution does those two things.
I would argue that security cannot be a consideration in this problem, as the hashed value is just too few bits. I believe that the solution I gave provides even distribution of 32-bit numbers to 16-bit hashes

This depends on the nature of the integers.
If they can contain some bit-masks, or can differ by powers of two, then simple XORs will have high probability of collisions.
You can try something like (i>>16) ^ ((i&0xffff) * p) with p being a prime number.
Security-hashes like MD5 are all good, but they are obviously an overkill here. Anything more complex than CRC16 is overkill.

I would say just apply a standard hash like sha1 or md5 and then grab the last 16 bits of that.

Assuming that you expect the least significant bits to 'vary' the most, I think you're probably going to get a good enough distribution by just using the lower 16-bits of the value as a hash.
If the numbers you're going to hash won't have that kind of distribution, then the additional step of xor-ing in the upper 16 bits might be helpful.
Of course this suggestion is if you're intending to use the hash merely for some sort of lookup/storage scheme and aren't looking for the crypto-related properties of non-guessability and non-reversability (which the xor-ing suggestions don't really buy you either).

Something simple like this....
function hash_32b_to_16b(val32b) {
var h = hmac(secretKey, sha512);
var v = val32b;
for(var i = 0; i < 4096; ++i)
v = h(v);
return v % 0xffff;
}

Develop Reference

JavaScript is the programming language of the Web.