compressing numbers in javascript to binary format - javascript

I need to convert the following to a binary format (and later recoup) in the smallest amount of data possible.
my_arr = [
[128,32 ,22,23],
[104,53 ,21,25],
[150,55 ,79,23],
[104,101,23,8 ],
[57 ,117,13,21],
[37 ,135,21,20],
[81 ,132,23,6 ],
[81 ,138,7 ,8 ],
[97 ,138,7 ,8 ]...
the numbers don't exceed 399
If I use a 0 for each digit (8 0's in a row = 8) and a 1 as separator, the first line looks like this:
010010000000011000100110010011001000
This is really long for numbers like 99
If I pad each number to three digits and convert each in turn to actual binary the first line looks like this:
000100101000000000110010000000100010000000100011
This works out as 12 chars per number.
As the first char won't ever be a 4 or above I can save two digts by treating 0 as 00, 1 as 01, 2 as 10 and 3 as 11. Hence 10 chars per number
On the whole this reduces the size down to about 90% of the first option (on average) but is there a shorter way?
edit: yes as a string of 1's and 0's... and it doesn’t need to be shorter than the original integers... just the shortest possible way of writing it using only 2 symbols

If the values are evenly distributed between 0 and 399, then a pretty good encoding would be to take three values and encode them as a base 400 three-digit integer. I.e. val1 + 400*val2 + 400*400*val3. Then that integer will fit nicely in 26 bits. Four successive 26-bit values will fit in 13 bytes. Then you get an average of 13/12 bytes per value.
That's about as good as you're going to be able to do, unless the distribution of values is biased or if there is repetition or correlation, in which case you would be able to compress them more.
To deal with the details, you can use the number of bytes in the encoded sequence to determine the number of values, which may not be a multiple of three. If it is not a multiple of three, then there will be one or two values on the end, coded simply as nine bits each. Since it takes eight bits to go from 18 to 26 bits to add a value, there is no ambiguity in the count.

A good starting point would be to create constant-length blocks of ones and zeroes, which gives you easy to decode strings.
400 in binary is 110010000, which requires 9 characters to encode each number as its binary representation zero-padded to constant length.
encoding the first row:
var padTo9 = function( bin ){
while( bin.length<9 ){ bin = "0" + bin; }
return bin;
}
[128,32 ,22,23].map( function(i){ return padTo9( i.toString(2) ) }).join('');
/* result:
"010000000000100000000010110000010111"
*/
decoding
"010000000000100000000010110000010111".match(/[0-1]{9}/g).map( function(i){ return parseInt( i, 2 ) });
/* result:
[128, 32, 22, 23]
*/
I think the only way to get shorter string is using variable block length, which would require adding some control symbols to tell the decoder that following numbers are encoded in a specific number of characters. But these symbols have to be in >400 and still 9 characters long, so I think it wouldn't help given random distribution of data.

max 399:
2**9 is the smallest instance of (2**n)>=399, each number can be stored as 9 bits;
convert each to binary, and concat

Related

Implementing extendible hash table in javascript: how to use binary number as index

I'm studying data structures and trying to implement extendible hashing from scratch in Javascript and I'm confused. Here is an example I'm using as reference hash table with binary labels
Example: to store "john":35 in a table of size: 8 indexes / depth 3 (last 3 digits of binary hash)
"john" gets converted to a hash, example: 13,
13 is converted to a binary: 1101
find which index of the table 1101 belongs to, by looking at the last 3 digits "101"
This is where I'm stuck. Am I suppose to convert 101 back to decimal form (which would be 5), to then access the index by doing array[5]? Is there a way to label the array indexes in binary format like array[101] (but then wouldn't it be better to use an object?)? This seems like a lot of unnecessary extra steps to avoid just using modulo (13%8), am I missing something? Is this implementation useful in not-javascript language?
First post - thanks in advance!
Internally, all data in the computer is stored in binary, so you can't "convert" from decimal to binary since everything is already binary (it's just shown to use as decimal). If you want to print out a number as binary for debugging purposes, you can do:
console.log((5).toString(2)); // will print "101"
The .toString(2) method converts the number to a string with the binary representation of the number.
You can also write numbers in binary by starting it with 0b:
let x = 0b1101; // == 13
If you want to get the last few binary digits of a number, use the modulo operator to 2 to the power of the number of digits you want:
(0b1101 % (2**3)).toString(2) // "101"
With the table selected, you probably want to use the rest of the number that you haven't used already as the index in the table. We can use the bitshift operator, >>, to do this:
(0b1101 >> 3).toString(2) // "1", right three bits cut off
With a longer number:
// Note that underscores don't mean anything, they are just used for spacing
(0b1101_1101 >> 3).toString(2) // "11011" you can see that the right three bits have been cut off
Keep in mind that you probably shouldn't be using .toString(2) to actually store anything in the table; it should only be used for debugging.

Getting specific digits in a binary number in javascript

I need to be able to take the first 8 digits in a binary number, and save that value to a variable, then save the next 8, and so on. I read this https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Bitwise_Operators on bitwise operations, but didn't see anything about getting a specific digit or set of digits. I suppose I could just AND the number in question with another number that is all zeros except for the digits in question, which would be ones. For instance if the number in question was 10110011010111, and I wanted the first 5 digits, I could do 1000110011010111 & 0000000000011111 which would return 0000000000010111, which would be fine, but if there's a better or more direct way to do this, I would prefer that.
Edit: I'm doing this to be able to store a number as a number in base 256, so I can use color to encode information. I don't need to know the actual ones and zeros in those locations, but what number they would be taken in groups of 8, and saving that number.
You could use splice:
var str = '10110011010111';
var arr = str.split('');
console.log(arr.splice(arr.length - 5, 5).join('')); // prints 10111

How can I find the missing integer in a string of random non-repeating integers in a range

I'm having trouble even beginning to think of how to do this.
I need to find a missing number in a string of random numbers that don't have separators.
Here's an example: 14036587109. In this example, the missing number is 2, the range is 0-10 (inclusive).
How can I write a JavaScript/Node.JS program to solve this?
The part I can't figure out is how the program separates the numbers; In the example above, how would the program know that the number 10 (before the last number) isn't the numbers 1 and 0.
There are two things we know about the missing integer: the total number of digits in the input tells us the number of digits in the missing integer, and (as #samgak mentioned in a comment) counting the occurrences of each digit in the input tells us which digits the missing integer is made of. This may give us a quick path to the solution, if one of the permutations of those digits is missing from the input. If it doesn't, then:
Find the integers from highest to lowest number of digits; if the range is e.g. 0-999, then search the 3-digit integers first, then 2, then 1.
If an integer is only present at one location in the input, mark it as found, and remove it from the input.
Then, start again with the longest integers that haven't been found yet, and look at the ones that are present at two locations; try both options, and then check whether all other integers that rely on the digits we're using are also present; e.g. if 357 is present at two locations:
... 1235789 ... 2435768 ...
357 357
23 43
123 243
235 435
578 576
78 76
789 768
When trying the first location for the 357, check whether there is another possibility for 23, 123, 235, 578, 78, and 789. For the second location, check 43, 243, 435, 576, 76 and 768.
If these checks show that only one of the options is possible, mark the number as found and remove it from the input.
Go on to do this for shorter integers, and for integers that are present at 3, 4, ... locations. If, after doing this to a certain point, there is still no result, you may have to recursively try several options, which will quickly lead to a huge number of options. (With especially crafted large input, it is probably possible to thwart this method and make it unusably slow.) But the average complexity with random input may be decent.
Actually, when you find an integer that is only present in one location in the input, but it is a permutation of the missing digits, you should not remove it, because it could be the missing integer. So the algorithm might be: remove all integers you can unequivocally locate in the input, then try removing all possible missing integers one by one, and look for inconsistencies, i.e. other missing numbers that don't have the correct length or digits.
It's all a question of heuristics, of course. You try something simple, if that doesn't work you try something more complicated, if that doesn't work, you try something even more complicated... and at each step there are several options, and each one could be optimal for some input strings but not for others.
E.g. if the range is 0-5000, you'd start by marking the 4-digit integers that are only present at one location. But after that, you could do the same thing again (because integers that were present twice could have had one of their options removed) until there's no more improvement, or you could check integers that are present twice, or integers that are present up to x times, or move on to 3-digit integers... I don't think there's a straightforward way to know which of these options will give the best result.
This solution should work for any input string and any start/end range:
We can think about the numbers in the string as a pool of digits that we can choose from. We start at startRange and go through to endRange, looking for each number along the way in our pool of digits.
When we find a number that can be composed from our pool of digits, we delete those digits from our pool of digits, as those digits are already being used to form a number in our range.
As soon as we come across a number that cannot be composed from our pool of digits, that must be the missing number.
const str = "14036587109"; // input
const numsLeft = str.split("").map(num => parseInt(num)); // array of numbers
const startRange = 0;
const endRange = 10;
for(let i = startRange; i <= endRange ; i++) {
// check if number can be formed given the numbers left in numsLeft
const numFound = findNum(numsLeft, i);
if(!numFound) {
console.log("MISSING: " + i); // prints 2
break;
}
}
function findNum(numsLeft, i) {
// array of digits
const numsToFind = String(i).split("").map(num => parseInt(num));
// default is true, if all digits are found in numsLeft
let found = true;
numsToFind.forEach(num => {
// find digit in numsLeft
const numFoundIndex = numsLeft.indexOf(num);
if(numFoundIndex < 0) {
// digit was not found in numsLeft
found = false;
return;
} else {
// digit was found; delete digit from numsLeft
numsLeft.splice(numFoundIndex, 1);
}
});
return found;
}
var input = '10436587109';
var range = [10,9,8,7,6,5,4,3,2,1,0];
var expr1 = new RegExp(range.join('|'),'g');
var expr2 = new RegExp('[0-9]','g');
var a = input.match(expr1).map(Number).concat(input.match(expr2).map(Number));
var x = range.filter(function(i){ return a.indexOf(i)===-1; });

Any way to reliably compress a short string?

I have a string exactly 53 characters long that contains a limited set of possible characters.
[A-Za-z0-9\.\-~_+]{53}
I need to reduce this to length 50 without loss of information and using the same set of characters.
I think it should be possible to compress most strings down to 50 length, but is it possible for all possible length 53 strings? We know that in the worst case 14 characters from the possible set will be unused. Can we use this information at all?
Thanks for reading.
If, as you stated, your output strings have to use the same set of characters as the input string, and if you don't know anything special about the requirements of the input string, then no, it's not possible to compress every possible 53-character string down to 50 characters. This is a simple application of the pigeonhole principle.
Your input strings can be represented as a 53-digit number in base 67, i.e., an integer from 0 to 6753 - 1 ≅ 6*1096.
You want to map those numbers to an integer from 0 to 6750 - 1 ≅ 2*1091.
So by the pigeonhole principle, you're guaranteed that 673 = 300,763 different inputs will map to each possible output -- which means that, when you go to decompress, you have no way to know which of those 300,763 originals you're supposed to map back to.
To make this work, you have to change your requirements. You could use a larger set of characters to encode the output (you could get it down to 50 characters if each one had 87 possible values, instead of the 67 in the input). Or you could identify redundancy in the input -- perhaps the first character can only be a '3' or a '5', the nineteenth and twentieth are a state abbreviation that can only have 62 different possible values, that sort of thing.
If you can't do either of those things, you'll have to use a compression algorithm, like Huffman coding, and accept the fact that some strings will be compressible (and get shorter) and others will not (and will get longer).
What you ask is not possible in the most general case, which can be proven very simply.
Say it was possible to encode an arbitrary 53 character string to 50 chars in the same set. Do that, then add three random characters to the encoded string. Then you have another arbitrary, 53 character string. How do you compress that?
So what you want can not be guaranteed to work for any possible data. However, it is possible that all your real data has low enough entropy that you can devise a scheme that will work.
In that case, you will probably want to do some variant of Huffman coding, which basically allocates variable-bit-length encodings for the characters in your set, using the shortest encodings for the most commonly used characters. You can analyze all your data to come up with a set of encodings. After Huffman coding, your string will be a (hopefully shorter) bitstream, which you encode to your character set at 6 bits per character. It may be short enough for all your real data.
A library-based encoding like Smaz (referenced in another answer) may work as well. Again, it is impossible to guarantee that it will work for all possible data.
One byte (character) can encode 256 values (0-255) but your set of valid characters uses only 67 values, which can be represented in 7 bits (alas, 6 bits gets you only 64) and none of your characters uses the high bit of the byte.
Given that, you can throw away the high bit and store only 7 bits, running the initial bits of the next character into the "spare" space of the first character. This would require only 47 bytes of space to store. (53 x 7 = 371 bits, 371 / 8 = 46.4 == 47)
This is not really considered compression, but rather a change in encoding.
For example "ABC" is 0x41 0x42 0x43
0x41 0x42 0x43 // hex values
0100 0001 0100 0010 0100 0011 // binary
100 0001 100 0010 100 0011 // drop high bit
// run it all together
100000110000101000011
// split as 8 bits (and pad to 8)
10000011 00001010 00011[000]
0x83 0x0A 0x18
As an example these 3 characters won't save any space, but your 53 characters will always come out as 47, guaranteed.
Note, however, that the output will not be in your original character set, if that is important to you.
The process becomes:
original-text --> encode --> store output-text (in database?)
retrieve --> decode --> original-text restored
If I remember correctly Huffman coding is going to be the most compact way to store the data. It has been too long since I used it to write the algorithm quickly, but the general idea is covered here, but if I remember correctly what you do is:
get the count for each character that is used
prioritize them based on how frequently they occurred
build a tree based off the prioritization
get the compressed bit representation of each character by traversing the tree (start at the root, left = 0 right = 1)
replace each character with the bits from the tree
Smaz is a simple compression library suitable for compressing very short strings.

What is the fastest method to calculate substring

I have a huge "binary" string, like: 1110 0010 1000 1111 0000 1100 1010 0111....
It's length is 0 modulo 4, and may reach 500,000.
I have also a corresponding array: {14, 2, 8, 15, 0, 12, 10, 7, ...}
(every number in the array corresponds to 4 bits in the string)
Given this string, this array, and a number N, I need to calculate the following substring string.substr(4*N, 4), i.e.:
for N=0 the result should be 1110
for N=1 the result should be 0010
I need to perform this task many many times, and my question is what would be the fastest method to calculate this substring ?
One method is to calculate the substring straight forward: string.substr(4*N, 4). I'm afraid this one is not efficient for such huge strings.
Another method is to use array[N].toString(2) and then wrap the result with zeros if needed. I'm not sure how fast is this.
May be you have any other ideas ?
Where does the string come from? Why not represent the string not as binary, but as hex, and then you can store each four-binary-digit section as a single character? (You could obviously pack it twice that densely if you wanted, or actually now that I think of it, 4 times, since Javascript strings are 16-bit Unicode). Then finding a single group would be a single call to "charAt()", and you'd just have to expand to the binary form via a lookup table.
edit — oh well duhh, you already have an array. In that case don't do the substring work at all; it's crazy. Just grab the array element and translate it through a lookup array into the 4-binary-digit string.
You could consider representing your huge string as a Rope data structure. A rope is basically a binary tree whose leaves are arrays of characters. A node in the tree has a left child and a right child, the left child being the first part of the string, while the right child the final part.
By using a rope, substring operations become logarithmic in complexity, rather then linear, as they are for regular strings.
If you want it padded, you could do this:
var elem = array[N]
var str = "" + ((elem>>3)&1) + ((elem>>2)&1) + ((elem>>1)&1) + (elem&1);
The array already has exactly what you need, does it not, save that you need to print it in binary format. Fortunately, sprintf for javascript is available.

Categories

Resources