I saw quite many compression methods for js but in most cases compressed data was in string and it contained text. I need to compress array of less than 10^7 floats in range 0-1.
As precision is not really important eventually i can save it as string containing only numbers 0-9 (containing only 2 first digits after decimal of each float). What method would be best for data like this? I'd like to have smallest possible output but also it should't take more than ~10 sec to compress this string it's about up to 10 000 000 signs when saving 2 digits per float.I saw quite many compression methods for js but in most cases compressed data was in string and it contained text. I need to compress array of less than 10^7 floats in range 0-1.
As precision is not really important eventually i can save it as string containing only numbers 0-9 (containing only 2 first digits after decimal of each float). What method would be best for data like this? I'd like to have smallest possible output but also it should't take more than ~10 sec to decompress this string it's about up to 10 000 000 signs when saving 2 digits per float.
Data contains records of sound waveform for visualization on archaic browsers not supporting Web Audio API. Waveform is recorded at 20 fps on Chrome user client, compressed and stored in server db. Then send back to IE or ff after request to draw visualization - so I need lossy compression - it can be really lossy to achieve size able to be send with song metadata request. I hope compression on wav -> mp3 64k level would be possible (like 200:1 or something) noone will recognise that waveform is not perfect on visualization, I thought maybe about saving theese floats as 0-9a-Z it gives 36 instead of 100 steps but reduces record of one frequency to 1 sign. but what next, what compression to use on this string with 0-Z signs to achieve best compression? would lzma be suitable for string like this? compression / decompression would run on web worker so it doesn't need to be really instant - decompression like 10 sec, compression doesn't matter - rather less than one song so about 2 min
Taking a shot in the dark, if you truly can rely on only the first two digits after the decimal (i.e. there are no 0.00045s in the array), and you need two digits, the easiest thing to do would be multiply by 256 and take the integer part as a byte
encoded = Math.floor(floatValue * 256)
decoded = encoded / 256.0
However, if you know more about your data, you can squeeze more entropy out of your values. This comes out to a 4:1 compression ration.
Related
is there any way to encode file that has for example 2GB without "chopping" it for chunks? Because files larger than 2GB throw error that file is too large for fs. And making it smaller chunks dont work either, cause of encoding/decoding problem. Thanks for any help :)
Base64 isn't a good solution for large file transfer.
It's simple, and easy to work with, but will increase your file size. See MDN's article about this. I would recommend looking into best practices for data transfer in JS. MDN has an other article on this that breaks down the DataTransfer API.
Encoded size increase
Each Base64 digit represents exactly 6 bits of data. So, three 8-bits bytes of the input string/binary file (3×8 bits
= 24 bits) can be represented by four 6-bit Base64 digits (4×6 = 24 bits).
This means that the Base64 version of a string or file will be at
least 133% the size of its source (a ~33% increase). The increase may
be larger if the encoded data is small. For example, the string "a"
with length === 1 gets encoded to "YQ==" with length === 4 — a 300%
increase.
Additionally
Could share what you're trying to do, and add a MRE? There are so many different ways to tackle this problem, it's hard to narrow it down without knowing any of the requirements.
Very simple question, how much data (bytes) do strings take up? Do they take up 1 byte per character?
I tried searching it up, but ws schools doesn't say...
I want to know this to reduce bandwidth in my web app.
Also, for anyone that knows, does socket.io automatically json stringify when using socket.emit();?
String is a character array. So, it will take up roughly sizeof(char) * noOfCharacters ignoring other fields in String class for now. Character can be of 1 byte or 2 bytes depending upon the system, the type of chars being represented- unicode etc.
However, from your question, you are more interested in data being transported over the network. Note that data is always exchanged in bytes (byte[]) and thus string will be converted into byte[] representation first and then ported over.
To limit the bandwidth usage, you can enable compression, choose interoperable serialisation technique(protobuf, smile, fastinfoset etc)
I am writing compression and decompression functions for strings containing base 10 digits. I figure that, since it is a mere 10 characters being acted upon, that there exists a much smaller string that can represent large strings. The compressed result is encoded in ISO-8859-7, so I can use 256 characters in the result string
For example, I want to take a string that represents a 1000-digit number (this one, for example) and "compress it". Numbers of these lengths exceed the number type in the language that I am working in, JavaScript. As such, numeric manipulation/conversion is out of the question. The compression software I use (shoco) does not compress numbers. At all.
How might I go about doing this? Is there a certain algorithm that can be used to compress numbers? I am not looking for performing speed, but rather looking for optimal compression for a majority of numbers, not just the number given as an example.
If you work on the number in groups of three digits, you can represent each triplet in 10 bits with very little wastage. Then you "just" need to create a stream of 8-bit octets from your stream of 10-bit triples, which will require a certain amount of bit-shifting, but is not awfully complicated.
That assumes that your number consists of a multiple of 3 digits (you could pad it with leading zeros) or that you know how many digits it contains (in which case you could pad it at the end with trailing zeros). If you encoded subsequences into 50 bit units, you would have enough codespace to encode digit sequences of up to 15 digits, not just exactly 15 digits, which would avoid the need to pad. You could just barely get away with that in a language which uses 53-bit floating point as a common numeric type, but it might or might not be worth the extra complication.
rici's answer, using 10 bits for every three digits, is indeed what I would use for a practical application.
However since you asked for the optimal compression and stated that you don't care about speed, that would be generating a binary representation of the decimal number using multiple precision arithmetic. This code has already been written for you in the GMP library. That library is highly optimized and quite fast, so you would not see a huge speed impact, depending on what else you're doing with the numbers.
As an example your 1000-digit number would take 418 bytes to code using 334 sets of 10 bits. It would take 416 bytes when encoded as a single, large, binary integer. On a 2 GHz i7, I get 1.9 µs for the 1000-digit conversion using sets of 10 bits, vs. 55 µs using multiple precision arithmetic to make a big integer.
Update:
I missed the javascript tag until someone pointed it out in a comment. You can use Crunch for multiple-precision arithmetic in javascript.
Update 2:
As pointed out by rici, the comparison above assumes that the length of the input is known a priori for both encodings. However if the stream of bits needs to embedded in a larger stream and the number of digits is not known a priori, then a means must be provided to determine where the number ends.
The 10-bit encoding of three digits permits using a final 10-bit code to be that marker, since 24 of the possible values are unused. In fact, we can use 10 of those 24 to provide one more digit to the number. (We could even add a "half" digit by using 20 values for 0..19, allowing a leading 1 if present in that position. Or we could use that for sign to allow negative integers. But I digress.) This turns out to be perfect for the case of 1000 digits, which is a multiple of three, plus one. Then 1000 digits can be encoded with an end marker in 418 bytes, the same as before when not requiring an end marker. (In a stream of bits it can actually be 417.5 bytes.)
For the binary integer we can either precede it with a length in bits, or use bit stuffing to mark the end of the stream with a series of one bits. The overhead is about the same either way. We'll do the latter to make it easy to handle arbitrary-length integers. The 1000-digit integer will take 3322 bits, or 415 bytes and two bits. We can choose the maximum run of one bits in the data to be 11 long. When 11 1's appear in a row, a 0 bit is stuffed into the stream. If 12 1's are seen in a row, then you have reached the end of the stream (the 12 1's and a preceding 0 are discarded.) Using 11 will add 13 bits to the end, plus allowing up to one bit of stuffing to fill the last byte (the mean number of stuffed bits is 0.81), bringing the total bytes to 417.
So there is still gain, four bits to be precise, though less now due to the advantage of the unused 10-bit patterns.
I want to send game result data as binary, partly for efficiency (sending 6 bytes per item instead of 13... that's more than halving the total amount of data to send, and as there can be a few hundred of these items, result is huge savings), and partly for obfuscation (people monitoring network activity would see seemingly random bytes instead of distinguishable data).
My "code" (not in use yet, just a prototype) is as follows:
String.fromCharCode.apply(null,somevar.toString(16).split(/(?=(?:..)+$)/).map(function(a) {return parseInt(a,16);}))
This will convert any integer value into a binary string value.
However, I seem to recall that AJAX and binary data don't mix. I'd like to know what range of values is safe to use. Should I stick to the range 32-255, or go even safer and stick to 32-127? In the case of 32-255, I can use 15 as the base in the above code and add 32 to all the numbers, so that'dw work for me.
But really I'm more interested in the character range question, and if there is any cross-browser (among browsers that support Canvas) way to transfer binary data?
AJAX and binary data does not conflict with each other. What happens is, when you make AJAX call, the data are posted as form data. When you post form data, you would usually encode the form data as application/x-www-form-url-encode. The encoded data only contain letters/numbers and certain special characters. For example, space is encoded as %20. For this reason, it may not save you any space at all even if you convert your "normal" letters to binary because eventually everything has to be encoded again.
I am developing a phonegap application in html5/javascript. I have a string of around 1000 characters comprising of guids in below format
1=0a0a8907-40b9-4e81-8c4d-d01af26efb78;2=cd4713339;3=Cjdnd;4=19120581-21e5-42b9-b85f-3b8c5b1206d9;5=hdhsfsdfsd;6=30a21580-48f3-40e8-87a3-fa6e39e6412f; ...............
I have to write this particular string into a QR code. Is there any working technique to compress this string and store in QR code. The QR generated by this string is too complex and is not easily read by the QR scanner of mobile phones. Pls suggest the approach to reduce the size of string to around 200-250 character which can be easily read.
Any help is appreciated.
In your question you have the following sample data:
1=0a0a8907-40b9-4e81-8c4d-d01af26efb78;2=cd4713339;3=Cjdnd;
4=19120581-21e5-42b9-b85f-3b8c5b1206d9;5=hdhsfsdfsd;6=30a21
580-48f3-40e8-87a3-fa6e39e6412f; ..............
Where 1, 4 & 6 looks like version 4 UUIDs as described here. I suspect that 2, 3 and 5 might also actually be UUIDs?!
The binary representation of a UUIDs are 128 bits long, and they should be fairly simple to convert to this representation by just reading the hex digits of the UUIDs and convert to binary. This gives 16 bytes per UUID.
However - as the UUID's are version 4, they are based on random data, that in effect counter further compression (appart from the few bits representing the UUID version). So apart from getting rid of the counters (1=, 2=) and the seperater: ;, no further compression seem to be possible.
QR codes encode data using different character sets depending on the range of characters being used. IOW, if you use just ascii digits it will use an encoding that doesn't use 8 bits per digit. See the wikipedia page on QR codes.
Because of the characters in your example, e.g., lower case, you'll be using a binary encoding which is way overkill for your actual information content.
Presuming you have control over the decoder, you could use any compression library to take your ascii data and compress it before encoding, encode/decode the binary result, and then decompress it in the decoder. There are a world of techniques for trying to get the most out of the compression. You can also start with a non-ascii encoding and elminate redudant information like the #= parts.
Couldn't say, though, how much this will buy you.
If you have access to a database already, can you create a table to support this? If so, archive the value and use an ID for QR.
1) Simple schema: ID = bigint with Identity (1000,1) and set as primary key, Value = NVARCHAR(MAX). Yes this is a bit overkill, so modify to taste.
2) Create a function to add your string value to the table and get the ID back as a string for the QR code.
3) Create another function to return the string value when passed a valid ID number.
Stays below the 200 character limit for a very long time.
You don't need the whole guid; that could eliminate all but one record out of 2^128 records (enough to address every bit of digital information on earth many times over).
How many records do you need to eliminate? Probably a lot less than 4 billion right? That's 2^32, so just take the first 1/4 of the guid and there's your 1000 characters to 250.