Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
If I say 2 bytes can hold 510 characters of data will I be wrong? based on the fact that 1 byte can hold 0-255 max character
One byte is :
8 bits, each one can be either 0 or 1
something that can represent 256 distinct values
Two bytes are ..two bytes.
16 bits
something that can represent 65536 distinct values
There is no meaning in what a byte (or two bytes) is, if you don't know the encoding used, what each single one of the 256 (or 65536) values are supposed to be/mean.
If you're talking about Char, you can't either say it's one or two or fifty chars...
ASCII encoding holds 128 distinct characters (95 chars can be displayed while the remaining are control chars) ranging from code 0 to 127 (Byte value expressed in decimal literal)
Unicode encoding (v7) is a generic encoding. You have the UTF-8, the UTF-16 Little Endian or Big Endian, and the UTF-32 Little Endian or Big Endian.
UTF-8 requires either 1, 2, 3 or 4 bytes to represent one single character.
UTF-16 is a fixed-size character encoding : each char requires 2 bytes.
UTF-32 is also a fixed size character encoding that requires 4 bytes per character.
There are hundreds of different Encodings that can represent one character for each of the 256 unique values a single Byte can represent. Like ANSI.
So I tend to say, yes, you're wrong thinking two bytes can hold 510 characters of data, assuming you're using one of the above encoding or similar.
But again, a Byte is a Byte, not a Char !
Let's imagine a (new) custom encoding with specific parser and formatter where each bit [0 or 1] define the selection of one word/text/string stored in a dictionary, and following words/text/string selection depends on the previous selected word (previous bit value)
The purpose of such type of encoding is somewhat useless, but hey ! Because you used a dictionary, you can affirm one single byte can represent exactly 510 characters of data (or even more) because of the use of this specific encoding/decoding..!
Again, a byte is a byte, saying it holds one, two, zero or 510 characters doesn't mean anything if you don't define first what is the encoding used.
EDIT !
And while it's out of the scope of the question, compression is even more evil - and generally uses dictionary ;) - But compression are only effective from a certain amount of bytes....
A character is a graphical representation of a concept and may occupy an arbitrary number of bytes. For example, character "S" (capital letter 'S') occupies 1 byte whereas character 💋 (kissing lips) occupies 3 bytes.
I think your answer is wrong. byte is 1 character. a character in binary is a series of 8 on or offs or 0 or 1s. one of those is a bit and 8 bits make a byte so 1 byte is one character.so 2 bytes hold two characters.
It depends on the format of the string. 1 byte per character in ASCII and 2 bytes per character in Unicode. so 2 byte can hold only single Unicode character or 2 ASCII character.
The following code will explain my answer
MsgBox(System.Text.ASCIIEncoding.Unicode.GetByteCount("h")) '<--- displays 2
MsgBox(System.Text.ASCIIEncoding.ASCII.GetByteCount("h")) '<--- displays 1
Related
I'm writing a JSON parser in Xojo. It's working apart from the fact that I can't figure out how to encode and decode unicode strings that are not in the basic multilingual plane (BMP). In other words, my parser dies if encounters something greater than \uFFFF.
The specs say:
To escape a code point that is not in the Basic Multilingual Plane,
the character may be represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair corresponding to the code point. So
for example, a string containing only the G clef character (U+1D11E)
may be represented as "\uD834\uDD1E". However, whether a processor of
JSON texts interprets such a surrogate pair as a single code point or
as an explicit surrogate pair is a semantic decision that is
determined by the specific processor.
What I don't understand is what is the algorithm to go from U+1D11E to \uD834\uDD1E. I can't find any explanation of how to "encode the UTF-16 surrogate pair corresponding to the code point".
For example, say I want to encode the smiley face character (U+1F600). What would this be as a UTF-16 surrogate pair and what is the working to derive it?
Could somebody please at least point me in the correct direction?
Taken from the Wikipedia article linked by Remy Lebeau in the comments above (link):
To encode U+10437 (𐐷) to UTF-16:
Subtract 0x10000 from the code point, leaving 0x0437. For the high
surrogate, shift right by 10 (divide by 0x400), then add 0xD800,
resulting in 0x0001 + 0xD800 = 0xD801. For the low surrogate, take the
low 10 bits (remainder of dividing by 0x400), then add 0xDC00,
resulting in 0x0037 + 0xDC00 = 0xDC37. To decode U+10437 (𐐷) from
UTF-16:
Take the high surrogate (0xD801) and subtract 0xD800, then multiply by
0x400, resulting in 0x0001 × 0x400 = 0x0400. Take the low surrogate
(0xDC37) and subtract 0xDC00, resulting in 0x37. Add these two results
together (0x0437), and finally add 0x10000 to get the final decoded
UTF-32 code point, 0x10437.
in javascript numbers are always allocated as double precision floats. This is fine if you aren't sending huge amounts of these as binary without compression, or don't need to conserve memory. If you need to make these numbers smaller how do you do so?
The obvious goal would be to store numbers into the smallest possible byte size, for example 208 : 1 byte, 504 : 2 bytes. Even better would be smallest number of bit size, for example 208 : 8 bits, 504 : 9 bits.
example:
//myNetwork is a supposed network API that sends as binary
var x = 208;
myNetwork.send(x); // sends 01000011010100000000000000000000
myNetwork.send(x.toString()); //sends 001100100011000000111000
There is also typed arrays, but turning into a typed array is tricky if it isn't already a blob or file. On certain network APIs in Javascript the raw data is often represented as a string before you can touch it.
encoding
//myNetwork is a supposed network API that sends as binary
var x = 208;
myNetwork.send(String.fromCharCode(x)); //sends 11010000 , also known as Ð
decoding
var receivedString = "Ð";
var decodedNum = receivedString.charCodeAt(0); //208
The string method mentioned is 24 bits, whereas this is only 8 bits.
The drawback of this method is that there is obviously some waste if you want less than byte sized values. For example, you should be able to store 512 values in 9 bits, however you'd be forced to go up to 16 bits (2 bytes) which is 65,535 values because in unicode characters are all byte-sized. However, it is fine if you'll be utilizing the full range of values.
I need to convert the following to a binary format (and later recoup) in the smallest amount of data possible.
my_arr = [
[128,32 ,22,23],
[104,53 ,21,25],
[150,55 ,79,23],
[104,101,23,8 ],
[57 ,117,13,21],
[37 ,135,21,20],
[81 ,132,23,6 ],
[81 ,138,7 ,8 ],
[97 ,138,7 ,8 ]...
the numbers don't exceed 399
If I use a 0 for each digit (8 0's in a row = 8) and a 1 as separator, the first line looks like this:
010010000000011000100110010011001000
This is really long for numbers like 99
If I pad each number to three digits and convert each in turn to actual binary the first line looks like this:
000100101000000000110010000000100010000000100011
This works out as 12 chars per number.
As the first char won't ever be a 4 or above I can save two digts by treating 0 as 00, 1 as 01, 2 as 10 and 3 as 11. Hence 10 chars per number
On the whole this reduces the size down to about 90% of the first option (on average) but is there a shorter way?
edit: yes as a string of 1's and 0's... and it doesn’t need to be shorter than the original integers... just the shortest possible way of writing it using only 2 symbols
If the values are evenly distributed between 0 and 399, then a pretty good encoding would be to take three values and encode them as a base 400 three-digit integer. I.e. val1 + 400*val2 + 400*400*val3. Then that integer will fit nicely in 26 bits. Four successive 26-bit values will fit in 13 bytes. Then you get an average of 13/12 bytes per value.
That's about as good as you're going to be able to do, unless the distribution of values is biased or if there is repetition or correlation, in which case you would be able to compress them more.
To deal with the details, you can use the number of bytes in the encoded sequence to determine the number of values, which may not be a multiple of three. If it is not a multiple of three, then there will be one or two values on the end, coded simply as nine bits each. Since it takes eight bits to go from 18 to 26 bits to add a value, there is no ambiguity in the count.
A good starting point would be to create constant-length blocks of ones and zeroes, which gives you easy to decode strings.
400 in binary is 110010000, which requires 9 characters to encode each number as its binary representation zero-padded to constant length.
encoding the first row:
var padTo9 = function( bin ){
while( bin.length<9 ){ bin = "0" + bin; }
return bin;
}
[128,32 ,22,23].map( function(i){ return padTo9( i.toString(2) ) }).join('');
/* result:
"010000000000100000000010110000010111"
*/
decoding
"010000000000100000000010110000010111".match(/[0-1]{9}/g).map( function(i){ return parseInt( i, 2 ) });
/* result:
[128, 32, 22, 23]
*/
I think the only way to get shorter string is using variable block length, which would require adding some control symbols to tell the decoder that following numbers are encoded in a specific number of characters. But these symbols have to be in >400 and still 9 characters long, so I think it wouldn't help given random distribution of data.
max 399:
2**9 is the smallest instance of (2**n)>=399, each number can be stored as 9 bits;
convert each to binary, and concat
I have a string exactly 53 characters long that contains a limited set of possible characters.
[A-Za-z0-9\.\-~_+]{53}
I need to reduce this to length 50 without loss of information and using the same set of characters.
I think it should be possible to compress most strings down to 50 length, but is it possible for all possible length 53 strings? We know that in the worst case 14 characters from the possible set will be unused. Can we use this information at all?
Thanks for reading.
If, as you stated, your output strings have to use the same set of characters as the input string, and if you don't know anything special about the requirements of the input string, then no, it's not possible to compress every possible 53-character string down to 50 characters. This is a simple application of the pigeonhole principle.
Your input strings can be represented as a 53-digit number in base 67, i.e., an integer from 0 to 6753 - 1 ≅ 6*1096.
You want to map those numbers to an integer from 0 to 6750 - 1 ≅ 2*1091.
So by the pigeonhole principle, you're guaranteed that 673 = 300,763 different inputs will map to each possible output -- which means that, when you go to decompress, you have no way to know which of those 300,763 originals you're supposed to map back to.
To make this work, you have to change your requirements. You could use a larger set of characters to encode the output (you could get it down to 50 characters if each one had 87 possible values, instead of the 67 in the input). Or you could identify redundancy in the input -- perhaps the first character can only be a '3' or a '5', the nineteenth and twentieth are a state abbreviation that can only have 62 different possible values, that sort of thing.
If you can't do either of those things, you'll have to use a compression algorithm, like Huffman coding, and accept the fact that some strings will be compressible (and get shorter) and others will not (and will get longer).
What you ask is not possible in the most general case, which can be proven very simply.
Say it was possible to encode an arbitrary 53 character string to 50 chars in the same set. Do that, then add three random characters to the encoded string. Then you have another arbitrary, 53 character string. How do you compress that?
So what you want can not be guaranteed to work for any possible data. However, it is possible that all your real data has low enough entropy that you can devise a scheme that will work.
In that case, you will probably want to do some variant of Huffman coding, which basically allocates variable-bit-length encodings for the characters in your set, using the shortest encodings for the most commonly used characters. You can analyze all your data to come up with a set of encodings. After Huffman coding, your string will be a (hopefully shorter) bitstream, which you encode to your character set at 6 bits per character. It may be short enough for all your real data.
A library-based encoding like Smaz (referenced in another answer) may work as well. Again, it is impossible to guarantee that it will work for all possible data.
One byte (character) can encode 256 values (0-255) but your set of valid characters uses only 67 values, which can be represented in 7 bits (alas, 6 bits gets you only 64) and none of your characters uses the high bit of the byte.
Given that, you can throw away the high bit and store only 7 bits, running the initial bits of the next character into the "spare" space of the first character. This would require only 47 bytes of space to store. (53 x 7 = 371 bits, 371 / 8 = 46.4 == 47)
This is not really considered compression, but rather a change in encoding.
For example "ABC" is 0x41 0x42 0x43
0x41 0x42 0x43 // hex values
0100 0001 0100 0010 0100 0011 // binary
100 0001 100 0010 100 0011 // drop high bit
// run it all together
100000110000101000011
// split as 8 bits (and pad to 8)
10000011 00001010 00011[000]
0x83 0x0A 0x18
As an example these 3 characters won't save any space, but your 53 characters will always come out as 47, guaranteed.
Note, however, that the output will not be in your original character set, if that is important to you.
The process becomes:
original-text --> encode --> store output-text (in database?)
retrieve --> decode --> original-text restored
If I remember correctly Huffman coding is going to be the most compact way to store the data. It has been too long since I used it to write the algorithm quickly, but the general idea is covered here, but if I remember correctly what you do is:
get the count for each character that is used
prioritize them based on how frequently they occurred
build a tree based off the prioritization
get the compressed bit representation of each character by traversing the tree (start at the root, left = 0 right = 1)
replace each character with the bits from the tree
Smaz is a simple compression library suitable for compressing very short strings.
I'm generating QR codes using strings that could very easily be longer in length then a QRCode could handle. I'm looking for suggestions on algorithms to encode these strings as small as possible, or a proof that the string cannot be shrunk any further.
Since I'm encoding a series of items, I can represent them using ID's and delineate them using pipes as in the following lookup table:
function encodeLookUp(character){
switch(character){
case '0': return '0000';
case '1': return '0001';
case '2': return '0010';
case '3': return '0011';
case '4': return '0100';
case '5': return '0101';
case '6': return '0110';
case '7': return '0111';
case '8': return '1000';
case '9': return '1001';
case '|': return '1010';
case ':': return '1011';
}
return false;
}
Using this table I am already doing a base 16 encoding, therefore each 32 ascii character from the original string becomes half a character in the new string (effectively halving the length).
Starting String: 01251548|4654654:4465464 // ID1 | ID2 : ID3 demonstrates both pipes.
Bit String: 000000010010010100010101010010001010010001100101010001100101010010110100010001100101010001100100
Result String: %H¤eFT´FTd // Half the length of the starting string.
Then this new ascii code, is translated according to QRCode specification.
EDIT: The most amount of characters currently encodable: 384
CLARIFICATION: Both ID numberic length, and the quantity of ID's or pipes is variable with a tendancy towards one. I am looking to be able to reduce this algorithm to contain on average the least amount of characters by the time its a 'result string'.
NOTE: The result string is only an ascii represenetaion of the binary string i've encoded with the data to conform with standard QRCode specifications and readers.
If you have relatively non-random data, a Huffman encoding might be a good solution.
Using the function, you're going to loose a lot of space (since 4 bits are way too much storage for 12 combinations).
I'd start by looking at the maximum length possible for your IDs and find a suitable storage block.
If you are storing these items serially in a fixed count (say, 4 ids). You would need id_length*id_count at most, and you won't need to use any separators.
Edit: Again according to the number of IDs you want to write and their expected maximum length, there may be different types of encodings to compress it done. RLE (run length encoding) came to my mind.
QR codes support a binary mode, and that's going to be the most efficient way for you to store your IDs. Either:
Pick a length (in bytes) that is sufficient to store all your IDs, and encode the QR-code as a series of fixed-length integers. 4 bytes (32 bits) is a standard choice that ought to cover the likely range, or
If you want to be able to encode a wide range of IDs, but expect most of the values to be small, use a variable-length encoding scheme. One example is to use the lowest 7 bits of each byte to store the integer, and the most significant bit to indicate if there are any further bytes.
Also note that QR codes can be a lot larger than 384 characters!
Edit: From your original question, though, it looks like you're encoding more than just a series of integers - you have at least two different types of delimiters. Where can they appear and in what circumstances? The encoding format is going to depend on those parameters.
QR codes already have special encoding modes that are optimized for digits, or just alphanumeric data. It would probably be easier to take advantage of these rather than invent a scheme.
If you're going to do something custom, I think you'll find it hard to beat something like gzip compression. Just gzip the bytes, encode the bytes in byte mode, and decompress on the other end.
As a start of an answer to my own question:
If I start with a string of numbers
I can parse that string for patterns and hold those patters in special symbols that are able to take up the other 4 spaces available in my Huffman tree.
EDIT: Example: staring string 12222345, ending string 12x345. Where x is a symbol that means 'repeat the last symbol 3 more times'