Compressing Guid based string to write on QR Code - javascript

I am developing a phonegap application in html5/javascript. I have a string of around 1000 characters comprising of guids in below format
1=0a0a8907-40b9-4e81-8c4d-d01af26efb78;2=cd4713339;3=Cjdnd;4=19120581-21e5-42b9-b85f-3b8c5b1206d9;5=hdhsfsdfsd;6=30a21580-48f3-40e8-87a3-fa6e39e6412f; ...............
I have to write this particular string into a QR code. Is there any working technique to compress this string and store in QR code. The QR generated by this string is too complex and is not easily read by the QR scanner of mobile phones. Pls suggest the approach to reduce the size of string to around 200-250 character which can be easily read.
Any help is appreciated.

In your question you have the following sample data:
1=0a0a8907-40b9-4e81-8c4d-d01af26efb78;2=cd4713339;3=Cjdnd;
4=19120581-21e5-42b9-b85f-3b8c5b1206d9;5=hdhsfsdfsd;6=30a21
580-48f3-40e8-87a3-fa6e39e6412f; ..............
Where 1, 4 & 6 looks like version 4 UUIDs as described here. I suspect that 2, 3 and 5 might also actually be UUIDs?!
The binary representation of a UUIDs are 128 bits long, and they should be fairly simple to convert to this representation by just reading the hex digits of the UUIDs and convert to binary. This gives 16 bytes per UUID.
However - as the UUID's are version 4, they are based on random data, that in effect counter further compression (appart from the few bits representing the UUID version). So apart from getting rid of the counters (1=, 2=) and the seperater: ;, no further compression seem to be possible.

QR codes encode data using different character sets depending on the range of characters being used. IOW, if you use just ascii digits it will use an encoding that doesn't use 8 bits per digit. See the wikipedia page on QR codes.
Because of the characters in your example, e.g., lower case, you'll be using a binary encoding which is way overkill for your actual information content.
Presuming you have control over the decoder, you could use any compression library to take your ascii data and compress it before encoding, encode/decode the binary result, and then decompress it in the decoder. There are a world of techniques for trying to get the most out of the compression. You can also start with a non-ascii encoding and elminate redudant information like the #= parts.
Couldn't say, though, how much this will buy you.

If you have access to a database already, can you create a table to support this? If so, archive the value and use an ID for QR.
1) Simple schema: ID = bigint with Identity (1000,1) and set as primary key, Value = NVARCHAR(MAX). Yes this is a bit overkill, so modify to taste.
2) Create a function to add your string value to the table and get the ID back as a string for the QR code.
3) Create another function to return the string value when passed a valid ID number.
Stays below the 200 character limit for a very long time.

You don't need the whole guid; that could eliminate all but one record out of 2^128 records (enough to address every bit of digital information on earth many times over).
How many records do you need to eliminate? Probably a lot less than 4 billion right? That's 2^32, so just take the first 1/4 of the guid and there's your 1000 characters to 250.

Related

How much data does a string data type take up?

Very simple question, how much data (bytes) do strings take up? Do they take up 1 byte per character?
I tried searching it up, but ws schools doesn't say...
I want to know this to reduce bandwidth in my web app.
Also, for anyone that knows, does socket.io automatically json stringify when using socket.emit();?
String is a character array. So, it will take up roughly sizeof(char) * noOfCharacters ignoring other fields in String class for now. Character can be of 1 byte or 2 bytes depending upon the system, the type of chars being represented- unicode etc.
However, from your question, you are more interested in data being transported over the network. Note that data is always exchanged in bytes (byte[]) and thus string will be converted into byte[] representation first and then ported over.
To limit the bandwidth usage, you can enable compression, choose interoperable serialisation technique(protobuf, smile, fastinfoset etc)

Encode and compress a string with JavaScript

I'm working on a client side app. Users can select a few widgets on the page and share their selection with friends by sending them the URL of the page. I'm planning on saving the user's widget selections via a query string. I'd like the URL to be as small as possible so that it's easier for people to share.
Now to my question. I have a string of characters (8) that I'd like to encode so that output of the encoding is significantly smaller. I realize that 8 characters isn't very big but it's got potential to get larger in the future.
//using hex encoding results in a saving of 1 character
(98765432).toString(16) //"5e30a78"
example.com?q=98765432 vs example.com?q=5e30a78
Ideally I'd like the new string to be 4 characters or less. What are my options for encoding a string that will be used in URLs?
I've looked at this question: How can I quickly encode and then compress a short string containing numbers in c# but the encoded string is still too long.
Short tale about compression:
Let's say that you have an alphabet A and you have a set of words W(A) in alphabet A. Consider function
f: W(A) -> W(A)
which takes a word w and maps it into a word f(w) in the same alphabet.
Now it can be shown that if this function is invertible and there is a word w1 such that
length(f(w1)) < length(w1)
(i.e. we've compressed the word) then there exists a word w2 such that the opposite holds
length(f(w2)) > length(w2)
So this means that every compression method you've ever heard of is actually an illusion. For every method there is a file that will be larger after compression. It works because compression methods make assumptions about initial files. For example that these are words written in natural language. They are optimized for such cases and fail for other cases like whitenoise.
Back to your problem. If you wish to compress [a-zA-Z0-9] words onto itself and all cases are possible then you are doomed.
But there are at least two things you can think about:
Find most common [a-zA-Z0-9] words and map them onto small words. For example you found out that the case example.com?q=98765432 is most common among your users. Then you will map it to example.com?c=1 (note the parameter change). You will need a dictionary for such mappings. Of course for same rare cases you will end up with larger url, e.g. example.com?q=abcd will be mapped to example.com?c=abcdefgh unfortunately.
Restrict your input alphabet and enlarge your output alphabet. The bigger the difference, the bigger real compression is possible. Note that unfortunately there is a quite low upper limit for the alphabet used in URLs, namely 128 (ascii characters). For example if you have alphabet A={1,2} and B={1,2,3,4,5,6} then you can map 1~1, 2~2, 11~3, 12~4, 21~5, 22~6 which basically means that every word in A can be written in B in such a way that you reduce the size by half.

Javascript - Alternative to lzw compression for Database entry

I have strings (about 1-5Kb) of the form:
FF,A3V,X7Y,aA4,....
lzw compresses these really nicely, but includes Turkish characters. These are then submitted to a MySQL database.
Sometimes MySQL can 'play-up' and not submit these properly, putting question marks '?' in place of the Turkish characters. They can do this even when you have your text areas properly defined. Exporting and reimporting the table can sort this out. This is fine for my test database, but not something I am happy with when this goes live.
Consequently I am looking for an alternative to lzw, which will compress but only using normal letters/numbers etc.
Does anyone know of a PUBLIC DOMAIN compression method that avoid Turkish Characters (and any other non-standard characters)? Can anyone point me to some code in javascript (or c++ or c# which I can convert)?
To expand a bit on what's been said in the comments... Storing strings of bytes, such as the output from a compression algorithm typically contains, in a VARCHAR or CHAR or TEXT column is not valid usage.
These column types are not for byte strings, they are for strings of valid characters only. Not every string of bytes contains valid strings of characters in any given character set... and MySQL isn't going to allow invalid characters (which, for some character sets, the correlation between "character" and "byte" isn't 1:1).
In the good ol' days™, the two were interchangeable but this is not the case any more (and hasn't been, to one degree or another, for a while).
If your column type, instead, were BINARY or VARBINARY or BLOB, the issue should disappear, because those data types are for binary data.

Character Encoding: â?

I am trying to piece together the mysterious string of characters â?? I am seeing quite a bit of in our database - I am fairly sure this is a result of conversion between character encodings, but I am not completely positive.
The users are able to enter text (or cut and paste) into a Ext-Js rich text editor. The data is posted to a severlet which persists it to the database, and when I view it in the database i see those strange characters...
is there any way to decode these back to their original meaning, if I was able to discover the correct encoding - or is there a loss of bits or bytes that has occured through the conversion process?
Users are cutting and pasting from multiple versions of MS Word and PDF. Does the encoding follow where the user copied from?
Thank you
website is UTF-8
We are using ms sql server 2005;
SELECT serverproperty('Collation') -- Server default collation.
Latin1_General_CI_AS
SELECT databasepropertyex('xxxx', 'Collation') -- Database default
SQL_Latin1_General_CP1_CI_AS
and the column:
Column_name Type Computed Length Prec Scale Nullable TrimTrailingBlanks FixedLenNullInSource Collation
text varchar no -1 yes no yes SQL_Latin1_General_CP1_CI_AS
The non-Unicode equivalents of the
nchar, nvarchar, and ntext data types
in SQL Server 2000 are listed below.
When Unicode data is inserted into one
of these non-Unicode data type columns
through a command string (otherwise
known as a "language event"), SQL
Server converts the data to the data
type using the code page associated
with the collation of the column. When
a character cannot be represented on a
code page, it is replaced by a
question mark (?), indicating the data
has been lost. Appearance of
unexpected characters or question
marks in your data indicates your data
has been converted from Unicode to
non-Unicode at some layer, and this
conversion resulted in lost
characters.
So this may be the root cause of the problem... and not an easy one to solve on our end.
â is encoded as 0xE2 in ISO-8859-1 and windows-1252. 0xE2 is also a lead byte for a three-byte sequence in UTF-8. (Specifically, for the range U+2000 to U+2FFF, which includes the windows-1252 characters –—‘’‚“”„†‡•…‰‹›€™).
So it looks like you have text encoded in UTF-8 that's getting misinterpreted as being in windows-1252, and displays as a â followed by two unprintable characters.
This is an something of an educated guess that you're just experiencing a naive conversion of Word/PDF documents to HTML. (windows-1252 to utf8 most likely) If that's the case probably 2/3 of the mysterious characters from Word documents are "smart quotes" and most of the rest are a result of their other "smart" editing features, elipsis, em dashes, etc. PDF's probably have similar features.
I would also guess that if the formatting after pasting into the ExtJS editor looks OK, then the encoding is getting passed along. Depending on the resulting use of the text, you may not need to convert.
If I'm still on base, and we're not talking about internationalization issues, then I can add that there are Word to HTML converters out there, but I don't know the details of how they operate, and I had mixed success when evaluating them. There is almost certainly some small information loss/error involved with such converters, since they need to make guesses about the original source of the "smart" characters. In my isolated case it was easier to just go back to the users and have them turn off the "smart" features.
The issue is clear: if the browser is good enough, a form in a web page can accept any Unicode character you can type or paste. If the character belongs to the HTML charset, it will be sent as is. If it doesn't, it'll get converted to an HTML entity. SQL Server will perform the appropriate conversion and silently corrupt your data when a character does not have an equivalent.
There's not much you can do to fully fix it but you can make a workaround: let your servlet perform the conversion. This way you have full control about it. You can, for instance, compile a list of the most common non-Latin1 characters users paste (smart quotes, unicode spaces...), which should be fairly easy to identify from context, and replace them with something else better that ?. Or you use a library that makes this for you.
Or you can switch your DB to Unicode :)
you're storing unicode data that uses 2 bytes per charcter into a varchar type columns that uses 1 byte per character. any text that uses 2 bytes per chars will have 1 byte lost when stored in the db.
all you need to do is change varchar column to nvarchar.
and then change sql parameters you're using in code of course.

Javascript client-data compression

I am trying to develop a paint brush application thru processingjs.
This API has function loadPixels() that will load the RGB values in to the array.
Now i want to store the array in the server db.
The problem is the size of the array, when i convert to a string the size is 5 MB.
Is the best solution is to do compression at javascript level? How to do it?
See http://rosettacode.org/wiki/LZW_compression#JavaScript for an LZW compression example. It works best on longer strings with repeated patterns.
From the Wikipedia article on LZW:
A dictionary is initialized to contain
the single-character strings
corresponding to all the possible
input characters (and nothing else
except the clear and stop codes if
they're being used). The algorithm
works by scanning through the input
string for successively longer
substrings until it finds one that is
not in the dictionary. When such a
string is found, the index for the
string less the last character (i.e.,
the longest substring that is in the
dictionary) is retrieved from the
dictionary and sent to output, and the
new string (including the last
character) is added to the dictionary
with the next available code. The last
input character is then used as the
next starting point to scan for
substrings.
In this way, successively longer
strings are registered in the
dictionary and made available for
subsequent encoding as single output
values. The algorithm works best on
data with repeated patterns, so the
initial parts of a message will see
little compression. As the message
grows, however, the compression ratio
tends asymptotically to the
maximum.
JavaScript implementation of Gzip has a couple answers that are relevant.
Also, Javascript LZW and Huffman Coding with PHP and JavaScript are other implementations I found.

Categories

Resources