Javascript - Alternative to lzw compression for Database entry

Javascript - Alternative to lzw compression for Database entry - javascript

I have strings (about 1-5Kb) of the form:
FF,A3V,X7Y,aA4,....
lzw compresses these really nicely, but includes Turkish characters. These are then submitted to a MySQL database.
Sometimes MySQL can 'play-up' and not submit these properly, putting question marks '?' in place of the Turkish characters. They can do this even when you have your text areas properly defined. Exporting and reimporting the table can sort this out. This is fine for my test database, but not something I am happy with when this goes live.
Consequently I am looking for an alternative to lzw, which will compress but only using normal letters/numbers etc.
Does anyone know of a PUBLIC DOMAIN compression method that avoid Turkish Characters (and any other non-standard characters)? Can anyone point me to some code in javascript (or c++ or c# which I can convert)?

To expand a bit on what's been said in the comments... Storing strings of bytes, such as the output from a compression algorithm typically contains, in a VARCHAR or CHAR or TEXT column is not valid usage.
These column types are not for byte strings, they are for strings of valid characters only. Not every string of bytes contains valid strings of characters in any given character set... and MySQL isn't going to allow invalid characters (which, for some character sets, the correlation between "character" and "byte" isn't 1:1).
In the good ol' days™, the two were interchangeable but this is not the case any more (and hasn't been, to one degree or another, for a while).
If your column type, instead, were BINARY or VARBINARY or BLOB, the issue should disappear, because those data types are for binary data.

Related

How viable is base128 encoding for scenarios like JavaScript strings?

I recently found that base32, base64 and base128 are the most efficient forms of base-n encoding, and that while base58, Ascii85, base91, base92 et al do provide some efficiency improvements over the ubiquitous base64 due to their use of more characters, there are some mapping losses; for example, there happen to be 272 indices per character-pair in base92 that are impossible to map to from base-10 powers of 2 and are thus completely wasted. (Base91 encoding only has a similar loss of 89 characters (as found by the script in the link above) but it's patented.)
It would be great if it were viable to use base128 in modern-day real-world scenarios.
There are 92 characters available within 0x21 (33) to 0x7E (126) sans \ and ", which make for a great start to creating JSONifiable strings with the most characters possible.
Here are a few ways I envisage the rest of the characters could be found. This is the question I'm asking.
Just dumbly use Unicode
Two-byte Unicode characters could be used to fill in the remaining 36 required indices. Highly suboptimal; I wouldn't be surprised if this was worse than base64 on the wire. Would only be useful for Unicode character counting scenarios like tweet length. Not exactly what I'm going for.
Select 36 non-Unicode characters from within the upper (>128) ASCII range
JavaScript was built with the expectation that character encoding configuration will occasionally go horribly wrong. So the language (and web browsers) handle printing arbitrary and unprintable binary data just fine. So why not just use the upper ASCII range? It's there to be used, right?
One very real problem could be data going over HTTP and falling through one or more can openers proxies on the way between my browser and the server. How badly could this go? I'm aware that WebSockets over HTTP caused some real pain a couple years ago, and potentially even today.
Kind of use UTF-8 in interesting ways
UTF-8 defines 1- to 4-byte long sequences to encapsulate Unicode codepoints. Bytes 2 to 4 always start with 10xxxxxx. There are 64 characters within that range. If I pass through a naïve proxy that filters characters outside the Unicode range on a character-by-character basis, using bytes within this range might mean my data would get through unscathed!
Determine 36 magic bytes that will work for various esoteric reasons
Maybe there are some high ASCII characters that will successfully traverse >99% of the Internet infrastructure for various historical or implementational reasons. What characters might these be?
Base64 is ubiquitous and has wound up being used everywhere, and it's easy to understand why: it was defined in 1987 to use a carefully-chosen, very restricted alphabet of A-Z, a-z, 0-9, + and / that was (and remains) difficult for most environments (such as mainframes using non-ASCII encoding) to have problems with.
EBCDIC mainframes and MIME email are still very much out there, but today base64 has also wound up as a heavily-used pipe within JavaScript to handle the case of "something in this data path might choke on binary", and the collective overhead it adds is nontrivial.
There's currently only one other question on SO regarding the general viability of base128 encoding, and literally every single answer has one or more issues. The accepted answer suggests that base128 must exactly use the first 128 characters of ASCII, and the only answer that acknowledges that the encoded alphabet can use any characters proceeds to claim that that base128 is not in use because the encoded characters must be easily retypeable (which base58 is optimized for, FWIW). All the others have various problems (which I can explain further if desired).
This question is an attempt to re-ask the above with some additional unambiguous subject clarification, in the hope that a concrete go/no-go can be determined.

It's viable in the sense of being technically possible, but it's not viable in the sense of being able to achieve a result better than a much simpler alternative: using HTTP gzip compression. In practice if compression is enabled, the Huffman encoding of the strings will negate the 1/3 increase in size from base64 encoding because each character in the base64 string has only 6 bits of entropy.
As a test, I tried generating a 1Mb file of random data using a utility like Dummy File Creator. Then base64 encoded it and gzipped the resulting file using 7zip.
Original data: 1,048,576 bytes
Base64 encoded data: 1,398,104 bytes
Gzipped base64 encoded data: 1,060,329 bytes
That's only a 1.12% increase in size (and the overhead of encoding -> compressing -> decompressing -> decoding).
Base128 encoding would take 1,198,373 bytes, so you'd have to compress it too if you wanted comparable file size. Gzip compression is a standard feature in all modern browsers so what's the case for base128 and all the extra complexity that would entail?

Select 36 non-Unicode characters from within the upper (>128) ASCII range
base128 is not effective because you must use characters witch codes greater than '128'. For charater witch codes >=128 chrome send two bytes... (so string witch 1MB of this characters on sending will be change to 2MB bytes... so you loose all profit). For base64 strings this phenomena does't appear (so we loose only ~33%). More details here in "update" section.

The problem why base64 is used a lot is because they use English alphabets and numbers to encode a binary stream.
Technically we can use higher bases but the problem with them is that they will need to fit some character set.
UTF-8 is one of the widely used charsets and if you are using XML or JSON to transmit data, you can very well use a Base256 encoding like the below
https://github.com/bharatmicrosystems/base256

Kind of use UTF-8 in interesting ways
UTF-8 defines 1- to 4-byte long sequences to encapsulate Unicode codepoints. Bytes 2 to 4 always start with 10xxxxxx. There are 64 characters within that range. If I pass through a naïve proxy that filters characters outside the Unicode range on a character-by-character basis, using bytes within this range might mean my data would get through unscathed!
This is actually quite viable and has been used in base-122. Despite the name, it's in fact base-128 because the 6 invalid values (128 – 122) are encoded specially so that a series of 14 bits can always be represented with at most 2 bytes, exactly like base-128 where 7 bits will be encoded in 1 byte, and in reality can be optimized to be more efficient than base-128
Base-122 encoding takes chunks of seven bits of input data at a time. If the chunk maps to a legal character, it is encoded with the single byte UTF-8 character: 0xxxxxxx. If the chunk would map to an illegal character, we instead use the the two-byte UTF-8 character: 110xxxxx 10xxxxxx. Since there are only six illegal code points, we can distinguish them with only three bits. Denoting these bits as sss gives us the format: 110sssxx 10xxxxxx. The remaining eight bits could seemingly encode more input data. Unfortunately, two-byte UTF-8 characters representing code points less than 0x80 are invalid. Browsers will parse invalid UTF-8 characters into error characters. A simple way of enforcing code points greater than 0x80 is to use the format 110sss1x 10xxxxxx, equivalent to a bitwise OR with 0x80 (this can likely be improved, see §4). Figure 3 summarizes the complete base-122 encoding.
§2.2 Base-122 Encoding
You can find the implementation on github
The accepted answer suggests that base128 must exactly use the first 128 characters of ASCII, ...
Base-122 doesn't exactly use the first 128 ASCII characters, so it can be encoded normally in a null-terminated string. But as
... and the only answer that acknowledges that the encoded alphabet can use any characters proceeds to claim that that base128 is not in use because the encoded characters must be easily retypeable (which base58 is optimized for, FWIW)
Encodings that use non-printable characters are generally not for typing by hand but for transmission. For example base-122 is optimized for storing binary data in JavaScript strings in a UTF-8 html file which probably works best for your use case

Encode and compress a string with JavaScript

I'm working on a client side app. Users can select a few widgets on the page and share their selection with friends by sending them the URL of the page. I'm planning on saving the user's widget selections via a query string. I'd like the URL to be as small as possible so that it's easier for people to share.
Now to my question. I have a string of characters (8) that I'd like to encode so that output of the encoding is significantly smaller. I realize that 8 characters isn't very big but it's got potential to get larger in the future.
//using hex encoding results in a saving of 1 character
(98765432).toString(16) //"5e30a78"
example.com?q=98765432 vs example.com?q=5e30a78
Ideally I'd like the new string to be 4 characters or less. What are my options for encoding a string that will be used in URLs?
I've looked at this question: How can I quickly encode and then compress a short string containing numbers in c# but the encoded string is still too long.

Short tale about compression:
Let's say that you have an alphabet A and you have a set of words W(A) in alphabet A. Consider function
f: W(A) -> W(A)
which takes a word w and maps it into a word f(w) in the same alphabet.
Now it can be shown that if this function is invertible and there is a word w1 such that
length(f(w1)) < length(w1)
(i.e. we've compressed the word) then there exists a word w2 such that the opposite holds
length(f(w2)) > length(w2)
So this means that every compression method you've ever heard of is actually an illusion. For every method there is a file that will be larger after compression. It works because compression methods make assumptions about initial files. For example that these are words written in natural language. They are optimized for such cases and fail for other cases like whitenoise.
Back to your problem. If you wish to compress [a-zA-Z0-9] words onto itself and all cases are possible then you are doomed.
But there are at least two things you can think about:
Find most common [a-zA-Z0-9] words and map them onto small words. For example you found out that the case example.com?q=98765432 is most common among your users. Then you will map it to example.com?c=1 (note the parameter change). You will need a dictionary for such mappings. Of course for same rare cases you will end up with larger url, e.g. example.com?q=abcd will be mapped to example.com?c=abcdefgh unfortunately.
Restrict your input alphabet and enlarge your output alphabet. The bigger the difference, the bigger real compression is possible. Note that unfortunately there is a quite low upper limit for the alphabet used in URLs, namely 128 (ascii characters). For example if you have alphabet A={1,2} and B={1,2,3,4,5,6} then you can map 1~1, 2~2, 11~3, 12~4, 21~5, 22~6 which basically means that every word in A can be written in B in such a way that you reduce the size by half.

Character Encoding: â?

I am trying to piece together the mysterious string of characters â?? I am seeing quite a bit of in our database - I am fairly sure this is a result of conversion between character encodings, but I am not completely positive.
The users are able to enter text (or cut and paste) into a Ext-Js rich text editor. The data is posted to a severlet which persists it to the database, and when I view it in the database i see those strange characters...
is there any way to decode these back to their original meaning, if I was able to discover the correct encoding - or is there a loss of bits or bytes that has occured through the conversion process?
Users are cutting and pasting from multiple versions of MS Word and PDF. Does the encoding follow where the user copied from?
Thank you
website is UTF-8
We are using ms sql server 2005;
SELECT serverproperty('Collation') -- Server default collation.
Latin1_General_CI_AS
SELECT databasepropertyex('xxxx', 'Collation') -- Database default
SQL_Latin1_General_CP1_CI_AS
and the column:
Column_name Type Computed Length Prec Scale Nullable TrimTrailingBlanks FixedLenNullInSource Collation
text varchar no -1 yes no yes SQL_Latin1_General_CP1_CI_AS
The non-Unicode equivalents of the
nchar, nvarchar, and ntext data types
in SQL Server 2000 are listed below.
When Unicode data is inserted into one
of these non-Unicode data type columns
through a command string (otherwise
known as a "language event"), SQL
Server converts the data to the data
type using the code page associated
with the collation of the column. When
a character cannot be represented on a
code page, it is replaced by a
question mark (?), indicating the data
has been lost. Appearance of
unexpected characters or question
marks in your data indicates your data
has been converted from Unicode to
non-Unicode at some layer, and this
conversion resulted in lost
characters.
So this may be the root cause of the problem... and not an easy one to solve on our end.

â is encoded as 0xE2 in ISO-8859-1 and windows-1252. 0xE2 is also a lead byte for a three-byte sequence in UTF-8. (Specifically, for the range U+2000 to U+2FFF, which includes the windows-1252 characters –—‘’‚“”„†‡•…‰‹›€™).
So it looks like you have text encoded in UTF-8 that's getting misinterpreted as being in windows-1252, and displays as a â followed by two unprintable characters.

This is an something of an educated guess that you're just experiencing a naive conversion of Word/PDF documents to HTML. (windows-1252 to utf8 most likely) If that's the case probably 2/3 of the mysterious characters from Word documents are "smart quotes" and most of the rest are a result of their other "smart" editing features, elipsis, em dashes, etc. PDF's probably have similar features.
I would also guess that if the formatting after pasting into the ExtJS editor looks OK, then the encoding is getting passed along. Depending on the resulting use of the text, you may not need to convert.
If I'm still on base, and we're not talking about internationalization issues, then I can add that there are Word to HTML converters out there, but I don't know the details of how they operate, and I had mixed success when evaluating them. There is almost certainly some small information loss/error involved with such converters, since they need to make guesses about the original source of the "smart" characters. In my isolated case it was easier to just go back to the users and have them turn off the "smart" features.

The issue is clear: if the browser is good enough, a form in a web page can accept any Unicode character you can type or paste. If the character belongs to the HTML charset, it will be sent as is. If it doesn't, it'll get converted to an HTML entity. SQL Server will perform the appropriate conversion and silently corrupt your data when a character does not have an equivalent.
There's not much you can do to fully fix it but you can make a workaround: let your servlet perform the conversion. This way you have full control about it. You can, for instance, compile a list of the most common non-Latin1 characters users paste (smart quotes, unicode spaces...), which should be fairly easy to identify from context, and replace them with something else better that ?. Or you use a library that makes this for you.
Or you can switch your DB to Unicode :)

you're storing unicode data that uses 2 bytes per charcter into a varchar type columns that uses 1 byte per character. any text that uses 2 bytes per chars will have 1 byte lost when stored in the db.
all you need to do is change varchar column to nvarchar.
and then change sql parameters you're using in code of course.

Insert EBCDIC character into javascript string

I need to create an EBCDIC string within my javascript and save it into an EBCDIC database. A process on the EBCDIC system then uses the data. I haven't had any problems until I came across the character '¬'. In EBCDIC it is hex value of 5F. All of the usual letters and symbols seem to automagically convert with no problem. Any idea how I can create the EBCDIC value for '¬' within javascript so I can store it properly in the EBCDIC db?
Thanks!

If "all of the usual letters and symbols seem to automagically convert", then I very strongly suspect that you do not have to create an EBCDIC string in Javascript. The character codes for Latin letters and digits are completely different in EBCDIC than they are in Unicode, so something in your server code is already converting the strings.
Thus what you need to determine is how that process works, and specifically you need to find out how the translation maps character codes from Unicode source into the EBCDIC equivalents. Once you know that, you'll know what Unicode character to use in your Javascript code.
As a further note: every single time I've been told by an IT organization that their mainframe software requires that data be supplied in EBCDIC, that advice has been dead wrong. The fact that there's some external interface means that something in the pile of iron that makes up the mainframe and it's tentacles, something the IT people have forgotten about and probably couldn't find if they needed to, is already mapping "real world" character encodings like Unicode into EBCDIC. How does it work? Well, it may be impossible to figure out.
You might try whether this works: var notSign = "\u00AC";
edit: also: here's a good reference for HTML entities and Unicode glyphs: http://www.elizabethcastro.com/html/extras/entities.html The HTML/XML syntax uses decimal numbers for the character codes. For Javascript, you have to convert those to hex, and the notation in Javascript strings is "\u" followed by a 4-digit hex constant. (That reference isn't complete, but it's pretty easy to read and it's got lots of useful symbols.)

Javascript client-data compression

I am trying to develop a paint brush application thru processingjs.
This API has function loadPixels() that will load the RGB values in to the array.
Now i want to store the array in the server db.
The problem is the size of the array, when i convert to a string the size is 5 MB.
Is the best solution is to do compression at javascript level? How to do it?

See http://rosettacode.org/wiki/LZW_compression#JavaScript for an LZW compression example. It works best on longer strings with repeated patterns.
From the Wikipedia article on LZW:
A dictionary is initialized to contain
the single-character strings
corresponding to all the possible
input characters (and nothing else
except the clear and stop codes if
they're being used). The algorithm
works by scanning through the input
string for successively longer
substrings until it finds one that is
not in the dictionary. When such a
string is found, the index for the
string less the last character (i.e.,
the longest substring that is in the
dictionary) is retrieved from the
dictionary and sent to output, and the
new string (including the last
character) is added to the dictionary
with the next available code. The last
input character is then used as the
next starting point to scan for
substrings.
In this way, successively longer
strings are registered in the
dictionary and made available for
subsequent encoding as single output
values. The algorithm works best on
data with repeated patterns, so the
initial parts of a message will see
little compression. As the message
grows, however, the compression ratio
tends asymptotically to the
maximum.

JavaScript implementation of Gzip has a couple answers that are relevant.
Also, Javascript LZW and Huffman Coding with PHP and JavaScript are other implementations I found.

Develop Reference

JavaScript is the programming language of the Web.