Javascript string compression for URL hash parameter

Javascript string compression for URL hash parameter - javascript

I'm looking to store a lot of data in a URL hash parameter without exceeding URL character limits.
Are there any conventional ways of compressing string length which could be then decoded on another page load?
I've seen LZW encoding used for similar solutions, however would special characters be valid for this use?

LZW encoding technically works; you'll just need to convert the LZW-encoded binary into URL-safe base64, so that the output doesn't contain special characters. Here's an MDN article on base64 in JavaScript; the URL-safe variant of base64 just replaces + with - and / with _. Of course, you're not likely to reduce the size of your string by much by doing this, unless the data you want to store is extremely compressible.

You can look at smaz or shoco, which are designed for the compression of short strings. Most compression methods don't really get rolling until well after your URL length limit, so you need a specialized compressor for this case if you expect to get any gain. You can then encode the binary result using a scheme like Base 64 or a more efficient coding that uses all of the URI-safe characters.

Related

How atob doesn't convert from Buffer with base64?

I have data that I encrypt using lz-string package.
I also convert the result to base64 and using atob() function to convert from base64.
The problem is atob() doesn't work as expected but Buffer.from(b64, 'base64').toString(); does.
Why? how do I fix that? I need to use atob in the client side (the Buffer is not exist in the browser).
StackBlitz example

Use decodeURIComponent and escape to convert to UTF-8.
const non64 = decodeURIComponent(escape(window.atob( b64 )));

The more effective (see below) option would be, if your LZ library supports it, to not interpret the base64-encoded buffer as a string and pass it to the library as a Uint8Array directly. You can do that with
const buffer = Uint8Array.from(atob(b64), c => c.charCodeAt(0))
And then if you really need a string, you can use a TextDecoder, which is a bit less hacky than Shlomi's admittedly very nice solution:
const text = new TextDecoder().decode(buffer)
There are a couple reasons why using a TypedArray is more effective and an implementation of LZ should really work on them rather than strings (and probably use WebAssembly). Obviously you skip the UTF-8 decoding, but the more significant reason is because in JavaScript, strings are represented in memory as UTF-16, so each character takes at least 2 bytes (exactly 2 bytes in the case of a binary string) whereas the Uint8Array — as the name suggests — only uses one byte per item.

CryptoJS.enc.Base64.parse vs Base64.decodeBase64, what's the difference?

Want to understand how these two are different? Or they are same?
var key2 = CryptoJS.enc.Base64.parse(apiKey);
&
byte[] decodedBase64APIKeyByteArray = Base64.decodeBase64(apiKey);
I have gone through the APIs of both but seems like both are doing conversions but my question is would the conversion be same for same input?
Will the output for both would be same?

Both decode normal base64 with the default base64 alphabet including possible padding characters at the end.
There are a few differences however.
Documentation: The commons-codec one is at least somewhat documented.
The input: The commons-codec allows base64 and removes line endings and such (required for e.g. MIME decoding). A quick look at the CryptoJS code shows that it requires base64 without whitespace. So the Java based decoder allows different forms of input.
The implementation: The CryptoJS parsing brings tears to my eyes, and not of joy. It has terrible performance, if just on how it handles the base 64 without streaming. It even is stupid enough to use an indexOf to lookup possible padding characters up front, which is both woefully bad and non-performant. Apache's implementation is only slightly better. Both should only be used for relatively small amounts of data.
The output: The CryptoJS returns a word-array while the commons-codec one returns a byte array. For keys this doesn't matter much, as Java usually expects a byte array for SecretKeySpec while CryptoJS directly uses a word array as key.

How viable is base128 encoding for scenarios like JavaScript strings?

I recently found that base32, base64 and base128 are the most efficient forms of base-n encoding, and that while base58, Ascii85, base91, base92 et al do provide some efficiency improvements over the ubiquitous base64 due to their use of more characters, there are some mapping losses; for example, there happen to be 272 indices per character-pair in base92 that are impossible to map to from base-10 powers of 2 and are thus completely wasted. (Base91 encoding only has a similar loss of 89 characters (as found by the script in the link above) but it's patented.)
It would be great if it were viable to use base128 in modern-day real-world scenarios.
There are 92 characters available within 0x21 (33) to 0x7E (126) sans \ and ", which make for a great start to creating JSONifiable strings with the most characters possible.
Here are a few ways I envisage the rest of the characters could be found. This is the question I'm asking.
Just dumbly use Unicode
Two-byte Unicode characters could be used to fill in the remaining 36 required indices. Highly suboptimal; I wouldn't be surprised if this was worse than base64 on the wire. Would only be useful for Unicode character counting scenarios like tweet length. Not exactly what I'm going for.
Select 36 non-Unicode characters from within the upper (>128) ASCII range
JavaScript was built with the expectation that character encoding configuration will occasionally go horribly wrong. So the language (and web browsers) handle printing arbitrary and unprintable binary data just fine. So why not just use the upper ASCII range? It's there to be used, right?
One very real problem could be data going over HTTP and falling through one or more can openers proxies on the way between my browser and the server. How badly could this go? I'm aware that WebSockets over HTTP caused some real pain a couple years ago, and potentially even today.
Kind of use UTF-8 in interesting ways
UTF-8 defines 1- to 4-byte long sequences to encapsulate Unicode codepoints. Bytes 2 to 4 always start with 10xxxxxx. There are 64 characters within that range. If I pass through a naïve proxy that filters characters outside the Unicode range on a character-by-character basis, using bytes within this range might mean my data would get through unscathed!
Determine 36 magic bytes that will work for various esoteric reasons
Maybe there are some high ASCII characters that will successfully traverse >99% of the Internet infrastructure for various historical or implementational reasons. What characters might these be?
Base64 is ubiquitous and has wound up being used everywhere, and it's easy to understand why: it was defined in 1987 to use a carefully-chosen, very restricted alphabet of A-Z, a-z, 0-9, + and / that was (and remains) difficult for most environments (such as mainframes using non-ASCII encoding) to have problems with.
EBCDIC mainframes and MIME email are still very much out there, but today base64 has also wound up as a heavily-used pipe within JavaScript to handle the case of "something in this data path might choke on binary", and the collective overhead it adds is nontrivial.
There's currently only one other question on SO regarding the general viability of base128 encoding, and literally every single answer has one or more issues. The accepted answer suggests that base128 must exactly use the first 128 characters of ASCII, and the only answer that acknowledges that the encoded alphabet can use any characters proceeds to claim that that base128 is not in use because the encoded characters must be easily retypeable (which base58 is optimized for, FWIW). All the others have various problems (which I can explain further if desired).
This question is an attempt to re-ask the above with some additional unambiguous subject clarification, in the hope that a concrete go/no-go can be determined.

It's viable in the sense of being technically possible, but it's not viable in the sense of being able to achieve a result better than a much simpler alternative: using HTTP gzip compression. In practice if compression is enabled, the Huffman encoding of the strings will negate the 1/3 increase in size from base64 encoding because each character in the base64 string has only 6 bits of entropy.
As a test, I tried generating a 1Mb file of random data using a utility like Dummy File Creator. Then base64 encoded it and gzipped the resulting file using 7zip.
Original data: 1,048,576 bytes
Base64 encoded data: 1,398,104 bytes
Gzipped base64 encoded data: 1,060,329 bytes
That's only a 1.12% increase in size (and the overhead of encoding -> compressing -> decompressing -> decoding).
Base128 encoding would take 1,198,373 bytes, so you'd have to compress it too if you wanted comparable file size. Gzip compression is a standard feature in all modern browsers so what's the case for base128 and all the extra complexity that would entail?

Select 36 non-Unicode characters from within the upper (>128) ASCII range
base128 is not effective because you must use characters witch codes greater than '128'. For charater witch codes >=128 chrome send two bytes... (so string witch 1MB of this characters on sending will be change to 2MB bytes... so you loose all profit). For base64 strings this phenomena does't appear (so we loose only ~33%). More details here in "update" section.

The problem why base64 is used a lot is because they use English alphabets and numbers to encode a binary stream.
Technically we can use higher bases but the problem with them is that they will need to fit some character set.
UTF-8 is one of the widely used charsets and if you are using XML or JSON to transmit data, you can very well use a Base256 encoding like the below
https://github.com/bharatmicrosystems/base256

Kind of use UTF-8 in interesting ways
UTF-8 defines 1- to 4-byte long sequences to encapsulate Unicode codepoints. Bytes 2 to 4 always start with 10xxxxxx. There are 64 characters within that range. If I pass through a naïve proxy that filters characters outside the Unicode range on a character-by-character basis, using bytes within this range might mean my data would get through unscathed!
This is actually quite viable and has been used in base-122. Despite the name, it's in fact base-128 because the 6 invalid values (128 – 122) are encoded specially so that a series of 14 bits can always be represented with at most 2 bytes, exactly like base-128 where 7 bits will be encoded in 1 byte, and in reality can be optimized to be more efficient than base-128
Base-122 encoding takes chunks of seven bits of input data at a time. If the chunk maps to a legal character, it is encoded with the single byte UTF-8 character: 0xxxxxxx. If the chunk would map to an illegal character, we instead use the the two-byte UTF-8 character: 110xxxxx 10xxxxxx. Since there are only six illegal code points, we can distinguish them with only three bits. Denoting these bits as sss gives us the format: 110sssxx 10xxxxxx. The remaining eight bits could seemingly encode more input data. Unfortunately, two-byte UTF-8 characters representing code points less than 0x80 are invalid. Browsers will parse invalid UTF-8 characters into error characters. A simple way of enforcing code points greater than 0x80 is to use the format 110sss1x 10xxxxxx, equivalent to a bitwise OR with 0x80 (this can likely be improved, see §4). Figure 3 summarizes the complete base-122 encoding.
§2.2 Base-122 Encoding
You can find the implementation on github
The accepted answer suggests that base128 must exactly use the first 128 characters of ASCII, ...
Base-122 doesn't exactly use the first 128 ASCII characters, so it can be encoded normally in a null-terminated string. But as
... and the only answer that acknowledges that the encoded alphabet can use any characters proceeds to claim that that base128 is not in use because the encoded characters must be easily retypeable (which base58 is optimized for, FWIW)
Encodings that use non-printable characters are generally not for typing by hand but for transmission. For example base-122 is optimized for storing binary data in JavaScript strings in a UTF-8 html file which probably works best for your use case

javascript string internal representation

As far as I know java uses UTF-16 to represent chars and string internally,
so if we load a text file from a file it is automatically decoded to its original encoding to utf-16.
Now the same can be said also for javascript
it also uses utf-16 as the internal string representation.
Suppose we load a string x encoded in utf-8 using ajax,
a converion takes place in order for javascript to be able to represent internally that string in UTF-16.
Please tell me if any of what I stated is correct or not,
because the real question is yet to come...
Now suppose the browser is rendering a page using utf-8 encoding,
and using javascript we want the browser to render also the ajax string x (as you normally do)
Would, in this case, a further conversion be needed from utf-16 to utf-8 ?
Thanks in advance.

According to this article, it is USC-2 or UTF-16

something better than base64 to encode data that doesn't take up too much processing power

I have a web site in which users can add multiple items and sometimes the URL can be long. I thought by using base64 encoding, I'd pass the URL along but it contains a slash which I use to separate items because my web server cannot handle path names (anything between 2 slashes) longer than 255 characters or I'd get a 403 error.
Is there another way I can encode data quickly in javascript so that theres a 0% chance that a slash will occur in the result?
I'm looking for something not too processor intensive and if possible, I want to go for something better than character swapping.
I will understand if I need to visit a library, but the only encoding built-in to javascript (to my knowledge) is base64 (via the atob function) and I want something different.
I also want to be able to make the solution work with older web browsers as well.

What you need is encodeURIComponent, which is part of the javascript spec and automatically included in all javascript environments
var url = 'example.com/someextenstion/' + encodeURIComponent(theString);

There are many ways to address this but one of the simplest is going to be to take an implementation of atob and btoa and modify it to use a - instead of a / when encoding. You'll have to rename the functions so they don't mask the standard function, but here's some JavaScript source code that does the trick: github. In that particular implementation just replace the / in _ALPHA with a - (or any character of your choosing).
It might be faster to just do as Amit suggests: use the standard functions and do a quick string replace of / on conversion: str.replace(/\//g,'-'); and perform the reverse on decoding, but it doesn't seem like performance will be critical in this application.

Develop Reference

JavaScript is the programming language of the Web.