I'd like to deliver custom binary data to browser. They are actually images, but I need to deliver multiple versions of the same image + some metadata. Network performance should be achieved by using just a single GET request, and it should unpack fast in the browser. So far I could think of these solutions:
image sprite (what about the metadata?)
ZIP
msgpack
JSON + base64 encoding
I don't care about < IE8. I think avoiding XHR is not possible in my case, but same origin policy is making it even worse as I need to load from different (sub) domain. That could be worked around though by server routing, on the other side that prevents using CDN.
That depend on data structure, but if you need do something with this data in JavaScript - there are two (three) ways to achieve this
JSON + base64 or escaping special characters
XML + base64 or escaping ( Choosing between one of them depend on preferences )
Harder one, but most effective - plain text with escaped 0 char and special marks (make three special codes - 0x20 0x40 for 0x00 char, 0x20 0x41 as mark and 0x20 0x42 for 0x20)
It turned out today's (and near future - IE9) browsers have very poor support for binary data, so whatever you decide to use, it has to be encoded (encodeable?) into a Javascript String, ie. JSON+base64.
For curious minds: http://status-501.tumblr.com/post/20293218962/delivering-binary-data-to-browser
Related
I recently found that base32, base64 and base128 are the most efficient forms of base-n encoding, and that while base58, Ascii85, base91, base92 et al do provide some efficiency improvements over the ubiquitous base64 due to their use of more characters, there are some mapping losses; for example, there happen to be 272 indices per character-pair in base92 that are impossible to map to from base-10 powers of 2 and are thus completely wasted. (Base91 encoding only has a similar loss of 89 characters (as found by the script in the link above) but it's patented.)
It would be great if it were viable to use base128 in modern-day real-world scenarios.
There are 92 characters available within 0x21 (33) to 0x7E (126) sans \ and ", which make for a great start to creating JSONifiable strings with the most characters possible.
Here are a few ways I envisage the rest of the characters could be found. This is the question I'm asking.
Just dumbly use Unicode
Two-byte Unicode characters could be used to fill in the remaining 36 required indices. Highly suboptimal; I wouldn't be surprised if this was worse than base64 on the wire. Would only be useful for Unicode character counting scenarios like tweet length. Not exactly what I'm going for.
Select 36 non-Unicode characters from within the upper (>128) ASCII range
JavaScript was built with the expectation that character encoding configuration will occasionally go horribly wrong. So the language (and web browsers) handle printing arbitrary and unprintable binary data just fine. So why not just use the upper ASCII range? It's there to be used, right?
One very real problem could be data going over HTTP and falling through one or more can openers proxies on the way between my browser and the server. How badly could this go? I'm aware that WebSockets over HTTP caused some real pain a couple years ago, and potentially even today.
Kind of use UTF-8 in interesting ways
UTF-8 defines 1- to 4-byte long sequences to encapsulate Unicode codepoints. Bytes 2 to 4 always start with 10xxxxxx. There are 64 characters within that range. If I pass through a naïve proxy that filters characters outside the Unicode range on a character-by-character basis, using bytes within this range might mean my data would get through unscathed!
Determine 36 magic bytes that will work for various esoteric reasons
Maybe there are some high ASCII characters that will successfully traverse >99% of the Internet infrastructure for various historical or implementational reasons. What characters might these be?
Base64 is ubiquitous and has wound up being used everywhere, and it's easy to understand why: it was defined in 1987 to use a carefully-chosen, very restricted alphabet of A-Z, a-z, 0-9, + and / that was (and remains) difficult for most environments (such as mainframes using non-ASCII encoding) to have problems with.
EBCDIC mainframes and MIME email are still very much out there, but today base64 has also wound up as a heavily-used pipe within JavaScript to handle the case of "something in this data path might choke on binary", and the collective overhead it adds is nontrivial.
There's currently only one other question on SO regarding the general viability of base128 encoding, and literally every single answer has one or more issues. The accepted answer suggests that base128 must exactly use the first 128 characters of ASCII, and the only answer that acknowledges that the encoded alphabet can use any characters proceeds to claim that that base128 is not in use because the encoded characters must be easily retypeable (which base58 is optimized for, FWIW). All the others have various problems (which I can explain further if desired).
This question is an attempt to re-ask the above with some additional unambiguous subject clarification, in the hope that a concrete go/no-go can be determined.
It's viable in the sense of being technically possible, but it's not viable in the sense of being able to achieve a result better than a much simpler alternative: using HTTP gzip compression. In practice if compression is enabled, the Huffman encoding of the strings will negate the 1/3 increase in size from base64 encoding because each character in the base64 string has only 6 bits of entropy.
As a test, I tried generating a 1Mb file of random data using a utility like Dummy File Creator. Then base64 encoded it and gzipped the resulting file using 7zip.
Original data: 1,048,576 bytes
Base64 encoded data: 1,398,104 bytes
Gzipped base64 encoded data: 1,060,329 bytes
That's only a 1.12% increase in size (and the overhead of encoding -> compressing -> decompressing -> decoding).
Base128 encoding would take 1,198,373 bytes, so you'd have to compress it too if you wanted comparable file size. Gzip compression is a standard feature in all modern browsers so what's the case for base128 and all the extra complexity that would entail?
Select 36 non-Unicode characters from within the upper (>128) ASCII range
base128 is not effective because you must use characters witch codes greater than '128'. For charater witch codes >=128 chrome send two bytes... (so string witch 1MB of this characters on sending will be change to 2MB bytes... so you loose all profit). For base64 strings this phenomena does't appear (so we loose only ~33%). More details here in "update" section.
The problem why base64 is used a lot is because they use English alphabets and numbers to encode a binary stream.
Technically we can use higher bases but the problem with them is that they will need to fit some character set.
UTF-8 is one of the widely used charsets and if you are using XML or JSON to transmit data, you can very well use a Base256 encoding like the below
https://github.com/bharatmicrosystems/base256
Kind of use UTF-8 in interesting ways
UTF-8 defines 1- to 4-byte long sequences to encapsulate Unicode codepoints. Bytes 2 to 4 always start with 10xxxxxx. There are 64 characters within that range. If I pass through a naïve proxy that filters characters outside the Unicode range on a character-by-character basis, using bytes within this range might mean my data would get through unscathed!
This is actually quite viable and has been used in base-122. Despite the name, it's in fact base-128 because the 6 invalid values (128 – 122) are encoded specially so that a series of 14 bits can always be represented with at most 2 bytes, exactly like base-128 where 7 bits will be encoded in 1 byte, and in reality can be optimized to be more efficient than base-128
Base-122 encoding takes chunks of seven bits of input data at a time. If the chunk maps to a legal character, it is encoded with the single byte UTF-8 character: 0xxxxxxx. If the chunk would map to an illegal character, we instead use the the two-byte UTF-8 character: 110xxxxx 10xxxxxx. Since there are only six illegal code points, we can distinguish them with only three bits. Denoting these bits as sss gives us the format: 110sssxx 10xxxxxx. The remaining eight bits could seemingly encode more input data. Unfortunately, two-byte UTF-8 characters representing code points less than 0x80 are invalid. Browsers will parse invalid UTF-8 characters into error characters. A simple way of enforcing code points greater than 0x80 is to use the format 110sss1x 10xxxxxx, equivalent to a bitwise OR with 0x80 (this can likely be improved, see §4). Figure 3 summarizes the complete base-122 encoding.
§2.2 Base-122 Encoding
You can find the implementation on github
The accepted answer suggests that base128 must exactly use the first 128 characters of ASCII, ...
Base-122 doesn't exactly use the first 128 ASCII characters, so it can be encoded normally in a null-terminated string. But as
... and the only answer that acknowledges that the encoded alphabet can use any characters proceeds to claim that that base128 is not in use because the encoded characters must be easily retypeable (which base58 is optimized for, FWIW)
Encodings that use non-printable characters are generally not for typing by hand but for transmission. For example base-122 is optimized for storing binary data in JavaScript strings in a UTF-8 html file which probably works best for your use case
I have a web site in which users can add multiple items and sometimes the URL can be long. I thought by using base64 encoding, I'd pass the URL along but it contains a slash which I use to separate items because my web server cannot handle path names (anything between 2 slashes) longer than 255 characters or I'd get a 403 error.
Is there another way I can encode data quickly in javascript so that theres a 0% chance that a slash will occur in the result?
I'm looking for something not too processor intensive and if possible, I want to go for something better than character swapping.
I will understand if I need to visit a library, but the only encoding built-in to javascript (to my knowledge) is base64 (via the atob function) and I want something different.
I also want to be able to make the solution work with older web browsers as well.
What you need is encodeURIComponent, which is part of the javascript spec and automatically included in all javascript environments
var url = 'example.com/someextenstion/' + encodeURIComponent(theString);
There are many ways to address this but one of the simplest is going to be to take an implementation of atob and btoa and modify it to use a - instead of a / when encoding. You'll have to rename the functions so they don't mask the standard function, but here's some JavaScript source code that does the trick: github. In that particular implementation just replace the / in _ALPHA with a - (or any character of your choosing).
It might be faster to just do as Amit suggests: use the standard functions and do a quick string replace of / on conversion: str.replace(/\//g,'-'); and perform the reverse on decoding, but it doesn't seem like performance will be critical in this application.
How to decode cp-1251 to UTF-8 in javascript?
The cp-1251 is from a datafeed, which required to decode from js client side.
There is no way to change server side output, since it is related to a 3rd party, and due to some reason, I would not use any server side programming to convert the datafeed to become another datafeed.
(Assuming that by "UTF-8" you meant the JS strings in their native encoding...)
Depending on the format your 'cp-1251' data is in and depending on the browsers you need to support, you can choose from:
TextDecoder.decode() API (decodes a sequence of octets from a typed array, like Uint8Array) - if you're using web sockets, you can get an ArrayBuffer out of it to decode.
https://github.com/mathiasbynens/windows-1251 operates on something it calls 'byte strings' (JS Strings consisting of characters like \u00XY, where 0xXY is the encoded byte.
build the decoding table yourself (example)
Note that in most cases (not something as low-level as websockets though) it might be easier to read the data in the correct encoding before it ends up as a JS string (for example, you can force XMLHttpRequest to use a certain encoding even if the server misreports the encoding).
I need to be able to compress a string in Javascript, but without saving a temporary file. I am then going to send this compressed data via a POST. I will receive it in Python so I need to be able to decompress it there. I implemented the following, http://rosettacode.org/wiki/LZW_compression, only to discover that it only works on ascii-characters. I am going to be reading webpages and never know what characters I'll be getting.
(The reason I need to do this is because the strings can become quite long and therefore take too long for slow networks to post.)
You can try base64-encoding the string beforehand (this will yield a compressed stream from 1.5 to twice the size it would have if it had been possible to compress it directly).
There is another implementation (this of gzip Deflate algorithm) here.
Or you might try and escape the non-ASCII characters by replacing them with \xNN (NN = hex code of character). Of course you will also have to escape the slash .
Anyway, you are unlikely to achieve more than about a 10X increase in speed, and I fear this would be more than balanced by the encoding overhead. Without knowing more about the use case, I'd suggest going with Deflate.
From OP comment.
The javascript reads the DOM elements and sends that. It won't work to
point to the source for various reasons, a main one being that I need
the elements created by javascript on the page. I also need the
computed style that the browser calculates for me.
One solution would be to automate a browser using Selenium with Python and then retrieve the DOM from that.
Use deflate in javascript and zlib in python. (LZW is ancient and obsolete -- modern methods are much better.) In between use base 85 encoding, picking 85 ASCII characters than experimentation or standards documentation indicate can make it through POST unscathed. Base 85 is simply where each character is a digit in a base 85 number, where five such digits encode 32 bits.
First of all, I am aware of this question:
How do I load binary image data using Javascript and XMLHttpRequest?
and specifically best answer therein, http://emilsblog.lerch.org/2009/07/javascript-hacks-using-xhr-to-load.html.
So accessing binary data from Javascript using Firefox (and later versions of Chrome which actually seem to work too; don't know about Opera). So far so good.
But I am still hoping to find a way to access binary data with a modern IE (ideally IE 6, but at least IE 7+), without using VB.
It has been mentioned that XHR.messageBody would not work (if it contains zero bytes), but I was wondering if this might have been resolved with newer versions; or if there might be alternate settings that would allow simple binary data access.
Specific use case for me is that of accessing data returned by a web service that is encoded using a binary data transfer format (including byte combinations that are not legal in UTF-8 encoding).
It's possible with IE10, using responseType=arraybuffer or blob. You only had to wait for a few years...
http://msdn.microsoft.com/en-us/library/ie/br212474%28v=vs.94%29.aspx
http://msdn.microsoft.com/en-us/library/ie/hh673569%28v=vs.85%29.aspx
Ok, I have found some interesting leads, although not completely good solution yet.
One obvious thing I tried was to play with encodings. There are 2 obvious things that really should work:
Latin-1 (aka ISO-8859-1): it is single-byte encoding, mapping one-to-one with Unicode. So theoretically it should be enough to declare content type of "text/plain; charset=ISO-8859-1" and get character-per-byte. Alas, due to idiotic logics of browsers (and even more idiotic mandate by HTML 5!), there is some transcoding occuring which changes high control character range (codes 128 - 159) in strange ways. Apparently this is due to mandatory assumption that encoding really is Windows-1252 (why? For some silly reasons.. but it is what it is)
UCS-2 is a fixed-length 2-byte encoding that predated UTF-17; and simply splits 16-bit character codes into 2 bytes. Alas, browsers do not seem to support it.
UTF-16 might work, theoretically, but there is the problem of surrogate pair characters (0xD800 - 0xDFFF) which are reserved. And if byte pairs that encode these characters are included, corruption occurs.
However: it seems to conversion for Latin-1 might be reversible, and if so, I bet I could make use of it after all. All mutations are from 1 byte (0x00 - 0xFF) into larger-than-byte values, and there are no ambiguous mappings at least for Firefox. If this holds true for other browsers, it will be possible to map values back and remove ill effects of automatic transcoding. And that would then work for multiple browsers, including IE (with the caveat of needing something special to deal with null values).
Finally, some useful links for conversions of datatypes are:
http://www.merlyn.demon.co.uk/js-exact.htm#IEEE (to handle floating points to/from binary IEEE representation)
http://jsfromhell.com/classes/binary-parser (for general parsing)
You can use the JScript "VBArray" object to get at these bytes in IE (without using VBScript):
var data = new VBArray(xhr.responseBody).toArray();
I guess answer is plain "no", as per this post: how do I access XHR responseBody (for binary data) from Javascript in IE?
(or: "use VBScript to help")