Javascript UTF-8 illegal characters when minifying

Javascript UTF-8 illegal characters when minifying - javascript

I am trying my hand at Javascript minification using a self extracting compression based on LZW compression. In order for it to work the minified Javascript is stored in a string using the new ES6 template string delimiters.
Part of the compression involves creating a dictionary, each item in the dictionary is represent by a single UTF-8 character in the code string. This gives me around 128 dictionary entries but means that the code string will contain each character from 128-255 UTF-8.
It works well on every browser I have tried, and being the only ones I am interested in supporting. I have zero interest in supporting legacy browsers.
To see how my code fairs with existing minifiers I do a comparison, and I am more than happy with the results.
Then I thought, let's see what the various javascript minifiers will do to the my minified code. This is where I struck problems as at least half the minifiers fail and report Illegal characters. Upon examination I find that it is characters in the range 128-255 UTF-8 that are marked as illegal.
Being under the impression that javascript is compliant with ASCII, UTF-8, and UTF-16 I then try to find out why the UTF-8 string is causing these minifiers to fail. I have searched to find a list of illegal UTF-8 Javascript characters and can't find any.
Could someone please shed some light on why using UTF-8 characters inside javascript strings is causing these minifiers to fail (not one YUI minifier works), and which if any UTF-8 characters need special treatment within a Javascript template string. Note I am only interested in characters codes 128-255 as I have correctly encoded all characters below theses.

Related

How viable is base128 encoding for scenarios like JavaScript strings?

I recently found that base32, base64 and base128 are the most efficient forms of base-n encoding, and that while base58, Ascii85, base91, base92 et al do provide some efficiency improvements over the ubiquitous base64 due to their use of more characters, there are some mapping losses; for example, there happen to be 272 indices per character-pair in base92 that are impossible to map to from base-10 powers of 2 and are thus completely wasted. (Base91 encoding only has a similar loss of 89 characters (as found by the script in the link above) but it's patented.)
It would be great if it were viable to use base128 in modern-day real-world scenarios.
There are 92 characters available within 0x21 (33) to 0x7E (126) sans \ and ", which make for a great start to creating JSONifiable strings with the most characters possible.
Here are a few ways I envisage the rest of the characters could be found. This is the question I'm asking.
Just dumbly use Unicode
Two-byte Unicode characters could be used to fill in the remaining 36 required indices. Highly suboptimal; I wouldn't be surprised if this was worse than base64 on the wire. Would only be useful for Unicode character counting scenarios like tweet length. Not exactly what I'm going for.
Select 36 non-Unicode characters from within the upper (>128) ASCII range
JavaScript was built with the expectation that character encoding configuration will occasionally go horribly wrong. So the language (and web browsers) handle printing arbitrary and unprintable binary data just fine. So why not just use the upper ASCII range? It's there to be used, right?
One very real problem could be data going over HTTP and falling through one or more can openers proxies on the way between my browser and the server. How badly could this go? I'm aware that WebSockets over HTTP caused some real pain a couple years ago, and potentially even today.
Kind of use UTF-8 in interesting ways
UTF-8 defines 1- to 4-byte long sequences to encapsulate Unicode codepoints. Bytes 2 to 4 always start with 10xxxxxx. There are 64 characters within that range. If I pass through a naïve proxy that filters characters outside the Unicode range on a character-by-character basis, using bytes within this range might mean my data would get through unscathed!
Determine 36 magic bytes that will work for various esoteric reasons
Maybe there are some high ASCII characters that will successfully traverse >99% of the Internet infrastructure for various historical or implementational reasons. What characters might these be?
Base64 is ubiquitous and has wound up being used everywhere, and it's easy to understand why: it was defined in 1987 to use a carefully-chosen, very restricted alphabet of A-Z, a-z, 0-9, + and / that was (and remains) difficult for most environments (such as mainframes using non-ASCII encoding) to have problems with.
EBCDIC mainframes and MIME email are still very much out there, but today base64 has also wound up as a heavily-used pipe within JavaScript to handle the case of "something in this data path might choke on binary", and the collective overhead it adds is nontrivial.
There's currently only one other question on SO regarding the general viability of base128 encoding, and literally every single answer has one or more issues. The accepted answer suggests that base128 must exactly use the first 128 characters of ASCII, and the only answer that acknowledges that the encoded alphabet can use any characters proceeds to claim that that base128 is not in use because the encoded characters must be easily retypeable (which base58 is optimized for, FWIW). All the others have various problems (which I can explain further if desired).
This question is an attempt to re-ask the above with some additional unambiguous subject clarification, in the hope that a concrete go/no-go can be determined.

It's viable in the sense of being technically possible, but it's not viable in the sense of being able to achieve a result better than a much simpler alternative: using HTTP gzip compression. In practice if compression is enabled, the Huffman encoding of the strings will negate the 1/3 increase in size from base64 encoding because each character in the base64 string has only 6 bits of entropy.
As a test, I tried generating a 1Mb file of random data using a utility like Dummy File Creator. Then base64 encoded it and gzipped the resulting file using 7zip.
Original data: 1,048,576 bytes
Base64 encoded data: 1,398,104 bytes
Gzipped base64 encoded data: 1,060,329 bytes
That's only a 1.12% increase in size (and the overhead of encoding -> compressing -> decompressing -> decoding).
Base128 encoding would take 1,198,373 bytes, so you'd have to compress it too if you wanted comparable file size. Gzip compression is a standard feature in all modern browsers so what's the case for base128 and all the extra complexity that would entail?

Select 36 non-Unicode characters from within the upper (>128) ASCII range
base128 is not effective because you must use characters witch codes greater than '128'. For charater witch codes >=128 chrome send two bytes... (so string witch 1MB of this characters on sending will be change to 2MB bytes... so you loose all profit). For base64 strings this phenomena does't appear (so we loose only ~33%). More details here in "update" section.

The problem why base64 is used a lot is because they use English alphabets and numbers to encode a binary stream.
Technically we can use higher bases but the problem with them is that they will need to fit some character set.
UTF-8 is one of the widely used charsets and if you are using XML or JSON to transmit data, you can very well use a Base256 encoding like the below
https://github.com/bharatmicrosystems/base256

Kind of use UTF-8 in interesting ways
UTF-8 defines 1- to 4-byte long sequences to encapsulate Unicode codepoints. Bytes 2 to 4 always start with 10xxxxxx. There are 64 characters within that range. If I pass through a naïve proxy that filters characters outside the Unicode range on a character-by-character basis, using bytes within this range might mean my data would get through unscathed!
This is actually quite viable and has been used in base-122. Despite the name, it's in fact base-128 because the 6 invalid values (128 – 122) are encoded specially so that a series of 14 bits can always be represented with at most 2 bytes, exactly like base-128 where 7 bits will be encoded in 1 byte, and in reality can be optimized to be more efficient than base-128
Base-122 encoding takes chunks of seven bits of input data at a time. If the chunk maps to a legal character, it is encoded with the single byte UTF-8 character: 0xxxxxxx. If the chunk would map to an illegal character, we instead use the the two-byte UTF-8 character: 110xxxxx 10xxxxxx. Since there are only six illegal code points, we can distinguish them with only three bits. Denoting these bits as sss gives us the format: 110sssxx 10xxxxxx. The remaining eight bits could seemingly encode more input data. Unfortunately, two-byte UTF-8 characters representing code points less than 0x80 are invalid. Browsers will parse invalid UTF-8 characters into error characters. A simple way of enforcing code points greater than 0x80 is to use the format 110sss1x 10xxxxxx, equivalent to a bitwise OR with 0x80 (this can likely be improved, see §4). Figure 3 summarizes the complete base-122 encoding.
§2.2 Base-122 Encoding
You can find the implementation on github
The accepted answer suggests that base128 must exactly use the first 128 characters of ASCII, ...
Base-122 doesn't exactly use the first 128 ASCII characters, so it can be encoded normally in a null-terminated string. But as
... and the only answer that acknowledges that the encoded alphabet can use any characters proceeds to claim that that base128 is not in use because the encoded characters must be easily retypeable (which base58 is optimized for, FWIW)
Encodings that use non-printable characters are generally not for typing by hand but for transmission. For example base-122 is optimized for storing binary data in JavaScript strings in a UTF-8 html file which probably works best for your use case

Javascript - Alternative to lzw compression for Database entry

I have strings (about 1-5Kb) of the form:
FF,A3V,X7Y,aA4,....
lzw compresses these really nicely, but includes Turkish characters. These are then submitted to a MySQL database.
Sometimes MySQL can 'play-up' and not submit these properly, putting question marks '?' in place of the Turkish characters. They can do this even when you have your text areas properly defined. Exporting and reimporting the table can sort this out. This is fine for my test database, but not something I am happy with when this goes live.
Consequently I am looking for an alternative to lzw, which will compress but only using normal letters/numbers etc.
Does anyone know of a PUBLIC DOMAIN compression method that avoid Turkish Characters (and any other non-standard characters)? Can anyone point me to some code in javascript (or c++ or c# which I can convert)?

To expand a bit on what's been said in the comments... Storing strings of bytes, such as the output from a compression algorithm typically contains, in a VARCHAR or CHAR or TEXT column is not valid usage.
These column types are not for byte strings, they are for strings of valid characters only. Not every string of bytes contains valid strings of characters in any given character set... and MySQL isn't going to allow invalid characters (which, for some character sets, the correlation between "character" and "byte" isn't 1:1).
In the good ol' days™, the two were interchangeable but this is not the case any more (and hasn't been, to one degree or another, for a while).
If your column type, instead, were BINARY or VARBINARY or BLOB, the issue should disappear, because those data types are for binary data.

HTML1121: Codepage unicode is not allowed, only codepage utf-8 is allowed

I see this error in Visual Studio 2012 as I'm trying to get my HTML5 app running inside a native Windows 8 app:
HTML1121: Codepage unicode is not allowed, only codepage utf-8 is allowed.
Clearly it's a character encoding issue, but I'm not familiar with the differences between unicode and UTF-8. Can anyone shed some light on this?

If you are bringing files into your project from outside VS, use VS and the Save filename As feature and select the Save With Encoding from the Save dropdown. Choose UTF-8 Encoding. This will normally solve the problem you are experiencing.
All JavaScript files (files with a .js extension) included in the app package are converted into bytecode that the JavaScript engine can consume directly. This requires UTF-8 encoding, IIRC.

When Microsoft says Unicode they generally mean UTF-16:
... UTF-16 (wide character) encoding, which is the most common encoding of Unicode and the one used for native Unicode encoding on Windows operating systems.
http://msdn.microsoft.com/en-us/library/windows/desktop/dd374081(v=vs.85).aspx
The designMode flag ends up forcing the browser to fall back to UTF-16, whereas windows 8 expects UTF-8 (that decision to migrate to UTF-8 is relatively recent). Your best option is to keep designMode off and rework the page

Unicode is a standard. It assigns characters to abstract code points. But there's more, most of the work is actually towards creating properties for those code points as well as defining relationships between them.
For example, the character A (LATIN CAPITAL LETTER A) is assigned to code point U+0041. Properties defined for this code point include for example that its General Category is Letter, Uppercase and that it's written from left-to-right. It has a relationship with the code point U+0061, in that U+0061 is its lowercase mapping. So that's Unicode.
There are Unicode Transformation Formats for mapping these abstract code points to actual concrete bytes in a computer. And this is what is relevant when specifiying encoding, "code page" or "charset". You should use UTF-8.
Also, "Unicode" can actually refer to the encoding UTF-16LE in some Microsoft contexts.

using unicode in Javascript

In JavaScript we can use the below line of code(which uses Unicode) for displaying copyright symbol:
var x = "\u00A9 RPeripherals";
Why can't we type the copyright symbol directly using ALT code (alt+0169) like below :
var x = "© RPeripherals" ;
What is the difference between these two methods?

Why can't we type the copyright symbol directly using ALT code (alt+0169) like below :
Who says so? Of course you can. Just configure your code editor to use UTF-8 encoding for source files. You should never use anything else to begin with...
What is the difference between these two methods?
The difference is that using the \uXXXX scheme you are transmitting at best 2 and at worst 5 extra bytes on the wire. This kind of spelling may help if you need to embed characters in your source code, which your font cannot display properly. For example, I don't have traditional Chinese characters in the font I'm using for programming, so if I type Chinese characters into my code editor, I'll see a bunch of question marks or rectangles with Unicode codepoint digits instead of actual characters. But someone who has Chinese glyphs in the font wouldn't have that problem.
If me and that person want to share our source code, it would be preferable that the other person uses \uXXXX scheme, as I would be able to verify which character is that by looking it up in the Unicode table. That's about all the difference.
EDIT
ECMAScript standard (v 262/5.1) says specifically that
A conforming implementation of this Standard shall interpret
characters in conformance with the Unicode Standard, Version 3.0 or
later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted
encoding form, implementation level 3. If the adopted ISO/IEC 10646-1
subset is not otherwise specified, it is presumed to be the BMP
subset, collection 300. If the adopted encoding form is not otherwise
specified, it presumed to be the UTF-16 encoding form.
So, the standard guarantees that character encoding is Unicode, and enforces the use of UTF-16 (that's strange, I thought it was UTF-8), but I don't think that this is what happens in practice... I believe that browsers use UTF-8 as default. Perhaps this have changed in the later standards, but this is the one last universally accepted.

Why can't we directly type the copyright symbol directly
Because JavaScript engines are capable of parsing UTF-8 encoded source files.
What is the difference between these two methods?
One is short, requires the source file be encoded in an encoding that supports the character, and requires that you type a character that isn't printed on the keyboard's buttons.
The other is (comparatively) long, can be expressed entirely in ASCII, and can be typed with characters printed on the buttons of a standard keyboard.

Insert EBCDIC character into javascript string

I need to create an EBCDIC string within my javascript and save it into an EBCDIC database. A process on the EBCDIC system then uses the data. I haven't had any problems until I came across the character '¬'. In EBCDIC it is hex value of 5F. All of the usual letters and symbols seem to automagically convert with no problem. Any idea how I can create the EBCDIC value for '¬' within javascript so I can store it properly in the EBCDIC db?
Thanks!

If "all of the usual letters and symbols seem to automagically convert", then I very strongly suspect that you do not have to create an EBCDIC string in Javascript. The character codes for Latin letters and digits are completely different in EBCDIC than they are in Unicode, so something in your server code is already converting the strings.
Thus what you need to determine is how that process works, and specifically you need to find out how the translation maps character codes from Unicode source into the EBCDIC equivalents. Once you know that, you'll know what Unicode character to use in your Javascript code.
As a further note: every single time I've been told by an IT organization that their mainframe software requires that data be supplied in EBCDIC, that advice has been dead wrong. The fact that there's some external interface means that something in the pile of iron that makes up the mainframe and it's tentacles, something the IT people have forgotten about and probably couldn't find if they needed to, is already mapping "real world" character encodings like Unicode into EBCDIC. How does it work? Well, it may be impossible to figure out.
You might try whether this works: var notSign = "\u00AC";
edit: also: here's a good reference for HTML entities and Unicode glyphs: http://www.elizabethcastro.com/html/extras/entities.html The HTML/XML syntax uses decimal numbers for the character codes. For Javascript, you have to convert those to hex, and the notation in Javascript strings is "\u" followed by a 4-digit hex constant. (That reference isn't complete, but it's pretty easy to read and it's got lots of useful symbols.)

Develop Reference

JavaScript is the programming language of the Web.