HTML1121: Codepage unicode is not allowed, only codepage utf-8 is allowed - javascript

I see this error in Visual Studio 2012 as I'm trying to get my HTML5 app running inside a native Windows 8 app:
HTML1121: Codepage unicode is not allowed, only codepage utf-8 is allowed.
Clearly it's a character encoding issue, but I'm not familiar with the differences between unicode and UTF-8. Can anyone shed some light on this?

If you are bringing files into your project from outside VS, use VS and the Save filename As feature and select the Save With Encoding from the Save dropdown. Choose UTF-8 Encoding. This will normally solve the problem you are experiencing.
All JavaScript files (files with a .js extension) included in the app package are converted into bytecode that the JavaScript engine can consume directly. This requires UTF-8 encoding, IIRC.

When Microsoft says Unicode they generally mean UTF-16:
... UTF-16 (wide character) encoding, which is the most common encoding of Unicode and the one used for native Unicode encoding on Windows operating systems.
http://msdn.microsoft.com/en-us/library/windows/desktop/dd374081(v=vs.85).aspx
The designMode flag ends up forcing the browser to fall back to UTF-16, whereas windows 8 expects UTF-8 (that decision to migrate to UTF-8 is relatively recent). Your best option is to keep designMode off and rework the page

Unicode is a standard. It assigns characters to abstract code points. But there's more, most of the work is actually towards creating properties for those code points as well as defining relationships between them.
For example, the character A (LATIN CAPITAL LETTER A) is assigned to code point U+0041. Properties defined for this code point include for example that its General Category is Letter, Uppercase and that it's written from left-to-right. It has a relationship with the code point U+0061, in that U+0061 is its lowercase mapping. So that's Unicode.
There are Unicode Transformation Formats for mapping these abstract code points to actual concrete bytes in a computer. And this is what is relevant when specifiying encoding, "code page" or "charset". You should use UTF-8.
Also, "Unicode" can actually refer to the encoding UTF-16LE in some Microsoft contexts.

Related

JavaScript/NodeJS RTF CJK Conversions

I'm working on a node module that parses RTF files and does some find and replace. I have already come up with a solution for special characters expressed in escaped unicode here, but have ran into a wall when it comes to CJK characters. Is there an easy way to do these conversions in JavaScript, either with a library or built in?
Example:
An RTF file viewed in plain text contains:
Now testing symbols {鈴:200638d}
When parsed in NodeJS, this part of the file looks like:
Now testing symbols \{
\f1 \'e2\'8f
\f0 :200638d\}\
I understand that \f1 and \f0 denote font changes, and the \'e2\'8f block is the actual character... but how can I take \'e2\'8f and convert it back to 鈴, or conversely, convert 鈴 to \'e2\'8f?
I have tried looking up the character in different encodings and am not seeing anything that remotely resembles \'e2\'8f. I understand that the RTF control \'hh is A hexadecimal value, based on the specified character set (may be used to identify 8-bit values) (source) or maybe the better definition comes from Microsoft RTF Spec; %xHH (OCTET with the hexadecimal value of HH) (download) but I have no idea what to do with that information to get conversions going on this.
I was able to parse your sample file using my RTF parser and retrieve the correct character.
The key thing is the \fonttbl command, as the name suggests, defines the fonts used in the document. As part of the definition of each font the \fcharset command determines the character set to be used with this font. You need to use this to correctly interpret the character data.
My parser maps the argument to the \fcharset to a Codeset name here then this is then translated to a charecter set name which can be used to retrieve the correct Java Charsethere. Your character set handling will obviously be different as you are working in Javascript, but hopefully this information will help you move forward.

How do I send a Windows-1255 encoded file using Express?

I'm writing an API that creates text files.
It's used by legacy software that requires the files to be with the Windows 1255 encoding.
I'm creating the content of the file from a good old JavaScript string.
This is the relevant parts of the code I have so far:
var iconv = require('iconv-lite');
const str = 'Hello world, שלום עולם';
const encoded_str = iconv.encode(str, 'win1255', {addBOM: true});
response.status(200).send('data:text/plain;base64,' + Buffer.from(str, 'binary').toString('base64'));
It sends a text file successfully. Its ASCII content is preserved and is shown nicely when I open it in notepad, but any non-ASCII (think: Hebrew) characters are garbled.
I have a gut feeling it has something to do with the base64 conversion.
(the file is later opened using an HTML <a href="..."> tag)
Your code is correct.
The reason you're seeing garbled characters has more to do with your Windows settings.
Windows-1255 is an old standard. These days we use UTF (Unicode).
Windows-1255, like other Windows code pages, is 8-bit SBCS.
The first 127 values are ASCII compatible. The other ones take a different meaning based on encoding. Hebrew encodings give them Hebrew meanings, Japaneese give them Japaneese meanings, etc...
There aren't enough bits to represent the wide variety of symbols.
If you go to your Windows settings and define how to treat non-unicode encodings, it will change which meanings the upper 127 numbers take.
Go and set it to Hebrew, and your content won't be garbled anymore.
Further reading: Joel on Software - The absolute minimum every software developer absolutely positively must know about unicode and character sets, no excuses.

Javascript UTF-8 illegal characters when minifying

I am trying my hand at Javascript minification using a self extracting compression based on LZW compression. In order for it to work the minified Javascript is stored in a string using the new ES6 template string delimiters.
Part of the compression involves creating a dictionary, each item in the dictionary is represent by a single UTF-8 character in the code string. This gives me around 128 dictionary entries but means that the code string will contain each character from 128-255 UTF-8.
It works well on every browser I have tried, and being the only ones I am interested in supporting. I have zero interest in supporting legacy browsers.
To see how my code fairs with existing minifiers I do a comparison, and I am more than happy with the results.
Then I thought, let's see what the various javascript minifiers will do to the my minified code. This is where I struck problems as at least half the minifiers fail and report Illegal characters. Upon examination I find that it is characters in the range 128-255 UTF-8 that are marked as illegal.
Being under the impression that javascript is compliant with ASCII, UTF-8, and UTF-16 I then try to find out why the UTF-8 string is causing these minifiers to fail. I have searched to find a list of illegal UTF-8 Javascript characters and can't find any.
Could someone please shed some light on why using UTF-8 characters inside javascript strings is causing these minifiers to fail (not one YUI minifier works), and which if any UTF-8 characters need special treatment within a Javascript template string. Note I am only interested in characters codes 128-255 as I have correctly encoded all characters below theses.

Squared Question Mark Sign on CSV file read from JS

I'm reading a CSV file in my JS, but characters with accent (á, ó...) are being replaced with a black square question mark (�).
I always have this sort of problem in PHP, but, i'm using JS and i don't know how to fix that.
The problem is in the UTF8 codification of the file, of the HTML, is there a way to fix this in code?
Thanks
This character is U+FFFD, REPLACEMENT CHARACTER, commonly used to replace invalid data in streams thought to be some Unicode encoding.
For example if you had the text "Résumé" encoded as IS0 8859-1 and wanted to convert it to UTF-16, but told the conversion routine that the text was UTF-8 then the library would probably produce the UTF-16 text "R�sum�" (the other alternative would be to throw an error and not give any results).
Another way these may appear is if a web page declares that it is UTF-8 but it is not actually UTF-8. The browser is likely to do the re-encoding described above and the replacement characters will show up in the rendered web-page, but viewing the source with an editor that ignores or disregards the HTML encoding info will show the characters correctly.
From your comments it looks like your process is something like:
Excel -> export to csv -> process csv in js -> produce html
Windows software typically uses the platform's 'encoding for non-Unicode programs' for encoding eight bit text, not UTF-8. So the CSV file is probably Windows CP1252 (If you're using a version of windows set up for most of the western world), and if your javascript program is reading that data and copying it directly into HTML source that's supposed to be UTF-8, that would cause a problem that fits your description.
What you need to do convert from whatever encoding the CSV is using to UTF-8. Javascript doesn't really have the facilities to do this so your best bet is probably to convert the file after exporting it from Excel but before accessing it in JS.
Other alternatives are to change the encoding the HTML page is using to whatever the csv uses, or to not specify an encoding and leave it up to the browser to guess.

using unicode in Javascript

In JavaScript we can use the below line of code(which uses Unicode) for displaying copyright symbol:
var x = "\u00A9 RPeripherals";
Why can't we type the copyright symbol directly using ALT code (alt+0169) like below :
var x = "© RPeripherals" ;
What is the difference between these two methods?
Why can't we type the copyright symbol directly using ALT code (alt+0169) like below :
Who says so? Of course you can. Just configure your code editor to use UTF-8 encoding for source files. You should never use anything else to begin with...
What is the difference between these two methods?
The difference is that using the \uXXXX scheme you are transmitting at best 2 and at worst 5 extra bytes on the wire. This kind of spelling may help if you need to embed characters in your source code, which your font cannot display properly. For example, I don't have traditional Chinese characters in the font I'm using for programming, so if I type Chinese characters into my code editor, I'll see a bunch of question marks or rectangles with Unicode codepoint digits instead of actual characters. But someone who has Chinese glyphs in the font wouldn't have that problem.
If me and that person want to share our source code, it would be preferable that the other person uses \uXXXX scheme, as I would be able to verify which character is that by looking it up in the Unicode table. That's about all the difference.
EDIT
ECMAScript standard (v 262/5.1) says specifically that
A conforming implementation of this Standard shall interpret
characters in conformance with the Unicode Standard, Version 3.0 or
later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted
encoding form, implementation level 3. If the adopted ISO/IEC 10646-1
subset is not otherwise specified, it is presumed to be the BMP
subset, collection 300. If the adopted encoding form is not otherwise
specified, it presumed to be the UTF-16 encoding form.
So, the standard guarantees that character encoding is Unicode, and enforces the use of UTF-16 (that's strange, I thought it was UTF-8), but I don't think that this is what happens in practice... I believe that browsers use UTF-8 as default. Perhaps this have changed in the later standards, but this is the one last universally accepted.
Why can't we directly type the copyright symbol directly
Because JavaScript engines are capable of parsing UTF-8 encoded source files.
What is the difference between these two methods?
One is short, requires the source file be encoded in an encoding that supports the character, and requires that you type a character that isn't printed on the keyboard's buttons.
The other is (comparatively) long, can be expressed entirely in ASCII, and can be typed with characters printed on the buttons of a standard keyboard.

Categories

Resources