Writing Javascript using UTF-16 character encoding

Writing Javascript using UTF-16 character encoding - javascript

Here is what I am trying but am not sure how to get this working or if it is even possible -
I have an HTML page MyHTMLPage.htm and I want to src a Javascript from this HTML file. This is pretty straightforward. I plan to include a <script src = "MyJavascript.js"></script> tag in my HTML file and that should take care of it.
However, I want to create my Javascript file using UTF-16 encoding. So, I plan to use the following tag <script charset="UTF-16" src="MyJavascript.js"></script> in my HTML file to take care of that
Now the problem I am really stuck at is how do I create the Javascript using UTF-16 encoding - E.g. let's say my Javascript code is alert(1); I created my Javascript file with the contents as \u0061\u006c\u0065\u0072\u0074\u0028\u0031\u0029\u003b but that does not seem to execute as valid Javascript at runtime.
To summarize, here is what I have -
MyHTMLPage.html
...
...
...
<script charset="UTF-16" src="MyJavascript.js"></script>
...
...
...
MyJavascript.js
\u0061\u006c\u0065\u0072\u0074\u0028\u0031\u0029\u003b
When I open the HTML page in Firefox, I get the error - "Syntax error - Illegal character" right at the beginning of the MyJavascript.js file. I have also tried adding the BOM character "\ufeff" at the beginning of the above Javascript but I still get the same error.
I know I could create my Javascript file as - "alert(1);" and then save it using UTF-16 encoding using the text editor and then the browser runs it fine however is there a way I could use "\u" notation (or an alternate escape character) and still get the Javascript to execute fine?
Thanks,

You are misunderstanding character encoding. Character encoding is a scheme of how characters are represented as bits behind the scenes.
You would not write \u004a in your file to "make it utf-16" as that is literally a sequence of 6 characters:
\, u, 0, 0, 4, a
And if you saved the above as utf-16, it would be represented as the following bits:
005C0075
00300030
00340061
Had you saved it as utf-8 it would be:
5C753030
3461
Which takes 50% of the space and bandwidth. It takes even less to write that character literally ("J"): just a byte
(4A) in utf-8.
The "\u"-notation is a way to reference any BMP character by just using a small set of ascii characters. If you were
working with a text editor with no unicode support, you could write "\u2665", instead of literally writing "♥" and the
browser would show it properly.
If you for some weird reason still want to use utf-16, simply write the code normally, save the file as utf-16 and serve it with the proper charset header.

Related

JavaScript/NodeJS RTF CJK Conversions

I'm working on a node module that parses RTF files and does some find and replace. I have already come up with a solution for special characters expressed in escaped unicode here, but have ran into a wall when it comes to CJK characters. Is there an easy way to do these conversions in JavaScript, either with a library or built in?
Example:
An RTF file viewed in plain text contains:
Now testing symbols {鈴:200638d}
When parsed in NodeJS, this part of the file looks like:
Now testing symbols \{
\f1 \'e2\'8f
\f0 :200638d\}\
I understand that \f1 and \f0 denote font changes, and the \'e2\'8f block is the actual character... but how can I take \'e2\'8f and convert it back to 鈴, or conversely, convert 鈴 to \'e2\'8f?
I have tried looking up the character in different encodings and am not seeing anything that remotely resembles \'e2\'8f. I understand that the RTF control \'hh is A hexadecimal value, based on the specified character set (may be used to identify 8-bit values) (source) or maybe the better definition comes from Microsoft RTF Spec; %xHH (OCTET with the hexadecimal value of HH) (download) but I have no idea what to do with that information to get conversions going on this.

I was able to parse your sample file using my RTF parser and retrieve the correct character.
The key thing is the \fonttbl command, as the name suggests, defines the fonts used in the document. As part of the definition of each font the \fcharset command determines the character set to be used with this font. You need to use this to correctly interpret the character data.
My parser maps the argument to the \fcharset to a Codeset name here then this is then translated to a charecter set name which can be used to retrieve the correct Java Charsethere. Your character set handling will obviously be different as you are working in Javascript, but hopefully this information will help you move forward.

Unicode -- What's going on here?

This code:
console.log('😀');
console.log('\uD83D\uDE00');
From HTML script tag:
ðŸ˜€
😀
Ran pasted into browser console (same browser):
😀
😀
What's going on here that causes the first console.log('😀'); to fail when it's included with a script tag, but work fine when run in the browser console. The obvious problem seems to be that it isn't being converted to a surrogate pair, since the second line works as expected.

Your HTML file is not saved in the same encoding that the HTTP headers or HTML meta tags advertise. The file is interpreted in the wrong encoding resulting in the wrong characters. That doesn't matter for the unicode escape sequence, which is pure ASCII, it does matter for the non-ASCII literal.
Concrete guess: the file is saved as UTF-8 but advertised as ISO-8859-1.

json.loads in python containing non standard characters

What I'm trying to do is grab some text in my web application (string literals containing non English characters such as ă) with Javascript. Pass it to an object and then use JSON.Stringify() on the object, and then pass that to a python script.
The Python script is intended to load the json data and eventually print the text on a POS printer so the final data has to be in ascii hex format of a specific code page which means ill have to perform some character conversion either before or after the python script gets the data.
Basically something like this:
someObject.arrayOfTextInputs.push("how can I handle the ă character");
--run a python script and pass to it: JSON.Stringify(someObject) --
In Python:
jsonStuff = sys.argv[1]
myObject= json.loads(jsonStuff)
Now if I simply pass the strings as is, the python script hangs upon json.loads because of the ă character. If i replace the character prior to Stringify, with an'\xNN' representation matching a value from a Code Page i need, the json.loads still hangs.
Same thing for using '\uNNNN'.
I also printed out the json i get before handing it over to json.loads() and normally it just prints out some weird hex? character/image instead of ă.
However, replacing it to its utf-8 repesentation (in javascript) \xc4\x83, makes the print in python display the character properly (altho it creates problems in the next steps).
Same thing happens with replacing ă with \xC7 which is the matching character in code page 852 (latin-2) and then jsonStuff.decode('cp852')
What are my options here?
Edit: Thanks for the welcome!
I am using Python 2.7 which from what I've gathered uses standard ascii encoding.
If i skip any conversion i get the exception: ValueError: Invalid control character at some character/byte..
If I convert the character in Javascript (before doing Stringify() on the object) with an escape matching the same character from an utf-8 table: "\u0103" i get the same exception.
If I convert the character to some utf-8 character that falls into the standard ascii character set ("\u0045"), it loads fine. I guess the decoder can automatically map the unicode "regular" characters into their ascii representatives.
Same thing for converting to say "\x45".
If I add strict=False argument to the loads() function, i can load any escaped character but then I'm not sure how to handle it in my python script.
I have to admit, the stringify and loads() part really makes me lose track, since I'm starting out with utf8 in my IDE, using escape characters from a different encoding, then calling stringify (only utf8 encoded stuff is valid json?) and pass it into python which cant handle utf8 past the standard character set. And I have to end up with '\xNN' of a specific code page(say Latin-2) in the end just before the print.
Should I try to pass anything using strict=False and handle it from there within python, or is it possible to send everything encoded using some code page?
I'll add some code in a bit.

Squared Question Mark Sign on CSV file read from JS

I'm reading a CSV file in my JS, but characters with accent (á, ó...) are being replaced with a black square question mark (�).
I always have this sort of problem in PHP, but, i'm using JS and i don't know how to fix that.
The problem is in the UTF8 codification of the file, of the HTML, is there a way to fix this in code?
Thanks

This character is U+FFFD, REPLACEMENT CHARACTER, commonly used to replace invalid data in streams thought to be some Unicode encoding.
For example if you had the text "Résumé" encoded as IS0 8859-1 and wanted to convert it to UTF-16, but told the conversion routine that the text was UTF-8 then the library would probably produce the UTF-16 text "R�sum�" (the other alternative would be to throw an error and not give any results).
Another way these may appear is if a web page declares that it is UTF-8 but it is not actually UTF-8. The browser is likely to do the re-encoding described above and the replacement characters will show up in the rendered web-page, but viewing the source with an editor that ignores or disregards the HTML encoding info will show the characters correctly.
From your comments it looks like your process is something like:
Excel -> export to csv -> process csv in js -> produce html
Windows software typically uses the platform's 'encoding for non-Unicode programs' for encoding eight bit text, not UTF-8. So the CSV file is probably Windows CP1252 (If you're using a version of windows set up for most of the western world), and if your javascript program is reading that data and copying it directly into HTML source that's supposed to be UTF-8, that would cause a problem that fits your description.
What you need to do convert from whatever encoding the CSV is using to UTF-8. Javascript doesn't really have the facilities to do this so your best bet is probably to convert the file after exporting it from Excel but before accessing it in JS.
Other alternatives are to change the encoding the HTML page is using to whatever the csv uses, or to not specify an encoding and leave it up to the browser to guess.

UTF-8 in HTML input added by JavaScript

I just don't get it.
My case is, that my application is sending all the needed GUI text by JSON at page startup from my PHP server. On my PHP server I have all text special characters written in UTF-8. Example: Für
So on the client side I have exactly the same value, and it gets displayed nicely everywhere except on input fields. When I do this with JavaScript:
document.getElementById('myInputField').value = "FÖr";
Then it is written exactly like that without any transformation into the special character.
Did I understand something wrong in UTF-8 concepts?
Thanks for any hints.

The notation ü has nothing particular to do with UTF-8. The use of character references is a common way of avoiding the need to use UTF-8; they can be used with any encoding, but if you use UTF-8, you don’t need them.
The notation ü is an HTML notation, not JavaScript. Whether it gets interpreted by HTML rules when it appears inside your JavaScript code depends on the context (like JavaScript inside an HTML document vs. separate JavaScript file). This problem is best avoided by using either characters as such or by using JavaScript notations for characters.
For example, ü means the same as ü, i.e. U+00FC, ü (u with diaeresis). The JavaScript notation, for use inside string literals, for this is \u00fc (\u followed by exactly four hexadecimal digits). E.g., the following sets the value to “Für”:
document.getElementById('myInputField').value = "F\u00fcr";

Your using whats called HTML entities to encode characters which it not the same as UTF-8, but of course a UTF-8 string can include HTML entities.
I think the problem is that tag attributes can't include HTML entities so you have to use some other encoding when assigning the text input value attribute. I think you have two options:
Decode the HTML entity on the client side. A quite ugly solution to piggyback on the decoder available in the browser (im using jQuery in the example, but you probably get the point).
inputElement.value = $("<p/>").html("FÖr").text();
Another option, which is think is nicer, is to not send HTML entities in the server response but instead use proper UTF-8 encoding for all characters which should work fine when put into text nodes or tag attributes. This assumes the HTML page uses UTF-8 encoding of course.

Develop Reference

JavaScript is the programming language of the Web.