javascript string internal representation - javascript

As far as I know java uses UTF-16 to represent chars and string internally,
so if we load a text file from a file it is automatically decoded to its original encoding to utf-16.
Now the same can be said also for javascript
it also uses utf-16 as the internal string representation.
Suppose we load a string x encoded in utf-8 using ajax,
a converion takes place in order for javascript to be able to represent internally that string in UTF-16.
Please tell me if any of what I stated is correct or not,
because the real question is yet to come...
Now suppose the browser is rendering a page using utf-8 encoding,
and using javascript we want the browser to render also the ajax string x (as you normally do)
Would, in this case, a further conversion be needed from utf-16 to utf-8 ?
Thanks in advance.

According to this article, it is USC-2 or UTF-16

Related

Javascript string compression for URL hash parameter

I'm looking to store a lot of data in a URL hash parameter without exceeding URL character limits.
Are there any conventional ways of compressing string length which could be then decoded on another page load?
I've seen LZW encoding used for similar solutions, however would special characters be valid for this use?
LZW encoding technically works; you'll just need to convert the LZW-encoded binary into URL-safe base64, so that the output doesn't contain special characters. Here's an MDN article on base64 in JavaScript; the URL-safe variant of base64 just replaces + with - and / with _. Of course, you're not likely to reduce the size of your string by much by doing this, unless the data you want to store is extremely compressible.
You can look at smaz or shoco, which are designed for the compression of short strings. Most compression methods don't really get rolling until well after your URL length limit, so you need a specialized compressor for this case if you expect to get any gain. You can then encode the binary result using a scheme like Base 64 or a more efficient coding that uses all of the URI-safe characters.

json.loads in python containing non standard characters

What I'm trying to do is grab some text in my web application (string literals containing non English characters such as ă) with Javascript. Pass it to an object and then use JSON.Stringify() on the object, and then pass that to a python script.
The Python script is intended to load the json data and eventually print the text on a POS printer so the final data has to be in ascii hex format of a specific code page which means ill have to perform some character conversion either before or after the python script gets the data.
Basically something like this:
someObject.arrayOfTextInputs.push("how can I handle the ă character");
--run a python script and pass to it: JSON.Stringify(someObject) --
In Python:
jsonStuff = sys.argv[1]
myObject= json.loads(jsonStuff)
Now if I simply pass the strings as is, the python script hangs upon json.loads because of the ă character. If i replace the character prior to Stringify, with an'\xNN' representation matching a value from a Code Page i need, the json.loads still hangs.
Same thing for using '\uNNNN'.
I also printed out the json i get before handing it over to json.loads() and normally it just prints out some weird hex? character/image instead of ă.
However, replacing it to its utf-8 repesentation (in javascript) \xc4\x83, makes the print in python display the character properly (altho it creates problems in the next steps).
Same thing happens with replacing ă with \xC7 which is the matching character in code page 852 (latin-2) and then jsonStuff.decode('cp852')
What are my options here?
Edit: Thanks for the welcome!
I am using Python 2.7 which from what I've gathered uses standard ascii encoding.
If i skip any conversion i get the exception: ValueError: Invalid control character at some character/byte..
If I convert the character in Javascript (before doing Stringify() on the object) with an escape matching the same character from an utf-8 table: "\u0103" i get the same exception.
If I convert the character to some utf-8 character that falls into the standard ascii character set ("\u0045"), it loads fine. I guess the decoder can automatically map the unicode "regular" characters into their ascii representatives.
Same thing for converting to say "\x45".
If I add strict=False argument to the loads() function, i can load any escaped character but then I'm not sure how to handle it in my python script.
I have to admit, the stringify and loads() part really makes me lose track, since I'm starting out with utf8 in my IDE, using escape characters from a different encoding, then calling stringify (only utf8 encoded stuff is valid json?) and pass it into python which cant handle utf8 past the standard character set. And I have to end up with '\xNN' of a specific code page(say Latin-2) in the end just before the print.
Should I try to pass anything using strict=False and handle it from there within python, or is it possible to send everything encoded using some code page?
I'll add some code in a bit.

Decoding cp1251 to UTF-8 in javascript

How to decode cp-1251 to UTF-8 in javascript?
The cp-1251 is from a datafeed, which required to decode from js client side.
There is no way to change server side output, since it is related to a 3rd party, and due to some reason, I would not use any server side programming to convert the datafeed to become another datafeed.
(Assuming that by "UTF-8" you meant the JS strings in their native encoding...)
Depending on the format your 'cp-1251' data is in and depending on the browsers you need to support, you can choose from:
TextDecoder.decode() API (decodes a sequence of octets from a typed array, like Uint8Array) - if you're using web sockets, you can get an ArrayBuffer out of it to decode.
https://github.com/mathiasbynens/windows-1251 operates on something it calls 'byte strings' (JS Strings consisting of characters like \u00XY, where 0xXY is the encoded byte.
build the decoding table yourself (example)
Note that in most cases (not something as low-level as websockets though) it might be easier to read the data in the correct encoding before it ends up as a JS string (for example, you can force XMLHttpRequest to use a certain encoding even if the server misreports the encoding).

Squared Question Mark Sign on CSV file read from JS

I'm reading a CSV file in my JS, but characters with accent (á, ó...) are being replaced with a black square question mark (�).
I always have this sort of problem in PHP, but, i'm using JS and i don't know how to fix that.
The problem is in the UTF8 codification of the file, of the HTML, is there a way to fix this in code?
Thanks
This character is U+FFFD, REPLACEMENT CHARACTER, commonly used to replace invalid data in streams thought to be some Unicode encoding.
For example if you had the text "Résumé" encoded as IS0 8859-1 and wanted to convert it to UTF-16, but told the conversion routine that the text was UTF-8 then the library would probably produce the UTF-16 text "R�sum�" (the other alternative would be to throw an error and not give any results).
Another way these may appear is if a web page declares that it is UTF-8 but it is not actually UTF-8. The browser is likely to do the re-encoding described above and the replacement characters will show up in the rendered web-page, but viewing the source with an editor that ignores or disregards the HTML encoding info will show the characters correctly.
From your comments it looks like your process is something like:
Excel -> export to csv -> process csv in js -> produce html
Windows software typically uses the platform's 'encoding for non-Unicode programs' for encoding eight bit text, not UTF-8. So the CSV file is probably Windows CP1252 (If you're using a version of windows set up for most of the western world), and if your javascript program is reading that data and copying it directly into HTML source that's supposed to be UTF-8, that would cause a problem that fits your description.
What you need to do convert from whatever encoding the CSV is using to UTF-8. Javascript doesn't really have the facilities to do this so your best bet is probably to convert the file after exporting it from Excel but before accessing it in JS.
Other alternatives are to change the encoding the HTML page is using to whatever the csv uses, or to not specify an encoding and leave it up to the browser to guess.

what the function that I can use in Javascript to convert from one character encoding to another?

what the built-in or user-defined function that I can use in Javascript or jQuery to convert from one character encoding to another?
For Example,
FROM "utf-8" TO "windows-1256"
OR
FROM "windows-1256" TO "utf-8"
A practical use of that is if you have a php page with specific character encoding like "windows-1256" that you could not change it according to the business needs and when you use ajax to send a block data from database using json which uses "utf-8" encoding only so you need to convert the ouput of json to this encoding so that the characters and the strings will be displayed well
Thanks in advance .....
From the standpoint of a JavaScript runtime environment, there's really no such thing as character encodings – the messiness of encodings is abstracted away from you. By spec, all JS source text is interpreted as Unicode characters, and all Strings are Unicode.
As such, there's no way in JavaScript to represent characters in anything other than Unicode. Look at the methods available on a String instance – you'll see there's nothing related to character encoding.
Because JavaScript runs in Unicode, and all JavaScript strings are stored in Unicode, all AJAX calls will be transmitted over the wire in Unicode. From the jQuery AJAX docs:
Data will always be transmitted to the server using UTF-8 charset; you must decode this appropriately on the server side.
Your PHP script is going to have to cope with Unicode input from AJAX calls.

Categories

Resources