Convert ISO/Windows charsets to UTF-8 in Javascript

Convert ISO/Windows charsets to UTF-8 in Javascript - javascript

I'm developing a firefox plugin and i fetch web pages to do some analysis for the user. The problem is when i try to get (XMLHttpRequest) pages that are not utf-8 encoded the string i see is messed up. For example hebrew pages with windows-1125 or Chinese pages with gb2312.
I already tried the following:
var uDecoder=Components.classes["#mozilla.org/intl/scriptableunicodeconverter"].getService(Components.interfaces.nsIScriptableUnicodeConverter);
uDecoder.charset="windows-1255";
alert( xhr.responseText );
var decoder=Components.classes["#mozilla.org/intl/utf8converterservice;1"].getService(Components.interfaces.nsIUTF8ConverterService);
alert(decoder.convertStringToUTF8(xhr.responseText,"WINDOWS-1255",true));
I also tried escape/unescape/encodeURIComponent
any ideas???

Once XMLHttpRequest has tried to decode a non-UTF-8 string using UTF-8, you've already lost. The byte sequences in the page that weren't valid UTF-8 sequences will have been mangled (typically converted to �, the U+FFFD replacement character). No amount of re-encoding/decoding will get them back.
Pages that specify a Content-Type: text/html;charset=something HTTP header should be OK. Pages that don't have a real HTTP header but do have a <meta> version of it won't be, because XMLHttpRequest doesn't know about parsing HTML so it won't see the meta. If you know in advance the charset you want, you can tell XMLHttpRequest and it'll use it:
xhr.open(...);
xhr.overrideMimeType('text/html;charset=gb2312');
xhr.send();
(This is a currently non-standardised Mozilla extension.)
If you don't know the charset in advance, you can request the page once, hack about with the header for a <meta> charset, parse that out and request again with the new charset.
In theory you could get a binary response in a single request:
xhr.overrideMimeType('text/html;charset=iso-8859-1');
and then convert that from bytes-as-chars to UTF-8. However, iso-8859-1 wouldn't work for this because the browser interprets that charset as really being Windows code page 1252.
You could maybe use another codepage that maps every byte to a character, and do a load of tedious character replacements to map every character in that codepage to the character it would have been in real-ISO-8859-1, then do the conversion. Most encodings don't map every byte, but Arabic (cp1256) might be a candidate for this?

Related

Is URL encoding sufficient for href attribute for protection against XSS?

According to OWASP, user input into an href attribute should "...except for alphanumeric characters, escape all characters with ASCII values less than 256 with the %HH escaping format."
I don't understand the rationale behind this. Why can't URL encoding do the job? I have spent hours trying to create an attack vector for a URL string that is dynamically generated and presented back to the user, and to me, it seems like a pretty solid protection against XSS attacks.
I've also been looking into this for a while now and most people are advising to use URL encoding alongside HTML encoding. I totally get why HTML encoding is insufficient because other vectors can still be utilised such as onclick=alert()
Can someone show me an example of an attack vector being used to manipulate an href which is being rendered with URL encoding and without HTML encoding or the encoding suggested by owasp.org in rule #5?

No, if someone injects javascript:alert(0) then it will work. No method of encoding will prevent that, you should try to block javascript URI schemes along with all other URI schemes which would allow for XSS there, such as data: and blob: for example.
Recommended action is not to directly reflect user input into a link.
Additionally it is important to remember not to simply block these schemes precisely using something like preg_replace as line feeds would bypass this and produce an XSS payload. Such is: java%0a%0dscript:alert(0);. As you can see, a CRLF character was placed in the middle of the payload to prevent PHP (or other server-side languages) from recognizing it as javascript: which you have blocked. But HTML will still render this as javascript:alert(0); as the CRLF character is whitespace and ignored by HTML (within the value of an element's attribute), yet interpreted by PHP and other languages.

The encoding is context dependent. When you have a URL inside a HTML document, then you need both URL encoding and HTML encoding, but at different times.
...except for alphanumeric characters, escape all characters with
ASCII values less than 256 with the %HH escaping format.
This is recommending to use URL encoding. But not for the whole URL. The context is when inserting URL parameters into the URL They need to be URL encoded just to allow for example & symbols in a value.
Do not encode complete or relative URL's with URL encoding!
This is a separate rule for the whole URL Once the URL is encoded, then when inserting into a html attribute, then you apply html encoding.
You can't apply URL encoding to a complete URL, because it is already URL encoded and encoding it again will result in double encoding, corrupting the URL. For example, any % symbols in the original URL will be wrong.
HTML encoding is needed because of characters like ampersands are valid characters in URLs but have a different meaning in HTML because of character entities. It's possible for a URL to contain strings that look like HTML entities but aren't so need to be encoded when inserting into an HTML document.

How to force browsers not to normalize a unicode URL?

Most browsers, such as Firefox and Chrome, do Unicode normalization on URLs before requesting them. For example, when chrome or firefox want to open this link:
http://fa.wikipedia.org/wiki/سید_محمد_خاتمی
which contains persian Unicode characters, they automatically convert this string into:
http://fa.wikipedia.org/wiki/%D8%B3%DB%8C%D8%AF_%D9%85%D8%AD%D9%85%D8%AF_%D8%AE%D8%A7%D8%AA%D9%85%DB%8C
I want to modify the hyperlinks in my website in a way to prevent browsers from normalizing unicode characters, such that when a user clicks on a linke, its pure (original) URL is requested from the server.
Is there any trick for that? E.g. a small javascript code in the source page that links to such URLs.
UPDATE: When I request the url by a programming language, e.g. Java's HttpURLConnection, it requests the original URL and do not use any normalization (except that I explicitly call UrlNormalizer.normalize(url)). However, most browsers and Linux's GET command do the normalization.

For example, when chrome or firefox want to open this link: http://fa.wikipedia.org/wiki/سید_محمد_خاتمی
That's not a valid URI. It's an IRI. Web browsers and other client tools that support IRI will convert it to the ASCII-only URI form (percent-UTF-8-encoded paths and Punycode-encoded hostnames) for you behind the scenes.
When I request the url by a programming language, e.g. Java's HttpURLConnection, it requests the original URL
HttpURLConnection doesn't support IRI. It tries to send the URI as-is anyway, but it should really have rejected it for being invalid.
I want to modify the hyperlinks in my website in a way to prevent browsers from normalizing unicode characters, such that when a user clicks on a linke, its pure (original) URL is requested from the server.
It is not valid according to the HTTP standard to send raw non-ASCII bytes in the request-line (RFC7230 absolute path -> RFC3986 segment). Web servers do different, unpredictable things when presented with such invalid requests. It is at all times best avoided.
There is no way to tell IRI-aware browsers to ignore proper behaviour and send non-ASCII request lines, but why would you want to? What are you trying to do here?

External javascript writes in latin-1

I'm a bit stuck, given my page includes an external JavaScript which uses document.write. The problem is my page is UTF-8 encoded, and the contents written are encoded in latin-1, which causes some display problems.
Is there any way to handle this ?

I have to admit never having had to mix encodings, but in theory you should be able to specify the charset attribute (link) on the script tag — but be sure you're not conflicting with it when serving the external file. From that link:
The charset attribute gives the character encoding of the external script resource...its value must be a valid character encoding name, must be an ASCII case-insensitive match for the preferred MIME name for that encoding, and must match the encoding given in the charset parameter of the Content-Type metadata of the external file, if any.
So that will tell the browser how to interpret the script data, provided your server provides the same charset (or doesn't supply any charset) in the Content-Type header when serving up the script file.
Once the browser is reading the script with the right charset, you should be okay, because by the time JavaScript is dealing with strings, they're UTF-16 (according to Section 8.4 of the 5th edition spec).

iPhone browser/IIS/Tomcat, Japanese locale, http parameters getting messed

First the environment: the client is a mobile Safari on iPhone, the server consists of a Tomcat 5.5 fronted by IIS.
I have a piece of javascript code that sends a single parameter to the server and gets back some response:
var url = "/abc/ABCServlet";
var paramsString = "name=SomeName"
xmlhttpobj = getXmlHttpObject(); //Browser specific object returned
xmlhttpobj.onreadystatechange = callbackFunction;
xmlhttpobj.open("GET", url + "?" + paramsString, true);
xmlhttpobj.send(null);
This works fine when the iPhone language/locale is EN/US; but when the locale/language is changed to Japanese the query parameter received by the server becomes "SomeName#" without the quotes. Somehow a # is getting appended at the end.
Any clues why?

Hopefully, all you need to do is add a meta tag to the top of your HTML page that specifies the correct character set (e.g. <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />) and match whatever encoding your datafiles are expecting.
If that doesn't work, ensure that you are using the same character encoding (preferably UTF-8) throughout your application. Your server-side scripts and any files that include text strings you will be adding directly to the response stream should be saved with that single encoding. It's a good idea to have your servers send a "Content-Type" HTTP header of the same encoding if possible (e.g. "text/html; charset=utf-8"). And you should ensure that the mobile safari page that's doing the displaying has the right Content-Type meta tag.
Japanese developers have a nasty habit of storing files in EUC or ISO-2022-JP, both of which often force the browser to use different fonts faces on some browsers and can seriously break your page if the browser is expecting a Roman charset. The good news is that if you're forced to use one of the Japanese encodings, that encoding will typically display right for most English text. It's the extended characters you need to look out for.
Now I may be wrong, but I THOUGHT that loading these files via AJAX was not a problem (I think the browser remaps the character data according to the character set for every text file it loads), but as you start mixing document encodings in a single file (and especially in your document body), bad things can happen. Maybe mobile safari requires the same encoding for both HTML files and AJAX files. I hope not. That would be ugly.

Having encoded a unicode string in javascript, how can I decode it in Python?

Platform: App Engine
Framework: webapp / CGI / WSGI
On my client side (JS), I construct a URL by concatenating a URL with an unicode string:
http://www.foo.com/地震
then I call encodeURI to get
http://www.foo.com/%E5%9C%B0%E9%9C%87
and I put this in a HTML form value.
The form gets submitted to PayPal, where I've set the encoding to 'utf-8'.
PayPal then (through IPN) makes a post request on the said URL.
On my server side, WSGIApplication tries to extract the unicode string using a regular expression I've defined:
(r'/paypal-listener/(.+?)', c.PayPalIPNListener)
I'd try to decode it by calling
query = unquote_plus(query).decode('utf-8')
(or a variation) but I'd get the error
/paypal-listener/%E5%9C%B0%E9%9C%87
... (ommited) ...
'ascii' codec can't encode characters
in position 0-1: ordinal not in
range(128)
(the first line is the request URL)
When I check the length of query, python says it has length 18, which suggests to me that '%E5%9C%B0%E9%9C%87' has not been encoded in anyway.

In principle this should work:
>>> urllib.unquote_plus('http://www.foo.com/%E5%9C%B0%E9%9C%87').decode('utf-8')
u'http://www.foo.com/\u5730\u9707'
However, note that:
unquote_plus is for application/x-form-www-urlencoded data such as POSTed forms and query string parameters. In the path part of a URL, + means a literal plus sign, not space, so you should use plain unquote here.
You shouldn't generally unquote a whole URL. Characters that have special meaning in a component of the URL will be lost. You should split the URL into parts, get the single pathname component (%E5%9C%B0%E9%9C%87) that you are interested in, and then unquote it.
(If you want to fully convert a URI to an IRI like http://www.foo.com/地震 things are a bit more complicated. Only the path/query/fragment part of an IRI is UTF-8-%-encoded; the domain name is mapped between Unicode and bytes using the oddball ‘Punycode’ IDN scheme.)
This gets received in my python server side.
What exactly is your server-side? Server, gateway, framework? And how are you getting the url variable?
You appear to be getting a UnicodeEncodeError, which is about unexpected non-ASCII characters in the input to the unquote function, not an decoding problem at all. So I suggest that something has already decoded the path part of your URL to a Unicode string of some sort. Let's see the repr of that variable!
There are unfortunately a number of serious problems with several web servers that makes using Unicode in the pathname part of a URL very unreliable, not just in Python but generally.
The main problem is that the PATH_INFO variable is defined (by the CGI specification, and subsequently by WSGI) to be pre-decoded. This is a dreadful mistake partly because of issue (1) above, which means you can't get %2F in a path part, but more seriously because decoding a %-sequence introduces a Unicode decode step that is out of the hands of the application. Server environments differ greatly in how non-ASCII %-escapes in the URL are handled, and it is often impossible to recreate the exact sequence of bytes that the web browser passed in.
IIS is a particular problem in that it will try to parse the URL path as UTF-8 by default, falling back to the wildly-unreliable system default codepage (eg. cp1252 on a Western Windows install) if the path isn't a valid UTF-8 sequence, but without telling you. You are then likely to have fairly severe problems trying to read any non-ASCII characters in PATH_INFO out of the environment variables map, because Windows envvars are Unicode but are accessed by Python 2 and many others as bytes in the system codepage.
Apache mitigates the problem by providing an extra non-standard environ REQUEST_URI that holds the original, completely undecoded URL submitted by the browser, which is easy to handle manually. However if you are using URL rewriting or error documents, that unmapped URL may not match what you thought it was going to be.
Some frameworks attempt to fix up these problems, with varying degrees of success. WSGI 1.1 is expected to make a stab at standardising this, but in the meantime the practical position we're left in is that Unicode paths won't work everywhere, and hacks to try to fix it on one server will typically break it on another.
You can always use URL rewriting to convert a Unicode path into a Unicode query parameter. Since the QUERY_STRING environ variable is not decoded outside of the application, it is much easier to handle predictably.

Assuming the HTML page is encoded in utf-8, it should just be a simple path.decode('utf-8') if the framework decodes the URLs percentage escapes.
If it doesn't, you could use:
urllib.unquote(path).decode('utf-8') if the URL is http://www.foo.com/地震
urllib.unquote_plus(path).decode('utf-8') if you're talking about a parameter sent via AJAX or in an HTML <form>
(see http://docs.python.org/library/urllib.html#urllib.unquote)
EDIT: Please supply us with the following information if you're still having problems to help us track this problem down:
Which web framework you're using inside of google app engine, e.g. Django, WebOb, CGI etc
How you're getting the URL in your app (please add a short code sample if you can)
repr(url) of when you add http://www.foo.com/地震 as the URL
Try adding this as the URL and post repr(url) so we can make sure the server isn't decoding the characters as either latin-1 or Windows-1252:
http://foo.com/¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
EDIT 2: Seeing as it's an actual URL (and not in the query section i.e. not http://www.foo.com/?param=%E5%9C%B0%E9%9C%87), doing
query = unquote(query.encode('ascii')).decode('utf-8')
is probably safe. It should be unquote and not unquote_plus if you're decoding the actual URL though. I don't know why google passes the URL as a unicode object but I doubt the actual URL passed to the app would be decoded using windows-1252 etc. I was a bit concerned as I thought it was decoding the query incorrectly (i.e. the parameters passed to GET or POST) but it doesn't seem to be doing that by the looks of it.

Usually there is a function in server-side languages to decode urls, there might be one in Python as well. You can also use the decodeURIComponent() function of javascript in your case.

urllib.unquote() doesn't like unicode-string in this case. Pass it byte-string and decode afterwards to get unicode.
This works:
>>> u = u'http://www.foo.com/%E5%9C%B0%E9%9C%87'
>>> print urllib.unquote(u.encode('ascii'))
http://www.foo.com/地震
>>> print urllib.unquote(u.encode('ascii')).decode('utf-8')
http://www.foo.com/地震
This doesn't (see also urllib.unquote decodes percent-escapes with Latin-1):
>>> print urllib.unquote(u)
http://www.foo.com/å °é
Decoding string that already unicode doesn't work:
>>> print urllib.unquote(u).decode('utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File ".../lib/python2.6/encodings/utf_8.py", line
16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 19-24: o
rdinal not in range(128)

check out this way
var uri = "https://rasamarasa.com/service/catering/ගාල්ල-Galle";
var uri_enc = encodeURIComponent(uri);
var uri_dec = decodeURIComponent(uri_enc);
var res = "Encoded URI: " + uri_enc + "<br>" + "Decoded URI: " + uri_dec;
document.getElementById("demo").innerHTML = res;
for more check this link
https://www.w3schools.com/jsref/jsref_decodeuricomponent.asp

aaaah, the dreaded
'ascii' codec can't encode characters in position... ordinal not in range
error. unavoidable when dealing with languages like Japanese in python...
this is not a url encode/decode issue in this case. your data is most likely already decoded and ready to go.
i would try getting rid of the call to 'decode' and see what happens. if you get garbage but no error it probably means people are sending you data in one of the other lovely japanese specific encodings: eucjp, iso-2022-jp, shift-jis, or perhaps even the elusive iso-2022-jp-ext which is nowadays only rarely spotted in the wild. this latter case seems pretty unlikely though.
edit: id also take a look at this for reference:
What is the difference between encode/decode?

Develop Reference

JavaScript is the programming language of the Web.