This code downloads a sample latin1/ISO-8859-1 encoded file and saves it to disk. Open that file and you'll see the strange question mark characters �. https://stackoverflow.com/a/3527176/779159 explains it's because of the wrong encoding being applied, and latin1 should fix it.
const url = 'http://vancouver-webpages.com/multilingual/french.asis'
request.get(url, { encoding: null })
.pipe(fs.createWriteStream('/tmp/file.txt', { defaultEncoding: 'latin1' }))
But using the request and fs modules, I can't get it to save in latin1 encoding. How do I fix this code?
Node v8.1.4 should support latin1 (aka 'binary') as one of its encodings for Buffer. I just tested your code and it actually works fine. I use Atom as my text editor and, initially, it thought it was 'UTF-8' so the question mark characters appeared. When I switched from UTF-8 to 'Auto-Detect', everything appeared okay. See the screenshot below.
Note how it says 'Windows 1252' for the encoding, but it works the same way if I selected 'ISO 8859-1'. So make sure that whatever editor you are using detects the character encoding correctly. It is not Node's fault!
By the way, an interesting thing to note, according to the docs for Node v8.1.4, in one of the sections for Buffer:
Today's browsers follow the WHATWG spec which aliases both 'latin1'
and ISO-8859-1 to win-1252. This means that while doing something like
http.get(), if the returned charset is one of those listed in the
WHATWG spec it's possible that the server actually returned
win-1252-encoded data, and using 'latin1' encoding may incorrectly
decode the characters.
Related
I am develop in node.js
I get file text-content using fs-module. This is .txt file
Привет мир
Hello world
Hello world
...
After I want to send this content to user to show using res.end(fileString)
But my browser can't decode russian symbols. Only english words are correct.
First and foremost, be sure that the browser is set correctly to use UTF-8 encoding.
If the text still does not show correctly, you can try an alternate encoding for sending the data, depending on how you're doing this.
As an example, you can encode the Russian in base64 prior to sending it over to the browser, and decode it on the browser side.
On the NodeJS side, you can use Buffer to encode the utf8 string to base64. Reference this thread for examples and more details.
On the client/browser side, the atob() and btoa() functions are used rather than Buffer. Here is the documentation for these functions. Use atob() in order to decode the base64 to utf8, and the browser will be able to show this correctly.
I have a problem trying to set a route in Node JS with Express framework.
My route is this one:
app.get('/campaña/nueva', sms.nueva);
But i cant get it to work, because of the evil "Ñ" (it works with an "N" tho)
I used codeigniter for a while, and you can set what characters you want to enable or disable
Do you guys knows of any workarround or way to enable it in node?
I think you'll need to handle both a URL-encoded and perhaps a UTF-8 (and possibly Latin-1 also) variant. Check the following:
How are your clients (browsers) sending the URL?
URL encoded as %C3%B1 ?
chrome and firefox send the %C3%B1 encoding
I would presume this is the dominant and compliant behavior
Unicode ?
I tested with curl and it looks to send a single character which I presume is just whatever encoding it got from my terminal, which is probably UTF-8.
Based on that, try adjusting your route. You could use a regex or an explicit list
.
app.get('/campaña/nueva', sms.nueva)
app.get('/campa%c3%b1a/nueva', sms.nueva)
//Or for convenience if you like
app.get('/' + encodeURIComponent('campaña') + '/nueva', sms.nueva)
My guess is ultimately most browsers are going to send the URL-encoded versions, so you can probably get by with just that last version.
I ran into the same problem with $ in my route. URL encoded character doesn't work in my case, but escaped one works.
So I ended up with
app.get('/\\$myRoute', function (req, res) {
}
Input to read the file Jade:
input#upload.(type='file', accept="text/xml, .csv")
and get in js:
var file = document.getElementById('upload').files[0];
var reader = new FileReader();
reader.onloadend = function(e){
var file = e.target.result;
};
reader.readAsBinaryString(file);
I get a line:
"mail;name;ТеÑÑ"
where ТеÑÑ in the last element in the file is a russian word.
how to fix charset?
The symptom is clear: you are (inadvertently) splicing UTF-8 (judging by your tag) content into something that is being presented as something else (not-UTF-8), hence mojibake ensues.
Make sure that every pass the content goes through is UTF-8 clean or preserves the original content byte-for-byte exactly. That includes setting Content-type headers appropriately (Likely: text/html; charset=utf-8).
This precise issue is why it is recommended to use UTF-8 for all the things. Set up your DBs to use UTF-8, set up your webserver to serve UTF-8, set up your source code to be in UTF-8, set up your editors to save in UTF-8 by default, set up your HTTP headers and meta tags to advertise UTF-8, do not accept anything that is not UTF-8 or transcode it where feasible. Anything that is not UTF-8 is just asking for trouble.
Why standardise on UTF-8, you ask? Because it's low 7bit range happens to look like ASCII which can make a whole world of difference in interoperability with broken/legacy things that don't really understand much else.
Platform: App Engine
Framework: webapp / CGI / WSGI
On my client side (JS), I construct a URL by concatenating a URL with an unicode string:
http://www.foo.com/地震
then I call encodeURI to get
http://www.foo.com/%E5%9C%B0%E9%9C%87
and I put this in a HTML form value.
The form gets submitted to PayPal, where I've set the encoding to 'utf-8'.
PayPal then (through IPN) makes a post request on the said URL.
On my server side, WSGIApplication tries to extract the unicode string using a regular expression I've defined:
(r'/paypal-listener/(.+?)', c.PayPalIPNListener)
I'd try to decode it by calling
query = unquote_plus(query).decode('utf-8')
(or a variation) but I'd get the error
/paypal-listener/%E5%9C%B0%E9%9C%87
... (ommited) ...
'ascii' codec can't encode characters
in position 0-1: ordinal not in
range(128)
(the first line is the request URL)
When I check the length of query, python says it has length 18, which suggests to me that '%E5%9C%B0%E9%9C%87' has not been encoded in anyway.
In principle this should work:
>>> urllib.unquote_plus('http://www.foo.com/%E5%9C%B0%E9%9C%87').decode('utf-8')
u'http://www.foo.com/\u5730\u9707'
However, note that:
unquote_plus is for application/x-form-www-urlencoded data such as POSTed forms and query string parameters. In the path part of a URL, + means a literal plus sign, not space, so you should use plain unquote here.
You shouldn't generally unquote a whole URL. Characters that have special meaning in a component of the URL will be lost. You should split the URL into parts, get the single pathname component (%E5%9C%B0%E9%9C%87) that you are interested in, and then unquote it.
(If you want to fully convert a URI to an IRI like http://www.foo.com/地震 things are a bit more complicated. Only the path/query/fragment part of an IRI is UTF-8-%-encoded; the domain name is mapped between Unicode and bytes using the oddball ‘Punycode’ IDN scheme.)
This gets received in my python server side.
What exactly is your server-side? Server, gateway, framework? And how are you getting the url variable?
You appear to be getting a UnicodeEncodeError, which is about unexpected non-ASCII characters in the input to the unquote function, not an decoding problem at all. So I suggest that something has already decoded the path part of your URL to a Unicode string of some sort. Let's see the repr of that variable!
There are unfortunately a number of serious problems with several web servers that makes using Unicode in the pathname part of a URL very unreliable, not just in Python but generally.
The main problem is that the PATH_INFO variable is defined (by the CGI specification, and subsequently by WSGI) to be pre-decoded. This is a dreadful mistake partly because of issue (1) above, which means you can't get %2F in a path part, but more seriously because decoding a %-sequence introduces a Unicode decode step that is out of the hands of the application. Server environments differ greatly in how non-ASCII %-escapes in the URL are handled, and it is often impossible to recreate the exact sequence of bytes that the web browser passed in.
IIS is a particular problem in that it will try to parse the URL path as UTF-8 by default, falling back to the wildly-unreliable system default codepage (eg. cp1252 on a Western Windows install) if the path isn't a valid UTF-8 sequence, but without telling you. You are then likely to have fairly severe problems trying to read any non-ASCII characters in PATH_INFO out of the environment variables map, because Windows envvars are Unicode but are accessed by Python 2 and many others as bytes in the system codepage.
Apache mitigates the problem by providing an extra non-standard environ REQUEST_URI that holds the original, completely undecoded URL submitted by the browser, which is easy to handle manually. However if you are using URL rewriting or error documents, that unmapped URL may not match what you thought it was going to be.
Some frameworks attempt to fix up these problems, with varying degrees of success. WSGI 1.1 is expected to make a stab at standardising this, but in the meantime the practical position we're left in is that Unicode paths won't work everywhere, and hacks to try to fix it on one server will typically break it on another.
You can always use URL rewriting to convert a Unicode path into a Unicode query parameter. Since the QUERY_STRING environ variable is not decoded outside of the application, it is much easier to handle predictably.
Assuming the HTML page is encoded in utf-8, it should just be a simple path.decode('utf-8') if the framework decodes the URLs percentage escapes.
If it doesn't, you could use:
urllib.unquote(path).decode('utf-8') if the URL is http://www.foo.com/地震
urllib.unquote_plus(path).decode('utf-8') if you're talking about a parameter sent via AJAX or in an HTML <form>
(see http://docs.python.org/library/urllib.html#urllib.unquote)
EDIT: Please supply us with the following information if you're still having problems to help us track this problem down:
Which web framework you're using inside of google app engine, e.g. Django, WebOb, CGI etc
How you're getting the URL in your app (please add a short code sample if you can)
repr(url) of when you add http://www.foo.com/地震 as the URL
Try adding this as the URL and post repr(url) so we can make sure the server isn't decoding the characters as either latin-1 or Windows-1252:
http://foo.com/¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
EDIT 2: Seeing as it's an actual URL (and not in the query section i.e. not http://www.foo.com/?param=%E5%9C%B0%E9%9C%87), doing
query = unquote(query.encode('ascii')).decode('utf-8')
is probably safe. It should be unquote and not unquote_plus if you're decoding the actual URL though. I don't know why google passes the URL as a unicode object but I doubt the actual URL passed to the app would be decoded using windows-1252 etc. I was a bit concerned as I thought it was decoding the query incorrectly (i.e. the parameters passed to GET or POST) but it doesn't seem to be doing that by the looks of it.
Usually there is a function in server-side languages to decode urls, there might be one in Python as well. You can also use the decodeURIComponent() function of javascript in your case.
urllib.unquote() doesn't like unicode-string in this case. Pass it byte-string and decode afterwards to get unicode.
This works:
>>> u = u'http://www.foo.com/%E5%9C%B0%E9%9C%87'
>>> print urllib.unquote(u.encode('ascii'))
http://www.foo.com/地震
>>> print urllib.unquote(u.encode('ascii')).decode('utf-8')
http://www.foo.com/地震
This doesn't (see also urllib.unquote decodes percent-escapes with Latin-1):
>>> print urllib.unquote(u)
http://www.foo.com/å °é
Decoding string that already unicode doesn't work:
>>> print urllib.unquote(u).decode('utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File ".../lib/python2.6/encodings/utf_8.py", line
16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 19-24: o
rdinal not in range(128)
check out this way
var uri = "https://rasamarasa.com/service/catering/ගාල්ල-Galle";
var uri_enc = encodeURIComponent(uri);
var uri_dec = decodeURIComponent(uri_enc);
var res = "Encoded URI: " + uri_enc + "<br>" + "Decoded URI: " + uri_dec;
document.getElementById("demo").innerHTML = res;
for more check this link
https://www.w3schools.com/jsref/jsref_decodeuricomponent.asp
aaaah, the dreaded
'ascii' codec can't encode characters in position... ordinal not in range
error. unavoidable when dealing with languages like Japanese in python...
this is not a url encode/decode issue in this case. your data is most likely already decoded and ready to go.
i would try getting rid of the call to 'decode' and see what happens. if you get garbage but no error it probably means people are sending you data in one of the other lovely japanese specific encodings: eucjp, iso-2022-jp, shift-jis, or perhaps even the elusive iso-2022-jp-ext which is nowadays only rarely spotted in the wild. this latter case seems pretty unlikely though.
edit: id also take a look at this for reference:
What is the difference between encode/decode?
I have a problem I don't know how to solve.
I have an Indy10 HTTP server. I have used both Indy9 and Indy10 HTTP servers in many applications and never had any problems. But now I am using Indy10 HTTP server with ExtJS javascript RAI framework.
The problem is when I submit data that contains non-ansi characters. For instance when I submit letter "č" which is a letter in 1250 codepage (slovenian, croatian...) I get the following in Indy under "unparsed params" -> "%C4%8D". This is correct hexadecimal representation of the "č" letter in utf-8 encoding. All my pages are utf-8 and I never had any problems submiting form data to Indy. I debugged the code and saw that I actually get a sequence of bytes like this: [37, 67, 52, 37, 56, 68]. This is the byte representation of the string "%C4%8D". But of course Indy cannot encode this correctly to UTF-16. So as an example. The actual form field:
FirstName=črt
comes out like this when submited:
FirstName=%C4%8Drt
I don't know how to solve this. I looked at ExtJS forums, but there is nothing on this topic. Anybody know anything about this kind of problem?
EDIT:
If I encode params ad JSON they arrive correctly. I also tried to URL decode the params, but the result is not correct. Maybe I missed something. I will look at this again. And yes it seems that ExtJS URL encodes the params
EDIT2:
Ok, I have discovered more. I compared the actual content of the post data. It is like this:
Delphi 2006 (Indy10): FirstName=%C4%8D
Delphi 2010 (Indy10): FirstName=%C4%8D
In both case the unparsed params are identical. I have ParseParams turned on and in BDS2006
they are correctly parsed, but under 2010 they are not. This is Indy10 bulked with delphi. Is there a bug in this version or am I doing something wrong?
EDIT3:
I downloaded the latest nightly build od Indy10. Still the same issue.
EDIT4:
I am forced to accept my own answer.
To answer on this topic.
This is definitely not working as it should under unicode. Indy uses unicode strings internally. The problem is when parameters are decoded to TStringList. The problem is the line:
Params.Add(TIdURI.URLDecode(s));
found in the "TIdHTTPRequestInfo.DecodeAndSetParams". It does not decode params correctly, probably because it is working over unicode strings.
The workaround I found is to use "HTTPDecode" from "HTTPApp.pas".
Params := TStringList.Create;
try
Params.StrictDelimiter := True;
Params.Delimiter := '&';
// parse the parameters and store them into temporary string list
Params.DelimitedText := UTF8ToString(HTTPDecode(UTF8String(Request.UnparsedParams)));
// do something with params...
finally
Params.Free;
end;
But I cannot believe that such a common task is not working correctly. Can someone confirm this is really a bug or am I just doing something wrong?
It appears the string is URL encoded, so you use the following code to decode:
uses
idURI;
value := TIdURI.URLDecode( value );
edit
It appears there is a case where the decoder does not properly decode the double bytes as a single character. Looking at the source, it does appear that it would decode properly if the character is coded like %UC48D but in my testing this still does not decode properly. What is interesting is that the TidURI.ParamsEncode function generates the proper encoding, but this encoding is not reversible using the proper routines in the latest version of Indy 10.
I´m using Delphi 7 and migrate to Indy 10. I found likely problem with portuguese characters and solve this changing the source below:
procedure TIdHTTPRequestInfo.DecodeAndSetParams(const AValue: String);
...
//Params.Add(TIdURI.URLDecode(s)); //-- UTF8 supose
Params.Add(TIdURI.URLDecode(s,TIdTextEncoding.Default)); //-- ASCII worked
...
end;