I implemented this form submission method that uses xmlhttpreqeust. I saw the new html5 feature, FormData, that allows submission of files along with forms. Cool! However, there's a problem with accented characters, specifically those stupid smart quotes that Word makes (yes, I'm a little bias against those characters). I used to have it submit to a hidden iframe, the old school way, and I never had a problem with the variety of weird characters that was put in there. But I thought this would be better. It's turning out to be a bigger headache :-/
Let's look at the code. My javascript function (note the commented out line):
var xhr = new XMLHttpRequest();
var fd = new FormData(form);
xhr.addEventListener("error", uploadFailed, false);
xhr.addEventListener("abort", uploadCanceled, false);
xhr.addEventListener("load", uploadComplete, false);
xhr.open($(form).attr('method'), $(form).attr('action'));
//xhr.setRequestHeader("Content-Type", "application/x-www-form-urlencoded; charset=ISO-8859-1");
xhr.send(fd);
This is a shortened view, check out line 1510 at http://archive.cyark.org/sitemanager/sitemanager.js to view the entire function.
Then on the receiving php page, I have at the top:
header('Content-Type: text/html; charset=ISO-8859-1');
Followed by some basic php to build a string with the post data and submit it as an update to mysql.
So what do I do? If I uncomment the content-type setting in javascript it totally breaks the POST data in my php script. I don't know if the problem is in javascript, php or mysql. Any thoughts?
Encoding problems are sometimes hard to debug. In short the best solution is to literally use UTF8 as encoding everywhere. That is, every component of your application stack.
Your page seems to be delivered as ISO-LATIN-1 (sent via HTTP header from your webserver) which leads browsers to use latin1 or some Windows equivalent like windows-1252 even though you may have META elements in your HTML's HEAD telling user agents to use UTF8. The HTTP header takes precedence. Check the delivery of your other file formats (especially .js) to be UTF8 as well. If your problems are still appearing after configuring everything client side related (HTML, JS, XHR etc.) to use UTF8 you will have to start checking your server side for problems.
This may include such simple problems as PHP files not being proper UTF8 (very unlikely on linux servers I'd say) but usually consists of problems with mysql configurations (server and client), database and table default encoding (and collation) and the correct connection settings. Problems may also be caused by incorrect PHP ini or mbstring configuration settings.
Examples (not complete; using mysql here as a common database example):
MySQL configuration
[mysqld]
default_character_set = utf8
character_set_client = utf8
character_set_server = utf8
[client]
default_character_set = utf8
Please note, that those settings are different for mysql version 5.1 and 5.5 and may prevent the mysqld from starting when using the wrong variable. See http://dev.mysql.com/doc/refman//5.5/en/server-system-variables.html and http://dev.mysql.com/doc/refman/5.1/en/server-system-variables.html for details.
You may check your mysql variables via CLI:
mysql> SHOW VARIABLES LIKE '%char%';
Variable_name Value
character_set_client utf8
character_set_connection utf8
character_set_database utf8
character_set_filesystem binary
character_set_results utf8
character_set_server utf8
character_set_system utf8
When creating databases and tables try to use something like
CREATE DATABASE $db /*!40100 DEFAULT CHARACTER SET utf8 */
PHP.ini settings (should be the default already):
default_charset = "utf-8"
MB-String extension of PHP uses latin1 by default and should be reconfigured if used:
[mbstring]
mbstring.internal_encoding = UTF-8
mbstring.http_output = UTF-8
...some more perhaps...
Webserver settings (Apache used as example, applies to other servers as well):
# httpd.conf
AddDefaultCharset UTF-8
PHP source codes may use header settings like:
header('Content-type: text/html; charset=UTF-8');
Shell (bash) settings:
# ~/.profile
export LC_CTYPE=en_US.UTF-8
export LANG=en_US.UF-8
The above list is presented here just to give you a hint on what pitfalls may wait for you in certain situations. Every single component of your used web stack must be able to use UTF8 and should be configured correctly to do so. Nonetheless usually a simple correct HTTP header of UTF8 is enough to sort most problems out though. Good luck! :-)
Related
Input to read the file Jade:
input#upload.(type='file', accept="text/xml, .csv")
and get in js:
var file = document.getElementById('upload').files[0];
var reader = new FileReader();
reader.onloadend = function(e){
var file = e.target.result;
};
reader.readAsBinaryString(file);
I get a line:
"mail;name;ТеÑÑ"
where ТеÑÑ in the last element in the file is a russian word.
how to fix charset?
The symptom is clear: you are (inadvertently) splicing UTF-8 (judging by your tag) content into something that is being presented as something else (not-UTF-8), hence mojibake ensues.
Make sure that every pass the content goes through is UTF-8 clean or preserves the original content byte-for-byte exactly. That includes setting Content-type headers appropriately (Likely: text/html; charset=utf-8).
This precise issue is why it is recommended to use UTF-8 for all the things. Set up your DBs to use UTF-8, set up your webserver to serve UTF-8, set up your source code to be in UTF-8, set up your editors to save in UTF-8 by default, set up your HTTP headers and meta tags to advertise UTF-8, do not accept anything that is not UTF-8 or transcode it where feasible. Anything that is not UTF-8 is just asking for trouble.
Why standardise on UTF-8, you ask? Because it's low 7bit range happens to look like ASCII which can make a whole world of difference in interoperability with broken/legacy things that don't really understand much else.
We are implementing a client-side web application that communicates with the server exclusively via XMLHttpRequests (and AJAX engine).
The XHR responses usually are plain text with some XML on it but in this case, the server is sending compressed data in .tgz file type. We know for sure that the data that the server is sending is correct because if we use an HTTP command-line client such as curl, the file sent as response is valid and contains the expected data.
However, when making an AJAX call and "blobing" the response in a downloadable file, the file we obtain is different in size (higher) than the correct one and it is not recognized by the decompresser. It Gives the following error:
gzip: stdin: not in gzip format
/bin/gtar: Child returned status 1
/bin/gtar: Error is not recoverable: exiting now
The code I'm using is the following:
*$.AJAX*.done(function(data){
window.URL = window.webkitURL || window.URL;
var contentType = 'application/x-compressed-tar';
var file = new Blob([data], {type: contentType});
var a = document.createElement('a'),
ev = document.createEvent("MouseEvents");
a.download = "browser_download2.tgz";
a.href = window.URL.createObjectURL(file);
ev.initMouseEvent("click", true, false, self, 0, 0, 0, 0, 0,
false, false, false, false, 0, null);
a.dispatchEvent(ev);
});
I avoided the parameters used to make the AJAX call, but let's assume that this is not the problem as I correctly receive an answer. I used this contentType because is the same one displayed by the obtained by curl but I tried different ones. The code may look a little bit weird so I'll desglosse it for you: I'm basically creating a link and I'm attaching to it the download link and the name of the file (it's a dirty way to be able to name the file). Finally I'm virtually clicking the link.
I compared the correct tgz file and the one obtained via browser with a hex viewer and I observed the repetition of patterns in the corrupted one (EF, BF and BD, all along the file) that is not present in the correct one.
Therefore I think about some possible causes:
(a) The browser is adding extra characters or maybe the response
header is still in the downloaded file.
(b) The file has been partially decompressed because when I inspect
the request Header I can state "Accept-Encoding: gzip, deflate";
although I don't know if the browser (Firefox in my case)
automatically decompresses data.
(c) The code that I'm using to blob the data is not correct; although
it acomplished well the aim with a plain/text file in another
occasion.
Edit
I also provide you the links to the hex inspection:
(a) Corrupted file: http://en.webhex.net/view/278aac05820c34dfbdd2217c03970dd9/0
(b) (Presumably) correct file: http://en.webhex.net/view/4a01894b814c17d2ec71ba49ac48e683
I don't know if this thread will be helpful for somebody, but just in case I figured out the cause and a possible solution for my problem.
The cause
Default Javascript variables store information in Unicode/ASCII format; they are not prepared for storing binary data correctly and this is why one can easily see wrong characters interpreted (this also explains why repetitions of EF, BF, etc. were observed in the Hex Viewer, which stand for wrong characters of ASCII/Unicode).
The solution
The last browser versions implement the so called typed arrays. They are javascript arrays that can store data in different formats (also binary). Then, if one specifies that the XMLHttpRequest response is in binary format, data will be correctly stored and, when blobed into a file, the file will not be corrupted. Check out the code I used:
var xhr = new XMLHttpRequest();
xhr.open('POST', url, true);
xhr.responseType = 'arraybuffer';
Notice that the key point is to define the responseType as "arraybuffer". It may be also interesting noticing that I decided not to use Jquery for the AJAX anymore. It poorly implements this feature and all attempts I did to parse Jquery were in vain (overrideMimeType described somewhere else didn't work in my case). Instead, old plain XMLHttRquest worked pretty nicely.
Is it possible to save text to a new text file using JavaScript/jQuery without using PHP? The text I'm trying to save may contain HTML entities, JS, HTML, CSS and PHP scripts that I don't want to escape or use urlencode!
If it's only can be achieved using PHP how can I pass the text to PHP without encoding it?
You must have a server-side script to handle your request, it can't be done using javascript.
To send raw data without URIencoding or escaping special characters to the php and save it as new txt file you can send ajax request using post method and FormData like:
JS:
var data = new FormData();
data.append("data" , "the_text_you_want_to_save");
var xhr = (window.XMLHttpRequest) ? new XMLHttpRequest() : new activeXObject("Microsoft.XMLHTTP");
xhr.open( 'post', '/path/to/php', true );
xhr.send(data);
PHP:
if(!empty($_POST['data'])){
$data = $_POST['data'];
$fname = mktime() . ".txt";//generates random name
$file = fopen("upload/" .$fname, 'w');//creates new file
fwrite($file, $data);
fclose($file);
}
Edit:
As Florian mentioned below, the XHR fallback is not required since FormData is not supported in older browsers (formdata browser compatibiltiy), so you can declare XHR variable as:
var xhr = new XMLHttpRequest();
Also please note that this works only for browsers that support FormData such as IE +10.
It's not possible to save content to the website using only client-side scripting such as JavaScript and jQuery, but by submitting the data in an AJAX POST request you could perform the other half very easily on the server-side.
However, I would not recommend having raw content such as scripts so easily writeable to your hosting as this could easily be exploited. If you want to learn more about AJAX POST requests, you can read the jQuery API page:
http://api.jquery.com/jQuery.post/
And here are some things you ought to be aware of if you still want to save raw script files on your hosting. You have to be very careful with security if you are handling files like this!
File uploading (most of this applies if sending plain text too if javascript can choose the name of the file)
http://www.developershome.com/wap/wapUpload/wap_upload.asp?page=security
https://www.owasp.org/index.php/Unrestricted_File_Upload
If you still want to work in JavaScript and avoid PHP, CGI, and things like that, it's no longer true that you can't do server side scripts with JavaScript.
With Node.js, you can do server side JavaScript. Of course, you have to have a server than can run a Node.js server. But once you get it up and running, you can write the server script to accept a JSON formatted string from your client side scripts Then, based on that JSON string received, the server side script could create and save files. Of course, you want to make sure you write secure code, check what is being sent to your server and verify it's not malicious before creating the files and saving them. You also probably want to stagger timing and pause between files to ensure you're not susceptible to a DDOS attack, either.
First the environment: the client is a mobile Safari on iPhone, the server consists of a Tomcat 5.5 fronted by IIS.
I have a piece of javascript code that sends a single parameter to the server and gets back some response:
var url = "/abc/ABCServlet";
var paramsString = "name=SomeName"
xmlhttpobj = getXmlHttpObject(); //Browser specific object returned
xmlhttpobj.onreadystatechange = callbackFunction;
xmlhttpobj.open("GET", url + "?" + paramsString, true);
xmlhttpobj.send(null);
This works fine when the iPhone language/locale is EN/US; but when the locale/language is changed to Japanese the query parameter received by the server becomes "SomeName#" without the quotes. Somehow a # is getting appended at the end.
Any clues why?
Hopefully, all you need to do is add a meta tag to the top of your HTML page that specifies the correct character set (e.g. <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />) and match whatever encoding your datafiles are expecting.
If that doesn't work, ensure that you are using the same character encoding (preferably UTF-8) throughout your application. Your server-side scripts and any files that include text strings you will be adding directly to the response stream should be saved with that single encoding. It's a good idea to have your servers send a "Content-Type" HTTP header of the same encoding if possible (e.g. "text/html; charset=utf-8"). And you should ensure that the mobile safari page that's doing the displaying has the right Content-Type meta tag.
Japanese developers have a nasty habit of storing files in EUC or ISO-2022-JP, both of which often force the browser to use different fonts faces on some browsers and can seriously break your page if the browser is expecting a Roman charset. The good news is that if you're forced to use one of the Japanese encodings, that encoding will typically display right for most English text. It's the extended characters you need to look out for.
Now I may be wrong, but I THOUGHT that loading these files via AJAX was not a problem (I think the browser remaps the character data according to the character set for every text file it loads), but as you start mixing document encodings in a single file (and especially in your document body), bad things can happen. Maybe mobile safari requires the same encoding for both HTML files and AJAX files. I hope not. That would be ugly.
Platform: App Engine
Framework: webapp / CGI / WSGI
On my client side (JS), I construct a URL by concatenating a URL with an unicode string:
http://www.foo.com/地震
then I call encodeURI to get
http://www.foo.com/%E5%9C%B0%E9%9C%87
and I put this in a HTML form value.
The form gets submitted to PayPal, where I've set the encoding to 'utf-8'.
PayPal then (through IPN) makes a post request on the said URL.
On my server side, WSGIApplication tries to extract the unicode string using a regular expression I've defined:
(r'/paypal-listener/(.+?)', c.PayPalIPNListener)
I'd try to decode it by calling
query = unquote_plus(query).decode('utf-8')
(or a variation) but I'd get the error
/paypal-listener/%E5%9C%B0%E9%9C%87
... (ommited) ...
'ascii' codec can't encode characters
in position 0-1: ordinal not in
range(128)
(the first line is the request URL)
When I check the length of query, python says it has length 18, which suggests to me that '%E5%9C%B0%E9%9C%87' has not been encoded in anyway.
In principle this should work:
>>> urllib.unquote_plus('http://www.foo.com/%E5%9C%B0%E9%9C%87').decode('utf-8')
u'http://www.foo.com/\u5730\u9707'
However, note that:
unquote_plus is for application/x-form-www-urlencoded data such as POSTed forms and query string parameters. In the path part of a URL, + means a literal plus sign, not space, so you should use plain unquote here.
You shouldn't generally unquote a whole URL. Characters that have special meaning in a component of the URL will be lost. You should split the URL into parts, get the single pathname component (%E5%9C%B0%E9%9C%87) that you are interested in, and then unquote it.
(If you want to fully convert a URI to an IRI like http://www.foo.com/地震 things are a bit more complicated. Only the path/query/fragment part of an IRI is UTF-8-%-encoded; the domain name is mapped between Unicode and bytes using the oddball ‘Punycode’ IDN scheme.)
This gets received in my python server side.
What exactly is your server-side? Server, gateway, framework? And how are you getting the url variable?
You appear to be getting a UnicodeEncodeError, which is about unexpected non-ASCII characters in the input to the unquote function, not an decoding problem at all. So I suggest that something has already decoded the path part of your URL to a Unicode string of some sort. Let's see the repr of that variable!
There are unfortunately a number of serious problems with several web servers that makes using Unicode in the pathname part of a URL very unreliable, not just in Python but generally.
The main problem is that the PATH_INFO variable is defined (by the CGI specification, and subsequently by WSGI) to be pre-decoded. This is a dreadful mistake partly because of issue (1) above, which means you can't get %2F in a path part, but more seriously because decoding a %-sequence introduces a Unicode decode step that is out of the hands of the application. Server environments differ greatly in how non-ASCII %-escapes in the URL are handled, and it is often impossible to recreate the exact sequence of bytes that the web browser passed in.
IIS is a particular problem in that it will try to parse the URL path as UTF-8 by default, falling back to the wildly-unreliable system default codepage (eg. cp1252 on a Western Windows install) if the path isn't a valid UTF-8 sequence, but without telling you. You are then likely to have fairly severe problems trying to read any non-ASCII characters in PATH_INFO out of the environment variables map, because Windows envvars are Unicode but are accessed by Python 2 and many others as bytes in the system codepage.
Apache mitigates the problem by providing an extra non-standard environ REQUEST_URI that holds the original, completely undecoded URL submitted by the browser, which is easy to handle manually. However if you are using URL rewriting or error documents, that unmapped URL may not match what you thought it was going to be.
Some frameworks attempt to fix up these problems, with varying degrees of success. WSGI 1.1 is expected to make a stab at standardising this, but in the meantime the practical position we're left in is that Unicode paths won't work everywhere, and hacks to try to fix it on one server will typically break it on another.
You can always use URL rewriting to convert a Unicode path into a Unicode query parameter. Since the QUERY_STRING environ variable is not decoded outside of the application, it is much easier to handle predictably.
Assuming the HTML page is encoded in utf-8, it should just be a simple path.decode('utf-8') if the framework decodes the URLs percentage escapes.
If it doesn't, you could use:
urllib.unquote(path).decode('utf-8') if the URL is http://www.foo.com/地震
urllib.unquote_plus(path).decode('utf-8') if you're talking about a parameter sent via AJAX or in an HTML <form>
(see http://docs.python.org/library/urllib.html#urllib.unquote)
EDIT: Please supply us with the following information if you're still having problems to help us track this problem down:
Which web framework you're using inside of google app engine, e.g. Django, WebOb, CGI etc
How you're getting the URL in your app (please add a short code sample if you can)
repr(url) of when you add http://www.foo.com/地震 as the URL
Try adding this as the URL and post repr(url) so we can make sure the server isn't decoding the characters as either latin-1 or Windows-1252:
http://foo.com/¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
EDIT 2: Seeing as it's an actual URL (and not in the query section i.e. not http://www.foo.com/?param=%E5%9C%B0%E9%9C%87), doing
query = unquote(query.encode('ascii')).decode('utf-8')
is probably safe. It should be unquote and not unquote_plus if you're decoding the actual URL though. I don't know why google passes the URL as a unicode object but I doubt the actual URL passed to the app would be decoded using windows-1252 etc. I was a bit concerned as I thought it was decoding the query incorrectly (i.e. the parameters passed to GET or POST) but it doesn't seem to be doing that by the looks of it.
Usually there is a function in server-side languages to decode urls, there might be one in Python as well. You can also use the decodeURIComponent() function of javascript in your case.
urllib.unquote() doesn't like unicode-string in this case. Pass it byte-string and decode afterwards to get unicode.
This works:
>>> u = u'http://www.foo.com/%E5%9C%B0%E9%9C%87'
>>> print urllib.unquote(u.encode('ascii'))
http://www.foo.com/地震
>>> print urllib.unquote(u.encode('ascii')).decode('utf-8')
http://www.foo.com/地震
This doesn't (see also urllib.unquote decodes percent-escapes with Latin-1):
>>> print urllib.unquote(u)
http://www.foo.com/å °é
Decoding string that already unicode doesn't work:
>>> print urllib.unquote(u).decode('utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File ".../lib/python2.6/encodings/utf_8.py", line
16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 19-24: o
rdinal not in range(128)
check out this way
var uri = "https://rasamarasa.com/service/catering/ගාල්ල-Galle";
var uri_enc = encodeURIComponent(uri);
var uri_dec = decodeURIComponent(uri_enc);
var res = "Encoded URI: " + uri_enc + "<br>" + "Decoded URI: " + uri_dec;
document.getElementById("demo").innerHTML = res;
for more check this link
https://www.w3schools.com/jsref/jsref_decodeuricomponent.asp
aaaah, the dreaded
'ascii' codec can't encode characters in position... ordinal not in range
error. unavoidable when dealing with languages like Japanese in python...
this is not a url encode/decode issue in this case. your data is most likely already decoded and ready to go.
i would try getting rid of the call to 'decode' and see what happens. if you get garbage but no error it probably means people are sending you data in one of the other lovely japanese specific encodings: eucjp, iso-2022-jp, shift-jis, or perhaps even the elusive iso-2022-jp-ext which is nowadays only rarely spotted in the wild. this latter case seems pretty unlikely though.
edit: id also take a look at this for reference:
What is the difference between encode/decode?