Problems with charset reading file. How to fix it

Problems with charset reading file. How to fix it - javascript

Input to read the file Jade:
input#upload.(type='file', accept="text/xml, .csv")
and get in js:
var file = document.getElementById('upload').files[0];
var reader = new FileReader();
reader.onloadend = function(e){
var file = e.target.result;
};
reader.readAsBinaryString(file);
I get a line:
"mail;name;Ð¢ÐµÑÑ"
where Ð¢ÐµÑÑ in the last element in the file is a russian word.
how to fix charset?

The symptom is clear: you are (inadvertently) splicing UTF-8 (judging by your tag) content into something that is being presented as something else (not-UTF-8), hence mojibake ensues.
Make sure that every pass the content goes through is UTF-8 clean or preserves the original content byte-for-byte exactly. That includes setting Content-type headers appropriately (Likely: text/html; charset=utf-8).
This precise issue is why it is recommended to use UTF-8 for all the things. Set up your DBs to use UTF-8, set up your webserver to serve UTF-8, set up your source code to be in UTF-8, set up your editors to save in UTF-8 by default, set up your HTTP headers and meta tags to advertise UTF-8, do not accept anything that is not UTF-8 or transcode it where feasible. Anything that is not UTF-8 is just asking for trouble.
Why standardise on UTF-8, you ask? Because it's low 7bit range happens to look like ASCII which can make a whole world of difference in interoperability with broken/legacy things that don't really understand much else.

Related

How to save a file in NodeJS using a latin1 character encoding?

This code downloads a sample latin1/ISO-8859-1 encoded file and saves it to disk. Open that file and you'll see the strange question mark characters �. https://stackoverflow.com/a/3527176/779159 explains it's because of the wrong encoding being applied, and latin1 should fix it.
const url = 'http://vancouver-webpages.com/multilingual/french.asis'
request.get(url, { encoding: null })
.pipe(fs.createWriteStream('/tmp/file.txt', { defaultEncoding: 'latin1' }))
But using the request and fs modules, I can't get it to save in latin1 encoding. How do I fix this code?

Node v8.1.4 should support latin1 (aka 'binary') as one of its encodings for Buffer. I just tested your code and it actually works fine. I use Atom as my text editor and, initially, it thought it was 'UTF-8' so the question mark characters appeared. When I switched from UTF-8 to 'Auto-Detect', everything appeared okay. See the screenshot below.
Note how it says 'Windows 1252' for the encoding, but it works the same way if I selected 'ISO 8859-1'. So make sure that whatever editor you are using detects the character encoding correctly. It is not Node's fault!
By the way, an interesting thing to note, according to the docs for Node v8.1.4, in one of the sections for Buffer:
Today's browsers follow the WHATWG spec which aliases both 'latin1'
and ISO-8859-1 to win-1252. This means that while doing something like
http.get(), if the returned charset is one of those listed in the
WHATWG spec it's possible that the server actually returned
win-1252-encoded data, and using 'latin1' encoding may incorrectly
decode the characters.

AJAX response gives a corrupted compressed (.tgz) file

We are implementing a client-side web application that communicates with the server exclusively via XMLHttpRequests (and AJAX engine).
The XHR responses usually are plain text with some XML on it but in this case, the server is sending compressed data in .tgz file type. We know for sure that the data that the server is sending is correct because if we use an HTTP command-line client such as curl, the file sent as response is valid and contains the expected data.
However, when making an AJAX call and "blobing" the response in a downloadable file, the file we obtain is different in size (higher) than the correct one and it is not recognized by the decompresser. It Gives the following error:
gzip: stdin: not in gzip format
/bin/gtar: Child returned status 1
/bin/gtar: Error is not recoverable: exiting now
The code I'm using is the following:
*$.AJAX*.done(function(data){
window.URL = window.webkitURL || window.URL;
var contentType = 'application/x-compressed-tar';
var file = new Blob([data], {type: contentType});
var a = document.createElement('a'),
ev = document.createEvent("MouseEvents");
a.download = "browser_download2.tgz";
a.href = window.URL.createObjectURL(file);
ev.initMouseEvent("click", true, false, self, 0, 0, 0, 0, 0,
false, false, false, false, 0, null);
a.dispatchEvent(ev);
});
I avoided the parameters used to make the AJAX call, but let's assume that this is not the problem as I correctly receive an answer. I used this contentType because is the same one displayed by the obtained by curl but I tried different ones. The code may look a little bit weird so I'll desglosse it for you: I'm basically creating a link and I'm attaching to it the download link and the name of the file (it's a dirty way to be able to name the file). Finally I'm virtually clicking the link.
I compared the correct tgz file and the one obtained via browser with a hex viewer and I observed the repetition of patterns in the corrupted one (EF, BF and BD, all along the file) that is not present in the correct one.
Therefore I think about some possible causes:
(a) The browser is adding extra characters or maybe the response
header is still in the downloaded file.
(b) The file has been partially decompressed because when I inspect
the request Header I can state "Accept-Encoding: gzip, deflate";
although I don't know if the browser (Firefox in my case)
automatically decompresses data.
(c) The code that I'm using to blob the data is not correct; although
it acomplished well the aim with a plain/text file in another
occasion.
Edit
I also provide you the links to the hex inspection:
(a) Corrupted file: http://en.webhex.net/view/278aac05820c34dfbdd2217c03970dd9/0
(b) (Presumably) correct file: http://en.webhex.net/view/4a01894b814c17d2ec71ba49ac48e683

I don't know if this thread will be helpful for somebody, but just in case I figured out the cause and a possible solution for my problem.
The cause
Default Javascript variables store information in Unicode/ASCII format; they are not prepared for storing binary data correctly and this is why one can easily see wrong characters interpreted (this also explains why repetitions of EF, BF, etc. were observed in the Hex Viewer, which stand for wrong characters of ASCII/Unicode).
The solution
The last browser versions implement the so called typed arrays. They are javascript arrays that can store data in different formats (also binary). Then, if one specifies that the XMLHttpRequest response is in binary format, data will be correctly stored and, when blobed into a file, the file will not be corrupted. Check out the code I used:
var xhr = new XMLHttpRequest();
xhr.open('POST', url, true);
xhr.responseType = 'arraybuffer';
Notice that the key point is to define the responseType as "arraybuffer". It may be also interesting noticing that I decided not to use Jquery for the AJAX anymore. It poorly implements this feature and all attempts I did to parse Jquery were in vain (overrideMimeType described somewhere else didn't work in my case). Instead, old plain XMLHttRquest worked pretty nicely.

submitting form with accented characters via xmlhttprequest

I implemented this form submission method that uses xmlhttpreqeust. I saw the new html5 feature, FormData, that allows submission of files along with forms. Cool! However, there's a problem with accented characters, specifically those stupid smart quotes that Word makes (yes, I'm a little bias against those characters). I used to have it submit to a hidden iframe, the old school way, and I never had a problem with the variety of weird characters that was put in there. But I thought this would be better. It's turning out to be a bigger headache :-/
Let's look at the code. My javascript function (note the commented out line):
var xhr = new XMLHttpRequest();
var fd = new FormData(form);
xhr.addEventListener("error", uploadFailed, false);
xhr.addEventListener("abort", uploadCanceled, false);
xhr.addEventListener("load", uploadComplete, false);
xhr.open($(form).attr('method'), $(form).attr('action'));
//xhr.setRequestHeader("Content-Type", "application/x-www-form-urlencoded; charset=ISO-8859-1");
xhr.send(fd);
This is a shortened view, check out line 1510 at http://archive.cyark.org/sitemanager/sitemanager.js to view the entire function.
Then on the receiving php page, I have at the top:
header('Content-Type: text/html; charset=ISO-8859-1');
Followed by some basic php to build a string with the post data and submit it as an update to mysql.
So what do I do? If I uncomment the content-type setting in javascript it totally breaks the POST data in my php script. I don't know if the problem is in javascript, php or mysql. Any thoughts?

Encoding problems are sometimes hard to debug. In short the best solution is to literally use UTF8 as encoding everywhere. That is, every component of your application stack.
Your page seems to be delivered as ISO-LATIN-1 (sent via HTTP header from your webserver) which leads browsers to use latin1 or some Windows equivalent like windows-1252 even though you may have META elements in your HTML's HEAD telling user agents to use UTF8. The HTTP header takes precedence. Check the delivery of your other file formats (especially .js) to be UTF8 as well. If your problems are still appearing after configuring everything client side related (HTML, JS, XHR etc.) to use UTF8 you will have to start checking your server side for problems.
This may include such simple problems as PHP files not being proper UTF8 (very unlikely on linux servers I'd say) but usually consists of problems with mysql configurations (server and client), database and table default encoding (and collation) and the correct connection settings. Problems may also be caused by incorrect PHP ini or mbstring configuration settings.
Examples (not complete; using mysql here as a common database example):
MySQL configuration
[mysqld]
default_character_set = utf8
character_set_client = utf8
character_set_server = utf8
[client]
default_character_set = utf8
Please note, that those settings are different for mysql version 5.1 and 5.5 and may prevent the mysqld from starting when using the wrong variable. See http://dev.mysql.com/doc/refman//5.5/en/server-system-variables.html and http://dev.mysql.com/doc/refman/5.1/en/server-system-variables.html for details.
You may check your mysql variables via CLI:
mysql> SHOW VARIABLES LIKE '%char%';
Variable_name Value
character_set_client utf8
character_set_connection utf8
character_set_database utf8
character_set_filesystem binary
character_set_results utf8
character_set_server utf8
character_set_system utf8
When creating databases and tables try to use something like
CREATE DATABASE $db /*!40100 DEFAULT CHARACTER SET utf8 */
PHP.ini settings (should be the default already):
default_charset = "utf-8"
MB-String extension of PHP uses latin1 by default and should be reconfigured if used:
[mbstring]
mbstring.internal_encoding = UTF-8
mbstring.http_output = UTF-8
...some more perhaps...
Webserver settings (Apache used as example, applies to other servers as well):
# httpd.conf
AddDefaultCharset UTF-8
PHP source codes may use header settings like:
header('Content-type: text/html; charset=UTF-8');
Shell (bash) settings:
# ~/.profile
export LC_CTYPE=en_US.UTF-8
export LANG=en_US.UF-8
The above list is presented here just to give you a hint on what pitfalls may wait for you in certain situations. Every single component of your used web stack must be able to use UTF8 and should be configured correctly to do so. Nonetheless usually a simple correct HTTP header of UTF8 is enough to sort most problems out though. Good luck! :-)

Including base64 gzipped stylesheets/images in javascript?

I know you can include css and images, among other file types, which have been stored in base64 form within a javascript file. However, those are decently huge... and gzipped, they shrink down a LOT, even with the ~33% overhead from base64 encoding.
Non-gzipped, images are data:image/gif;base64, data:image/jpeg, data:image/png, and css is data:text/css;base64. What mime type can/should I be using, then, to include css or image data URIs which are gzipped? (Or if gzip+base64 can't work, is there any other compression I can do to bring down the string's size, while still keeping the data stored within the javascript?)
..edit..
I think the question is being misunderstood. I am not asking if I should include gzipped base64 strings within javascript. Yes, I know it's best, in most cases, to gzip the javascript and other files on the server end. But that is not applicable for a userscript; a userscript has no server, and consists of only a single file. Firefox allows a #require directive, but Opera and Chrome do not, and local file security issues come into play with loading any local files. Thus anything needed by the script has to be either: 1) on the web (slow) or 2) embedded in the userscript (big).
Now this question assumes that big is preferable to slow, but that big does not have to mean we totally ignore just how big; if it can be smaller, that's an improvement.
So assuming that a base64 string is embedded in javascript, the question is how to make it into something meaningful.
Either:
1) atob() can convert raw base64-encoded gzip to raw gzip within javascript. (atob does not need to know the mediatype). The question then would be how to decompress that raw gzipped css or image file so that the resulting output can be fed into the document.
or 2) given the proper mediatype, browsers at least theoretically (per the datauri RFC) should be able to load any file directly from a datauri. "" is sufficient to load a non-gzipped css stylesheet. The question here would be what link type attribute and datauri mediatype combination should work (and which browsers would it work for)? Preferably, for a userscript, this would be a combination that works in Opera, FF, and Chrome.

In HTTP, compression is most often only applied for transmission to reduce the payload that is to be transmitted. This is done by the Content-Encoding header field.
But the data URL scheme is very limited and you can only specify the media type:
dataurl := "data:" [ mediatype ] [ ";base64" ] "," data
Although you could use a multipart message, most user agents don’t support them in data URLs. It would also be questionable whether the additional data to describe such a multipart message wouldn’t be more than the data you safe by compressing the actual payload.
So compressing the data in a data URL is possible in theory but impracticable. It is better to simply compress the whole document the data URL is embedded in.

How to parse xml in JavaScript?

I'm trying to fetch and parse an XML-file through JavaScript. I don't control the XML-file.
Now somehow the encoding of some XML-files changed, which results in the code not being able to parse the file as far as I can tell. It used to be ANSI, some are Unicode now (and those are failing). Is there a way for me to correctly get the content, so both versions (ANSI and Unicode) work?
Files just start with:
<?xml version="1.0"?>
And the only thing in javascript to to parse is:
var parser = new DOMParser();
var dom = parser.parseFromString(responseDetails.responseText,"application/xml");

If the encoding isn't correctly specified, I think you're going to have to chop the header off, then attach a new header specifying a candidate encoding. Parse that, and if it fails, attach a new header with a new candidate encoding. And so on.
Of course, a successful parse doesn't imply you've got the right encoding, but an encoding that passes the parsing stage.
The real fix is to correct the original XML, unfortunately.

Develop Reference

JavaScript is the programming language of the Web.