Character encoding works on one page, but not the other

Character encoding works on one page, but not the other - javascript

I have a page http://199.193.248.80/test/test.php that contains the « character.
But when I read this page with js on http://199.193.248.80/test/test.html, the character turns into �
Both pages are using Charset Windows-1252 so I have no idea why it works on one page but not the other. What needs to be done to fix this?

This is probably because PHP sets a different character set (when serving the .php) in the headers than Apache does (when serving the .html). Browsers use the character set that's mentioned in the response headers; it overrides the <meta> tags in fact.
By default PHP chooses iso-8859-1 I believe, but you can override the character set in PHP by using:
header('Content-Type: text/html; charset=windows-1252');
Or change the php.ini for a global change.
See also:
http://httpd.apache.org/docs/2.0/mod/core.html#adddefaultcharset (for Apache)
http://www.php.net/manual/en/ini.core.php#ini.default-charset (for PHP)

I suggest to use the HTML-entity form: «
This way it doesn't matter what charset you use for your file, because your browser just parses it.
In PHP you can use $str = htmlentities( $str ); to encode a string

Related

utf-8 text is being garbled

I'm still new to webdev and dealing with character set encodings. I've read http://kunststube.net/encoding/ along with a few other pieces on the subject.
My problem is that I've got a bunch of text that I'm pulling from a server. It is encoded and served as utf-8.
However, when I display the strings, the french / spanish accents are garbled up. I've googled around and it seems JavaScript engines use UCS-2 or UTF-16 internally. Is there something I have to do to get it to treat my text as UTF-8? I have the <meta charset="utf-8"> in my html, but it doesn't seem to do anything.
Any ideas?

Without any links, I can't inspect what you are doing directly, but you shouldn't need to do anything special inside JavaScript to get it to work, just make sure all your sources are set to UTF-8 correctly, and that the browser is interpreting them as such.
You may need to make sure your server (Apache? IIS?) is setting the appropriate encode header. For example in PHP:
header('Content-Type: text/plain; charset=utf-8');
header('Content-Type: text/html; charset=utf-8');
Or in .htaccess there are many ways to do it. A couple of ways:
AddCharset UTF-8 .html
or specific files:
<Files "example.js">
AddCharset UTF-8 .js
</Files>
refs:
http://us2.php.net/manual/fr/function.header.php
https://www.w3.org/International/questions/qa-htaccess-charset.en

If you don't have meta tag in your html then put it in the header :
<meta charset="UTF-8">
else , you have to declare character encoding in your script file

How to initialize textarea content with some text

Writing an HTTP server in Ruby, I need to edit a file in the browser which uses certain source code (HTML, JavaScript and Ruby). I need to put any text file content in the value of a textarea:
"<textarea>__CONTENT__</textarea>".gsub('__CONTENT__',File.read(filename))
However, this doesn't work if the files contain some special sub-trings, such as </textarea>. So I tried to 'prepare' the data, by doing certain replacements in the file content. However, there is an issue if the file contains source code with HTML/Ruby content, and especially if I try to send the source of my HTTP server. This chain of replacements seem good:
File.read(__FILE__).gsub(/&/,"&").gsub('<',"&"+"lt;").gsub('>',"&"+"gt;")
However, this is not good enough. There is an issue (in the web browser) when the file contains \'! Is there a useful technique to place any text in the textarea (server side and/or browser side)?

CGI::escapeHTML will "prepare" strings to be HTML-safe.
# require 'cgi'
CGI::escapeHTML(File.read(__FILE__))

this form is good:
CGI::escapeHTML(File.read(FILE))
except for the backslash character : double backslash become simple.
I found that :
Server side, replace backslash with &99992;
CGI::escapeHTML(File.read(#uri).gsub('\\','&9999'+'2222;'))
Browser side, replace in textarea "&99992222;" by backslash character :
var node=document.getElementById('textarea_1');
node.value=node.value.replace(/\&9{4,4}2{4,4};/g,String.fromCharCode(92));
Hopping that there is no sources with &99992222; !

submitting form with accented characters via xmlhttprequest

I implemented this form submission method that uses xmlhttpreqeust. I saw the new html5 feature, FormData, that allows submission of files along with forms. Cool! However, there's a problem with accented characters, specifically those stupid smart quotes that Word makes (yes, I'm a little bias against those characters). I used to have it submit to a hidden iframe, the old school way, and I never had a problem with the variety of weird characters that was put in there. But I thought this would be better. It's turning out to be a bigger headache :-/
Let's look at the code. My javascript function (note the commented out line):
var xhr = new XMLHttpRequest();
var fd = new FormData(form);
xhr.addEventListener("error", uploadFailed, false);
xhr.addEventListener("abort", uploadCanceled, false);
xhr.addEventListener("load", uploadComplete, false);
xhr.open($(form).attr('method'), $(form).attr('action'));
//xhr.setRequestHeader("Content-Type", "application/x-www-form-urlencoded; charset=ISO-8859-1");
xhr.send(fd);
This is a shortened view, check out line 1510 at http://archive.cyark.org/sitemanager/sitemanager.js to view the entire function.
Then on the receiving php page, I have at the top:
header('Content-Type: text/html; charset=ISO-8859-1');
Followed by some basic php to build a string with the post data and submit it as an update to mysql.
So what do I do? If I uncomment the content-type setting in javascript it totally breaks the POST data in my php script. I don't know if the problem is in javascript, php or mysql. Any thoughts?

Encoding problems are sometimes hard to debug. In short the best solution is to literally use UTF8 as encoding everywhere. That is, every component of your application stack.
Your page seems to be delivered as ISO-LATIN-1 (sent via HTTP header from your webserver) which leads browsers to use latin1 or some Windows equivalent like windows-1252 even though you may have META elements in your HTML's HEAD telling user agents to use UTF8. The HTTP header takes precedence. Check the delivery of your other file formats (especially .js) to be UTF8 as well. If your problems are still appearing after configuring everything client side related (HTML, JS, XHR etc.) to use UTF8 you will have to start checking your server side for problems.
This may include such simple problems as PHP files not being proper UTF8 (very unlikely on linux servers I'd say) but usually consists of problems with mysql configurations (server and client), database and table default encoding (and collation) and the correct connection settings. Problems may also be caused by incorrect PHP ini or mbstring configuration settings.
Examples (not complete; using mysql here as a common database example):
MySQL configuration
[mysqld]
default_character_set = utf8
character_set_client = utf8
character_set_server = utf8
[client]
default_character_set = utf8
Please note, that those settings are different for mysql version 5.1 and 5.5 and may prevent the mysqld from starting when using the wrong variable. See http://dev.mysql.com/doc/refman//5.5/en/server-system-variables.html and http://dev.mysql.com/doc/refman/5.1/en/server-system-variables.html for details.
You may check your mysql variables via CLI:
mysql> SHOW VARIABLES LIKE '%char%';
Variable_name Value
character_set_client utf8
character_set_connection utf8
character_set_database utf8
character_set_filesystem binary
character_set_results utf8
character_set_server utf8
character_set_system utf8
When creating databases and tables try to use something like
CREATE DATABASE $db /*!40100 DEFAULT CHARACTER SET utf8 */
PHP.ini settings (should be the default already):
default_charset = "utf-8"
MB-String extension of PHP uses latin1 by default and should be reconfigured if used:
[mbstring]
mbstring.internal_encoding = UTF-8
mbstring.http_output = UTF-8
...some more perhaps...
Webserver settings (Apache used as example, applies to other servers as well):
# httpd.conf
AddDefaultCharset UTF-8
PHP source codes may use header settings like:
header('Content-type: text/html; charset=UTF-8');
Shell (bash) settings:
# ~/.profile
export LC_CTYPE=en_US.UTF-8
export LANG=en_US.UF-8
The above list is presented here just to give you a hint on what pitfalls may wait for you in certain situations. Every single component of your used web stack must be able to use UTF8 and should be configured correctly to do so. Nonetheless usually a simple correct HTTP header of UTF8 is enough to sort most problems out though. Good luck! :-)

iPhone browser/IIS/Tomcat, Japanese locale, http parameters getting messed

First the environment: the client is a mobile Safari on iPhone, the server consists of a Tomcat 5.5 fronted by IIS.
I have a piece of javascript code that sends a single parameter to the server and gets back some response:
var url = "/abc/ABCServlet";
var paramsString = "name=SomeName"
xmlhttpobj = getXmlHttpObject(); //Browser specific object returned
xmlhttpobj.onreadystatechange = callbackFunction;
xmlhttpobj.open("GET", url + "?" + paramsString, true);
xmlhttpobj.send(null);
This works fine when the iPhone language/locale is EN/US; but when the locale/language is changed to Japanese the query parameter received by the server becomes "SomeName#" without the quotes. Somehow a # is getting appended at the end.
Any clues why?

Hopefully, all you need to do is add a meta tag to the top of your HTML page that specifies the correct character set (e.g. <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />) and match whatever encoding your datafiles are expecting.
If that doesn't work, ensure that you are using the same character encoding (preferably UTF-8) throughout your application. Your server-side scripts and any files that include text strings you will be adding directly to the response stream should be saved with that single encoding. It's a good idea to have your servers send a "Content-Type" HTTP header of the same encoding if possible (e.g. "text/html; charset=utf-8"). And you should ensure that the mobile safari page that's doing the displaying has the right Content-Type meta tag.
Japanese developers have a nasty habit of storing files in EUC or ISO-2022-JP, both of which often force the browser to use different fonts faces on some browsers and can seriously break your page if the browser is expecting a Roman charset. The good news is that if you're forced to use one of the Japanese encodings, that encoding will typically display right for most English text. It's the extended characters you need to look out for.
Now I may be wrong, but I THOUGHT that loading these files via AJAX was not a problem (I think the browser remaps the character data according to the character set for every text file it loads), but as you start mixing document encodings in a single file (and especially in your document body), bad things can happen. Maybe mobile safari requires the same encoding for both HTML files and AJAX files. I hope not. That would be ugly.

Convert ISO/Windows charsets to UTF-8 in Javascript

I'm developing a firefox plugin and i fetch web pages to do some analysis for the user. The problem is when i try to get (XMLHttpRequest) pages that are not utf-8 encoded the string i see is messed up. For example hebrew pages with windows-1125 or Chinese pages with gb2312.
I already tried the following:
var uDecoder=Components.classes["#mozilla.org/intl/scriptableunicodeconverter"].getService(Components.interfaces.nsIScriptableUnicodeConverter);
uDecoder.charset="windows-1255";
alert( xhr.responseText );
var decoder=Components.classes["#mozilla.org/intl/utf8converterservice;1"].getService(Components.interfaces.nsIUTF8ConverterService);
alert(decoder.convertStringToUTF8(xhr.responseText,"WINDOWS-1255",true));
I also tried escape/unescape/encodeURIComponent
any ideas???

Once XMLHttpRequest has tried to decode a non-UTF-8 string using UTF-8, you've already lost. The byte sequences in the page that weren't valid UTF-8 sequences will have been mangled (typically converted to �, the U+FFFD replacement character). No amount of re-encoding/decoding will get them back.
Pages that specify a Content-Type: text/html;charset=something HTTP header should be OK. Pages that don't have a real HTTP header but do have a <meta> version of it won't be, because XMLHttpRequest doesn't know about parsing HTML so it won't see the meta. If you know in advance the charset you want, you can tell XMLHttpRequest and it'll use it:
xhr.open(...);
xhr.overrideMimeType('text/html;charset=gb2312');
xhr.send();
(This is a currently non-standardised Mozilla extension.)
If you don't know the charset in advance, you can request the page once, hack about with the header for a <meta> charset, parse that out and request again with the new charset.
In theory you could get a binary response in a single request:
xhr.overrideMimeType('text/html;charset=iso-8859-1');
and then convert that from bytes-as-chars to UTF-8. However, iso-8859-1 wouldn't work for this because the browser interprets that charset as really being Windows code page 1252.
You could maybe use another codepage that maps every byte to a character, and do a load of tedious character replacements to map every character in that codepage to the character it would have been in real-ISO-8859-1, then do the conversion. Most encodings don't map every byte, but Arabic (cp1256) might be a candidate for this?

Develop Reference

JavaScript is the programming language of the Web.