According to MDN, The 'encodeURI()' function:
replacing each instance of certain characters by one, two, three, or four escape sequences representing the UTF-8 encoding of the character
However, when invoking encodeURI('\u0082') (in Chrome) Im getting %C2%81 as output.
I expected to get %82 or %00%82. What does the %C2 mean?
The '0082' in '\u0082' is the Unicode code point, not the UTF-8 bytes representation.
UTF-8 maps u+0082 code point to two bytes: C2+81
Unicode to UTF-8 mapping table
Decoding %C2 at http://www.albionresearch.com/misc/urlencode.php leads to Â
When dealing with German texts and ISO 8859-15 / ISO 8859-1 vs. UTF-8 I often ran into the à character. The characters are quite close to each other. May this also be an encoding problem?
Maybe HTML encoding issues - "Â" character showing up instead of " " helps.
Related
I'm writing some code that scans a string for TeX-style Greek character (like \Delta or \alpha), and replaces the string with the Unicode symbol. It works fine for the non-italic Greek characters. The problem is that I want to use mathematical italic for the lower case. These codes are one digit longer. For example, the code for the letter alpha is 1d6fc. When I put \u1d6fc into my string it displays as the character that matches \u1d6f (a lower case m with a superimposed tilde) followed by the letter c. How do I force the "correct" reading of the code?
You have to use UTF-16 surrogate pairs for characters beyond the UTF-16 range. In your particular case, you can use 0xD835 0xDEFC:
console.log('\uD835\uDEFC')
Here is a handy pair calculator. If you don't have to worry about Internet Explorer, you can also use String.fromCodePoint(), which will deal with that mess for you. If you do have to worry about Internet Explorer, MDN has a polyfill for that method.
To produce a \u escape sequence with more than 4 hex digits (code point belonging to a so-called astral plane), you can use the Unicode code point escape notation \u{xxxxx}:
console.log ('\u{1d6fc}');
or you can call String.fromCodePoint with the code point value expressed in hexadecimal using the 0x prefix notation:
console.log (String.fromCodePoint (0x1d6fc));
I have a pretty simple question, but a few simple googling and stachexchange queries were not able to answer it, so i guess i'm missing something here.
Here are my simplified parameters:
I'm using Javascript.
I have a text that needs to get URLEncoded and the text have more than 1 line.
My question is: What is the character for newline before the text get encoded? (I know that after the encoding the newline will be encoded into %0A)
I guess asking "What char is decoded when decoding %0A" will be the same.
Those codes consist of a percent sign, followed by a two character hexadecimal number representing a byte value.
So in this case, the byte value is 0A, representing the ASCII newline character. This is commonly written as \n inside strings in JavaScript (and others, like PHP).
But I think your question suggests you want to do some search and replace for this character. I would not do that, since there can be other characters too that need encoding. Instead, use the function encodeURIComponent, which can encode the entire string for you. There is encodeURI as well, but in your case, I think the first is more appropriate.
This example shows how special characters (newline, space, and others) are encoded to an url-friendly format. Note that the diacritic é translates to the two bytes of its UTF-8 representation.
document.write(encodeURIComponent("Normal text\nEéy, check the specials: /, + and \t!"));
Say I have the following Javascript instruction:
var a="hiàja, c . Non di–g t";
a contains binary data, i.e., any ASCII from 0-255.
Before what ASCII bytes should I add backslash so that a is read properly? (for example, before ").
Should I use an specific charset and content-type different than text/Javascript and UTF-8?
Thanks
The ASCII range is 0 to 127, but strings are not limited to ASCII in JavaScript. According to the ECMAScript standard, “All characters may appear literally in a string literal except for the closing quote character, backslash, carriage return, line separator, paragraph separator, and line feed.” If the encoding of your document is suitable (e.g., windows-1252 or utf-8) and properly declared, you can use your example string as it is.
When I JSON.stringify() the following code:
var exampleObject = { "name" : "Žiga Kovač", "kraj" : "Žužemberk"};
I get different results between browsers.
IE8 and Google Chrome return:
{"name":"\u017diga Kova\u010d","kraj":"\u017du\u017eemberk"}
While Firefox and Opera return:
{"name":"Žiga Kovač","kraj":"Žužemberk"}
I am using the browser's native JSON implementation in all 4 browsers. If I undefine the native JSON implementation and replace it with the one from json.org, then all browsers return:
{"name":"Žiga Kovač","kraj":"Žužemberk"}
Why is this happening, which result is correct and is it possible to make that all browsers return:
{"name":"\u017diga Kova\u010d","kraj":"\u017du\u017eemberk"}
?
These two representations are absolutely equivalent.
The one uses Unicode escape sequences (\uxxxx) to represent a Unicode character, the other uses an actual Unicode character. json.org defines a string as:
string
- ""
- "chars"
chars
- char
- char chars
char
- any Unicode character except " or \ or control characters
- one of: \" \\ \/ \b \f \n \r \t
- \u four-hex-digits
There is no difference in the strings themselves, only in their representation. This is the same thing HTML does when you use ©, © or © to represent the copyright sign.
The 'correct' (visibly) version is a UTF8 string, and the escaped string is an ASCII string with UTF8 escape codes. While the first one can be used in an HTTP body (as long as content-encoding is set to UTF8), the second one can also be used in an HTTP GET request header.
If you want to use the UTF8 version in a GET request, you need to escape it first, using encodeURIComponent.
When the content is received on the server side, the native string implementation will make sure that it contains exactly the same data (from all clients), provided that the HTTP transmission is correct.
Your browser will generally handle the encoding of it, if you send it as an HTTP POST body.
Both result's are correct, as long as your first example is encoded in UTF-8.
e.g. \u017d ist just another notation of Ž (017d is the position in UTF8-charset)
They are all correct. Some are returning it encoded in UTF-8, and some in ASCII.
Is it possible to detect binary data in JavaScript?
I'd like to be able to detect binary data and convert it to hex for easier readability/debugging.
After more investigation I've realized that detecting binary data is not the right question, because binary data can contain regular characters, and non-printable characters.
Outis's question and answer (/[\x00-\x1F]/) is really the best we can do in an attempt to detect binary characters.
Note: You must remove line feeds and possibly other characters from your ascii string sequence for the check to actually work.
If by "binary", you mean "contains non-printable characters", try:
/[\x00-\x1F]/.test(data)
If whitespace is considered non-binary data, try:
/[\x00-\x08\x0E-\x1F]/.test(data)
If you know the string is either ASCII or binary, use:
/[\x00-\x1F\x80-\xFF]/.test(data)
or:
/[\x00-\x08\x0E-\x1F\x80-\xFF]/.test(data)