When encoding a query string to be sent to a web server - when do you use escape() and when do you use encodeURI() or encodeURIComponent():
Use escape:
escape("% +&=");
OR
use encodeURI() / encodeURIComponent()
encodeURI("http://www.google.com?var1=value1&var2=value2");
encodeURIComponent("var1=value1&var2=value2");
escape()
Don't use it!
escape() is defined in section B.2.1.1 escape and the introduction text of Annex B says:
... All of the language features and behaviours specified in this annex have one or more undesirable characteristics and in the absence of legacy usage would be removed from this specification. ...
... Programmers should not use or assume the existence of these features and behaviours when writing new ECMAScript code....
Behaviour:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/escape
Special characters are encoded with the exception of: #*_+-./
The hexadecimal form for characters, whose code unit value is 0xFF or less, is a two-digit escape sequence: %xx.
For characters with a greater code unit, the four-digit format %uxxxx is used. This is not allowed within a query string (as defined in RFC3986):
query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
A percent sign is only allowed if it is directly followed by two hexdigits, percent followed by u is not allowed.
encodeURI()
Use encodeURI when you want a working URL. Make this call:
encodeURI("http://www.example.org/a file with spaces.html")
to get:
http://www.example.org/a%20file%20with%20spaces.html
Don't call encodeURIComponent since it would destroy the URL and return
http%3A%2F%2Fwww.example.org%2Fa%20file%20with%20spaces.html
Note that encodeURI, like encodeURIComponent, does not escape the ' character.
encodeURIComponent()
Use encodeURIComponent when you want to encode the value of a URL parameter.
var p1 = encodeURIComponent("http://example.org/?a=12&b=55")
Then you may create the URL you need:
var url = "http://example.net/?param1=" + p1 + "¶m2=99";
And you will get this complete URL:
http://example.net/?param1=http%3A%2F%2Fexample.org%2F%Ffa%3D12%26b%3D55¶m2=99
Note that encodeURIComponent does not escape the ' character. A common bug is to use it to create html attributes such as href='MyUrl', which could suffer an injection bug. If you are constructing html from strings, either use " instead of ' for attribute quotes, or add an extra layer of encoding (' can be encoded as %27).
For more information on this type of encoding you can check: http://en.wikipedia.org/wiki/Percent-encoding
The difference between encodeURI() and encodeURIComponent() are exactly 11 characters encoded by encodeURIComponent but not by encodeURI:
I generated this table easily with console.table in Google Chrome with this code:
var arr = [];
for(var i=0;i<256;i++) {
var char=String.fromCharCode(i);
if(encodeURI(char)!==encodeURIComponent(char)) {
arr.push({
character:char,
encodeURI:encodeURI(char),
encodeURIComponent:encodeURIComponent(char)
});
}
}
console.table(arr);
I found this article enlightening :
Javascript Madness: Query String Parsing
I found it when I was trying to undersand why decodeURIComponent was not decoding '+' correctly. Here is an extract:
String: "A + B"
Expected Query String Encoding: "A+%2B+B"
escape("A + B") = "A%20+%20B" Wrong!
encodeURI("A + B") = "A%20+%20B" Wrong!
encodeURIComponent("A + B") = "A%20%2B%20B" Acceptable, but strange
Encoded String: "A+%2B+B"
Expected Decoding: "A + B"
unescape("A+%2B+B") = "A+++B" Wrong!
decodeURI("A+%2B+B") = "A+++B" Wrong!
decodeURIComponent("A+%2B+B") = "A+++B" Wrong!
encodeURIComponent doesn't encode -_.!~*'(), causing problem in posting data to php in xml string.
For example:
<xml><text x="100" y="150" value="It's a value with single quote" />
</xml>
General escape with encodeURI
%3Cxml%3E%3Ctext%20x=%22100%22%20y=%22150%22%20value=%22It's%20a%20value%20with%20single%20quote%22%20/%3E%20%3C/xml%3E
You can see, single quote is not encoded.
To resolve issue I created two functions to solve issue in my project, for Encoding URL:
function encodeData(s:String):String{
return encodeURIComponent(s).replace(/\-/g, "%2D").replace(/\_/g, "%5F").replace(/\./g, "%2E").replace(/\!/g, "%21").replace(/\~/g, "%7E").replace(/\*/g, "%2A").replace(/\'/g, "%27").replace(/\(/g, "%28").replace(/\)/g, "%29");
}
For Decoding URL:
function decodeData(s:String):String{
try{
return decodeURIComponent(s.replace(/\%2D/g, "-").replace(/\%5F/g, "_").replace(/\%2E/g, ".").replace(/\%21/g, "!").replace(/\%7E/g, "~").replace(/\%2A/g, "*").replace(/\%27/g, "'").replace(/\%28/g, "(").replace(/\%29/g, ")"));
}catch (e:Error) {
}
return "";
}
encodeURI() - the escape() function is for javascript escaping, not HTTP.
Small comparison table Java vs. JavaScript vs. PHP.
1. Java URLEncoder.encode (using UTF8 charset)
2. JavaScript encodeURIComponent
3. JavaScript escape
4. PHP urlencode
5. PHP rawurlencode
char JAVA JavaScript --PHP---
[ ] + %20 %20 + %20
[!] %21 ! %21 %21 %21
[*] * * * %2A %2A
['] %27 ' %27 %27 %27
[(] %28 ( %28 %28 %28
[)] %29 ) %29 %29 %29
[;] %3B %3B %3B %3B %3B
[:] %3A %3A %3A %3A %3A
[#] %40 %40 # %40 %40
[&] %26 %26 %26 %26 %26
[=] %3D %3D %3D %3D %3D
[+] %2B %2B + %2B %2B
[$] %24 %24 %24 %24 %24
[,] %2C %2C %2C %2C %2C
[/] %2F %2F / %2F %2F
[?] %3F %3F %3F %3F %3F
[#] %23 %23 %23 %23 %23
[[] %5B %5B %5B %5B %5B
[]] %5D %5D %5D %5D %5D
----------------------------------------
[~] %7E ~ %7E %7E ~
[-] - - - - -
[_] _ _ _ _ _
[%] %25 %25 %25 %25 %25
[\] %5C %5C %5C %5C %5C
----------------------------------------
char -JAVA- --JavaScript-- -----PHP------
[ä] %C3%A4 %C3%A4 %E4 %C3%A4 %C3%A4
[ф] %D1%84 %D1%84 %u0444 %D1%84 %D1%84
I recommend not to use one of those methods as is. Write your own function which does the right thing.
MDN has given a good example on url encoding shown below.
var fileName = 'my file(2).txt';
var header = "Content-Disposition: attachment; filename*=UTF-8''" + encodeRFC5987ValueChars(fileName);
console.log(header);
// logs "Content-Disposition: attachment; filename*=UTF-8''my%20file%282%29.txt"
function encodeRFC5987ValueChars (str) {
return encodeURIComponent(str).
// Note that although RFC3986 reserves "!", RFC5987 does not,
// so we do not need to escape it
replace(/['()]/g, escape). // i.e., %27 %28 %29
replace(/\*/g, '%2A').
// The following are not required for percent-encoding per RFC5987,
// so we can allow for a little better readability over the wire: |`^
replace(/%(?:7C|60|5E)/g, unescape);
}
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent
For the purpose of encoding javascript has given three inbuilt functions -
escape() - does not encode #*/+
This method is deprecated after the ECMA 3 so it should be avoided.
encodeURI() - does not encode ~!##$&*()=:/,;?+'
It assumes that the URI is a complete URI, so does not encode reserved characters that have special meaning in the URI.
This method is used when the intent is to convert the complete URL instead of some special segment of URL.
Example - encodeURI('http://stackoverflow.com');
will give - http://stackoverflow.com
encodeURIComponent() - does not encode - _ . ! ~ * ' ( )
This function encodes a Uniform Resource Identifier (URI) component by replacing each instance of certain characters by one, two, three, or four escape sequences representing the UTF-8 encoding of the character. This method should be used to convert a component of URL. For instance some user input needs to be appended
Example - encodeURIComponent('http://stackoverflow.com');
will give - http%3A%2F%2Fstackoverflow.com
All this encoding is performed in UTF 8 i.e the characters will be converted in UTF-8 format.
encodeURIComponent differ from encodeURI in that it encode reserved characters and Number sign # of encodeURI
Also remember that they all encode different sets of characters, and select the one you need appropriately. encodeURI() encodes fewer characters than encodeURIComponent(), which encodes fewer (and also different, to dannyp's point) characters than escape().
Just try encodeURI() and encodeURIComponent() yourself...
console.log(encodeURIComponent('##$%^&*'));
Input: ##$%^&*. Output: %40%23%24%25%5E%26*. So, wait, what happened to *? Why wasn't this converted? It could definitely cause problems if you tried to do linux command "$string". TLDR: You actually want fixedEncodeURIComponent() and fixedEncodeURI(). Long-story...
When to use encodeURI()? Never. encodeURI() fails to adhere to RFC3986 with regard to bracket-encoding. Use fixedEncodeURI(), as defined and further explained at the MDN encodeURI() Documentation...
function fixedEncodeURI(str) {
return encodeURI(str).replace(/%5B/g, '[').replace(/%5D/g, ']');
}
When to use encodeURIComponent()? Never. encodeURIComponent() fails to adhere to RFC3986 with regard to encoding: !'()*. Use fixedEncodeURIComponent(), as defined and further explained at the MDN encodeURIComponent() Documentation...
function fixedEncodeURIComponent(str) {
return encodeURIComponent(str).replace(/[!'()*]/g, function(c) {
return '%' + c.charCodeAt(0).toString(16);
});
}
Then you can use fixedEncodeURI() to encode a single URL piece, whereas fixedEncodeURIComponent() will encode URL pieces and connectors; or, simply, fixedEncodeURI() will not encode +#?=:#;,$& (as & and + are common URL operators), but fixedEncodeURIComponent() will.
I've found that experimenting with the various methods is a good sanity check even after having a good handle of what their various uses and capabilities are.
Towards that end I have found this website extremely useful to confirm my suspicions that I am doing something appropriately. It has also proven useful for decoding an encodeURIComponent'ed string which can be rather challenging to interpret. A great bookmark to have:
http://www.the-art-of-web.com/javascript/escape/
The accepted answer is good.
To extend on the last part:
Note that encodeURIComponent does not escape the ' character. A common
bug is to use it to create html attributes such as href='MyUrl', which
could suffer an injection bug. If you are constructing html from
strings, either use " instead of ' for attribute quotes, or add an
extra layer of encoding (' can be encoded as %27).
If you want to be on the safe side, percent encoding unreserved characters should be encoded as well.
You can use this method to escape them (source Mozilla)
function fixedEncodeURIComponent(str) {
return encodeURIComponent(str).replace(/[!'()*]/g, function(c) {
return '%' + c.charCodeAt(0).toString(16);
});
}
// fixedEncodeURIComponent("'") --> "%27"
Inspired by Johann's table, I've decided to extend the table. I wanted to see which ASCII characters get encoded.
var ascii = " !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~";
var encoded = [];
ascii.split("").forEach(function (char) {
var obj = { char };
if (char != encodeURI(char))
obj.encodeURI = encodeURI(char);
if (char != encodeURIComponent(char))
obj.encodeURIComponent = encodeURIComponent(char);
if (obj.encodeURI || obj.encodeURIComponent)
encoded.push(obj);
});
console.table(encoded);
Table shows only the encoded characters. Empty cells mean that the original and the encoded characters are the same.
Just to be extra, I'm adding another table for urlencode() vs rawurlencode(). The only difference seems to be the encoding of space character.
<script>
<?php
$ascii = str_split(" !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~", 1);
$encoded = [];
foreach ($ascii as $char) {
$obj = ["char" => $char];
if ($char != urlencode($char))
$obj["urlencode"] = urlencode($char);
if ($char != rawurlencode($char))
$obj["rawurlencode"] = rawurlencode($char);
if (isset($obj["rawurlencode"]) || isset($obj["rawurlencode"]))
$encoded[] = $obj;
}
echo "var encoded = " . json_encode($encoded) . ";";
?>
console.table(encoded);
</script>
Modern rewrite of #johann-echavarria's answer:
console.log(
Array(256)
.fill()
.map((ignore, i) => String.fromCharCode(i))
.filter(
(char) =>
encodeURI(char) !== encodeURIComponent(char)
? {
character: char,
encodeURI: encodeURI(char),
encodeURIComponent: encodeURIComponent(char)
}
: false
)
)
Or if you can use a table, replace console.log with console.table (for the prettier output).
I have this function...
var escapeURIparam = function(url) {
if (encodeURIComponent) url = encodeURIComponent(url);
else if (encodeURI) url = encodeURI(url);
else url = escape(url);
url = url.replace(/\+/g, '%2B'); // Force the replacement of "+"
return url;
};
I'm trying to fix a string which has some encoding characters in it.
I thought I should be able to match the hex characters of the special characters and convert them back to a normal character.
Here is my example code:
let str = "url('https\3a //');";
str = str.replace(/\x5C\x33\x61\x20/g,":"); // equivalent to '\3a '
console.log(str);
I expected the output to be url('https://'); but I actually got url('https a //');
What am I missing? jsfiddle here. Is this some sort of multi byte character issue? I looked at the resulting string in a hex editor and the replaced characters seem to be \x03\x61\x20 rather than the expected \x3A.
EDIT: why has this been down voted? It is a fair question isn't it?
Is the code that you use really needs to be in this form?
I got the desired result using "3a".
str = "url('https\3a //');";
str = str.replace(/\3a /g,":"); // equivalent to '\3a '
console.log(str);
//result: url('https://');
I have found this today and can't make out why it fails:
Basically if you take some obscure symbol like
"👉"
then "👉".charCodeAt(0) in chrome console - you will get the code 55357, but when you revert the operation with String.fromCharCode(55357) it produces "�"
Even if I do it like this String.fromCharCode("👉".charCodeAt(0)) it produces "�" however String.fromCharCode("👉".charCodeAt(0)).charCodeAt(0) is still 55357, so information isn't lost, and it implies that it is Chrome that can't find correct symbol to map to 55357.
Why Chrome cannot represent symbol correctly? Is it because it cannot map it to font correctly? How do I make double conversion to be shown as "👉" again?
If you log
"👉".length
you will get 2, that is, the string actually contains 2 characters, not one. This is because JS only supports 16-bit unicode (BMP) and encodes "astral plane" symbols with "surrogate pairs". Your symbol is \uD83D\uDC49 internally, and when you do .charCodeAt(0) you only get \uD83D, which is invalid unicode.
More on https://mathiasbynens.be/notes/javascript-unicode
Following script will get the 'correct' char code (128073)
(("👉".charCodeAt(0)-0xD800)*0x400) + ("👉".charCodeAt(1)-0xDC00) + 0x10000
one then can convert it to HTML char code like this:
"&#x"+(((("👉".charCodeAt(0)-0xD800)*0x400) + ("👉".charCodeAt(1)-0xDC00) + 0x10000)).toString(16)+";"
And string extension:
String.prototype.charCodeUTF32 = function(){
return ((((this.charCodeAt(0)-0xD800)*0x400) + (this.charCodeAt(1)-0xDC00) + 0x10000));
};
Hope this saves you some time.
TypeScript to convert a text containing emojis:
private emoji2html(text: string): string {
const regexAstralSymbols = /([\uD800-\uDBFF])([\uDC00-\uDFFF])/g;
return text.replace(regexAstralSymbols, (m, first, second) =>
`&#x${(first + second).charCodeUTF32().toString(16)};`);
}
I am developing a small character encoder generator where the user input their text and on the click of a button, it outputs the encoded version.
I've defined an object of the characters that need to be encoded like so:
map = {
'©' : '©',
'&' : '&'
},
And here is the loop that gets the values from the map and replaces them:
Object.keys(map).forEach(function (ico) {
var icoE = ico.replace(/([.?*+^$[\]\\(){}|-])/g, "\\$1");
raw = raw.replace( new RegExp(icoE, 'g'), map[ico] );
});
I am them simply outputting the result to a textarea. This all works fine, however the problem I'm facing is this.
© is replaced with © however the & symbol at the beginning of this is then converted to & so it ends up being ©.
I see why this is happening however I'm not sure how to go about ensuring that & is not replaced within character encoded strings.
Here is a JSFiddle for a live preview of what I mean:
http://jsfiddle.net/4m3nw/1/
Any help would be much appreciated
Prelude: Apart from regex, an idea worth considering is something like this JS function that already handles html entities. Now, on to the regex question.
HTML Special Characters, Negative Lookahead
In HTML, special characters can look not only like © but also like —, and they can have upper-case characters.
To replace ampersands that are not immediately followed by a hash or word characters and a semicolon, you can use something like this:
&(?!(?:#[0-9]+|[a-z]+);)
See the demo.
Make sure to use the i flag to activate case-insensitive mode
& matches the literal ampersand
The negative lookahead (?!(?:#[0-9]+|[a-z]+);) asserts that it is not followed by...
(?:#[0-9]+|[a-z]+) a hash and digits, | OR letters...
then a semicolon.
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind
The problem is that since you process the same string you replace the &in ©. If you re-order your map then that seemingly solves the problem. However according to the ECMAScript specifications, this is not a given, so you would be relying on implementation details of the ECMAScript engine used.
What you can do to make sure it will always work is to swap the keys so that & is always processed first:
map = {
'©' : '©',
'&' : '&'
};
var keys = Object.keys(map);
keys[keys.indexOf('&')] = keys[0];
keys[0] = '&';
keys.forEach(function (ico) {
var icoE = ico.replace(/([.?*+^$[\]\\(){}|-])/g, "\\$1");
raw = raw.replace( new RegExp(icoE, 'g'), map[ico] );
});
Obviously you need to add checks for the &'s existence if it isn't always there.
jsFiddle Demo.
Probably the simplest code change is to reorder your map by putting the ampersand on top.
In Javascript '\uXXXX' returns in a unicode character. But how can I get a unicode character when the XXXX part is a variable?
For example:
var input = '2122';
console.log('\\u' + input); // returns a string: "\u2122"
console.log(new String('\\u' + input)); // returns a string: "\u2122"
The only way I can think of to make it work, is to use eval; yet I hope there's a better solution:
var input = '2122';
var char = '\\u' + input;
console.log(eval("'" + char + "'")); // returns a character: "™"
Use String.fromCharCode() like this: String.fromCharCode(parseInt(input,16)). When you put a Unicode value in a string using \u, it is interpreted as a hexdecimal value, so you need to specify the base (16) when using parseInt.
String.fromCharCode("0x" + input)
or
String.fromCharCode(parseInt(input, 16)) as they are 16bit numbers (UTF-16)
JavaScript uses UCS-2 internally.
Thus, String.fromCharCode(codePoint) won’t work for supplementary Unicode characters. If codePoint is 119558 (0x1D306, for the '𝌆' character), for example.
If you want to create a string based on a non-BMP Unicode code point, you could use Punycode.js’s utility functions to convert between UCS-2 strings and UTF-16 code points:
// `String.fromCharCode` replacement that doesn’t make you enter the surrogate halves separately
punycode.ucs2.encode([0x1d306]); // '𝌆'
punycode.ucs2.encode([119558]); // '𝌆'
punycode.ucs2.encode([97, 98, 99]); // 'abc'
Since ES5 you can use
String.fromCodePoint(number)
to get unicode values bigger than 0xFFFF.
So, in every new browser, you can write it in this way:
var input = '2122';
console.log(String.fromCodePoint(input));
or if it is a hex number:
var input = '2122';
console.log(String.fromCodePoint(parseInt(input, 16)));
More info:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCodePoint
Edit (2021):
fromCodePoint is not just used for bigger numbers, but also to combine Unicode emojis.
For example, to draw a waving hand, you have to write:
String.fromCodePoint(0x1F44B);
But if you want a waving hand with a skin tone, you have to combine it:
String.fromCodePoint(0x1F44B, 0x1F3FC);
In future (or from now), you will even be able to combine 2 emoji to create a new one, for example a heart and a fire, to create a burning heart:
String.fromCodePoint(0x2764, 0xFE0F, 0x200D, 0x1F525);
32-bit number:
<script>
document.write(String.fromCodePoint(0x1F44B));
</script>
<br>
32-bit number + skin:
<script>
document.write(String.fromCodePoint(0x1F44B, 0x1F3FE));
</script>
<br>
32-bit number + another emoji:
<script>
document.write(String.fromCodePoint(0x2764, 0xFE0F, 0x200D, 0x1F525));
</script>
var hex = '2122';
var char = unescape('%u' + hex);
console.log(char);
will returns " ™ "