decodeURIComponent replacing + in token string by space string [duplicate] - javascript

This question already has answers here:
URL encoding the space character: + or %20?
(5 answers)
Closed 1 year ago.
Sometimes the spaces get URL encoded to the + sign, and some other times to %20. What is the difference and why should this happen?

+ means a space only in application/x-www-form-urlencoded content, such as the query part of a URL:
http://www.example.com/path/foo+bar/path?query+name=query+value
In this URL, the parameter name is query name with a space and the value is query value with a space, but the folder name in the path is literally foo+bar, not foo bar.
%20 is a valid way to encode a space in either of these contexts. So if you need to URL-encode a string for inclusion in part of a URL, it is always safe to replace spaces with %20 and pluses with %2B. This is what, e.g., encodeURIComponent() does in JavaScript. Unfortunately it's not what urlencode does in PHP (rawurlencode is safer).
See Also
HTML 4.01 Specification application/x-www-form-urlencoded

So, the answers here are all a bit incomplete. The use of a '%20' to encode a space in URLs is explicitly defined in RFC 3986, which defines how a URI is built. There is no mention in this specification of using a '+' for encoding spaces - if you go solely by this specification, a space must be encoded as '%20'.
The mention of using '+' for encoding spaces comes from the various incarnations of the HTML specification - specifically in the section describing content type 'application/x-www-form-urlencoded'. This is used for posting form data.
Now, the HTML 2.0 specification (RFC 1866) explicitly said, in section 8.2.2, that the query part of a GET request's URL string should be encoded as 'application/x-www-form-urlencoded'. This, in theory, suggests that it's legal to use a '+' in the URL in the query string (after the '?').
But... does it really? Remember, HTML is itself a content specification, and URLs with query strings can be used with content other than HTML. Further, while the later versions of the HTML spec continue to define '+' as legal in 'application/x-www-form-urlencoded' content, they completely omit the part saying that GET request query strings are defined as that type. There is, in fact, no mention whatsoever about the query string encoding in anything after the HTML 2.0 specification.
Which leaves us with the question - is it valid? Certainly there's a lot of legacy code which supports '+' in query strings, and a lot of code which generates it as well. So odds are good you won't break if you use '+'. (And, in fact, I did all the research on this recently because I discovered a major site which failed to accept '%20' in a GET query as a space. They actually failed to decode any percent encoded character. So the service you're using may be relevant as well.)
But from a pure reading of the specifications, without the language from the HTML 2.0 specification carried over into later versions, URLs are covered entirely by RFC 3986, which means spaces ought to be converted to '%20'. And definitely that should be the case if you are requesting anything other than an HTML document.

http://www.example.com/some/path/to/resource?param1=value1
The part before the question mark must use % encoding (so %20 for space), after the question mark you can use either %20 or + for a space. If you need an actual + after the question mark use %2B.

For compatibility reasons, it's better to always encode spaces as "%20", not as "+".
It was RFC 1866 (HTML 2.0 specification), which specified that space characters should be encoded as "+" in "application/x-www-form-urlencoded" content-type key-value pairs. (see paragraph 8.2.1. subparagraph 1.). This way of encoding form data is also given in later HTML specifications, look for relevant paragraphs about application/x-www-form-urlencoded.
Here is an example of a URL string where RFC 1866 allows encoding spaces as pluses: "http://example.com/over/there?name=foo+bar". So, only after "?", spaces can be replaced by pluses, according to RFC 1866. In other cases, spaces should be encoded to %20. But since it's hard to determine the context, it's the best practice to never encode spaces as "+".
I would recommend to percent-encode all characters except "unreserved" defined in RFC 3986, p.2.3.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
The only situation when you may want to encode spaces as "+" (one byte) rather than "%20" (three bytes) is when you know for sure how to interpret the context, and when the size of the query string is of the essence.

What's the difference? See the other answers.
When should we use + instead of %20? Use + if, for some reason, you want to make the URL query string (?.....) or hash fragment (#....) more readable. Example: You can actually read this:
https://www.google.se/#q=google+doesn%27t+encode+:+and+uses+%2B+instead+of+spaces
(%2B = +)
But the following is a lot harder to read (at least to me):
https://www.google.se/#q=google%20doesn%27t%20oops%20:%20%20this%20text%20%2B%20is%20different%20spaces
I would think + is unlikely to break anything, since Google uses + (see the 1st link above) and they've probably thought about this. I'm going to use + myself just because readable + Google thinks it's OK.

Related

Encoding all the special characters in Javascript

I have to encode the string that I receive here and pass it as a URL parameter so I don't believe I can pass either / or a paranthesis ( so considering I have the following string
KEY WEST / Florida(FL)
I am trying the following
encodeURIComponent("KEY WEST / Florida(FL)")
"KEY%20WEST%20%2F%20Florida(FL)"
escape("KEY WEST / Florida(FL)")
"KEY%20WEST%20/%20Florida%28FL%29"
Neither of them are encoding the string which I can decode later in my code as the first one is keeping the () and the second one is keeping the /
How do I do this in one shot and at a later time decode it when needed?
Also it seems like escape() has been deprecated so which way to do the encoding is preferred?
For URL encoding, encodeURI and encodeURIComponent functions should be used.
encodeURI encodes only special characters, while encodeURIComponent also encodes characters that have a meaning in the URL, so it is used to encode query strings, for example.
Parentheses are (as explained here), however, allowed anywhere in the URL without encoding them, that's why encodeURIComponent leaves them as-is.
The escape function can be considered deprecated, and although it officially isn't, it should be avoided.
so which way to do the encoding is preferred?
For entire URLs, encodeURI
For URL parts, e.g. query string of fragment, encodeURIComponent
Also see When are you supposed to use escape instead of encodeURI / encodeURIComponent?

JSON not transferring "+" character to CGI script

I am currently working on a personal project, where I have to deal with some chemical formulas;
I have a form with JavaScript where I enter these formulas; The formulas are entered in a LaTeX-like style for super- en subscript.
An example formula can be found below:
Fe^{3+}
When I use JavaScript to read the form and console.log(); the formula is working as expected.
However if I send the formula to the back-end (Python with CGI), the + character seems to have disappeared and been replaced with a space.
I thought it had something to do with escaping the character, since parts of the formula look a lot like regex's, but after looking around, I couldn't find anything that would suggest that I had to escape the + character.
And now I have absolutely no idea how to resolve this... I could use a different character and replace it on the back-end but that seems like it is not the optimal solution...
Most important question: How did you invoke the CGI script?
With HTTP GET or HTTP POST?
If you're using HTTP POST and the data was being transferred in the HTTP Data portion, then you don't need to escape the "+" sign.
But if you're using HTTP GET, then the "+" sign will first be translated according to URL encoding standard (thus, "+" becomes a space), before transferred to the CGI script.
So in the latter scenario, you need to escape the "+" sign (and other 'special' characters such as "%" and "?").

Unable to download a file which is containing special character in filename in javascript [duplicate]

This question already has answers here:
URL encoding the space character: + or %20?
(5 answers)
Closed 1 year ago.
Sometimes the spaces get URL encoded to the + sign, and some other times to %20. What is the difference and why should this happen?
+ means a space only in application/x-www-form-urlencoded content, such as the query part of a URL:
http://www.example.com/path/foo+bar/path?query+name=query+value
In this URL, the parameter name is query name with a space and the value is query value with a space, but the folder name in the path is literally foo+bar, not foo bar.
%20 is a valid way to encode a space in either of these contexts. So if you need to URL-encode a string for inclusion in part of a URL, it is always safe to replace spaces with %20 and pluses with %2B. This is what, e.g., encodeURIComponent() does in JavaScript. Unfortunately it's not what urlencode does in PHP (rawurlencode is safer).
See Also
HTML 4.01 Specification application/x-www-form-urlencoded
So, the answers here are all a bit incomplete. The use of a '%20' to encode a space in URLs is explicitly defined in RFC 3986, which defines how a URI is built. There is no mention in this specification of using a '+' for encoding spaces - if you go solely by this specification, a space must be encoded as '%20'.
The mention of using '+' for encoding spaces comes from the various incarnations of the HTML specification - specifically in the section describing content type 'application/x-www-form-urlencoded'. This is used for posting form data.
Now, the HTML 2.0 specification (RFC 1866) explicitly said, in section 8.2.2, that the query part of a GET request's URL string should be encoded as 'application/x-www-form-urlencoded'. This, in theory, suggests that it's legal to use a '+' in the URL in the query string (after the '?').
But... does it really? Remember, HTML is itself a content specification, and URLs with query strings can be used with content other than HTML. Further, while the later versions of the HTML spec continue to define '+' as legal in 'application/x-www-form-urlencoded' content, they completely omit the part saying that GET request query strings are defined as that type. There is, in fact, no mention whatsoever about the query string encoding in anything after the HTML 2.0 specification.
Which leaves us with the question - is it valid? Certainly there's a lot of legacy code which supports '+' in query strings, and a lot of code which generates it as well. So odds are good you won't break if you use '+'. (And, in fact, I did all the research on this recently because I discovered a major site which failed to accept '%20' in a GET query as a space. They actually failed to decode any percent encoded character. So the service you're using may be relevant as well.)
But from a pure reading of the specifications, without the language from the HTML 2.0 specification carried over into later versions, URLs are covered entirely by RFC 3986, which means spaces ought to be converted to '%20'. And definitely that should be the case if you are requesting anything other than an HTML document.
http://www.example.com/some/path/to/resource?param1=value1
The part before the question mark must use % encoding (so %20 for space), after the question mark you can use either %20 or + for a space. If you need an actual + after the question mark use %2B.
For compatibility reasons, it's better to always encode spaces as "%20", not as "+".
It was RFC 1866 (HTML 2.0 specification), which specified that space characters should be encoded as "+" in "application/x-www-form-urlencoded" content-type key-value pairs. (see paragraph 8.2.1. subparagraph 1.). This way of encoding form data is also given in later HTML specifications, look for relevant paragraphs about application/x-www-form-urlencoded.
Here is an example of a URL string where RFC 1866 allows encoding spaces as pluses: "http://example.com/over/there?name=foo+bar". So, only after "?", spaces can be replaced by pluses, according to RFC 1866. In other cases, spaces should be encoded to %20. But since it's hard to determine the context, it's the best practice to never encode spaces as "+".
I would recommend to percent-encode all characters except "unreserved" defined in RFC 3986, p.2.3.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
The only situation when you may want to encode spaces as "+" (one byte) rather than "%20" (three bytes) is when you know for sure how to interpret the context, and when the size of the query string is of the essence.
What's the difference? See the other answers.
When should we use + instead of %20? Use + if, for some reason, you want to make the URL query string (?.....) or hash fragment (#....) more readable. Example: You can actually read this:
https://www.google.se/#q=google+doesn%27t+encode+:+and+uses+%2B+instead+of+spaces
(%2B = +)
But the following is a lot harder to read (at least to me):
https://www.google.se/#q=google%20doesn%27t%20oops%20:%20%20this%20text%20%2B%20is%20different%20spaces
I would think + is unlikely to break anything, since Google uses + (see the 1st link above) and they've probably thought about this. I'm going to use + myself just because readable + Google thinks it's OK.

Parsing RFC 7231 User-Agent strings in Javascript

I need to parse arbitrary RFC 7231 User-Agent strings. Is there a way to do so? I found ua-parser.js but it tries to be "smarter" by attaching semantic value to browser/device/OS/CPU information, and I just need an array of elements where each has a name and version and optional comment.
I think I could do this via a single regular expression, but the comments make it tricky. Maybe something like /([A-Za-z0-9!#$%&'*+.^`|~-])\/([A-Za-z0-9!#$%&'*+.^`|~-])\s+(\([^)]+\))?/, where
[A-Za-z0-9!#$%&'*+.^`|~-] is a token as described in RFC 7230 (used twice, once for the name and once for the version)
\s+(\([^)]+\))? is required whitespace optionally followed by non-parentheses characters surrounded by parentheses
but this doesn't handle multiple sub-User-Agent strings and I am not sure whether I missed something in defining comments.

javascript query string encoding

why does encodeURI and encodeURIComponent encode spaces as hex values, but then I see other encodings using the plus sign? there's something i'm obviously missing.
thanks!
IIRC + is a form encoded , while %20 is a standard URI encoding.
They are interchangeable, so don't worry about which one you use.
"+" is allowed as a substitute for spaces, however there are lots of other special characters that need escaping as hex values (in the form %nn). Presumably the authors of encodeURI and encodeURIComponent decided to use hex values for everything including space, since it made their code simpler and they didn't think the extra two characters for each space in a uri was really that important.
Look here for a discussion of the differences between escape, encodeURI, and encodeURIComponent, with interactive examples for all three of them:
http://xkr.us/articles/javascript/encode-compare/
To summarize:
The escape() method does not encode
the + character which is interpreted
as a space on the server side as well
as generated by forms with spaces in
their fields. Due to this shortcoming
and the fact that this function fails
to handle non-ASCII characters
correctly, you should avoid use of
escape() whenever possible. The
best alternative is usually
encodeURIComponent().
escape() will not encode: #*/+
Use of the encodeURI() method is a
bit more specialized than escape()
in that it encodes for
URIs
as opposed to the querystring, which
is part of a URL. Use this method when
you need to encode a string to be used
for any resource that uses URIs and
needs certain characters to remain
un-encoded. Note that this method does
not encode the ' character, as it is a
valid character within URIs.
encodeURI() will not encode:
~!##$&*()=:/,;?+'
Lastly, the encodeURIComponent()
method should be used in most cases
when encoding a single component of a
URI. This method will encode certain
chars that would normally be
recognized as special chars for URIs
so that many components may be
included. Note that this method does
not encode the ' character, as it is a
valid character within URIs.
encodeURIComponent() will not
encode: ~!*()'

Categories

Resources