How to encode ".." for use in a URL path? - javascript

I'm trying to construct a URL with something like:
var myUrl = '/path/to/api/' + encodeURIComponent(str);
But if str is .. then your browser automatically lops off a path segment so that the URL becomes /path/to which is not what I want.
I've tried encoding .. as %2E%2E but your browser still re-interprets it before the request is sent. Is there anything I can do to have path actually come through to my server as /path/to/api/..?

I believe this is not supported because the behaviour would violate RFC 3986.
From Section 2.3. Unreserved Characters (emphasis mine):
Characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include uppercase and lowercase
letters, decimal digits, hyphen, period, underscore, and tilde.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
URIs that differ in the replacement of an unreserved character with
its corresponding percent-encoded US-ASCII octet are equivalent: they
identify the same resource. However, URI comparison implementations
do not always perform normalization prior to comparison (see Section
6). For consistency, percent-encoded octets in the ranges of ALPHA
(%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
underscore (%5F), or tilde (%7E) should not be created by URI
producers and, when found in a URI, should be decoded to their
corresponding unreserved characters by URI normalizers.
From Section 6.2.2.3. Path Segment Normalization (emphasis mine):
The complete path segments "." and ".." are intended only for use
within relative references (Section 4.1) and are removed as part of
the reference resolution process (Section 5.2). However, some
deployed implementations incorrectly assume that reference resolution
is not necessary when the reference is already a URI and thus fail to
remove dot-segments when they occur in non-relative paths. URI
normalizers should remove dot-segments by applying the
remove_dot_segments algorithm to the path, as described in
Section 5.2.4.):

I've actually done similar by double encoding the text, then un-encoding it on the server back end. However, mine were query parameters, not part of the path.
PS. This is written on my phone, I'll add an example later.

Seeing as there's no solution, there's not much we can do but error:
export function encodeUriComponent(str) {
if(str === '.' || str === '..') {
throw new Error(`Cannot URI-encode "${str}" per RFC 3986 §6.2.2.3`)
}
return encodeURIComponent(str);
}
I feel that this is a better option than arbitrarily modifying the URL path which is exactly what I was trying to avoid by using encodeURIComponent.

Related

Matching ouput for HttpServerUtility.UrlTokenEncode in NodeJS Javascript

I am looking at an example in dotnet which looks like the following: https://dotnetfiddle.net/t0y8yD.
The output for the HttpServerUtility.UrlTokenEncode method is:
Pn55YBwEH2S2BEM5qlNrq-LMNE8BDdHYwbWKFEHiPZo1
When I try to complete the same in NodeJS with encodeURI, encodeURIComponent or any other attempt I get the following:
Pn55YBwEH2S2BEM5qlNrq+LMNE8BDdHYwbWKFEHiPZo=
As you can see from the above the '-' should be a '+' and the last character part is different. The hash is created the same and outputs the same buffer.
var hmac = crypto.createHmac("sha256", buf);
hmac.update("9644873");
var hash = hmac.digest("base64");
How can I get the two to match? One other important note is that this is one use case and I am unsure if there are other chars that do the same.
I am unsure if the dotnet variant is incorrect or the NodeJS version is. However, the comparison will be done on the dotnet side, so I need node to match that.
The difference of the two results is caused by the use of Base64URL encoding in the C# code vs. Base64 encoding in node.js.
Base64URL and Base64 are almost identical, but Base64 encoding uses the characters +, / and =, which have a special meaning in URLs and thus have to be avoided. In Base64URL encoding + is replaced with -, / with _ and = (the padding character on the end) is either replaced with %20 or simply omitted.
In your code you're calculating a HMAC-SHA256 hash, so you get a 256 bit result, which can be encoded in 32 bytes. In Base64/Base64URL every character represents 6 bits, therefore you would need 256/6 = 42,66 => 43 Base64 characters. With 43 characters you would have 2 'lonesome' bits on the end, therefore a padding char (=) is added.
The question now is why HttpServerUtility.UrlTokenEncode adds a 1 as a replacement for the padding char on the end. I didn't find anything in the documentation. But you you should keep in mind that it's insignificant anyway.
To to get the same in node.js, you can use the package base64url, or just use simple replace statements on the base64 encoded hash.
With base64url package:
const base64url = require('base64url');
var hmacB64 = "Pn55YBwEH2S2BEM5qlNrq+LMNE8BDdHYwbWKFEHiPZo="
var hmacB64url = base64url.fromBase64(hmacb64)
console.log(hmacB64url)
The result is:
Pn55YBwEH2S2BEM5qlNrq-LMNE8BDdHYwbWKFEHiPZo
as you can see, this library just omits the padding char.
With replace, also replacing the padding = with 1:
var hmacB64 = "Pn55YBwEH2S2BEM5qlNrq+LMNE8BDdHYwbWKFEHiPZo="
console.log(hmacb64.replace(/\//g,'_').replace(/\+/g,'-').replace(/\=+$/m,'1'))
The result is:
Pn55YBwEH2S2BEM5qlNrq-LMNE8BDdHYwbWKFEHiPZo1
I tried the C# code with different data and always got '1' on the end, so to replace = with 1 seems to be ok, though it doesn't seem to be conform to the RFC.
The other alternative, if this is an option for you, is to change the C# code. Use normal base64 encoding plus string replace to get base64url output instead of using HttpServerUtility.UrlTokenEncode
A possible solution for that is described here
I'm new here so I can't comment (need 50 reputation), but I would like to add to #jqs answer that if the string ends with two "=", the replace needs to be done with "2". So my replace looks like:
hmacb64.replace(///g,'_').replace(/+/g,'-').replace(/\=\=$/m,'2').replace(/\=$/m,'1')

file and directory regex

I am trying to create a regEx for file and directory path validation.
I have implemented this, but its failing 1 of the conditions, that it should not allow ie multiple slashes together.
Also, no other special character should not be allowed
var x = /^(\\|\/){1}([a-zA-Z0-9\s\-_\#\-\^!#$%&]*?(\\|\/)?)+(\.[a-z\/\/]+)?$/i
test 1 -> / (should pass)
test 2 -> /asdf (should pass)
test 3 -> /asdf/scd.csv (should pass)
test 4 -> //asdf (should fail, currently passing)
test 5 -> /asd/ads/c.csv/ (should pass)
test 6 -> asd/asfd/a (should fail)
Can suggestion how to solve this?
The path //asdf is valid on LINUX, UNIX, iOS, and Android, so your code already works. However, if it is important for some reason to invalidate that particular set of valid paths, simply substitute a plus sign in place of the an asterisk after the [a-z...] character group. That will cause invalidation of multiple path separators with no intervening characters.
It is probably useful to comment on larger issues with the regex approach and details.
1) You can use [\/] instead of (\|/), however both will allow false positives on every combination of operating system and file system. (Those that require forward slash should exclude backslashes as a separator and vice versa.)
2) The character range [a-zA-Z0-9\s-_\#-\^!#$%&] in the question is not the permissible character range for directory path elements for any known combination of operating system and file system. For instance, a period is valid in directory names for most.
3) Permissible character ranges are not portable. (The most reliable way to test path validation is to touch the file name on the actual file system, meaning actually instantiate an empty file and capture any indications of instantiation failure.)
4) You don't want or need a question mark after your asterisk or after your second (\|/) group. They don't create a bug, but they waste either compilation or run time, and they obfuscate your regex purpose.
5) You also need to repeat the character range just before the extension or rearrange like the example below.
6) You don't need to add the A-Z range to the a-z range if you use \i as a flag at the end of the regex.
7) It appears from the list of desired results that relative paths are to be filtered out, but there is no explicit mention of that as a rule for the solution.
With hesitation, this code is provided to demonstrate a few of the above improvements.
// This code is not production worthy
// for reasons (1) through (3) given
// above and is provided only for the
// purpose of clarifying points made.
var re = /^([\\/][a-z0-9\s\-_\#\-\^!#$%&]*)+(\.[a-z][a-z0-9]+)?$/i
console.log(
[
'/',
'/asdf',
'/asdf/scd.csv',
'//asdf',
'/asd/ads/c.csv/',
'asd/asfd/a'
].map(RegExp.prototype.test, re))
Try using /^(\/|([\\/][\w\s#^!#$%&-]+)+(\.[a-z]+[\\/]?)?)$/i instead, which forces at least one character to match between each slash:
var regex = /^(\/|([\\/][\w\s#^!#$%&-]+)+(\.[a-z]+[\\/]?)?)$/i
console.log([
'/',
'/asdf',
'/asdf/scd.csv',
'//asdf',
'/asd/ads/c.csv/',
'asd/asfd/a'
].map(RegExp.prototype.test, regex))
((\/[\w\s\.#^!#$%&-]+)+\/?)|\/[\w\.\s#^!#$%&-]*
This was tested to match your sample input,
BUT on np++ (i.e. perl-regex flavor), because I have no experience with javascript.
Therefor here the same in flavor-indpendent prose.
"(slash and character many times, followed by optional slash)
or
slash and zero or more characters".
Note1: I added explicit "." to allowed characters.
Note2: I assume your "\/" means, "explicit slash, not backslash".

Replacement for javascript escape?

I know that the escape function has been deprecated and that you should use encodeURI or encodeURIComponent instead. However, the encodeUri and encodeUriComponent doesn't do the same thing as escape.
I want to create a mailto link in javascript with Swedish åäö. Here are a comparison between escape, encodeURIComponent and encodeURI:
var subject="åäö";
var body="bodyåäö";
console.log("mailto:?subject="+escape(subject)+"&body=" + escape(body));
console.log("mailto:?subject="+encodeURIComponent(subject)+"&body=" + encodeURIComponent(body));
console.log("mailto:?subject="+encodeURI(subject)+"&body=" + encodeURI(body));
Output:
mailto:?subject=My%20subject%20with%20%E5%E4%F6&body=My%20body%20with%20more%20characters%20and%20swedish%20%E5%E4%F6
mailto:?subject=My%20subject%20with%20%C3%A5%C3%A4%C3%B6&body=My%20body%20with%20more%20characters%20and%20swedish%20%C3%A5%C3%A4%C3%B6
mailto:?subject=My%20subject%20with%20%C3%A5%C3%A4%C3%B6&body=My%20body%20with%20more%20characters%20and%20swedish%20%C3%A5%C3%A4%C3%B6
Only the mailto link created with "escape" opens a properly formatted mail in Outlook using IE or Chrome. When using encodeURI or encodeURIComponent the subject says:
My subject with åäö
and the body is also looking messed up.
Is there some other function besides escape that I can use to get the working mailto link?
escape() is defined in section B.2.1.2 escape and the introduction text of Annex B says:
... All of the language features and behaviours specified in this annex have one or more undesirable characteristics and in the absence of legacy usage would be removed from this specification. ...
For characters, whose code unit value is 0xFF or less, escape() produces a two-digit escape sequence: %xx. This basically means, that escape() converts a string containing only characters from U+0000 to U+00FF to an percent-encoded string using the latin-1 encoding.
For characters with a greater code unit, the four-digit format %uxxxx is used. This is not allowed within the hfields section (where subject and body are stored) of an mailto:-URI (as defined in RFC6068):
mailtoURI = "mailto:" [ to ] [ hfields ]
to = addr-spec *("," addr-spec )
hfields = "?" hfield *( "&" hfield )
hfield = hfname "=" hfvalue
hfname = *qchar
hfvalue = *qchar
...
qchar = unreserved / pct-encoded / some-delims
some-delims = "!" / "$" / "'" / "(" / ")" / "*"
/ "+" / "," / ";" / ":" / "#"
unreserved and pct-encoded are defined in STD66:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
A percent sign is only allowed if it is directly followed by two hexdigits, percent followed by u is not allowed.
Using a self-implemented version, that behaves exactly like escape doesn't solve anything - instead just continue to use escape, it won't be removed anytime soon.
To summerise: Your previous usage of escape() generated latin1-percent-encoded mailto-URIs if all characters are in the range U+0000 to U+00FF, otherwise an invalid URI was generated (which might still be correctly interpreted by some applications, if they had javascript-encode/decode compatibility in mind).
It is more correct (no risk of creating invalid URIs) and future-proof, to generate UTF8-percent-encoded mailto-URIs using encodeURIComponent() (don't use encodeURI(), it does not escape ?, /, ...). RFC6068 requires usage of UTF-8 in many places (but allows other encodings for "MIME encoded words and for bodies in composed email messages").
Example:
text_latin1="Swedish åäö"
text_other="Emoji 😎"
document.getElementById('escape-latin-1-link').href="mailto:?subject="+escape(text_latin1);
document.getElementById('escape-other-chars-link').href="mailto:?subject="+escape(text_other);
document.getElementById('utf8-link').href="mailto:?subject="+encodeURIComponent(text_latin1);
document.getElementById('utf8-other-chars-link').href="mailto:?subject="+encodeURIComponent(text_other);
function mime_word(text){
q_encoded = encodeURIComponent(text) //to utf8 percent encoded
.replace(/[_!'()*]/g, function(c){return '%'+c.charCodeAt(0).toString(16).toUpperCase();})// encode some more chars as utf8
.replace(/%20/g,'_') // mime Q-encoding is using underscore as space
.replace(/%/g,'='); //mime Q-encoding uses equal instead of percent
return encodeURIComponent('=?utf-8?Q?'+q_encoded+'?=');//add mime word stuff and escape for uri
}
//don't use mime_word for body!!!
document.getElementById('mime-word-link').href="mailto:?subject="+mime_word(text_latin1);
document.getElementById('mime-word-other-chars-link').href="mailto:?subject="+mime_word(text_other);
<a id="escape-latin-1-link">escape()-latin1</a><br/>
<a id="escape-other-chars-link">escape()-emoji</a><br/>
<a id="utf8-link">utf8</a><br/>
<a id="utf8-other-chars-link">utf8-emoji</a><br/>
<a id="mime-word-link">mime-word</a><br/>
<a id="mime-word-other-chars-link">mime-word-emoji</a><br/>
For me, the UTF-8 links and the Mime-Word links work in Thunderbird. Only the plain UTF-8 links work in Windows 10 builtin Mailapp and my up-to-date version of Outlook.
To quote the MDN Documentation directly...
This function was used mostly for URL queries (the part of a URL following ?)—not for escaping ordinary String literals, which use the format "\xHH". (HH are two hexadecimal digits, and the form \xHH\xHH is used for higher-plane Unicode characters.)
The problem you are experiencing is because escape() does not support the UTF-8 while encodeURI() and encodeURIComponent() do.
But to be absolutely clear: never use encodeURI() or encodeURIComponent(). Let's just try it out:
console.log(encodeURIComponent('##*'));
Input: ##*. Output: %40%23*. Ordinarily, once user input is cleansed, I feel like I can trust that cleansed input. But if I ran rm * on my Linux system to delete a file specified by a user, that would literally delete all files on my system, even if I did the encoding 100% completely server-side. This is a massive bug in encodeURI() and encodeURIComponent(), which MDN Web docs clearly point with a solution.
Use fixedEncodeURI(), when trying to encode a complete URL (i.e., all of example.com?arg=val), as defined and further explained at the MDN encodeURI() Documentation...
function fixedEncodeURI(str) {
return encodeURI(str).replace(/%5B/g, '[').replace(/%5D/g, ']');
}
Or, you may need to use use fixedEncodeURIComponent(), when trying to encode part of a URL (i.e., the arg or the val in example.com?arg=val), as defined and further explained at the MDN encodeURIComponent() Documentation...
function fixedEncodeURIComponent(str) {
return encodeURIComponent(str).replace(/[!'()*]/g, function(c) {
return '%' + c.charCodeAt(0).toString(16);
});
}
If you are having trouble distinguishing what fixedEncodeURI(), fixedEncodeURIComponent(), and escape() do, I always like to simplify it with:
fixedEncodeURI() : will not encode +#?=:#;,$& to their http-encoded equivalents (as & and + are common URL operators)
fixedEncodeURIComponent() will encode +#?=:#;,$& to their http-encoded equivalents.
The escape() function was deprecated in JavaScript version 1.5. Use encodeURI() or encodeURIComponent() instead.
example
string: "May/June 2016, Volume 72, Issue 3"
escape: "May/June%202016%2C%20Volume%2072%2C%20Issue%203"
encodeURI: "May/June%202016,%20Volume%2072,%20Issue%203"
encodeURIComponent:"May%2FJune%202016%2C%20Volume%2072%2C%20Issue%203"
source https://www.w3schools.com/jsref/jsref_escape.asp

What is the best way to serialize a JavaScript object into something that can be used as a fragment identifier (url#hash)?

My page state can be described by a JavaScript object that can be serialized into JSON. But I don't think a JSON string is suitable for use in a fragment ID due to, for example, the spaces and double-quotes.
Would encoding the JSON string into a base64 string be sensible, or is there a better way? My goal is to allow the user to bookmark the page and then upon returning to that bookmark, have a piece of JavaScript read window.location.hash and change state accordingly.
I think you are on a good way. Let's write down the requirements:
The encoded string must be usable as hash, i.e. only letters and numbers.
The original value must be possible to restore, i.e. hashing (md5, sha1) is not an option.
It shouldn't be too long, to remain usable.
There should be an implementation in JavaScript, so it can be generated in the browser.
Base64 would be a great solution for that. Only problem: base64 also contains characters like - and +, so you win nothing compared to simply attaching a JSON string (which also would have to be URL encoded).
BUT: Luckily, theres a variant of base64 called base64url which is exactly what you need. It is specifically designed for the type of problem you're describing.
However, I was not able to find a JS implementation; maybe you have to write one youself – or do a bit more research than my half-assed 15 seconds scanning the first 5 Google results.
EDIT: On a second thought, I think you don't need to write an own implementation. Use a normal implementation, and simply replace the “forbidden” characters with something you find appropriate for your URLs.
Base64 is an excellent way to store binary data in text. It uses just 33% more characters/bytes than the original data and mostly uses 0-9, a-z, and A-Z. It also has three other characters that would need encoded to be stored in the URL, which are /, =, and +. If you simply used URL encoding, it would take up 300% (3x) the size.
If you're only storing the characters in the fragment of the URL, base64-encoded text it doesn't need to be re-encoded and will not change. But if you want to send the data as part of the actual URL to visit, then it matters.
As referenced by lxg, there there is a base64url variant for that. This is a modified version of base64 to replace unsafe characters to store in the URL. Here is how to encode it:
function tobase64url(s) {
return btoa(x).replace(/\+/g,'-').replace(/\//g,'_').replace(/=/g,'');
}
console.log(tobase64url('\x00\xff\xff\xf1\xf1\xf1\xff\xff\xfe'));
// Returns "AP__8fHx___-" instead of "AP//8fHx///+"
And to decode a base64 string from the URL:
function frombase64url(s) {
return atob(x.replace(/-/g,'+').replace(/_/g, '/'));
}
Use encodeURIComponent and decodeURIComponent to serialize data for the fragment (aka hash) part of the URL.
This is safe because the character set output by encodeURIComponent is a subset of the character set allowed in the fragment. Specifically, encodeURIComponent escapes all characters except:
A - Z
a - z
0 - 9
- . _ ~ ! ' ( ) *
So the output includes the above characters, plus escaped characters, which are % followed by hexadecimal digits.
The set of allowed characters in the fragment is:
A - Z
a - z
0 - 9
? / : # - . _ ~ ! $ & ' ( ) * + , ; =
percent-encoded characters (a % followed by hexadecimal digits)
This set of allowed characters includes all the characters output by encodeURIComponent, plus a few other characters.

Is this a valid email address?

"Françoise Lefèvre"#example.com
I'm reading RFC 5321 to try to actually understand what constitutes a valid email address -- and I'm probably making this a lot more difficult than it needs to be -- but this has been bugging me.
i.e., within a quoted string, any
ASCII graphic or space is permitted
without blackslash-quoting except
double-quote and the backslash itself.
Does this mean that ASCII extended character sets are valid within quotes? Or does that imply standard ASCII table only?
EDIT - With the answers in mind, here's a simple jQuery validator that could work in supplement to the the plugin's built-in email validation to check the characters.
jQuery.validator.addMethod("ascii_email", function( value, element ) {
// In compliance with RFC 5321, this allows all standard printing ASCII characters in quoted text.
// Unquoted text must be ASCII-US alphanumeric or one of the following: ! # $ % & ' * + - / = ? ^ _ ` { | } ~
// # and . get a free pass, as this is meant to be used together with the email validator
var result = this.optional(element) ||
(
/^[\u002a\u002b\u003d\u003f\u0040\u0020-\u0027\u002d-u002f\u0030-\u0039\u0041-\u005a\u005e-\u007e]+$/.test(value.replace(/(["])(?:\\\1|.)*?\1/, "")) &&
/^[\u0020-\u007e]+$/.test(value.match(/(["])(?:\\\1|.)*?\1/, ""))
);
return result;
}, "Invalid characters");
The plugin's built-in validation appears to be pretty good, except for catching invalid characters. Out of the test cases listed here it only disallows comments, folding whitespace and addresses lacking a TDL (ie: #localhost, #255.255.255.255) -- all of which I can easily live without.
According to this MSDN page the extended ASCII characters aren't valid, currently, but there is a proposed specification that would change this.
http://msdn.microsoft.com/en-us/library/system.net.mail.mailaddress(VS.90).aspx
The important part is here:
Thomas Lee is correct in that a quoted
local part is valid in an email
address and certain mail addresses may
be invalid if not in a quoted string.
However, the characters that others of
you have mentioned such as the umlaut
and the agave are not in the ASCII
character set, they are extended
ASCII. In RFC 2822 (and subsequent
RFC's 5322 and 3696) the dtext
specification (allowed in quoted local
parts) only allows most ASCII values
(RFC 2822, section 3.4.1) which
includes values in ranges from 33-90
and 94-126. RFC 5335 has been proposed
that would allow non-ascii characters
in the addr-spec, however it is still
labeled as experimental and as such is
not supported in MailAddress.
In this RFC, ASCII means US-ASCII , i.e., no characters with a value greater than 127 are allowed. As a proof, here are some quotes from RFC 5321:
The mail data may contain any of the 128 ASCII character codes, [...]
[...]
Systems MUST NOT define mailboxes in such a way as to require the use in SMTP of non-ASCII characters (octets with the high order bit set to one) or ASCII "control characters" (decimal value 0-31 and 127). These characters MUST NOT be used in MAIL or RCPT commands or other commands that require mailbox names.
These quotes quite clearly imply that characters with a value greater than 127 are considered non-ASCII. Since such characters are explicitly forbidden in MAIL TO or RCPT commands, it is impossible to use them for e-mail addresses.
Thus, "Francoise Lefevre"#example.com is a perfectly valid address (according to the RFC), whereas "Françoise Lefèvre"#example.com is not.
Technically yes, but read on:
While the above definition for
Local-part is relatively permissive,
for maximum interoperability, a host
that expects to receive mail SHOULD
avoid defining mailboxes where the
Local-part requires (or uses) the
Quoted-string form or where the
Local-part is case- sensitive.
...
Systems MUST NOT define mailboxes in
such a way as to require the use in
SMTP of non-ASCII characters.
The HTML5 spec has an interesting take on the issue of valid email addresses:
A valid e-mail address is a string that matches the ABNF production 1*( atext / "." ) "#" ldh-str 1*( "." ldh-str ) where atext is defined in RFC 5322 section 3.2.3, and ldh-str is defined in RFC 1034 section 3.5.
The nice thing about this, of course, is that you can then take a look at the open source browser's source code for validating it (look for the IsValidEmailAddress function). Of course it's in C, but not too hard to translate to JS.

Categories

Resources