Character/URI encoding in JavaScript getting out of sync? - javascript

I have a question about encoding special/extended UTF-8 characters in URLs in JavaScript. The same question applies to many characters like the Registered R-circle, but my example uses an umlaut:
ü = %C3%BC in UTF-8 (four rows from bottom of http://www.utf8-chartable.de/)
If the url contains an umlaut represented as UTF-8 (ü = %C3%BC), and I run it through encodeURIComponent, the %s are encode, the string now looks like "%25C3%25BC" and it gets correctly processed by my system. This is good.
url = "http://foo.com/bar.html?%C3%BC"
url = encodeURIComponent(url);
// url is now represented as "http%3A%2F%2Ffoo.com%2Fbar.html%3F%25C3%25BC"
However, the bad: If the pre-encoded string has an unencoded character, the actual umlaut, the after encoding is looks like "%C3%BC" and fails because, I believe, the %s should be encoded, too.:
url = "http://foo.com/bar.html?ü"
url = encodeURIComponent(url);
// url is now represented as "http%3A%2F%2Ffoo.com%2Fbar.html%3F%C3%BC"
I think it fails because it is less thoroughly encoded than the rest of the url.
So, beyond general advice or answers to questions I don't know to ask, what I think i want to know is how to get the raw umlaut (and all other special characters) to fully encode. Is that what is incorrect?
Thanks for your help!
Nate

You cannot encode a URL all at once. If you have already concatenated the host, path, parameters, etc., together then it's impossible to correctly determine which characters actually need to be encoded and which characters are separators that need to be left alone.
The only reliable way to build a URL is by concatenating already-encoded values:
"http://foo.com/bar.html?" + encodeURIComponent("%C3%BC")

Related

Matching ouput for HttpServerUtility.UrlTokenEncode in NodeJS Javascript

I am looking at an example in dotnet which looks like the following: https://dotnetfiddle.net/t0y8yD.
The output for the HttpServerUtility.UrlTokenEncode method is:
Pn55YBwEH2S2BEM5qlNrq-LMNE8BDdHYwbWKFEHiPZo1
When I try to complete the same in NodeJS with encodeURI, encodeURIComponent or any other attempt I get the following:
Pn55YBwEH2S2BEM5qlNrq+LMNE8BDdHYwbWKFEHiPZo=
As you can see from the above the '-' should be a '+' and the last character part is different. The hash is created the same and outputs the same buffer.
var hmac = crypto.createHmac("sha256", buf);
hmac.update("9644873");
var hash = hmac.digest("base64");
How can I get the two to match? One other important note is that this is one use case and I am unsure if there are other chars that do the same.
I am unsure if the dotnet variant is incorrect or the NodeJS version is. However, the comparison will be done on the dotnet side, so I need node to match that.
The difference of the two results is caused by the use of Base64URL encoding in the C# code vs. Base64 encoding in node.js.
Base64URL and Base64 are almost identical, but Base64 encoding uses the characters +, / and =, which have a special meaning in URLs and thus have to be avoided. In Base64URL encoding + is replaced with -, / with _ and = (the padding character on the end) is either replaced with %20 or simply omitted.
In your code you're calculating a HMAC-SHA256 hash, so you get a 256 bit result, which can be encoded in 32 bytes. In Base64/Base64URL every character represents 6 bits, therefore you would need 256/6 = 42,66 => 43 Base64 characters. With 43 characters you would have 2 'lonesome' bits on the end, therefore a padding char (=) is added.
The question now is why HttpServerUtility.UrlTokenEncode adds a 1 as a replacement for the padding char on the end. I didn't find anything in the documentation. But you you should keep in mind that it's insignificant anyway.
To to get the same in node.js, you can use the package base64url, or just use simple replace statements on the base64 encoded hash.
With base64url package:
const base64url = require('base64url');
var hmacB64 = "Pn55YBwEH2S2BEM5qlNrq+LMNE8BDdHYwbWKFEHiPZo="
var hmacB64url = base64url.fromBase64(hmacb64)
console.log(hmacB64url)
The result is:
Pn55YBwEH2S2BEM5qlNrq-LMNE8BDdHYwbWKFEHiPZo
as you can see, this library just omits the padding char.
With replace, also replacing the padding = with 1:
var hmacB64 = "Pn55YBwEH2S2BEM5qlNrq+LMNE8BDdHYwbWKFEHiPZo="
console.log(hmacb64.replace(/\//g,'_').replace(/\+/g,'-').replace(/\=+$/m,'1'))
The result is:
Pn55YBwEH2S2BEM5qlNrq-LMNE8BDdHYwbWKFEHiPZo1
I tried the C# code with different data and always got '1' on the end, so to replace = with 1 seems to be ok, though it doesn't seem to be conform to the RFC.
The other alternative, if this is an option for you, is to change the C# code. Use normal base64 encoding plus string replace to get base64url output instead of using HttpServerUtility.UrlTokenEncode
A possible solution for that is described here
I'm new here so I can't comment (need 50 reputation), but I would like to add to #jqs answer that if the string ends with two "=", the replace needs to be done with "2". So my replace looks like:
hmacb64.replace(///g,'_').replace(/+/g,'-').replace(/\=\=$/m,'2').replace(/\=$/m,'1')

Encode and decode a string in JavaScript

Let's say I have a string, called str, which is equal to "Hello, world!"
Is there a way to choose some unicode characters like "azertyuiopqsdfghjklmwxcvbn1234567890-" and return an encoded string that contains only the choosen characters?
It should return something like "hvfebi iehfhe" (well, something encoded and not human readable, but which is decodable)?
Thanks
There's no such a thing as a function that you configure an arbitrary set of characters to enconde and decode, unless you find a library that does that or you implement it yourself.
But, you can use base64 encoding, which uses "A-Z", "a-z", "0-9", "+", "/" and "=" characters to encode the string.
The native browser functions are on window: btoa() to enconde and atob() to decode.
Edit after your comments:
Any function you make up to solve this will be decodable anyway analysing the code, so no point in hiding it. If you don't want to be simply base64, you can make a function that encodes it several times.

Text encoding that produces legible encodings suitable as Javascript identifiers?

I'm working on a tool that reads arbitrary data files and creates a table out of its data which I then store in a database. I'd like to preserve the column headers. The column headers are already ASCII text (or maybe latin1), but they have characters that aren't valid variable names (e.g., spaces, %), so I need to encode them somehow. I'm looking for an encoding for the column titles that has these properties:
Legible: it would be nice if the encoded text looked as similar as possible to the unencoded text (i.e., for debugging).
Legal identifier: I'd like the encoded text to be a valid JavaScript identifier (ECMA-262 Section 7.6).
Invertible: I'd like to be able to get the exact original text back from the encoded text.
I can think of approaches that work for 2 of the 3 cases, but I don't know how to get all 3. E.g., url encoding doesn't produce legal identifier names, I think I could transform base64 to be legal, but it isn't legible, what I've got currently just does some substitutions so it's not invertible.
Efficiency isn't a concern, so if necessary, I could store the encoded and unencoded texts together. The best option I can think of is to use url encoding and then swap percents for $. I thought there would be better options than this though, but I can't find anything. Is there anything better?
This pair of methods relying on Guava's PercentEscaper seems to meet my requirements. Guava doesn't provide an unescaper, but given my simple needs here, I can just use a simple URLDecoder.
private static PercentEscaper escaper = new PercentEscaper('',false)
static String getIdentifier(String str) {
//minimal safe characters, but leaves letters alone, so it's somewhat legible
String escaped = escaper.escape(str);
//javascript identifiers can't start with a digit, and the escaper doesn't know the first
//character has different rules. so prepend a "%3" to encode the digit
if(Character.isDigit(escaped.charAt(0))){
escaped = "%3"+escaped
}
//a percent isn't a valid in a javascript identifier, so we'll use _ as our special character
escaped = escaped.replace('%','_');
return escaped;
}
static String invertIdentifier(String str){
String unescaped = str.replace('_','%');
unescaped = URLDecoder.decode(unescaped, "UTF-8");
return unescaped;
}

What is the best way to serialize a JavaScript object into something that can be used as a fragment identifier (url#hash)?

My page state can be described by a JavaScript object that can be serialized into JSON. But I don't think a JSON string is suitable for use in a fragment ID due to, for example, the spaces and double-quotes.
Would encoding the JSON string into a base64 string be sensible, or is there a better way? My goal is to allow the user to bookmark the page and then upon returning to that bookmark, have a piece of JavaScript read window.location.hash and change state accordingly.
I think you are on a good way. Let's write down the requirements:
The encoded string must be usable as hash, i.e. only letters and numbers.
The original value must be possible to restore, i.e. hashing (md5, sha1) is not an option.
It shouldn't be too long, to remain usable.
There should be an implementation in JavaScript, so it can be generated in the browser.
Base64 would be a great solution for that. Only problem: base64 also contains characters like - and +, so you win nothing compared to simply attaching a JSON string (which also would have to be URL encoded).
BUT: Luckily, theres a variant of base64 called base64url which is exactly what you need. It is specifically designed for the type of problem you're describing.
However, I was not able to find a JS implementation; maybe you have to write one youself – or do a bit more research than my half-assed 15 seconds scanning the first 5 Google results.
EDIT: On a second thought, I think you don't need to write an own implementation. Use a normal implementation, and simply replace the “forbidden” characters with something you find appropriate for your URLs.
Base64 is an excellent way to store binary data in text. It uses just 33% more characters/bytes than the original data and mostly uses 0-9, a-z, and A-Z. It also has three other characters that would need encoded to be stored in the URL, which are /, =, and +. If you simply used URL encoding, it would take up 300% (3x) the size.
If you're only storing the characters in the fragment of the URL, base64-encoded text it doesn't need to be re-encoded and will not change. But if you want to send the data as part of the actual URL to visit, then it matters.
As referenced by lxg, there there is a base64url variant for that. This is a modified version of base64 to replace unsafe characters to store in the URL. Here is how to encode it:
function tobase64url(s) {
return btoa(x).replace(/\+/g,'-').replace(/\//g,'_').replace(/=/g,'');
}
console.log(tobase64url('\x00\xff\xff\xf1\xf1\xf1\xff\xff\xfe'));
// Returns "AP__8fHx___-" instead of "AP//8fHx///+"
And to decode a base64 string from the URL:
function frombase64url(s) {
return atob(x.replace(/-/g,'+').replace(/_/g, '/'));
}
Use encodeURIComponent and decodeURIComponent to serialize data for the fragment (aka hash) part of the URL.
This is safe because the character set output by encodeURIComponent is a subset of the character set allowed in the fragment. Specifically, encodeURIComponent escapes all characters except:
A - Z
a - z
0 - 9
- . _ ~ ! ' ( ) *
So the output includes the above characters, plus escaped characters, which are % followed by hexadecimal digits.
The set of allowed characters in the fragment is:
A - Z
a - z
0 - 9
? / : # - . _ ~ ! $ & ' ( ) * + , ; =
percent-encoded characters (a % followed by hexadecimal digits)
This set of allowed characters includes all the characters output by encodeURIComponent, plus a few other characters.

what kind of encoding is this?

I've got some data from dbpedia using jena and since jena's output is based on xml so there are some circumstances that xml characters need to be treated differently like following :
Guns n ' Roses
I just want to know what kind of econding is this?
I want decode/encode my input based on above encode(r) with the help of javascript and send it back to a servlet.
(edited post if you remove the space between & and amp you will get the correct character since in stackoverflow I couldn't find a way to do that I decided to put like that!)
Seems to be XML entity encoding, and a numeric character reference (decimal).
A numeric character reference refers to a character by its Universal
Character Set/Unicode code point, and uses the format
You can get some info here: List of XML and HTML character entity references on Wikipedia.
Your character is number 39, being the apostrophe: ', which can also be referenced with a character entity reference: '.
To decode this using Javascript, you could use for example php.js, which has an html_entity_decode() function (note that it depends on get_html_translation_table()).
UPDATE: in reply to your edit: Basically that is the same, the only difference is that it was encoded twice (possibly by mistake). & is the ampersand: &.
This is an SGML/HTML/XML numeric character entity reference.
In this case for an apostrophe '.

Categories

Resources