Javascript encodeURI returns unexpected value - javascript

I have a problem URL-encoding a text with javascript.
I am in Germany, where we have these "Umlaute" (ÄÖÜ), and these letters make some problems.
An online encoder/decoder returned the following results for the word "Äpfel" (apples).
Äpfel >>> url-encode >>> %C3%84pfel
%C3%84pfel >>> url-decode >>> Äpfel
For testing, I created the following php.file (poc.php) with no php-content, just the javascript:
<script type="text/javascript">
var t = "Äpfel";
t = encodeURI(t);
alert(t);
t = decodeURI(t);
alert(t);
</script>
The first alert returns "%EF%BF%BDpfel", which differs from the result of the online encoder.
The second alert returns "�pfel" (yes, the diamond with the "?").
It seems that javascript cannot decode the text it just encoded.
I guess the cause of this behaviour is somewhere in the PHP settings. When I just rename the file from "poc.php" to "poc.html" the encoding is correct and the alerts return the same results as the online encoder/decoder.
When I check the current encoding, javascript and php return "utf-8".
In my "real" project I have a ".js" file included in my php-file (with the same problem).
<script type="text/javascript" src="scripts/functions.js"></script>
Has anybody an idea what causes this behaviour?

The weird byte stream %EF%BF%BD you're receiving is utf-8 version of the Unicode replacement character, that is, literally the � symbol.
The Javascript portion can url-decode the text it just url-encoded, it was just asked to encode the symbol for a missing symbol.
So: some part of your system is not using utf-8, but some other character set instead, and there's an unnecessary conversion done. My guess is that the file is encoded in latin-1, aka. ISO 8859-1, and PHP tries to read it as if it was UTF-8, converting the unrecognized character 0xc4 ('Ä' in latin-1) to the replacament character symbol.

Related

Decoding Base64 String in Java

I'm using Java and I have a Base64 encoded string that I wish to decode and then do some operations to transform.
The correct decoded value is obtained in JavaScript through function atob(), but in java, using Base64.decodeBase64() I cannot get an equal value.
Example:
For:
String str = "AAAAAAAAAAAAAAAAAAAAAMaR+ySCU0Yzq+AV9pNCCOI="
With JavaScript atob(str) I get ->
"Æ‘û$‚SF3«àö“Bâ"
With Java new String(Base64.decodeBase64(str)) I get ->
"Æ?û$?SF3«à§ö?â"
Another way I could fixed the issue is to run JavaScript in Java with a Nashorn engine, but I'm getting an error near the "$" symbol.
Current Code:
ScriptEngine engine = new ScriptEngineManager().getEngineByName("JavaScript");
String script2 = "function decoMemo(memoStr){ print(atob(memoStr).split('')" +
".map((aChar) => `0${aChar.charCodeAt(0).toString(16)}`" +
".slice(-2)).join('').toUpperCase());}";
try {
engine.eval(script2);
Invocable inv = (Invocable) engine;
String returnValue = (String)inv.invokeFunction("decoMemo", memoTest );
System.out.print("\n result: " + returnValue);
} catch (ScriptException | NoSuchMethodException e1) {
e1.printStackTrace();
Any help would be appreciated. I search a lot of places but can't find the correct answer.
btoa is broken and shouldn't be used.
The problem is, bytes aren't characters. Base64 encoding does only one thing. It converts bytes to a stream of characters that survive just about any text-based transport mechanism. And Base64 decoding does that one thing in reverse, it converts such characters into bytes.
And the confusion is, you're printing those bytes as if they are characters. They are not.
You end up with the exact same bytes, but javascript and java disagree on how you're supposed to turn that into an ersatz string because you're trying to print it to a console. That's a mistake - bytes aren't characters. Thus, some sort of charset encoding is being used, and you don't want any of this, because these characters clearly aren't intended to be printed like that.
Javascript sort of half-equates characters and bytes and will freely convert one to the other, picking some random encoding. Oof. Javascript sucks in this regard, it is what it is. The MDN docs on btoa explains why you shouldn't use it. You're running into that problem.
Not entirely sure how you fix it in javascript - but perhaps you don't need it. Java is decoding the bytes perfectly well, as is javascript, but javascript then turns those bytes into characters into some silly fashion and that's causing the problem.
What you have there is not a text string at all. The giveaway is the AA's at the beginning. Those map to a number of zero bytes. That doesn't translate to meaningful text in any standard character set.
So what you have there is most likely binary data. Converting it to a string is not going to give you meaningful text.
Now to explain the difference you are seeing between Java and Javascript. It looks to me as if both Java and Javascript are making a "best effort" attempt to convert the binary data as if is was encoded in ISO-8859-1 (aka ISO LATIN-1).
The problem is some of the bytes codes are mapping to unassigned codes.
In the Java case those unassigned codes are being mapped to ?, either when the string is created or when it is being output.
In the Javascript case, either the unassigned codes are not included in the string, or them are being removed when you attempt to display them.
For the record, this is how an online base64 decoder the above for me:
����������������Æû$SF3«àöBâ
The unassigned codes are 0x91 0x82 and 0x93. 0x15 and 0x0B are non-printing control codes.
But the bottom line is that you should not be converting this data into a string in either Java or in Javascript. It should be treated as binary; i.e. an array of byte values.
byte[] data = Base64.getDecoder().decode(str);

Why special characters are automatically turning to someother characters in a string value?

My actual charcters:
ÆÐƎƏƐƔIJŊŒẞÞǷȜæðǝəɛɣijŋœĸſßþƿȝĄƁÇĐƊĘĦĮƘŁØƠŞȘŢȚŦŲƯY̨Ƴąɓçđɗęħįƙłøơşșţțŧųưy̨ƴÁÀÂÄǍĂĀÃÅǺĄÆǼǢƁĆĊĈČÇĎḌĐƊÐÉÈĖÊËĚĔĒĘẸƎƏƐĠĜǦĞĢƔáàâäǎăāãåǻąæǽǣɓćċĉčçďḍđɗðéèėêëěĕēęẹǝəɛġĝǧğģɣĤḤĦIÍÌİÎÏǏĬĪĨĮỊIJĴĶƘĹĻŁĽĿʼNŃN̈ŇÑŅŊÓÒÔÖǑŎŌÕŐỌØǾƠŒĥḥħıíìiîïǐĭīĩįịijĵķƙĸĺļłľŀʼnńn̈ňñņŋóòôöǒŏōõőọøǿơœŔŘŖŚŜŠŞȘṢẞŤŢṬŦÞÚÙÛÜǓŬŪŨŰŮŲỤƯẂẀŴẄǷÝỲŶŸȲỸƳŹŻŽẒŕřŗſśŝšşșṣßťţṭŧþúùûüǔŭūũűůųụưẃẁŵẅƿýỳŷÿȳỹƴźżžẓ
Above characters automatically turns into
’'‘ÆÃÆŽÆÆƔIJŊŒẞÞǷȜæðÇəɛɣijŋœĸſßþƿÈÄ„ÆÇÄƊĘĦĮƘÅØƠŞȘŢȚŦŲƯY̨Ƴąɓçđɗęħįƙłøơşșţțŧųưy̨ƴÃÀÂÄÇĂĀÃÅǺĄÆǼǢÆĆĊĈČÇĎḌÄÆŠÃÉÈĖÊËĚĔĒĘẸƎÆÆĠĜǦĞĢƔáàâäǎăÄãåǻąæǽǣɓćċĉÄçÄá¸Ä‘ɗðéèėêëěĕēęẹÇəɛġÄǧğģɣĤḤĦIÃÌİÎÃÇĬĪĨĮỊIJĴĶƘĹĻÅĽĿʼNŃN̈ŇÑŅŊÓÒÔÖǑŎŌÕÅỌØǾƠŒĥḥħıíìiîïÇĭīĩįịijĵķƙĸĺļłľŀʼnńn̈ňñņŋóòôöǒÅÅõőá»Ã¸Ç¿Æ¡Å“ŔŘŖŚŜŠŞȘṢẞŤŢṬŦÞÚÙÛÜǓŬŪŨŰŮŲỤƯẂẀŴẄǷÃỲŶŸȲỸƳŹŻŽẒŕřŗſśÅšşșṣßťţṭŧþúùûüǔŭūũűůųụưẃáºÅµáº…ƿýỳŷÿȳỹƴź
I got that output when i tried to console.log the string
That's not exactly a question, but it's obvious your file encodings are not what you expect them to be. Make sure everything is UTF-8 through and through.
Below code line, add between html head tags.
<meta charset="UTF-8"/>
1 - UTF-8 vs ANSI
Your first block is in UTF-8 format, and the second is encoded by ANSI.
Somewhere during your translation process, the strings changed from UTF-8 to ANSI. Make sure all your text sources are saved in UTF-8.
You can check with a free text editor like Notepad++ .
2 - Understand Javascript String codes
Each char has a given code, independently of what you think it's there.
For "special" chars, what it looks like a A might contain a different code from the default A.
A small example:
var letter1 = String.fromCharCode(65); // output: "A"
var letter2 = String.fromCharCode(913); // output: "Α"
console.log(letter1);
console.log(letter2);
console.log(letter1 === letter2);
So when you apply any logic to a string it will not give you the result you are expecting when the char code is not exactly the same.

JSON unicode characters conversion

I came across this strange JSON which I can't seem to decode.
To simplify things, let's say it's a JSON string:
"\uffffffe2\uffffff94\uffffff94\uffffffe2\uffffff94\uffffff80\uffffffe2\uffffff94\uffffff80 mystring"
After decoding it should look as following:
└── mystring
JS or PHP doesn't seem to convert it correctly.
js> JSON.parse('"\uffffffe2\uffffff94\uffffff94\uffffffe2\uffffff94\uffffff80\uffffffe2\uffffff94\uffffff80 mystring"')
ffe2ff94ff94ffe2ff94ff80ffe2ff94ff80 mystring
PHP behaves the same
php> json_decode('"\uffffffe2\uffffff94\uffffff94\uffffffe2\uffffff94\uffffff80\uffffffe2\uffffff94\uffffff80 mystring"')
ffe2ff94ff94ffe2ff94ff80ffe2ff94ff80 mystring
Any ideas how to properly parse this JSON string would be welcome.
It is not valid JSON string - JSON supports only 4 hex digits after \u. Results from both PHP and JS are correct.
It is not possible decode this using standard functions.
Where did you get this JSON string?
About correct json for string you want to get - it should be "\u2514\u2500\u2500 mystring", or just "└── mystring" (json supports any unicode characters in strings except " and \).
Also if you need to encode some character that require more than two bytes - it will result in two escape codes for example "𩄎" would be "\ud864\udd0e" when escaped.
So, If you really need to decode string above - you can fix it before decoding, replacing \uffffffe2 by \uffff\uffe2 via regexp (for js it would be something like: s.replace(/(\\u[A-Fa-f0-9]{4})([A-Fa-f0-9]{4})/gi,'$1\\u$2') ).
But anyway character codes in string specified above does not look right.

How to convert this text to correct HTML characters using Javascript

How to convert this text to correct HTML characters using Javascript:
'PingAsyncTask - Token v\ufffdlido'
Put in your console:
console.log('PingAsyncTask - Token v\ufffdlido');
I already try all common functions:
https://gist.github.com/chrisveness/bcb00eb717e6382c5608
http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html
http://jsfromhell.com/geral/utf-8
Can anyone help me?
If your document is already UTF-8 you don't need to do anything special. The string is already encoded correctly in JavaScript, so when you write it into the document it'll show up correctly. You can see it in this fiddle: https://jsfiddle.net/baar4ew8/
P.S. The character in your code (\ufffd) is U+FFFD, the Unicode replacement character. Most fonts render it as a black diamond with a question mark inside, or just an empty box. Here's how Stack Overflow renders it:
�
If you're seeing that in your output, your string is being rendered correctly.
If you think you should be seeing some other character, then your problem isn't in the HTML or JavaScript—it's with the source of your data, whatever that might be. When a program converts text from a non-Unicode encoding to a Unicode encoding like UTF-8, characters that don't exist in Unicode are replaced with U+FFFD (�)—hence "replacement character." If you're expecting some character that does exist in Unicode but you're getting U+FFFD then it might be the case that the program converting the text to UTF-8 doesn't know what encoding it was originally in and so converted it incorrectly. For example, if you stored text with encoding X in a database table with encoding Y without first converting it to encoding Y.

Decode JSON string in Mojolicious that was encoded with JSON.stringify

I am trying to send javascript variable as JSON string to Mojolicious and I am having problems with decoding it on perl side. My page uses utf-8 encoding.
The json string (value of $self->param('routes_jsonstr')) seems to have correct value but Mojo::JSON can't decode it. The code is working well when there are no utf-8 characters. What am I doing wrong?
Javascript code:
var routes = [ {
addr1: 'Škofja Loka', // string with utf-8 character
addr2: 'Kranj'
}];
var routes_jsonstr = JSON.stringify(routes);
$.get(url.on_route_change,
{
routes_jsonstr: routes_jsonstr
}
);
Perl code:
sub on_route_change {
my $self = shift;
my $routes=j( $self->param('routes_jsonstr') );
warn $self->param('routes_jsonstr');
warn Dumper $routes;
}
Server output
Wide character in warn at /opt/mojo/routes/script/../lib/Routes/Homepage.pm line 76.
[{"addr1":"Škofja Loka","addr2":"Kranj"}] at /opt/mojo/routes/script/../lib/Routes/Homepage.pm line 76.
$VAR1 = undef;
Last line above shows that decoding of json string didn't work. When there are no utf-8 characters to decode on perl side everything works fine and $routes contain expected data.
Mojolicious style solution can be found here:
http://showmetheco.de/articles/2010/10/how-to-avoid-unicode-pitfalls-in-mojolicious.html
In Javascript I only changed $.get() to $.post().
Updated and working Perl code now looks like this:
use Mojo::ByteStream 'b';
sub on_route_change {
my $self = shift;
my $routes=j( b( $self->param('routes_jsonstr') )->encode('UTF-8') );
}
Tested with many different utf8 strings.
Wide character warnings happen when you print. This is not due to how you decode your unicode but your STDOUT encoding. Try use utf8::all available from CPAN which will set all your IO handles to utf8. Avoiding decoding probably isn't fixing the problem, but rather making it worse. The only reason it appears to work is your terminal is fixing things up for you.
You can take away at least some of the pain by escaping the problematic characters; see https://stackoverflow.com/a/4901205/17389.

Categories

Resources