Unicode encode string - javascript

I'm json_encoding some strings. Sometimes they contain binary data. This causes the encoding to fail with error code JSON_ERROR_UTF8. Running the strings through utf8_encode gets around this error. However, ✓ (a unicode checkmark) gets encoded as \u00e2\u009c\u0093 which when interpreted by JavaScript and rendered in your browser actually looks like â.
How can I fix this? Is there another encoding I can use?
echo json_encode(utf8_encode('✓')); // "\u00e2\u009c\u0093"
Now press F12 and paste that into your JavaScript console (quotes included). It should output â.
Please note that
echo json_encode('✓'); // "\u2713"
Works as intended. The issue is that sometimes the string will contain binary data which json_encode can't handle, so I need to sanitize every string without breaking the strings it can handle.
More examples:
json_encode(chr(200)); // false (bad)
json_encode(utf8_encode(chr(200))) // "\u00c8" (good)
json_encode('✓'); // "\u2713" (good)
json_encode(utf8_encode(chr(200))) // "\u00e2\u009c\u0093" (bad)
So you see, encoding it works well for some strings and breaks others.
This is strictly for logging. I don't care if the binary data comes out weird, I just don't want it to mess with valid strings.

Running strings through this function
function _utf8($str) {
if(!mb_check_encoding($str, 'UTF-8')) {
return utf8_encode($str);
}
return $str;
}
(taken and modified from here)
Seems to give the results I'm after.
Checkmarks are left alone, but chr(200) and other weirdness is encoded:
json_encode(utf8_encode(chr(200))) // "\u00c8"

EDIT: This question is unanswerable. Encoding arbitrary binary data is one thing, keeping UTF-8 characters intact is something completely separate. What's to stop the byte sequence 0xe29c93 from being interpreted as ✓ when it shows up in your binary data?
According to the json_encode PHP reference page, you can use the following syntax to encode Unicode characters:
json_encode($data, JSON_UNESCAPED_UNICODE);
It should make it pass unicode characters through unescaped.

Related

Decoding Base64 String in Java

I'm using Java and I have a Base64 encoded string that I wish to decode and then do some operations to transform.
The correct decoded value is obtained in JavaScript through function atob(), but in java, using Base64.decodeBase64() I cannot get an equal value.
Example:
For:
String str = "AAAAAAAAAAAAAAAAAAAAAMaR+ySCU0Yzq+AV9pNCCOI="
With JavaScript atob(str) I get ->
"Æ‘û$‚SF3«àö“Bâ"
With Java new String(Base64.decodeBase64(str)) I get ->
"Æ?û$?SF3«à§ö?â"
Another way I could fixed the issue is to run JavaScript in Java with a Nashorn engine, but I'm getting an error near the "$" symbol.
Current Code:
ScriptEngine engine = new ScriptEngineManager().getEngineByName("JavaScript");
String script2 = "function decoMemo(memoStr){ print(atob(memoStr).split('')" +
".map((aChar) => `0${aChar.charCodeAt(0).toString(16)}`" +
".slice(-2)).join('').toUpperCase());}";
try {
engine.eval(script2);
Invocable inv = (Invocable) engine;
String returnValue = (String)inv.invokeFunction("decoMemo", memoTest );
System.out.print("\n result: " + returnValue);
} catch (ScriptException | NoSuchMethodException e1) {
e1.printStackTrace();
Any help would be appreciated. I search a lot of places but can't find the correct answer.
btoa is broken and shouldn't be used.
The problem is, bytes aren't characters. Base64 encoding does only one thing. It converts bytes to a stream of characters that survive just about any text-based transport mechanism. And Base64 decoding does that one thing in reverse, it converts such characters into bytes.
And the confusion is, you're printing those bytes as if they are characters. They are not.
You end up with the exact same bytes, but javascript and java disagree on how you're supposed to turn that into an ersatz string because you're trying to print it to a console. That's a mistake - bytes aren't characters. Thus, some sort of charset encoding is being used, and you don't want any of this, because these characters clearly aren't intended to be printed like that.
Javascript sort of half-equates characters and bytes and will freely convert one to the other, picking some random encoding. Oof. Javascript sucks in this regard, it is what it is. The MDN docs on btoa explains why you shouldn't use it. You're running into that problem.
Not entirely sure how you fix it in javascript - but perhaps you don't need it. Java is decoding the bytes perfectly well, as is javascript, but javascript then turns those bytes into characters into some silly fashion and that's causing the problem.
What you have there is not a text string at all. The giveaway is the AA's at the beginning. Those map to a number of zero bytes. That doesn't translate to meaningful text in any standard character set.
So what you have there is most likely binary data. Converting it to a string is not going to give you meaningful text.
Now to explain the difference you are seeing between Java and Javascript. It looks to me as if both Java and Javascript are making a "best effort" attempt to convert the binary data as if is was encoded in ISO-8859-1 (aka ISO LATIN-1).
The problem is some of the bytes codes are mapping to unassigned codes.
In the Java case those unassigned codes are being mapped to ?, either when the string is created or when it is being output.
In the Javascript case, either the unassigned codes are not included in the string, or them are being removed when you attempt to display them.
For the record, this is how an online base64 decoder the above for me:
����������������Æû$SF3«àöBâ
The unassigned codes are 0x91 0x82 and 0x93. 0x15 and 0x0B are non-printing control codes.
But the bottom line is that you should not be converting this data into a string in either Java or in Javascript. It should be treated as binary; i.e. an array of byte values.
byte[] data = Base64.getDecoder().decode(str);

TextEncoder / TextDecoder not round tripping

I'm definitely missing something about the TextEncoder and TextDecoder behavior. It seems to me like the following code should round-trip, but it doesn't seem to:
new TextDecoder().decode(new TextEncoder().encode(String.fromCharCode(55296))).charCodeAt(0);
Since I'm just encoding and decoding the string, the char code seems like it should be the same, but this returns 65533 instead of 55296. What am I missing?
Based on some spelunking, the TextEncoder.encode() method appears to take an argument of type USVString, where USV stands for Unicode Scalar Value. According to this page, a USV cannot be a high-surrogate or low-surrogate code point.
Also, according to MDN:
A USVString is a sequence of Unicode scalar values. This definition
differs from that of DOMString or the JavaScript String type in that
it always represents a valid sequence suitable for text processing,
while the latter can contain surrogate code points.
So, my guess is your String argument to encode() is getting converted to a USVString (either implicitly or within encode()). Based on this page, it looks like to convert from String to USVString, it first converts it to a DOMString, and then follows this procedure, which includes replacing all surrogates with U+FFFD, which is the code point you see, 65533, the "Replacement Character".
The reason String.fromCharCode(55296).charCodeAt(0) works I believe is because it doesn't need to do this String -> USVString conversion.
As to why TextEncoder.encode() was designed this way, I don't understand the unicode details well enough to attempt to explain, but I suspect it's to simplify implementation since the only output encoding it supports seems to be UTF-8, in an Uint8Array. I'm guessing requiring a USVString argument without surrogates (instead of a native UTF-16 String possibly with surrogates) simplifies the encoding to UTF-8, or maybe makes some encoding/decoding use cases simpler?
For those (like me) who aren't sure what "unicode surrogates" are:
The problem
The character code 55296 is not a valid character by itself. So this part of the code is already a problem:
String.fromCharCode(55296)
Since there is no valid character at that charCode, the .fromCharCode function returns the error character "�" instead, which happens to have the code 65533.
Codes like 55296 are only valid as the first element of a pair of codes. Pairs of codes are used to represent the characters that didn't fit in Unicode's Basic Multilingual Plane. (There are a lot of characters outside the Basic Multilingual Plane, so they need two 16-bit numbers to encode them.)
For example, here is a valid use of the code 55296:
console.log(String.fromCharCode(55296, 57091)
It returns the character "𐌃", from the ancient Etruscan alphabet.
The solution
This code will round-trip correctly:
const code = new TextEncoder().encode(String.fromCharCode(55296, 57091));
console.log(new TextDecoder().decode(code).charCodeAt(0)); // Returns 55296
But beware: .charCodeAt only returns the first part of the pair. A safer option might be to use String.codePointAt to convert the character into a single 32-bit code:
const code = new TextEncoder().encode(String.fromCharCode(55296, 57091));
console.log(new TextDecoder().decode(code).codePointAt(0)); // Returns 66307

Javascript encodeURI returns unexpected value

I have a problem URL-encoding a text with javascript.
I am in Germany, where we have these "Umlaute" (ÄÖÜ), and these letters make some problems.
An online encoder/decoder returned the following results for the word "Äpfel" (apples).
Äpfel >>> url-encode >>> %C3%84pfel
%C3%84pfel >>> url-decode >>> Äpfel
For testing, I created the following php.file (poc.php) with no php-content, just the javascript:
<script type="text/javascript">
var t = "Äpfel";
t = encodeURI(t);
alert(t);
t = decodeURI(t);
alert(t);
</script>
The first alert returns "%EF%BF%BDpfel", which differs from the result of the online encoder.
The second alert returns "�pfel" (yes, the diamond with the "?").
It seems that javascript cannot decode the text it just encoded.
I guess the cause of this behaviour is somewhere in the PHP settings. When I just rename the file from "poc.php" to "poc.html" the encoding is correct and the alerts return the same results as the online encoder/decoder.
When I check the current encoding, javascript and php return "utf-8".
In my "real" project I have a ".js" file included in my php-file (with the same problem).
<script type="text/javascript" src="scripts/functions.js"></script>
Has anybody an idea what causes this behaviour?
The weird byte stream %EF%BF%BD you're receiving is utf-8 version of the Unicode replacement character, that is, literally the � symbol.
The Javascript portion can url-decode the text it just url-encoded, it was just asked to encode the symbol for a missing symbol.
So: some part of your system is not using utf-8, but some other character set instead, and there's an unnecessary conversion done. My guess is that the file is encoded in latin-1, aka. ISO 8859-1, and PHP tries to read it as if it was UTF-8, converting the unrecognized character 0xc4 ('Ä' in latin-1) to the replacament character symbol.

Error Parsing JSON with escaped quotes

I am getting the following json object when I call the URL from Browser which I expect no data in it.
"{\"data\":[], \"SkipToken\":\"\", \"top\":\"\"}"
However, when I tried to call it in javascript it gives me error Parsing Json message
dspservice.callService(URL, "GET", "", function (data) {
var dataList = JSON.parse(data);
)};
This code was working before I have no idea why all of a sudden stopped working and throwing me error.
You say the server is returning the JSON (omitting the enclosing quotes):
{\"data\":[], \"SkipToken\":\"\", \"top\":\"\"}
This is invalid JSON. The quote marks in JSON surrounding strings and property names should not be preceded by a backslash. The backslash in JSON is strictly for inserting double quote marks inside a string. (It can also be used to escape other characters inside strings, but that is not relevant here.)
Correct JSON would be:
{"data":[], "SkipToken":"", "top":""}
If your server returned this, it would parse correctly.
The confusion here, and the reports by other posters that it seems like your string should work, lies in the fact that in a simple-minded test, where I type this string into the console:
var x = "{\"data\":[], \"SkipToken\":\"\", \"top\":\"\"}";
the JavaScript string literal escaping mechanism, which is entirely distinct from the use of escapes in JSON, results in a string with the value
{"data":[], "SkipToken":"", "top":""}
which of course JSON.parse can handle just fine. But Javascript string escaping applies to string literals in source code, not to things coming down from the server.
To fix the server's incorrectly-escaped JSON, you have two possibilities. One is to tell the server guys they don't need to (and must not) put backslashes before quote marks (except for quote marks inside strings). Then everything will work.
The other approach is to undo the escaping yourself before handing it off to JSON.parse. A first cut at this would be a simple regexp such as
data.replace(/\\"/g, '"')
as in
var dataList = JSON.parse(data.replace(/\\"/g, '"')
It might need additional tweaking depending on how the server guys are escaping quotes inside strings; are they sending \"\\"\", or possibly \"\\\"\"?
I cannot explain why this code that was working suddenly stopped working. My best guess is a change on the server side that started escaping the double quotes.
Since there is nothing wrong with the JSON string you gave us, the only other explanation is that the data being passed to your function is something other than what you listed.
To test this hypothesis, run the following code:
dspservice.callService(URL, "GET", "", handler(data));
function handler(data) {
var goodData = "{\"data\":[], \"SkipToken\":\"\", \"top\":\"\"}";
alert(goodData); // display the correct JSON string
var goodDataList = JSON.parse(goodData); // parse good string (should work)
alert(data); // display string in question
var dataList = JSON.parse(data); // try to parse it (should fail)
}
If the goodData JSON string can be parsed with no issues, and data appears to be incorrectly-formatted, then you have the answer to your question.
Place a breakpoint on the first line of the handler function, where goodData is defined. Then step through the code. From what you told me in your comments, it is still crashing during a JSON parse, but I'm willing to wager that it is failing on the second parse and not the first.
Did you mean that your JSON is like this?
"{\"data\":[], \"SkipToken\":\"\", \"top\":\"\"}"
Then data in your callback would be like this:
'"{\"data\":[], \"SkipToken\":\"\", \"top\":\"\"}"'
Because data is the fetched text content string.
You don't have to add extra quotes in your JSON:
{"data":[], "SkipToken":"", "top":""}

Decode JSON string in Mojolicious that was encoded with JSON.stringify

I am trying to send javascript variable as JSON string to Mojolicious and I am having problems with decoding it on perl side. My page uses utf-8 encoding.
The json string (value of $self->param('routes_jsonstr')) seems to have correct value but Mojo::JSON can't decode it. The code is working well when there are no utf-8 characters. What am I doing wrong?
Javascript code:
var routes = [ {
addr1: 'Škofja Loka', // string with utf-8 character
addr2: 'Kranj'
}];
var routes_jsonstr = JSON.stringify(routes);
$.get(url.on_route_change,
{
routes_jsonstr: routes_jsonstr
}
);
Perl code:
sub on_route_change {
my $self = shift;
my $routes=j( $self->param('routes_jsonstr') );
warn $self->param('routes_jsonstr');
warn Dumper $routes;
}
Server output
Wide character in warn at /opt/mojo/routes/script/../lib/Routes/Homepage.pm line 76.
[{"addr1":"Škofja Loka","addr2":"Kranj"}] at /opt/mojo/routes/script/../lib/Routes/Homepage.pm line 76.
$VAR1 = undef;
Last line above shows that decoding of json string didn't work. When there are no utf-8 characters to decode on perl side everything works fine and $routes contain expected data.
Mojolicious style solution can be found here:
http://showmetheco.de/articles/2010/10/how-to-avoid-unicode-pitfalls-in-mojolicious.html
In Javascript I only changed $.get() to $.post().
Updated and working Perl code now looks like this:
use Mojo::ByteStream 'b';
sub on_route_change {
my $self = shift;
my $routes=j( b( $self->param('routes_jsonstr') )->encode('UTF-8') );
}
Tested with many different utf8 strings.
Wide character warnings happen when you print. This is not due to how you decode your unicode but your STDOUT encoding. Try use utf8::all available from CPAN which will set all your IO handles to utf8. Avoiding decoding probably isn't fixing the problem, but rather making it worse. The only reason it appears to work is your terminal is fixing things up for you.
You can take away at least some of the pain by escaping the problematic characters; see https://stackoverflow.com/a/4901205/17389.

Categories

Resources