Decode JSON string in Mojolicious that was encoded with JSON.stringify - javascript

I am trying to send javascript variable as JSON string to Mojolicious and I am having problems with decoding it on perl side. My page uses utf-8 encoding.
The json string (value of $self->param('routes_jsonstr')) seems to have correct value but Mojo::JSON can't decode it. The code is working well when there are no utf-8 characters. What am I doing wrong?
Javascript code:
var routes = [ {
addr1: 'Škofja Loka', // string with utf-8 character
addr2: 'Kranj'
}];
var routes_jsonstr = JSON.stringify(routes);
$.get(url.on_route_change,
{
routes_jsonstr: routes_jsonstr
}
);
Perl code:
sub on_route_change {
my $self = shift;
my $routes=j( $self->param('routes_jsonstr') );
warn $self->param('routes_jsonstr');
warn Dumper $routes;
}
Server output
Wide character in warn at /opt/mojo/routes/script/../lib/Routes/Homepage.pm line 76.
[{"addr1":"Škofja Loka","addr2":"Kranj"}] at /opt/mojo/routes/script/../lib/Routes/Homepage.pm line 76.
$VAR1 = undef;
Last line above shows that decoding of json string didn't work. When there are no utf-8 characters to decode on perl side everything works fine and $routes contain expected data.

Mojolicious style solution can be found here:
http://showmetheco.de/articles/2010/10/how-to-avoid-unicode-pitfalls-in-mojolicious.html
In Javascript I only changed $.get() to $.post().
Updated and working Perl code now looks like this:
use Mojo::ByteStream 'b';
sub on_route_change {
my $self = shift;
my $routes=j( b( $self->param('routes_jsonstr') )->encode('UTF-8') );
}
Tested with many different utf8 strings.

Wide character warnings happen when you print. This is not due to how you decode your unicode but your STDOUT encoding. Try use utf8::all available from CPAN which will set all your IO handles to utf8. Avoiding decoding probably isn't fixing the problem, but rather making it worse. The only reason it appears to work is your terminal is fixing things up for you.

You can take away at least some of the pain by escaping the problematic characters; see https://stackoverflow.com/a/4901205/17389.

Related

Decoding Base64 String in Java

I'm using Java and I have a Base64 encoded string that I wish to decode and then do some operations to transform.
The correct decoded value is obtained in JavaScript through function atob(), but in java, using Base64.decodeBase64() I cannot get an equal value.
Example:
For:
String str = "AAAAAAAAAAAAAAAAAAAAAMaR+ySCU0Yzq+AV9pNCCOI="
With JavaScript atob(str) I get ->
"Æ‘û$‚SF3«àö“Bâ"
With Java new String(Base64.decodeBase64(str)) I get ->
"Æ?û$?SF3«à§ö?â"
Another way I could fixed the issue is to run JavaScript in Java with a Nashorn engine, but I'm getting an error near the "$" symbol.
Current Code:
ScriptEngine engine = new ScriptEngineManager().getEngineByName("JavaScript");
String script2 = "function decoMemo(memoStr){ print(atob(memoStr).split('')" +
".map((aChar) => `0${aChar.charCodeAt(0).toString(16)}`" +
".slice(-2)).join('').toUpperCase());}";
try {
engine.eval(script2);
Invocable inv = (Invocable) engine;
String returnValue = (String)inv.invokeFunction("decoMemo", memoTest );
System.out.print("\n result: " + returnValue);
} catch (ScriptException | NoSuchMethodException e1) {
e1.printStackTrace();
Any help would be appreciated. I search a lot of places but can't find the correct answer.
btoa is broken and shouldn't be used.
The problem is, bytes aren't characters. Base64 encoding does only one thing. It converts bytes to a stream of characters that survive just about any text-based transport mechanism. And Base64 decoding does that one thing in reverse, it converts such characters into bytes.
And the confusion is, you're printing those bytes as if they are characters. They are not.
You end up with the exact same bytes, but javascript and java disagree on how you're supposed to turn that into an ersatz string because you're trying to print it to a console. That's a mistake - bytes aren't characters. Thus, some sort of charset encoding is being used, and you don't want any of this, because these characters clearly aren't intended to be printed like that.
Javascript sort of half-equates characters and bytes and will freely convert one to the other, picking some random encoding. Oof. Javascript sucks in this regard, it is what it is. The MDN docs on btoa explains why you shouldn't use it. You're running into that problem.
Not entirely sure how you fix it in javascript - but perhaps you don't need it. Java is decoding the bytes perfectly well, as is javascript, but javascript then turns those bytes into characters into some silly fashion and that's causing the problem.
What you have there is not a text string at all. The giveaway is the AA's at the beginning. Those map to a number of zero bytes. That doesn't translate to meaningful text in any standard character set.
So what you have there is most likely binary data. Converting it to a string is not going to give you meaningful text.
Now to explain the difference you are seeing between Java and Javascript. It looks to me as if both Java and Javascript are making a "best effort" attempt to convert the binary data as if is was encoded in ISO-8859-1 (aka ISO LATIN-1).
The problem is some of the bytes codes are mapping to unassigned codes.
In the Java case those unassigned codes are being mapped to ?, either when the string is created or when it is being output.
In the Javascript case, either the unassigned codes are not included in the string, or them are being removed when you attempt to display them.
For the record, this is how an online base64 decoder the above for me:
����������������Æû$SF3«àöBâ
The unassigned codes are 0x91 0x82 and 0x93. 0x15 and 0x0B are non-printing control codes.
But the bottom line is that you should not be converting this data into a string in either Java or in Javascript. It should be treated as binary; i.e. an array of byte values.
byte[] data = Base64.getDecoder().decode(str);

Javascript encodeURI returns unexpected value

I have a problem URL-encoding a text with javascript.
I am in Germany, where we have these "Umlaute" (ÄÖÜ), and these letters make some problems.
An online encoder/decoder returned the following results for the word "Äpfel" (apples).
Äpfel >>> url-encode >>> %C3%84pfel
%C3%84pfel >>> url-decode >>> Äpfel
For testing, I created the following php.file (poc.php) with no php-content, just the javascript:
<script type="text/javascript">
var t = "Äpfel";
t = encodeURI(t);
alert(t);
t = decodeURI(t);
alert(t);
</script>
The first alert returns "%EF%BF%BDpfel", which differs from the result of the online encoder.
The second alert returns "�pfel" (yes, the diamond with the "?").
It seems that javascript cannot decode the text it just encoded.
I guess the cause of this behaviour is somewhere in the PHP settings. When I just rename the file from "poc.php" to "poc.html" the encoding is correct and the alerts return the same results as the online encoder/decoder.
When I check the current encoding, javascript and php return "utf-8".
In my "real" project I have a ".js" file included in my php-file (with the same problem).
<script type="text/javascript" src="scripts/functions.js"></script>
Has anybody an idea what causes this behaviour?
The weird byte stream %EF%BF%BD you're receiving is utf-8 version of the Unicode replacement character, that is, literally the � symbol.
The Javascript portion can url-decode the text it just url-encoded, it was just asked to encode the symbol for a missing symbol.
So: some part of your system is not using utf-8, but some other character set instead, and there's an unnecessary conversion done. My guess is that the file is encoded in latin-1, aka. ISO 8859-1, and PHP tries to read it as if it was UTF-8, converting the unrecognized character 0xc4 ('Ä' in latin-1) to the replacament character symbol.

How to handle weirdly combined websocket messages?

I'm connecting to an external websocket api using the node ws library (node 10.8.0 on Ubuntu 16.04). I've got a listener which simply parses the json and passes it to the callback:
this.ws.on('message', (rawdata) => {
let data = null;
try {
data = JSON.parse(rawdata);
} catch (e) {
console.log('Failed parsing the following string as json: ' + rawdata);
return;
}
mycallback(data);
});
I now receive errors in which the rawData looks as follows (I formatted and removed irrelevant contents):
�~A
{
"id": 1,
etc..
}�~�
{
"id": 2,
etc..
I then wondered; what are these characters? Seeing the structure I initially thought that the first weird sign must be an opening bracket of an array ([) and the second one a comma (,) so that it creates an array of objects.
I then investigated the problem further by writing the rawdata to a file whenever it encounters a JSON parsing error. In an hour or so it has saved about 1500 of these error files, meaning this happens a lot. I cated a couple of these files in the terminal, of which I uploaded an example below:
A few things are interesting here:
The files always start with one of these weird signs.
The files appear to exist out of multiple messages which should have been received separately. The weird signs separate those individual messages.
The files always end with an unfinished json object.
The files are of varying lengths. They are not always the same size and are thus not cut off on a specific length.
I'm not very experience with websockets, but could it be that my websocket somehow receives a stream of messages that it concatenates together, with these weird signs as separators, and then randomly cuts off the last message? Maybe because I'm getting a constant very fast stream of messages?
Or could it be because of an error (or functionality) server side in that it combines those individual messages?
Does anybody know what's going on here? All tips are welcome!
[EDIT]
#bendataclear suggested to interpret it as utf8. So I did, and I pasted a screenshot of the results below. The first print is as it is, and the second one interpreted as utf8. To me this doesn't look like anything. I could of course convert to utf8, and then split by those characters. Although the last message is always cut off, this would at least make some of the messages readble. Other ideas still welcome though.
My assumption is that you're working only with English/ASCII characters and something probably messed the stream. (NOTE:I am assuming), there are no special characters, if it's so, then I will suggest you pass the entire json string into this function:
function cleanString(input) {
var output = "";
for (var i=0; i<input.length; i++) {
if (input.charCodeAt(i) <= 127) {
output += input.charAt(i);
}
}
console.log(output);
}
//example
cleanString("�~�")
You can make reference to How to remove invalid UTF-8 characters from a JavaScript string?
EDIT
From an article by Internet Engineering Task Force (IETF),
A common class of security problems arises when sending text data
using the wrong encoding. This protocol specifies that messages with
a Text data type (as opposed to Binary or other types) contain UTF-8-
encoded data. Although the length is still indicated and
applications implementing this protocol should use the length to
determine where the frame actually ends, sending data in an improper
The "Payload data" is text data encoded as UTF-8. Note that a particular text frame might include a partial UTF-8 sequence; however, the whole message MUST contain valid UTF-8. Invalid UTF-8 in reassembled messages is handled as described in Handling Errors in UTF-8-Encoded Data, which states that When an endpoint is to interpret a byte stream as UTF-8 but finds that the byte stream is not, in fact, a valid UTF-8 stream, that endpoint MUST Fail the WebSocket Connection. This rule applies both during the opening handshake and during subsequent data exchange.
I really believe that you error (or functionality) is coming from the server side which combines your individual messages, so I will suggest come up with a logic of ensuring that all your characters MUST be converted from Unicode to ASCII by first encoding the characters as UTF-8. And you might also want to install npm install --save-optional utf-8-validate to efficiently check if a message contains valid UTF-8 as required by the spec.
You might also want to pass in an if condition to help you do some checks;
this.ws.on('message', (rawdata) => {
if (message.type === 'utf8') { // accept only text
}
I hope this gets to help.
The problem which you have is that one side sends a JSON in different encoding as the other side it intepretes.
Try to solve this problem with following code:
const { StringDecoder } = require('string_decoder');
this.ws.on('message', (rawdata) => {
const decoder = new StringDecoder('utf8');
const buffer = new Buffer(rawdata);
console.log(decoder.write(buffer));
});
Or with utf16:
const { StringDecoder } = require('string_decoder');
this.ws.on('message', (rawdata) => {
const decoder = new StringDecoder('utf16');
const buffer = new Buffer(rawdata);
console.log(decoder.write(buffer));
});
Please read: String Decoder Documentation
It seems your output is having some spaces, If you have any spaces or if you find any special characters please use Unicode to full fill them.
Here is the list of Unicode characters
This might help I think.
Those characters are known as "REPLACEMENT CHARACTER" - used to replace an unknown, unrecognized or unrepresentable character.
From: https://en.wikipedia.org/wiki/Specials_(Unicode_block)
The replacement character � (often a black diamond with a white question mark or an empty square box) is a symbol found in the Unicode standard at code point U+FFFD in the Specials table. It is used to indicate problems when a system is unable to render a stream of data to a correct symbol. It is usually seen when the data is invalid and does not match any character
Checking the section 8 of the WebSocket protocol Error Handling:
8.1. Handling Errors in UTF-8 from the Server
When a client is to interpret a byte stream as UTF-8 but finds that the byte stream is not in fact a valid UTF-8 stream, then any bytes or sequences of bytes that are not valid UTF-8 sequences MUST be interpreted as a U+FFFD REPLACEMENT CHARACTER.
8.2. Handling Errors in UTF-8 from the Client
When a server is to interpret a byte stream as UTF-8 but finds that the byte stream is not in fact a valid UTF-8 stream, behavior is undefined. A server could close the connection, convert invalid byte sequences to U+FFFD REPLACEMENT CHARACTERs, store the data verbatim, or perform application-specific processing. Subprotocols layered on the WebSocket protocol might define specific behavior for servers.
Depends on the implementation or library in use how to deal with this, for example from this post Implementing Web Socket servers with Node.js:
socket.ondata = function(d, start, end) {
//var data = d.toString('utf8', start, end);
var original_data = d.toString('utf8', start, end);
var data = original_data.split('\ufffd')[0].slice(1);
if (data == "kill") {
socket.end();
} else {
sys.puts(data);
socket.write("\u0000", "binary");
socket.write(data, "utf8");
socket.write("\uffff", "binary");
}
};
In this case, if a � is found it will do:
var data = original_data.split('\ufffd')[0].slice(1);
if (data == "kill") {
socket.end();
}
Another thing that you could do is to update node to the latest stable, from this post OpenSSL and Breaking UTF-8 Change (fixed in Node v0.8.27 and v0.10.29):
As of these releases, if you try and pass a string with an unmatched surrogate pair, Node will replace that character with the unknown unicode character (U+FFFD). To preserve the old behavior set the environment variable NODE_INVALID_UTF8 to anything (even nothing). If the environment variable is present at all it will revert to the old behavior.

Unicode encode string

I'm json_encoding some strings. Sometimes they contain binary data. This causes the encoding to fail with error code JSON_ERROR_UTF8. Running the strings through utf8_encode gets around this error. However, ✓ (a unicode checkmark) gets encoded as \u00e2\u009c\u0093 which when interpreted by JavaScript and rendered in your browser actually looks like â.
How can I fix this? Is there another encoding I can use?
echo json_encode(utf8_encode('✓')); // "\u00e2\u009c\u0093"
Now press F12 and paste that into your JavaScript console (quotes included). It should output â.
Please note that
echo json_encode('✓'); // "\u2713"
Works as intended. The issue is that sometimes the string will contain binary data which json_encode can't handle, so I need to sanitize every string without breaking the strings it can handle.
More examples:
json_encode(chr(200)); // false (bad)
json_encode(utf8_encode(chr(200))) // "\u00c8" (good)
json_encode('✓'); // "\u2713" (good)
json_encode(utf8_encode(chr(200))) // "\u00e2\u009c\u0093" (bad)
So you see, encoding it works well for some strings and breaks others.
This is strictly for logging. I don't care if the binary data comes out weird, I just don't want it to mess with valid strings.
Running strings through this function
function _utf8($str) {
if(!mb_check_encoding($str, 'UTF-8')) {
return utf8_encode($str);
}
return $str;
}
(taken and modified from here)
Seems to give the results I'm after.
Checkmarks are left alone, but chr(200) and other weirdness is encoded:
json_encode(utf8_encode(chr(200))) // "\u00c8"
EDIT: This question is unanswerable. Encoding arbitrary binary data is one thing, keeping UTF-8 characters intact is something completely separate. What's to stop the byte sequence 0xe29c93 from being interpreted as ✓ when it shows up in your binary data?
According to the json_encode PHP reference page, you can use the following syntax to encode Unicode characters:
json_encode($data, JSON_UNESCAPED_UNICODE);
It should make it pass unicode characters through unescaped.

Convert cryptic string to a readable one with JavaScript (UTF-8)

I found out that when I save this distorted string ("Äußerungen üben") as an ANSI text file, then open it with Firefox and choose in the Firefox menu "Unicode", it turns it into a readable german format ("Äußerungen üben").
The same thing is possible with my text editor (Notepad++).
Is there any way to achieve this with JavaScript? E.g. the following would be nice:
var output = makeReadable("Äußerungen üben");
Unfortunately, I get this kind of distorted strings from an external source which doesn't care about UTF-8 and provides all data as ANSI.
PS: Saving the file as UTF-8 and setting the charset as UTF-8 in the META Tag has no effect.
Edit:
Now I solved it through making a list of all common UTF8/ANSI distortions (more than 1300) and wrote a function replacing all wrong character combinations with the right character. It works fine :-) .
I think the encoding of the "distorted string" in your question got munged further by posting it here. But a quick Google search for "javascript convert from utf-8" returns this blog post as the top hit:
http://ecmanaut.blogspot.com/2006/07/encoding-decoding-utf8-in-javascript.html
So it turns out that encoding and decoding UTF-8 in JavaScript is really easy. This works great for me:
var original = "Äußerungen üben";
var utf8 = unescape(encodeURIComponent(original));
//return utf8; // something like "ÃuÃerungen üben"
var output = decodeURIComponent(escape(utf8));
return output;

Categories

Resources