JSON unicode characters conversion

JSON unicode characters conversion - javascript

I came across this strange JSON which I can't seem to decode.
To simplify things, let's say it's a JSON string:
"\uffffffe2\uffffff94\uffffff94\uffffffe2\uffffff94\uffffff80\uffffffe2\uffffff94\uffffff80 mystring"
After decoding it should look as following:
└── mystring
JS or PHP doesn't seem to convert it correctly.
js> JSON.parse('"\uffffffe2\uffffff94\uffffff94\uffffffe2\uffffff94\uffffff80\uffffffe2\uffffff94\uffffff80 mystring"')
ffe2ff94ff94ffe2ff94ff80ffe2ff94ff80 mystring
PHP behaves the same
php> json_decode('"\uffffffe2\uffffff94\uffffff94\uffffffe2\uffffff94\uffffff80\uffffffe2\uffffff94\uffffff80 mystring"')
ffe2ff94ff94ffe2ff94ff80ffe2ff94ff80 mystring
Any ideas how to properly parse this JSON string would be welcome.

It is not valid JSON string - JSON supports only 4 hex digits after \u. Results from both PHP and JS are correct.
It is not possible decode this using standard functions.
Where did you get this JSON string?
About correct json for string you want to get - it should be "\u2514\u2500\u2500 mystring", or just "└── mystring" (json supports any unicode characters in strings except " and \).
Also if you need to encode some character that require more than two bytes - it will result in two escape codes for example "𩄎" would be "\ud864\udd0e" when escaped.
So, If you really need to decode string above - you can fix it before decoding, replacing \uffffffe2 by \uffff\uffe2 via regexp (for js it would be something like: s.replace(/(\\u[A-Fa-f0-9]{4})([A-Fa-f0-9]{4})/gi,'$1\\u$2') ).
But anyway character codes in string specified above does not look right.

Related

Keeping escaped unicode characters with JSON.stringify of JSON.parse

I have an input JSON like this (which really contains the literal values "\u2013" (the encoded form of a unicode character)):
{"source":"Subject: NEED: 11/5 BNA-MSL \u2013 1200L Departure - 1 Pax"}
I read it with JSON.parse and it reads the \u2013 as –, which is fine for display in my app.
However, I need to export again the same JSON, to send it down to some other app. I want to keep the same format and have back the \u2013 into the JSON. I am doing JSON.stringify, but it keeps the – in the output.
Any idea what I could do to keep the \u syntax?

Using a replacer function in a JSON.stringify call didn't work - strings returned from the replacer with an escaped backslash produce a double backslash in output, and a single backslashed character is unescaped in output if possible.
Simply re-escaping the stringify result has potential:
const obj = {"source":"Subject: NEED: 11/5 BNA-MSL \u2013 1200L Departure - 1 Pax"}
console.log(" stringify: ", JSON.stringify( obj));
console.log("& replaceAll: ", JSON.stringify(obj).replaceAll('\u2013', '\\u2013'));
using more complex string modifications as necessary.
However this looks very like an X solution to an X-Y problem. Better might be to fix the downstream parsing to handle JSON text as JSON text and not try to use it in raw form - particularly given that JSON text in encoded in utf-8 and can handle non-ASCII characters without special treatment.

UINT8 Array to String without escape characters

I'm parsing a Uint8 array that is an HTML document. It contains a script tag which in turn contains JSON data that I would like to parse.
I first converted the array to text:
data = Buffer.from(str).toString('utf8')
I then searched for the script tag, and extracted the string containing the JSON:
... {\"phrase\":\"Go to \"California\"\",\"color\":\"red\",\"html\":\"<div class=\"myclass\">Ok</div>\"} ...
I then did a replace to clean it up.
data = data.replace(/\\"/g, "\"").replace(/\\/g, "").
{"phrase":"Go to "California"","color":"red","html":"<div class="myclass">Ok</div>"}
I tried to parse using JSON.parse() and got an error because the attributes contain quotes. Is there a way to process this further using a regex ? Or perhaps a library? I am working with Cheerio, so can use that if helpful.

The escape characters are necessary if you want to parse the JSON. The embedded quotes would need to be double escaped, so the extracted text isn't even valid JSON.
"{\"phrase\":\"Go to \\\"California\\\"\",\"color\":\"red\",\"html\":\"<div class=\\\"myclass\\\">Ok</div>\"}"
or, using single quotes:
'{"phrase":"Go to \\"California\\"","color":"red","html":"<div class=\\"myclass\\">Ok</div>"}'

Thanks.
After some more tinkering around, I realized that I should have encoded the data to Uint8 at the source (a Lambda function) before transmitting it for further processing. So now, I have:
Text
Encoded text to Uint8
Return from Lambda function.
Decode from Uint8 to text
Process readily as no escape characters.
Before, I was skipping step 2. And so Lambda was encoded the text however it does by default.

Decode utf8 character on javascript

I have a badly configured third party service that outputs strings like this:
"SK Uni=C4=8Dov vs Prostejov"
I want to replace on the fly all the wrong characters it sends me, so my modules work with the correctly decoded string
I have found on this website (https://www.compart.com/en/unicode/U+010D) that the =C4=8D substring corresponds to the utf-8 character č
https://www.compart.com/en/unicode/U+010D
č
...
UTF-8 Encoding: 0xC4 0x8D
UTF-16 Encoding: 0x010D
UTF-32 Encoding: 0x0000010D
...
but I cannot find the way to decode it automatically.
I've tried with:
>> String.fromCodePoint(0xc48d)
"쒍"
>> String.fromCodePoint("0xc4 0x8d")
RangeError
>> String.fromCharCode(0xc48d)
"쒍"
etc...
If I do it with the utf-16 code, String.fromCodePoint(0x010D) outputs the correct character.
How can I make it work with utf-8 instead of utf-16 codes?
Should I convert my string to utf16 achieve what I want? If so, How can I convert it?

Since the encoding is almost identical to percent escapes used in URLs, you can simply use:
decodeURIComponent("SK Uni=C4=8Dov vs Prostejov".replace(/=/g, "%"))

Encoding in C# and Decoding in Javascript

I have encoded some text in C# like below:
var encodedCredential = Convert.ToBase64String(Encoding.Unicode.GetBytes(JsonConvert.SerializeObject("Sample text")));
The encoded String is :IgBTAGEAbQBwAGwAZQAgAHQAZQB4AHQAIgA=
I want to decode the encoded String in java script.
I have tried the below
decodeURIComponent(atob("IgBTAGEAbQBwAGwAZQAgAHQAZQB4AHQAIgA="))
decodeURIComponent(atob("IgBTAGEAbQBwAGwAZQAgAHQAZQB4AHQAIgA=").replace(' ',''))
The result is something different, There are some spaces in each letter. I cant even replace the spaces.

You need to use UTF-8 encoding in C#. Export base64 by this command
Convert.ToBase64String(Encoding.UTF8.GetBytes("Sample text"))

#King_Fisher, you shouldn't be getting additional spaces, also the replace method will replace a single occurrence.
Here's what I did with your code (see attached screenshot)

Converting JSON strings with escaped Unicode characters to JavaScript objects

I have a JSON string which contains an escaped Unicode character. The JSON includes this snippet:
I co-ordinate our Chat Literacy network \u2013 an online group for practitioners of Information Literacy
The \u2013 is a long dash.
I'm using
var theObject = eval ("(" + jsonString + ")");
to convert the JSON string to a JavaScript object. I need to use a version of SpiderMonkey that doesn't have a direct JSON to Object method in it.
After conversion, the character in question becomes the Unicode control character \0013 which is an invalid UTF-8 character.
Is there another way I can convert the JSON to an object which will preserve the correct long-dash character? Maybe some other JSON to Object method I can load?
This happens with some other characters also, like curly quotes.
Thanks,
Doug

eval() is evil. Stay away from it.
Try using JSON 3: http://bestiejs.github.io/json3/

Develop Reference

JavaScript is the programming language of the Web.

JSON unicode characters conversion - javascript

Related

Keeping escaped unicode characters with JSON.stringify of JSON.parse

UINT8 Array to String without escape characters

Decode utf8 character on javascript

Encoding in C# and Decoding in Javascript

Converting JSON strings with escaped Unicode characters to JavaScript objects

Categories

Resources