Decode utf8 character on javascript - javascript

I have a badly configured third party service that outputs strings like this:
"SK Uni=C4=8Dov vs Prostejov"
I want to replace on the fly all the wrong characters it sends me, so my modules work with the correctly decoded string
I have found on this website (https://www.compart.com/en/unicode/U+010D) that the =C4=8D substring corresponds to the utf-8 character č
https://www.compart.com/en/unicode/U+010D
č
...
UTF-8 Encoding: 0xC4 0x8D
UTF-16 Encoding: 0x010D
UTF-32 Encoding: 0x0000010D
...
but I cannot find the way to decode it automatically.
I've tried with:
>> String.fromCodePoint(0xc48d)
"쒍"
>> String.fromCodePoint("0xc4 0x8d")
RangeError
>> String.fromCharCode(0xc48d)
"쒍"
etc...
If I do it with the utf-16 code, String.fromCodePoint(0x010D) outputs the correct character.
How can I make it work with utf-8 instead of utf-16 codes?
Should I convert my string to utf16 achieve what I want? If so, How can I convert it?

Since the encoding is almost identical to percent escapes used in URLs, you can simply use:
decodeURIComponent("SK Uni=C4=8Dov vs Prostejov".replace(/=/g, "%"))

Related

Decoding Base64 String in Java

I'm using Java and I have a Base64 encoded string that I wish to decode and then do some operations to transform.
The correct decoded value is obtained in JavaScript through function atob(), but in java, using Base64.decodeBase64() I cannot get an equal value.
Example:
For:
String str = "AAAAAAAAAAAAAAAAAAAAAMaR+ySCU0Yzq+AV9pNCCOI="
With JavaScript atob(str) I get ->
"Æ‘û$‚SF3«àö“Bâ"
With Java new String(Base64.decodeBase64(str)) I get ->
"Æ?û$?SF3«à§ö?â"
Another way I could fixed the issue is to run JavaScript in Java with a Nashorn engine, but I'm getting an error near the "$" symbol.
Current Code:
ScriptEngine engine = new ScriptEngineManager().getEngineByName("JavaScript");
String script2 = "function decoMemo(memoStr){ print(atob(memoStr).split('')" +
".map((aChar) => `0${aChar.charCodeAt(0).toString(16)}`" +
".slice(-2)).join('').toUpperCase());}";
try {
engine.eval(script2);
Invocable inv = (Invocable) engine;
String returnValue = (String)inv.invokeFunction("decoMemo", memoTest );
System.out.print("\n result: " + returnValue);
} catch (ScriptException | NoSuchMethodException e1) {
e1.printStackTrace();
Any help would be appreciated. I search a lot of places but can't find the correct answer.
btoa is broken and shouldn't be used.
The problem is, bytes aren't characters. Base64 encoding does only one thing. It converts bytes to a stream of characters that survive just about any text-based transport mechanism. And Base64 decoding does that one thing in reverse, it converts such characters into bytes.
And the confusion is, you're printing those bytes as if they are characters. They are not.
You end up with the exact same bytes, but javascript and java disagree on how you're supposed to turn that into an ersatz string because you're trying to print it to a console. That's a mistake - bytes aren't characters. Thus, some sort of charset encoding is being used, and you don't want any of this, because these characters clearly aren't intended to be printed like that.
Javascript sort of half-equates characters and bytes and will freely convert one to the other, picking some random encoding. Oof. Javascript sucks in this regard, it is what it is. The MDN docs on btoa explains why you shouldn't use it. You're running into that problem.
Not entirely sure how you fix it in javascript - but perhaps you don't need it. Java is decoding the bytes perfectly well, as is javascript, but javascript then turns those bytes into characters into some silly fashion and that's causing the problem.
What you have there is not a text string at all. The giveaway is the AA's at the beginning. Those map to a number of zero bytes. That doesn't translate to meaningful text in any standard character set.
So what you have there is most likely binary data. Converting it to a string is not going to give you meaningful text.
Now to explain the difference you are seeing between Java and Javascript. It looks to me as if both Java and Javascript are making a "best effort" attempt to convert the binary data as if is was encoded in ISO-8859-1 (aka ISO LATIN-1).
The problem is some of the bytes codes are mapping to unassigned codes.
In the Java case those unassigned codes are being mapped to ?, either when the string is created or when it is being output.
In the Javascript case, either the unassigned codes are not included in the string, or them are being removed when you attempt to display them.
For the record, this is how an online base64 decoder the above for me:
����������������Æû$SF3«àöBâ
The unassigned codes are 0x91 0x82 and 0x93. 0x15 and 0x0B are non-printing control codes.
But the bottom line is that you should not be converting this data into a string in either Java or in Javascript. It should be treated as binary; i.e. an array of byte values.
byte[] data = Base64.getDecoder().decode(str);

Javascript: convert CSV string into a) UTF-8 and b) a 2D array

Two questions in one, not sure if that's allowed, but they're directly related to the same code.
I retrieve a CSV string as a HTTP response in Javascript - this string comes in UTF-16 encoding it seems, as it has for example ' € ' instead of '€'.
a) How can I convert this to UTF-8 in vanilla Javascript?
Once that's done, how do I
b) transform the multi-line CSV into a 2D array in vanilla Javascript?
Thanks!
[UPDATE]
Based on anqooqie's pointers, I take the following approach to re-encode the string:
OK, clear - so to be honest, I went a slightly different way (as the reencode function didn't work for me and it threw a generic error code) and now do the below;
var O = new ActiveXObject('ADODB.Stream');
O.Type = 2;
O.Open;
O.Charset = 'ISO-8859-1';
O.LineSeparator = 10;
O.WriteText (csvStr);
O.Position = 0;
O.Charset = 'UTF-8';
And this works fine and in pretty much a split second (even though it's a 35K row CSV). Now if I want to put it back into the csvStr, I would do
csvStr = O.ReadText
but this takes ages - is that expected or am I doing something wrong?
For putting it into a 2D array, I split on the LineSeparator and then loop using a regex, which seems to work.
var A = new Array
A.push(csvStr[0].match(/"[^"]*"|[^,]+/g))
The vast delay on the readText is bothering me though, especially as the WriteText is so quick. Any help is appreciated.
Looks like you are confused about the terms of character encoding, so let's reconfirm that.
String is just a string.
There is no "UTF-16 string", nor "UTF-8 string".
Character encoding is a protocol which converts between a string and a byte array.
UTF-16 is one of the character encodings.
Also, both of UTF-8 and ISO-8859-1 are character encodings.
In UTF-16, the string '€' can be encoded to a byte array 20 AC.
In UTF-8, the string '€' can be encoded to a byte array E2 82 AC.
In ISO-8859-1, the byte array E2 82 AC can be decoded to a string 'â¬'.
Now, you may find that 'â¬' is not a "UTF-16 string".
It is '€' encoded as UTF-8 and mistakenly decoded as ISO-8859-1.
a) How can I convert this to UTF-8 in vanilla Javascript?
What you should do is to fix the code to retrieve a CSV file.
I cannot tell you how to fix it since I do not know your code, but I believe that it now decodes a CSV file as ISO-8859-1.
You should fix the character encoding from ISO-8859-1 to UTF-8.
If the code is not yours and you cannot fix it, you can use a workaround.
In other words, you can 1) re-encode a mistakenly decoded string as ISO-8859-1, and 2) re-decode it as UTF-8.
1)
// Note: This code requires ES5 or later.
function reencode(inputString) {
return Array.apply(null, Array(inputString.length)).map(function (x, i) { return inputString.charCodeAt(i); });
}
2)
See this answer.
b) How do I transform the multi-line CSV into a 2D array in vanilla Javascript?
See this answer.

Encode and decode a string in JavaScript

Let's say I have a string, called str, which is equal to "Hello, world!"
Is there a way to choose some unicode characters like "azertyuiopqsdfghjklmwxcvbn1234567890-" and return an encoded string that contains only the choosen characters?
It should return something like "hvfebi iehfhe" (well, something encoded and not human readable, but which is decodable)?
Thanks
There's no such a thing as a function that you configure an arbitrary set of characters to enconde and decode, unless you find a library that does that or you implement it yourself.
But, you can use base64 encoding, which uses "A-Z", "a-z", "0-9", "+", "/" and "=" characters to encode the string.
The native browser functions are on window: btoa() to enconde and atob() to decode.
Edit after your comments:
Any function you make up to solve this will be decodable anyway analysing the code, so no point in hiding it. If you don't want to be simply base64, you can make a function that encodes it several times.

JSON unicode characters conversion

I came across this strange JSON which I can't seem to decode.
To simplify things, let's say it's a JSON string:
"\uffffffe2\uffffff94\uffffff94\uffffffe2\uffffff94\uffffff80\uffffffe2\uffffff94\uffffff80 mystring"
After decoding it should look as following:
└── mystring
JS or PHP doesn't seem to convert it correctly.
js> JSON.parse('"\uffffffe2\uffffff94\uffffff94\uffffffe2\uffffff94\uffffff80\uffffffe2\uffffff94\uffffff80 mystring"')
ffe2ff94ff94ffe2ff94ff80ffe2ff94ff80 mystring
PHP behaves the same
php> json_decode('"\uffffffe2\uffffff94\uffffff94\uffffffe2\uffffff94\uffffff80\uffffffe2\uffffff94\uffffff80 mystring"')
ffe2ff94ff94ffe2ff94ff80ffe2ff94ff80 mystring
Any ideas how to properly parse this JSON string would be welcome.
It is not valid JSON string - JSON supports only 4 hex digits after \u. Results from both PHP and JS are correct.
It is not possible decode this using standard functions.
Where did you get this JSON string?
About correct json for string you want to get - it should be "\u2514\u2500\u2500 mystring", or just "└── mystring" (json supports any unicode characters in strings except " and \).
Also if you need to encode some character that require more than two bytes - it will result in two escape codes for example "𩄎" would be "\ud864\udd0e" when escaped.
So, If you really need to decode string above - you can fix it before decoding, replacing \uffffffe2 by \uffff\uffe2 via regexp (for js it would be something like: s.replace(/(\\u[A-Fa-f0-9]{4})([A-Fa-f0-9]{4})/gi,'$1\\u$2') ).
But anyway character codes in string specified above does not look right.

How to decode Chinese hex string into Chinese characters or JavaScript?

I am working on a Rails app.
I am using an API that returns some Chinese provinces.
The API returns the provinces in hex strings, for example:
{ "\xE5\x8C\x97\xE4\xBA\xAC" => "some data" }
My JavaScript calls a controller that returns this hash. I put all the province strings into a dropdown but the strings show up as a black diamond with a question mark in the middle. I am wondering how do I convert the Ruby hex string into actual Chinese characters, 北京? Or if possible, can I convert the hex string in JavaScript into Chinese characters?
The bytes \xE5\x8C\x97 are the UTF-8 representation of 北 and \xE4\xBA\xAC is the UTF-8 representation of 京. So this string:
"\xE5\x8C\x97\xE4\xBA\xAC"
is 北京 if the bytes are interpreted as UTF-8. That you're seeing hex codes instead of Chinese characters suggests that the string's encoding is binary:
> s = "\xE5\x8C\x97\xE4\xBA\xAC"
=> "北京"
> s.encoding
=> #<Encoding:UTF-8>
> s.force_encoding('binary')
=> "\xE5\x8C\x97\xE4\xBA\xAC"
So this API you're talking to is speaking UTF-8 but somewhere your application is losing track of what encoding that string is supposed to be. If you force the encoding to be UTF-8 then the problem goes away:
> s.force_encoding('utf-8')
=> "北京"
You should fix this encoding problem at the very edge of your application where it reads data from this remote API. Once that's done, everything should be sensible UTF-8 everywhere that you care about. This should fix your JavaScript problem as well as JavaScript is quite happy to work with UTF-8.
I think you can do like this: doc
rb:
2.1.2 :002 > require 'uri'
=> true
2.1.2 :003 > URI.decode("\xE5\x8C\x97\xE4\xBA\xAC")
=> "北京"
js: decodeURIComponent(URIstring)

Categories

Resources