Why won't window.btoa work on – ” characters in Javascript? - javascript

So I'm converting a string to BASE64 as shown in the code below...
var str = "Hello World";
var enc = window.btoa(str);
This yields SGVsbG8gV29ybGQ=. However if I add these characters – ” such as the code shown below, the conversion doesn't happen. What is the reason behind this? Thank you so much.
var str = "Hello – World”";
var enc = window.btoa(str);

btoa is an exotic function in that it requires a "Binary String", which is an 8-bit clean string format. It doesn't work with unicode values above charcode 255, such as used by your em dash and "fancy" quote symbol.
You'll either have to turn your string into a new string that conforms to single byte packing (and then manually reconstitute the result of the associated atob), or you can uri encode the data first, making it safe:
> var str = `Hello – World`;
> window.btoa(encodeURIComponent(str));
"SGVsbG8lMjAlRTIlODAlOTMlMjBXb3JsZA=="
And then remember to decode it again when unpacking:
> var base64= "SGVsbG8lMjAlRTIlODAlOTMlMjBXb3JsZA==";
> decodeURIComponent(window.atob(base64));
"Hello – World"

The Problem is the character ” lies outside of Latin1 range.
For this you can use unescape (now deprecated)
var str = "Hello – World”";
var enc = btoa(unescape(encodeURIComponent(str)));
alert(enc);
And to decode:
var encStr = "SGVsbG8g4oCTIFdvcmxk4oCd";
var dec = decodeURIComponent(escape(window.atob(encStr)))
alert(dec);

This ultimately owes to a deficiency in the JavaScript type system.
JavaScript strings are strings of 16-bit code units, which are customarily interpreted as UTF-16. The Base64 encoding is a method of transforming an 8-bit byte stream into a string of digits, by taking each three bytes and mapping them into four digits, each covering 6 bits: 3 × 8 = 4 × 6. As we see, this is crucially dependent on the bit width of each symbol.
At the time the btoa function was defined, JavaScript had no type for 8-bit byte streams, so the API was defined to take the ordinary 16-bit string type as input, with the restriction that each code unit was supposed to be confined to the range [U+0000, U+00FF]; when encoded into ISO-8859-1, such a string would reproduce the intended byte stream exactly.
The character – is U+2013, while ” is U+201D; neither of those characters fits into the above-mentioned range, so the function rejects it.
If you want to convert Unicode text into Base64, you need to pick a character encoding and convert it into a byte string first, and encode that. Asking for a Base64 encoding of a Unicode string itself is meaningless.

The most bullet proof way is to work on binary data directly.
For this, you can encode your string to an ArrayBuffer object representing the UTF-8 version of your string.
Then a FileReader instance will be able to give you the base64 quite easily.
var str = "Hello – World”";
var buf = new TextEncoder().encode( str );
var reader = new FileReader();
reader.onload = evt => { console.log( reader.result.split(',')[1] ); };
reader.readAsDataURL( new Blob([buf]) );
And since the Blob() constructor does automagically encode DOMString instances to UTF-8, we could even get rid of the TextEncoder object:
var str = "Hello – World”";
var reader = new FileReader();
reader.onload = evt => { console.log( reader.result.split(',')[1] ); };
reader.readAsDataURL( new Blob([str]) );

Related

Convert a file from to Base 64 using JavaScript and converting it back to file using C#

I am trying to convert pdf and image files to base 64 using javascript and convert it back to file using C# in WEB API.
Javascript
var filesSelected = document.getElementById("inputFileToLoad").files;
if (filesSelected.length > 0)
{
var fileToLoad = filesSelected[0];
var fileReader = new FileReader();
fileReader.onload = function(fileLoadedEvent)
{
var textAreaFileContents = document.getElementById("textAreaFileContents");
textAreaFileContents.innerHTML = fileLoadedEvent.target.result;
};
fileReader.readAsDataURL(fileToLoad);
}
C#
Byte[] bytes = Convert.FromBase64String(dd[0].Image_base64Url);
File.WriteAllBytes(actualSavePath,bytes);
But in API I'm getting exception as {"The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or an illegal character among the padding characters. "}
Please tell me how to proceed with this...
Thanks
According to MDN: FileReader.readAsDataURL those generated URLs are prefixed with something like data:image/jpeg;base64,. Have a look at your generated string. Look for the occurence of base64, and take the base64 data that starts after this prefix.
Because the FileReader.readAsDataURL() produces a string that is prefixed with extra metadata (the "URL" portion), you need to strip it off on the C# side. Here's some sample code:
// Sample string from FileReader.readAsDataURL()
var base64 = "";
// Some known piece of information that will be in the above string
const string identifier = ";base64,";
// Find where it exists in the input string
var dataIndex = base64.IndexOf(identifier);
// Take the portion after this identifier; that's the real base-64 portion
var cleaned = base64.Substring(dataIndex + identifier.Length);
// Get the bytes
var bytes = Convert.FromBase64String(cleaned);
You could condense this down if it's too verbose, I just wanted to explain it step by step.
var bytes = Convert.FromBase64String(base64.Substring(base64.IndexOf(";base64,") + 8));

Convert Windows-1252 hex value to Unicode in JavaScript

Let's say I have a string containing Windows-1252 hex value for a character, I would like to make that appropriate Unicode character:
const asciiHex = '85' //represents hellip
parseInt(asciiHex, 16) //I get 133 as expected
I can't do String.fromCharCode now as this takes Unicode codes, rather than ASCII (in unicode hellip is 8230 (decimal)). Is anyone aware of any simple conversion?
btw I am doing this in node 6
You don't mention the input encoding: in which character encoding is \x85 mapped to the horizontal ellipsis? Turns out that's Windows-1252, which Node.js doesn't "speak" out of the box.
A module that can encode/decode it is windows-1252.
To convert your hex code to a UTF-8 encoded string:
const windows1252 = require('windows-1252');
let asciiHex = '85';
let utf8text = windows1252.decode( Buffer.from(asciiHex, 'hex').toString('binary') );
console.log( utf8text ); // outputs: …

CryptoJS not decrypting non-Latin characters faithfully

I am trying to use CryptoJS AES, like so:
var msg = "café";
var key = "something";
var c = CryptoJS.AES.encrypt(msg, key).toString();
CryptoJS.AES.decrypt(c, key).toString(CryptoJS.enc.Latin1);
Unfortunately this returns café, not café. Clearly Latin1 is not the right encoding to use, but I can't find a better one. Is there a solution?
Thanks.
You are just missing the format
The proper way is using CryptoJS.enc.Utf8
So, Please try:
CryptoJS.AES.decrypt(c, key).toString(CryptoJS.enc.Utf8);
https://code.google.com/p/crypto-js/#The_Hasher_Input
The hash algorithms accept either strings or instances of CryptoJS.lib.WordArray [...] an array of 32-bit words. When you pass a string, it's automatically converted to a WordArray encoded as UTF-8.
So, when you pass a string (and don't use CryptoJS.enc.* to generate a WordArray) it automatically converts the string (message) to a utf8 WordArray.
See here for sample roundtrip encrypt/decrypt:
https://code.google.com/p/crypto-js/#The_Cipher_Output
Here's a jsfiddle to play with CryptoJS
https://jsfiddle.net/8qbf4746/4/
var message = "café";
var key = "something";
var encrypted = CryptoJS.AES.encrypt(message, key);
//equivalent to CryptoJS.AES.encrypt(CryptoJS.enc.Utf8.parse(message), key);
var decrypted = CryptoJS.AES.decrypt(encrypted, key);
$('#1').text("Encrypted: "+encrypted);
$('#2').text("Decrypted: "+decrypted.toString(CryptoJS.enc.Utf8));
To emphasize my point here is the same thing using Latin1 encoding:
https://jsfiddle.net/3a8tf48f/2/
var message = "café";
var key = "something";
var encrypted = CryptoJS.AES.encrypt(CryptoJS.enc.Latin1.parse(message), key);
var decrypted = CryptoJS.AES.decrypt(encrypted, key);
$('#1').text("Encrypted: " + encrypted);
$('#2').text("Decrypted: " + decrypted.toString(CryptoJS.enc.Latin1));
On a side note, the API would probably be better if it only accepted WordArray and didn't overload the toString method (which is just a convenience interface to CryptoJS.enc.*.stringify). The string conversion magic is a little misleading.
You are trying to decrypt your data as a Latin1 string, even though your input string is not in Latin1. The encoding used by CryptoJS internally is not the same as the encoding you use to write the input file.
You need to specify the same encoding both when encrypting (for the string -> byte array conversion) and when decrypting (for the byte array -> string conversion).

Convert this unicode to string with javascript (Thai Language)

มอเตอร์ไซค์
Can I convert this unicode to string with JS. (It is Thailand Language)
I use
console.log(String.fromCharCode("มอเตอร์ไซค์"));
And It's not correct. if it right it will show มอเตอร์ไซค์
Your Unicode string is encoded using HTML entity notation. Generally that means that whatever encoded the string expected it to end up in the middle of an HTML document, where it would be seen by an HTML parser.
If you've somehow got that string in JavaScript in a browser, you can get to the encoded Unicode by letting the browser parse it:
var str = "มอเตอร์ไซค์";
var elem = document.createElement("div");
elem.innerHTML = str;
alert(elem.textContent);
The string.fromCharCode() function expects one or more numeric arguments; it won't understand HTML entities. Thus if you're not in a browser (like, if you've got the string in a Node.js program or something like that), you could convert the string with your own code:
var str = "มอเตอร์ไซค์";
var thai = String.fromCharCode.apply(String, str.match(/x[^;]*;/g).map(function(n) { return parseInt(n.slice(1, -1), 16); }));
That conversion will only work when the code points involved are within the first 64K values.
You may want something like this :
var input = "มอเตอร์ไซค์";
var output = input.replace(/&#x[0-9A-Fa-f]+;/g,
function(htmlCode) {
var codePoint = parseInt( htmlCode.slice(3, -1), 16 );
return String.fromCharCode( codePoint );
});

How to convert mixed ascii and unicode to a string in javascript?

I have a mixed source of unicode and ascii characters, for example:
var source = "\u5c07\u63a2\u8a0e HTML5 \u53ca\u5176\u4ed6";
How do I convert it to a string by leveraging and extending the below uniCodeToString function written by myself in Javascript? This function can convert pure unicode to string.
function uniCodeToString(source){
//for example, source = "\u5c07\u63a2\u8a0e"
var escapedSource = escape(source);
var codeArray = escapedSource.split("%u");
var str = "";
for(var i=1; i<codeArray.length; i++){
str += String.fromCharCode("0x"+codeArray[i]);
}
return str;
}
Use encodeURIComponent, escape was never meant for unicode.
var source = "\u5c07\u63a2\u8a0e HTML5 \u53ca\u5176\u4ed6";
var enc=encodeURIComponent(source)
//returned value: (String)
%E5%B0%87%E6%8E%A2%E8%A8%8E%20HTML5%20%E5%8F%8A%E5%85%B6%E4%BB%96
decodeURIComponent(enc)
//returned value: (String)
將探討 HTML5 及其他
I think you are misunderstanding the purpose of Unicode escape sequences.
var source = "\u5c07\u63a2\u8a0e HTML5 \u53ca\u5176\u4ed6";
JavaScript strings are always Unicode (each code unit is a 16 bit UTF-16 encoded value.) The purpose of the escapes is to allow you to describe values that are unsupported by the encoding used to save the source file (e.g. the HTML page or .JS file is encoded as ISO-8859-1) or to overcome things like keyboard limitations. This is no different to using \n to indicate a linefeed code point.
The above string ("將探討 HTML5 及其他") is made up of the values 5c07 63a2 8a0e 0020 0048 0054 004d 004c 0035 0020 53ca 5176 4ed6 whether you write the sequence as a literal or in escape sequences.
See the String Literals section of ECMA-262 for more details.

Categories

Resources