Why are atob and btoa not reversible - javascript

I'm trying to find a simple way to record and temporarily obfuscate answers to "quiz" questions I'm writing in Markdown. (I'll tell the students the quiz answers during the presentation, so I'm not looking for any kind of secure encryption.)
I thought I could use atob('message I want to obfuscate') then tell students they can use btoa() in their developer tools panel to reverse the process. However the following does not return 'one':
btoa( atob('one') )
Does anyone know why this doesn't return 'one'? Are there other methods built into JavaScript that will allow one to loosely encrypt and decrypt a message? (I'm working with absolute beginners who might be confused by functions and who would be very confused trying to add libraries to a page).

That is the reason.
In Base64 encoding, the length of output encoded String must be a
multiple of 3. If it's not, the output will be padded with additional
pad characters (=). On decoding, these extra padding characters will
be discarded.
var string1 = "one",
string2 = "one2";
console.log("Value of string1", string1)
console.log("Decoded string1", atob(string1))
console.log("Encoded string1", btoa(atob(string1)))
console.log("-------------------------------------")
console.log("Value of string2", string2)
console.log("Decoded string2", atob(string2))
console.log("Encoded string2", btoa(atob(string2)))

As #george pointed out, one must use btoa() before using atob():
atob( btoa( 'hello' ) )

btoa means binary to ascii: input is Binary=any kind of data: text, images, audio. Output is Ascii=its base64 encoding, which is an ascii subset, i.e. a text string containing only upper and lowercase letters, numbers, comma, plus, slash, equal sign (only for padding at end).
atob means ascii to binary: input MUST be a subset of Ascii, i.e. the result of a base64 encoded string. Output is Binary=any type of data (text, image, audio, ...).

Related

Decoding Base64 String in Java

I'm using Java and I have a Base64 encoded string that I wish to decode and then do some operations to transform.
The correct decoded value is obtained in JavaScript through function atob(), but in java, using Base64.decodeBase64() I cannot get an equal value.
Example:
For:
String str = "AAAAAAAAAAAAAAAAAAAAAMaR+ySCU0Yzq+AV9pNCCOI="
With JavaScript atob(str) I get ->
"Æ‘û$‚SF3«àö“Bâ"
With Java new String(Base64.decodeBase64(str)) I get ->
"Æ?û$?SF3«à§ö?â"
Another way I could fixed the issue is to run JavaScript in Java with a Nashorn engine, but I'm getting an error near the "$" symbol.
Current Code:
ScriptEngine engine = new ScriptEngineManager().getEngineByName("JavaScript");
String script2 = "function decoMemo(memoStr){ print(atob(memoStr).split('')" +
".map((aChar) => `0${aChar.charCodeAt(0).toString(16)}`" +
".slice(-2)).join('').toUpperCase());}";
try {
engine.eval(script2);
Invocable inv = (Invocable) engine;
String returnValue = (String)inv.invokeFunction("decoMemo", memoTest );
System.out.print("\n result: " + returnValue);
} catch (ScriptException | NoSuchMethodException e1) {
e1.printStackTrace();
Any help would be appreciated. I search a lot of places but can't find the correct answer.
btoa is broken and shouldn't be used.
The problem is, bytes aren't characters. Base64 encoding does only one thing. It converts bytes to a stream of characters that survive just about any text-based transport mechanism. And Base64 decoding does that one thing in reverse, it converts such characters into bytes.
And the confusion is, you're printing those bytes as if they are characters. They are not.
You end up with the exact same bytes, but javascript and java disagree on how you're supposed to turn that into an ersatz string because you're trying to print it to a console. That's a mistake - bytes aren't characters. Thus, some sort of charset encoding is being used, and you don't want any of this, because these characters clearly aren't intended to be printed like that.
Javascript sort of half-equates characters and bytes and will freely convert one to the other, picking some random encoding. Oof. Javascript sucks in this regard, it is what it is. The MDN docs on btoa explains why you shouldn't use it. You're running into that problem.
Not entirely sure how you fix it in javascript - but perhaps you don't need it. Java is decoding the bytes perfectly well, as is javascript, but javascript then turns those bytes into characters into some silly fashion and that's causing the problem.
What you have there is not a text string at all. The giveaway is the AA's at the beginning. Those map to a number of zero bytes. That doesn't translate to meaningful text in any standard character set.
So what you have there is most likely binary data. Converting it to a string is not going to give you meaningful text.
Now to explain the difference you are seeing between Java and Javascript. It looks to me as if both Java and Javascript are making a "best effort" attempt to convert the binary data as if is was encoded in ISO-8859-1 (aka ISO LATIN-1).
The problem is some of the bytes codes are mapping to unassigned codes.
In the Java case those unassigned codes are being mapped to ?, either when the string is created or when it is being output.
In the Javascript case, either the unassigned codes are not included in the string, or them are being removed when you attempt to display them.
For the record, this is how an online base64 decoder the above for me:
����������������Æû$SF3«àöBâ
The unassigned codes are 0x91 0x82 and 0x93. 0x15 and 0x0B are non-printing control codes.
But the bottom line is that you should not be converting this data into a string in either Java or in Javascript. It should be treated as binary; i.e. an array of byte values.
byte[] data = Base64.getDecoder().decode(str);

Encode and decode a string in JavaScript

Let's say I have a string, called str, which is equal to "Hello, world!"
Is there a way to choose some unicode characters like "azertyuiopqsdfghjklmwxcvbn1234567890-" and return an encoded string that contains only the choosen characters?
It should return something like "hvfebi iehfhe" (well, something encoded and not human readable, but which is decodable)?
Thanks
There's no such a thing as a function that you configure an arbitrary set of characters to enconde and decode, unless you find a library that does that or you implement it yourself.
But, you can use base64 encoding, which uses "A-Z", "a-z", "0-9", "+", "/" and "=" characters to encode the string.
The native browser functions are on window: btoa() to enconde and atob() to decode.
Edit after your comments:
Any function you make up to solve this will be decodable anyway analysing the code, so no point in hiding it. If you don't want to be simply base64, you can make a function that encodes it several times.

Text encoding that produces legible encodings suitable as Javascript identifiers?

I'm working on a tool that reads arbitrary data files and creates a table out of its data which I then store in a database. I'd like to preserve the column headers. The column headers are already ASCII text (or maybe latin1), but they have characters that aren't valid variable names (e.g., spaces, %), so I need to encode them somehow. I'm looking for an encoding for the column titles that has these properties:
Legible: it would be nice if the encoded text looked as similar as possible to the unencoded text (i.e., for debugging).
Legal identifier: I'd like the encoded text to be a valid JavaScript identifier (ECMA-262 Section 7.6).
Invertible: I'd like to be able to get the exact original text back from the encoded text.
I can think of approaches that work for 2 of the 3 cases, but I don't know how to get all 3. E.g., url encoding doesn't produce legal identifier names, I think I could transform base64 to be legal, but it isn't legible, what I've got currently just does some substitutions so it's not invertible.
Efficiency isn't a concern, so if necessary, I could store the encoded and unencoded texts together. The best option I can think of is to use url encoding and then swap percents for $. I thought there would be better options than this though, but I can't find anything. Is there anything better?
This pair of methods relying on Guava's PercentEscaper seems to meet my requirements. Guava doesn't provide an unescaper, but given my simple needs here, I can just use a simple URLDecoder.
private static PercentEscaper escaper = new PercentEscaper('',false)
static String getIdentifier(String str) {
//minimal safe characters, but leaves letters alone, so it's somewhat legible
String escaped = escaper.escape(str);
//javascript identifiers can't start with a digit, and the escaper doesn't know the first
//character has different rules. so prepend a "%3" to encode the digit
if(Character.isDigit(escaped.charAt(0))){
escaped = "%3"+escaped
}
//a percent isn't a valid in a javascript identifier, so we'll use _ as our special character
escaped = escaped.replace('%','_');
return escaped;
}
static String invertIdentifier(String str){
String unescaped = str.replace('_','%');
unescaped = URLDecoder.decode(unescaped, "UTF-8");
return unescaped;
}

How to display unicode / hexadecimal emoji and octal literals in HTML using Vue.js

So I'm getting such response from webserver:
"\ud83d\ude48\ud83d\ude02\ud83d\ude30\ud83d\ude09\ud83d\udc4f\ud83c\udffd\ud83d\udc4c\ud83c\udffd\ud83d\udd1d\u2714\ufe0f\ud83d\ude42 \344\366\374\337\u015b\u0161"
which after decoding should look like this:
🙈😂😰😉👏🏽👌🏽🔝✔️🙂 äöüßśš
äöüß are encoded as octal literals \344\366\374\337
To display correctly this message (not encoded plain text) I've used:
{{ JSON.parse('"' + messageContent.message + '"') }}
And it worked perfectly for escaped unicode values but when octal literals appear it's not, so here is the problem - ES6 won't allow for using octal literals since they are deprecated, and an error occurs, so what I've done is just finding with regex for octal literals and then parse them using: String.fromCharCode(parseInt(parseInt(val.replace('\\', ''), 8), 10)) so that from eg: \344 I'm getting ä. After I replace octals, I have to search for any unicode characters and again, parse it one by one using JSON.parse(`"${val}"`) (here is the same case as described below - if I hardcode a string and return just \ud83d\ude48 I don't have to parse it with JSON.parse, it just returns 🙈). I believe it's not optimal solution.
The other strange thing for me is when I try display message directly from server response (even if it does not contain any octal literals) using
{{ response.message }} it will print as normal string, but when I create new variable and assign exact the same value as I receive from server:
message='\ud83d\ude48\ud83d\ude02\ud83d\ude30\ud83d\ude09\ud83d\udc4f\ud83c\udffd\ud83d\udc4c\ud83c\udffd\ud83d\udd1d\u2714\ufe0f\ud83d\ude42'
and then display it
{{ message }} displayed value is 🙈😂😰😉👏🏽👌🏽🔝✔️🙂.
And last thing: even when I use my algorithm i'm just looking for text that match /\\[[a-zA-Z0-9]{1,5}\\[[a-zA-Z0-9]{1,5}/g sometimes it does not parse unicode well - eg: if user change a skin color, the unicode message would be: \ud83d\udc4d\ud83c\udffd, decoded: 👍🏽, but with this regex it would be 👍�\udffd
It's possible to make some small changes on the backend side if it's necessary, but it's used also by mobile apps that are finished so that changes should not affect them.
Thanks for any help.
Try manually decoding the unicode escape sequences (\uXXXX) and octal escape sequences (\XXX) as follows:
const response = '\\ud83d\\ude48\\ud83d\\ude02\\ud83d\\ude30\\ud83d\\ude09\\ud83d\\udc4f\\ud83c\\udffd\\ud83d\\udc4c\\ud83c\\udffd\\ud83d\\udd1d\\u2714\\ufe0f\\ud83d\\ude42 \\344\\366\\374\\337\\u015b\\u0161'
const decoded = response
.replace(/\\u(....)/g, (match, p1) => String.fromCharCode(parseInt(p1, 16)))
.replace(/\\(\d{3})/g, (match, p1) => String.fromCharCode(parseInt(p1, 8)))
console.log(decoded)
The server is sending you a string containing the literal characters \ud83d\ude48 (and so on), so the string must be explicitly decoded somehow by converting the escape sequences into the unicode characters they represent. On the other hand, if a string literal in JavaScript code contains the characters \ud83d\ude48 then it will be automatically decoded into 🙈.
Observe the difference between these two strings:
console.log('\ud83d\ude48')
console.log('\\ud83d\\ude48')

What is the best way to serialize a JavaScript object into something that can be used as a fragment identifier (url#hash)?

My page state can be described by a JavaScript object that can be serialized into JSON. But I don't think a JSON string is suitable for use in a fragment ID due to, for example, the spaces and double-quotes.
Would encoding the JSON string into a base64 string be sensible, or is there a better way? My goal is to allow the user to bookmark the page and then upon returning to that bookmark, have a piece of JavaScript read window.location.hash and change state accordingly.
I think you are on a good way. Let's write down the requirements:
The encoded string must be usable as hash, i.e. only letters and numbers.
The original value must be possible to restore, i.e. hashing (md5, sha1) is not an option.
It shouldn't be too long, to remain usable.
There should be an implementation in JavaScript, so it can be generated in the browser.
Base64 would be a great solution for that. Only problem: base64 also contains characters like - and +, so you win nothing compared to simply attaching a JSON string (which also would have to be URL encoded).
BUT: Luckily, theres a variant of base64 called base64url which is exactly what you need. It is specifically designed for the type of problem you're describing.
However, I was not able to find a JS implementation; maybe you have to write one youself – or do a bit more research than my half-assed 15 seconds scanning the first 5 Google results.
EDIT: On a second thought, I think you don't need to write an own implementation. Use a normal implementation, and simply replace the “forbidden” characters with something you find appropriate for your URLs.
Base64 is an excellent way to store binary data in text. It uses just 33% more characters/bytes than the original data and mostly uses 0-9, a-z, and A-Z. It also has three other characters that would need encoded to be stored in the URL, which are /, =, and +. If you simply used URL encoding, it would take up 300% (3x) the size.
If you're only storing the characters in the fragment of the URL, base64-encoded text it doesn't need to be re-encoded and will not change. But if you want to send the data as part of the actual URL to visit, then it matters.
As referenced by lxg, there there is a base64url variant for that. This is a modified version of base64 to replace unsafe characters to store in the URL. Here is how to encode it:
function tobase64url(s) {
return btoa(x).replace(/\+/g,'-').replace(/\//g,'_').replace(/=/g,'');
}
console.log(tobase64url('\x00\xff\xff\xf1\xf1\xf1\xff\xff\xfe'));
// Returns "AP__8fHx___-" instead of "AP//8fHx///+"
And to decode a base64 string from the URL:
function frombase64url(s) {
return atob(x.replace(/-/g,'+').replace(/_/g, '/'));
}
Use encodeURIComponent and decodeURIComponent to serialize data for the fragment (aka hash) part of the URL.
This is safe because the character set output by encodeURIComponent is a subset of the character set allowed in the fragment. Specifically, encodeURIComponent escapes all characters except:
A - Z
a - z
0 - 9
- . _ ~ ! ' ( ) *
So the output includes the above characters, plus escaped characters, which are % followed by hexadecimal digits.
The set of allowed characters in the fragment is:
A - Z
a - z
0 - 9
? / : # - . _ ~ ! $ & ' ( ) * + , ; =
percent-encoded characters (a % followed by hexadecimal digits)
This set of allowed characters includes all the characters output by encodeURIComponent, plus a few other characters.

Categories

Resources