Text encoding that produces legible encodings suitable as Javascript identifiers? - javascript

I'm working on a tool that reads arbitrary data files and creates a table out of its data which I then store in a database. I'd like to preserve the column headers. The column headers are already ASCII text (or maybe latin1), but they have characters that aren't valid variable names (e.g., spaces, %), so I need to encode them somehow. I'm looking for an encoding for the column titles that has these properties:
Legible: it would be nice if the encoded text looked as similar as possible to the unencoded text (i.e., for debugging).
Legal identifier: I'd like the encoded text to be a valid JavaScript identifier (ECMA-262 Section 7.6).
Invertible: I'd like to be able to get the exact original text back from the encoded text.
I can think of approaches that work for 2 of the 3 cases, but I don't know how to get all 3. E.g., url encoding doesn't produce legal identifier names, I think I could transform base64 to be legal, but it isn't legible, what I've got currently just does some substitutions so it's not invertible.
Efficiency isn't a concern, so if necessary, I could store the encoded and unencoded texts together. The best option I can think of is to use url encoding and then swap percents for $. I thought there would be better options than this though, but I can't find anything. Is there anything better?

This pair of methods relying on Guava's PercentEscaper seems to meet my requirements. Guava doesn't provide an unescaper, but given my simple needs here, I can just use a simple URLDecoder.
private static PercentEscaper escaper = new PercentEscaper('',false)
static String getIdentifier(String str) {
//minimal safe characters, but leaves letters alone, so it's somewhat legible
String escaped = escaper.escape(str);
//javascript identifiers can't start with a digit, and the escaper doesn't know the first
//character has different rules. so prepend a "%3" to encode the digit
if(Character.isDigit(escaped.charAt(0))){
escaped = "%3"+escaped
}
//a percent isn't a valid in a javascript identifier, so we'll use _ as our special character
escaped = escaped.replace('%','_');
return escaped;
}
static String invertIdentifier(String str){
String unescaped = str.replace('_','%');
unescaped = URLDecoder.decode(unescaped, "UTF-8");
return unescaped;
}

Related

Failed to declare and display UTF-8 character properly in JSON

I have a JSON object where one attribute contains a static special character - https://www.compart.com/en/unicode/U+1F514
I have tried to store the string both as encoded UTF-8 "\xF0\x9F\x94\x94"
or tried to print it using its HEX value - String.fromCharCode(0x1F514) or decimal value String.fromCharCode(128276)
But it all results in an empty charater/empty square character in Google Chrome.
How can I please store this character properly, statically in a simple JSON {header1:"____"} and then echo it?
Also not able to display it in IntelliJ - so if you have a comment regarding this side issue would be very thankful.
For historial reasons, JavaScript doesn't have full Unicode support because language creators assumed that UTF-16 would never need more than 2-bytes to encode a single character. JSON inherits that and \u entities only accept 4 hexadecimal characters.
You need to use a workaround that basically consists on splitting the actual 4-byte UTF-16 character in two 2-byte characters, as in:
var raw = "🔔";
var doesNotWork = "\u1F514";
var works = "\uD83D\uDD14";
console.log(raw, doesNotWork, works);
... or get rid of entities and just dump the actual binary character:
var data = ["🔔"];
var json = JSON.stringify(data);
console.log(json, JSON.parse(json));
I think that the problem is that the font doesn't have support for such symbol, hence the square character being drawn. If there is not an specific reason as why you are using this character, you could draw it with an icon, or using a character in an icon font.

How to feed strange characters to javaScript? [duplicate]

I need to insert an Omega (Ω) onto my html page. I am using its HTML escaped code to do that, so I can write Ω and get Ω. That's all fine and well when I put it into a HTML element; however, when I try to put it into my JS, e.g. var Omega = Ω, it parses that code as JS and the whole thing doesn't work. Anyone know how to go about this?
I'm guessing that you actually want Omega to be a string containing an uppercase omega? In that case, you can write:
var Omega = '\u03A9';
(Because Ω is the Unicode character with codepoint U+03A9; that is, 03A9 is 937, except written as four hexadecimal digits.)
Edited to add (in 2022): There now exists an alternative form that better supports codepoints above U+FFFF:
let Omega = '\u{03A9}';
let desertIslandEmoji = '\u{1F3DD}';
Judging from https://caniuse.com/mdn-javascript_builtins_string_unicode_code_point_escapes, most or all browsers added support for it in 2015, so it should be reasonably safe to use.
Although #ruakh gave a good answer, I will add some alternatives for completeness:
You could in fact use even var Omega = 'Ω' in JavaScript, but only if your JavaScript code is:
inside an event attribute, as in onclick="var Omega = '&#937';
alert(Omega)" or
in a script element inside an XHTML (or XHTML + XML) document
served with an XML content type.
In these cases, the code will be first (before getting passed to the JavaScript interpreter) be parsed by an HTML parser so that character references like Ω are recognized. The restrictions make this an impractical approach in most cases.
You can also enter the Ω character as such, as in var Omega = 'Ω', but then the character encoding must allow that, the encoding must be properly declared, and you need software that let you enter such characters. This is a clean solution and quite feasible if you use UTF-8 encoding for everything and are prepared to deal with the issues created by it. Source code will be readable, and reading it, you immediately see the character itself, instead of code notations. On the other hand, it may cause surprises if other people start working with your code.
Using the \u notation, as in var Omega = '\u03A9', works independently of character encoding, and it is in practice almost universal. It can however be as such used only up to U+FFFF, i.e. up to \uffff, but most characters that most people ever heard of fall into that area. (If you need “higher” characters, you need to use either surrogate pairs or one of the two approaches above.)
You can also construct a character using the String.fromCharCode() method, passing as a parameter the Unicode number, in decimal as in var Omega = String.fromCharCode(937) or in hexadecimal as in var Omega = String.fromCharCode(0x3A9). This works up to U+FFFF. This approach can be used even when you have the Unicode number in a variable.
One option is to put the character literally in your script, e.g.:
const omega = 'Ω';
This requires that you let the browser know the correct source encoding, see Unicode in JavaScript
However, if you can't or don't want to do this (e.g. because the character is too exotic and can't be expected to be available in the code editor font), the safest option may be to use new-style string escape or String.fromCodePoint:
const omega = '\u{3a9}';
// or:
const omega = String.fromCodePoint(0x3a9);
This is not restricted to UTF-16 but works for all unicode code points. In comparison, the other approaches mentioned here have the following downsides:
HTML escapes (const omega = '&#937';): only work when rendered unescaped in an HTML element
old style string escapes (const omega = '\u03A9';): restricted to UTF-16
String.fromCharCode: restricted to UTF-16
The answer is correct, but you don't need to declare a variable.
A string can contain your character:
"This string contains omega, that looks like this: \u03A9"
Unfortunately still those codes in ASCII are needed for displaying UTF-8, but I am still waiting (since too many years...) the day when UTF-8 will be same as ASCII was, and ASCII will be just a remembrance of the past.
I found this question when trying to implement a font-awesome style icon system in html. I have an API that provides me with a hex string and I need to convert it to unicode to match with the font-family.
Say I have the string const code = 'f004'; from my API. I can't do simple string concatenation (const unicode = '\u' + code;) since the system needs to recognize that it's unicode and this will in fact cause a syntax error if you try.
#coldfix mentioned using String.fromCodePoint but it takes a number as an argument, not a string.
To finally cross the finish line, just add parseInt and pass 16 (since hex is base 16) to it's second parameter. You'll finally get a unicode string from a simple hex string.
This is what I did:
const code = 'f004';
const toUnicode = code => String.fromCodePoint(parseInt(code, 16));
toUnicode(code);
// => '\uf004'
Try using Function(), like this:
var code = "2710"
var char = Function("return '\\u"+code+"';")()
It works well, just do not add any 's or "s or spaces.
In the example, char is "✐".

What is the best way to serialize a JavaScript object into something that can be used as a fragment identifier (url#hash)?

My page state can be described by a JavaScript object that can be serialized into JSON. But I don't think a JSON string is suitable for use in a fragment ID due to, for example, the spaces and double-quotes.
Would encoding the JSON string into a base64 string be sensible, or is there a better way? My goal is to allow the user to bookmark the page and then upon returning to that bookmark, have a piece of JavaScript read window.location.hash and change state accordingly.
I think you are on a good way. Let's write down the requirements:
The encoded string must be usable as hash, i.e. only letters and numbers.
The original value must be possible to restore, i.e. hashing (md5, sha1) is not an option.
It shouldn't be too long, to remain usable.
There should be an implementation in JavaScript, so it can be generated in the browser.
Base64 would be a great solution for that. Only problem: base64 also contains characters like - and +, so you win nothing compared to simply attaching a JSON string (which also would have to be URL encoded).
BUT: Luckily, theres a variant of base64 called base64url which is exactly what you need. It is specifically designed for the type of problem you're describing.
However, I was not able to find a JS implementation; maybe you have to write one youself – or do a bit more research than my half-assed 15 seconds scanning the first 5 Google results.
EDIT: On a second thought, I think you don't need to write an own implementation. Use a normal implementation, and simply replace the “forbidden” characters with something you find appropriate for your URLs.
Base64 is an excellent way to store binary data in text. It uses just 33% more characters/bytes than the original data and mostly uses 0-9, a-z, and A-Z. It also has three other characters that would need encoded to be stored in the URL, which are /, =, and +. If you simply used URL encoding, it would take up 300% (3x) the size.
If you're only storing the characters in the fragment of the URL, base64-encoded text it doesn't need to be re-encoded and will not change. But if you want to send the data as part of the actual URL to visit, then it matters.
As referenced by lxg, there there is a base64url variant for that. This is a modified version of base64 to replace unsafe characters to store in the URL. Here is how to encode it:
function tobase64url(s) {
return btoa(x).replace(/\+/g,'-').replace(/\//g,'_').replace(/=/g,'');
}
console.log(tobase64url('\x00\xff\xff\xf1\xf1\xf1\xff\xff\xfe'));
// Returns "AP__8fHx___-" instead of "AP//8fHx///+"
And to decode a base64 string from the URL:
function frombase64url(s) {
return atob(x.replace(/-/g,'+').replace(/_/g, '/'));
}
Use encodeURIComponent and decodeURIComponent to serialize data for the fragment (aka hash) part of the URL.
This is safe because the character set output by encodeURIComponent is a subset of the character set allowed in the fragment. Specifically, encodeURIComponent escapes all characters except:
A - Z
a - z
0 - 9
- . _ ~ ! ' ( ) *
So the output includes the above characters, plus escaped characters, which are % followed by hexadecimal digits.
The set of allowed characters in the fragment is:
A - Z
a - z
0 - 9
? / : # - . _ ~ ! $ & ' ( ) * + , ; =
percent-encoded characters (a % followed by hexadecimal digits)
This set of allowed characters includes all the characters output by encodeURIComponent, plus a few other characters.

Insert Unicode character into JavaScript

I need to insert an Omega (Ω) onto my html page. I am using its HTML escaped code to do that, so I can write Ω and get Ω. That's all fine and well when I put it into a HTML element; however, when I try to put it into my JS, e.g. var Omega = Ω, it parses that code as JS and the whole thing doesn't work. Anyone know how to go about this?
I'm guessing that you actually want Omega to be a string containing an uppercase omega? In that case, you can write:
var Omega = '\u03A9';
(Because Ω is the Unicode character with codepoint U+03A9; that is, 03A9 is 937, except written as four hexadecimal digits.)
Edited to add (in 2022): There now exists an alternative form that better supports codepoints above U+FFFF:
let Omega = '\u{03A9}';
let desertIslandEmoji = '\u{1F3DD}';
Judging from https://caniuse.com/mdn-javascript_builtins_string_unicode_code_point_escapes, most or all browsers added support for it in 2015, so it should be reasonably safe to use.
Although #ruakh gave a good answer, I will add some alternatives for completeness:
You could in fact use even var Omega = 'Ω' in JavaScript, but only if your JavaScript code is:
inside an event attribute, as in onclick="var Omega = '&#937';
alert(Omega)" or
in a script element inside an XHTML (or XHTML + XML) document
served with an XML content type.
In these cases, the code will be first (before getting passed to the JavaScript interpreter) be parsed by an HTML parser so that character references like Ω are recognized. The restrictions make this an impractical approach in most cases.
You can also enter the Ω character as such, as in var Omega = 'Ω', but then the character encoding must allow that, the encoding must be properly declared, and you need software that let you enter such characters. This is a clean solution and quite feasible if you use UTF-8 encoding for everything and are prepared to deal with the issues created by it. Source code will be readable, and reading it, you immediately see the character itself, instead of code notations. On the other hand, it may cause surprises if other people start working with your code.
Using the \u notation, as in var Omega = '\u03A9', works independently of character encoding, and it is in practice almost universal. It can however be as such used only up to U+FFFF, i.e. up to \uffff, but most characters that most people ever heard of fall into that area. (If you need “higher” characters, you need to use either surrogate pairs or one of the two approaches above.)
You can also construct a character using the String.fromCharCode() method, passing as a parameter the Unicode number, in decimal as in var Omega = String.fromCharCode(937) or in hexadecimal as in var Omega = String.fromCharCode(0x3A9). This works up to U+FFFF. This approach can be used even when you have the Unicode number in a variable.
One option is to put the character literally in your script, e.g.:
const omega = 'Ω';
This requires that you let the browser know the correct source encoding, see Unicode in JavaScript
However, if you can't or don't want to do this (e.g. because the character is too exotic and can't be expected to be available in the code editor font), the safest option may be to use new-style string escape or String.fromCodePoint:
const omega = '\u{3a9}';
// or:
const omega = String.fromCodePoint(0x3a9);
This is not restricted to UTF-16 but works for all unicode code points. In comparison, the other approaches mentioned here have the following downsides:
HTML escapes (const omega = '&#937';): only work when rendered unescaped in an HTML element
old style string escapes (const omega = '\u03A9';): restricted to UTF-16
String.fromCharCode: restricted to UTF-16
The answer is correct, but you don't need to declare a variable.
A string can contain your character:
"This string contains omega, that looks like this: \u03A9"
Unfortunately still those codes in ASCII are needed for displaying UTF-8, but I am still waiting (since too many years...) the day when UTF-8 will be same as ASCII was, and ASCII will be just a remembrance of the past.
I found this question when trying to implement a font-awesome style icon system in html. I have an API that provides me with a hex string and I need to convert it to unicode to match with the font-family.
Say I have the string const code = 'f004'; from my API. I can't do simple string concatenation (const unicode = '\u' + code;) since the system needs to recognize that it's unicode and this will in fact cause a syntax error if you try.
#coldfix mentioned using String.fromCodePoint but it takes a number as an argument, not a string.
To finally cross the finish line, just add parseInt and pass 16 (since hex is base 16) to it's second parameter. You'll finally get a unicode string from a simple hex string.
This is what I did:
const code = 'f004';
const toUnicode = code => String.fromCodePoint(parseInt(code, 16));
toUnicode(code);
// => '\uf004'
Try using Function(), like this:
var code = "2710"
var char = Function("return '\\u"+code+"';")()
It works well, just do not add any 's or "s or spaces.
In the example, char is "✐".

ASCII Text String Shortening

I am not really interested in security or anything of that nature, but I need some function(s) that allow me to "compress"/"decompress" a string. I have tried Base64, but that has a big issue with the size of the string, it makes it longer. I also know about this Huffman stuff, but that doesn't work either because it too makes it longer (less in terms of memory, it is an integer).
In other words, I want some arbitrary string 'djshdjkash' to be encoded to some other string 'dhaldhnctu'. Be able to go from one to another, and have the new string's length be equal to or less than the original.
Is this possible with Javascript, has it already been done?
Needed to clarify, as I said security is not the objective, just to disguise the string and keeps its length (or shorten it). Base64 is the best example, but it makes strings longer. ROT13 is neat, but doesn't cover all ASCII characters, only letters.
You need compression, not encoding. Encoding generally adds bits. Google "String Compression Algorithms."
Since ROT13 is out because it only affects alphas, why not just implement something across a larger character set. Set up a from array of characters containing your entire printable character set and a to array containing the same characters in a different order.
Then for every character in your string, if it's in the from array, replace it with the equivalent position in the to array.
This yields no compression at all but will satisfy all your requirements (shorter or same length, disguised string).
In pseudo-code, something like:
chfrom = "ABCDEF..."
chto = "1$#zX^..."
def encode(s1):
s2 = ""
foreach ch in s1:
idx = chfrom.find(ch)
if idx == -1:
s2 += ch
else:
s2 += chto[idx]
return s2
def decode(s1):
# same as encode but swap chfrom and chto.
ROT13?
http://en.wikipedia.org/wiki/ROT13
I'm not sure what exactly you want to compress. If it is the length of the string (as seen by String.length(), you could compress two ASCII characters into a Unicode character. So a string like hello, world (12 characters) might result in \u6865\u6c6c\u6f2c\u206f\u6f72\u6c64 (6 characters). You have to be very careful though that you don't generate invalid characters like \uFFFF and that you can always go back from the compressed string to the uncompressed one.
On the other hand, if you want to reduce the length of the string literal, this way is completely wrong. So please clarify under what circumstances you want to compress the strings.
You can use a simple substitution cipher. Here's an example in JavaScript.
Note that there are tools out there to break substitution ciphers. Make sure security isn't an issue here before going down this path.

Categories

Resources