Compare strings with different encodings

Compare strings with different encodings - javascript

I've just needed to compare to strings in JavaScript, and the comparision of specific strings failed sometimes.
One value was obtained with jQuery via the text() method (from some auto-generated HTML):
var value1 = $('#somelement').text();
The other value is hardcoded in a JavaScript file (from me).
After some testing I found that these strings have different encodings, which became clear when I logged them with the escape() function.
Firebug showed me something like this:
console.log(escape(value1));
"blabla%A0%28blub%29"
console.log(escape(value2));
"blabla%20%28blub%29"
So at the end it's the whitespace with different encodings which made my comparison fails.
So my question is: how to handle this correctly? Can I just replace the whitespace to be equal? But I guess there are other control characters - like tab, return and so on - which could mess up my comparison?

So at the end it's the whitespace with different encodings which made my comparison fails.
No, it is not a different encoding. It is just a different whitespace - a non-breaking space.
Can I just replace the white space to be equal? But I guess there are other control characters - like tab, return and so on - which could mess up my comparison?
You can replace all of them. You might want to try something like
value1.replace(/\s+/g, " ").replace(/^\s*|\s$/g, "") == value2
which joins multiple whitespaces (of all kinds, including returns) to a single space and also trims the string before the comparison.

Related

Text encoding that produces legible encodings suitable as Javascript identifiers?

I'm working on a tool that reads arbitrary data files and creates a table out of its data which I then store in a database. I'd like to preserve the column headers. The column headers are already ASCII text (or maybe latin1), but they have characters that aren't valid variable names (e.g., spaces, %), so I need to encode them somehow. I'm looking for an encoding for the column titles that has these properties:
Legible: it would be nice if the encoded text looked as similar as possible to the unencoded text (i.e., for debugging).
Legal identifier: I'd like the encoded text to be a valid JavaScript identifier (ECMA-262 Section 7.6).
Invertible: I'd like to be able to get the exact original text back from the encoded text.
I can think of approaches that work for 2 of the 3 cases, but I don't know how to get all 3. E.g., url encoding doesn't produce legal identifier names, I think I could transform base64 to be legal, but it isn't legible, what I've got currently just does some substitutions so it's not invertible.
Efficiency isn't a concern, so if necessary, I could store the encoded and unencoded texts together. The best option I can think of is to use url encoding and then swap percents for $. I thought there would be better options than this though, but I can't find anything. Is there anything better?

This pair of methods relying on Guava's PercentEscaper seems to meet my requirements. Guava doesn't provide an unescaper, but given my simple needs here, I can just use a simple URLDecoder.
private static PercentEscaper escaper = new PercentEscaper('',false)
static String getIdentifier(String str) {
//minimal safe characters, but leaves letters alone, so it's somewhat legible
String escaped = escaper.escape(str);
//javascript identifiers can't start with a digit, and the escaper doesn't know the first
//character has different rules. so prepend a "%3" to encode the digit
if(Character.isDigit(escaped.charAt(0))){
escaped = "%3"+escaped
}
//a percent isn't a valid in a javascript identifier, so we'll use _ as our special character
escaped = escaped.replace('%','_');
return escaped;
}
static String invertIdentifier(String str){
String unescaped = str.replace('_','%');
unescaped = URLDecoder.decode(unescaped, "UTF-8");
return unescaped;
}

Efficiently remove common patterns from a string

I am trying to write a function to calculate how likely two strings are to mean the same thing. In order to do this I am converting to lower case and removing special characters from the strings before I compare them. Currently I am removing the strings '.com' and 'the' using String.replace(substring, '') and special characters using String.replace(regex, '')
str = str.toLowerCase()
.replace('.com', '')
.replace('the', '')
.replace(/[&\/\\#,+()$~%.'":*?<>{}]/g, '');
Is there a better regex that I can use to remove the common patterns like '.com' and 'the' as well as the special characters? Or some other way to make this more efficient?
As my dataset grows I may find other common meaningless patterns that need to be removed before trying to match strings and would like to avoid the performance hit of chaining more replace functions.
Examples:
Fish & Chips? => fish chips
stackoverflow.com => stackoverflow
The Lord of the Rings => lord of rings

You can connect the replace calls to a single one with a rexexp like this:
str = str.toLowerCase().replace(/\.com|the|[&\/\\#,+()$~%.'":*?<>{}]/g, '');
The different strings to remove are inside parentheses () and separated by pipes |
This makes it easy enough to add more string to the regexp.
If you are storing the words to remove in an array, you can generate the regex using the RegExp constructor, e.g.:
var words = ["\\.com", "the"];
var rex = new RegExp(words.join("|") + "|[&\\/\\\\#,+()$~%.'\":*?<>{}]", "g");
Then reuse rex for each string:
str = str.toLowerCase().replace(rex, "");
Note the additional escaping required because instead of a regular expression literal, we're using a string, so the backslashes (in the words array and in the final bit) need to be escaped, as does the " (because I used " for the string quotes).

The problem with this question is that im sure you have a very concrete idea in your mind of what you want to do, but the solution you have arrived at (removing un-informative letters before making a is-identical comparison) may not be the best for the comparison you want to do.
I think perhaps a better idea would be to use a different method comparison and a different datastructure than a string. A very simple example would be to condense your strings to sets with set('string') and then compare set similarity/difference. Another method might be to create a Directed Acyclic Graph, or sub-string Trei. The main point is that it's probably ok to reduce the information from the original string and store/compare that - however don't underestimate the value of storing the original string, as it will help you down the road if you want to change the way you compare.
Finally, if your strings are really really really long, you might want to use a perceptual hash - which is like an MD5 hash except similar strings have similar hashes. However, you will most likely have to roll your own for short strings, and define what you think is important data, and what is superfluous.

Why does Closure Compiler insist on adding more bytes?

If I give Closure Compiler something like this:
window.array = '0123456789'.split('');
It "compiles" it to this:
window.array="0,1,2,3,4,5,6,7,8,9".split(",");
Now as you can tell, that's bigger. Is there any reason why Closure Compiler is doing this?

I think this is what's going on, but I am by no means certain...
The code that causes the insertion of commas is tryMinimizeStringArrayLiteral in PeepholeSubstituteAlternateSyntax.java.
That method contains a list of characters that are likely to have a low Huffman encoding, and are therefore preferable to split on than other characters. You can see the result of this if you try something like this:
"a b c d e f g".split(" "); //Uncompiled, split on spaces
"a,b,c,d,e,f,g".split(","); //Compiled, split on commas (same size)
The compiler will replace the character you try to split on with one it thinks is favourable. It does so by iterating over the characters of the string and finding the most favourable splitting character that does not occur within the string:
// These delimiters are chars that appears a lot in the program therefore
// probably have a small Huffman encoding.
NEXT_DELIMITER: for (char delimiter : new char[]{',', ' ', ';', '{', '}'}) {
for (String cur : strings) {
if (cur.indexOf(delimiter) != -1) {
continue NEXT_DELIMITER;
}
}
String template = Joiner.on(delimiter).join(strings);
//...
}
In the above snippet you can see the array of characters the compiler claims to be optimal to split on. The comma is first (which is why in my space example above, the spaces have been replaced by commas).
I believe the insertion of commas in the case where the string to split on is the empty string may simply be an oversight. There does not appear to be any special treatment of this case, so it's treated like any other split call and each character is joined with the first appropriate character from the array shown in the above snippet.
Another example of how the compiler deals with the split method:
"a,;b;c;d;e;f;g".split(";"); //Uncompiled, split on semi-colons
"a, b c d e f g".split(" "); //Compiled, split on spaces
This time, since the original string already contains a comma (and we don't want to split on the comma character), the comma can't be chosen from the array of low-Huffman-encoded characters, so the next best choice is selected (the space).
Update
Following some further research into this, it is definitely not a bug. This behaviour is actually by design, and in my opinion it's a very clever little optimisation, when you bear in mind that the Closure compiler tends to favour the speed of the compiled code over size.
Above I mentioned Huffman encoding a couple of times. The Huffman coding algorithm, explained very simply, assigns a weight to each character appearing the the text to be encoded. The weight is based on the frequency with which each character appears. These frequencies are used to build a binary tree, with the most common character at the root. That means the most common characters are quicker to decode, since they are closer to the root of the tree.
And since the Huffman algorithm is a large part of the DEFLATE algorithm used by gzip. So if your web server is configured to use gzip, your users will be benefiting from this clever optimisation.

This issue was fixed on Apr 20, 2012 see revision:
https://code.google.com/p/closure-compiler/source/detail?r=1267364f742588a835d78808d0eef8c9f8ba8161

Ironically, split in the compiled code has nothing to do with split in the source. Consider:
Source : a = ["0","1","2","3","4","5"]
Compiled: a="0,1,2,3,4,5".split(",")
Here, split is just a way to represent long arrays (long enough for sum of all quotes + commas to be longer than split(","") ). So, what's going on in your example? First, the compiler sees a string function applied to a constant and evaluates it right away:
'0123456789'.split('') => ["0","1","2","3","4","5","6","7","8","9"]
At some later point, when generating output, the compiler considers this array to be "long" and writes it in the above "split" form:
["0","1","2","3","4","5","6","7","8","9"] => "0,1,2,3,4,5,6,7,8,9".split(",")
Note that all information about split('') in the source is already lost at this point.
If the source string were shorter, it would be generated in the array array form, without extra splitting:
Source : a = '0123'.split('')
Compiled: a=["0","1","2","3"]

JavaScript/HTML/Unicode accents: á != á

I want to check if a user submitted string is the same as the string in my answer key. Sometimes the words involve Spanish accents (like in sábado), and that makes the condition always false.
I have Firebug log $('#answer').val() and it shows up as sábado. (The á comes from a button that inserts the value á, if that matters) whereas logging the answer from the answer key shows sábado (how I wrote it in the actual answer key).
I have tried replacing the &aacute in the answer key with a normal á, but it still doesn't work, and results in a Unicode diamond-question-mark. When I do that and also replace the value of the button that makes the user-submitted á, the condition works correctly, but then the button, the user string, and the answer string all have the weird Unicode diamond-question-mark.
I have also tried using á in both places and it's no different from using á. Both my HTML and Javascript are using charset="utf-8".
How can I fix this?

If you're consistently using UTF-8, there's no need for HTML entities except to encode syntax (ie <, >, & and - within attributes - ").
For anything else, use the proper characters, and your problems should go away - until you run into unicode normalization issues, ie the difference between 'a\u0301' and '\u00E1'...

The issue is that you're not using the real UTF-8 characters in both strings (entered answer and the key). You should NOT be supplying "a button that inserts the value á" -- Re: "if that matters" it does!
The characters should be added by the keyboard input system. And your comparison string should also be only utf-8 characters. It should NOT be character entities.

ASCII Text String Shortening

I am not really interested in security or anything of that nature, but I need some function(s) that allow me to "compress"/"decompress" a string. I have tried Base64, but that has a big issue with the size of the string, it makes it longer. I also know about this Huffman stuff, but that doesn't work either because it too makes it longer (less in terms of memory, it is an integer).
In other words, I want some arbitrary string 'djshdjkash' to be encoded to some other string 'dhaldhnctu'. Be able to go from one to another, and have the new string's length be equal to or less than the original.
Is this possible with Javascript, has it already been done?
Needed to clarify, as I said security is not the objective, just to disguise the string and keeps its length (or shorten it). Base64 is the best example, but it makes strings longer. ROT13 is neat, but doesn't cover all ASCII characters, only letters.

You need compression, not encoding. Encoding generally adds bits. Google "String Compression Algorithms."

Since ROT13 is out because it only affects alphas, why not just implement something across a larger character set. Set up a from array of characters containing your entire printable character set and a to array containing the same characters in a different order.
Then for every character in your string, if it's in the from array, replace it with the equivalent position in the to array.
This yields no compression at all but will satisfy all your requirements (shorter or same length, disguised string).
In pseudo-code, something like:
chfrom = "ABCDEF..."
chto = "1$#zX^..."
def encode(s1):
s2 = ""
foreach ch in s1:
idx = chfrom.find(ch)
if idx == -1:
s2 += ch
else:
s2 += chto[idx]
return s2
def decode(s1):
# same as encode but swap chfrom and chto.

ROT13?
http://en.wikipedia.org/wiki/ROT13

I'm not sure what exactly you want to compress. If it is the length of the string (as seen by String.length(), you could compress two ASCII characters into a Unicode character. So a string like hello, world (12 characters) might result in \u6865\u6c6c\u6f2c\u206f\u6f72\u6c64 (6 characters). You have to be very careful though that you don't generate invalid characters like \uFFFF and that you can always go back from the compressed string to the uncompressed one.
On the other hand, if you want to reduce the length of the string literal, this way is completely wrong. So please clarify under what circumstances you want to compress the strings.

You can use a simple substitution cipher. Here's an example in JavaScript.
Note that there are tools out there to break substitution ciphers. Make sure security isn't an issue here before going down this path.

Develop Reference

JavaScript is the programming language of the Web.

Compare strings with different encodings - javascript

Related

Text encoding that produces legible encodings suitable as Javascript identifiers?

Efficiently remove common patterns from a string

Why does Closure Compiler insist on adding more bytes?

JavaScript/HTML/Unicode accents: á != á

ASCII Text String Shortening

Categories

Resources