Comparing strings with localeCompare vs ===? - javascript

I ran into a pretty strange issue with my latest JS project. I usually compare strings using === but when comparing the string properties of two of different objects I got false even though they were the exact same strings. I tested this in my Node.js interpreter by doing the following:
> x = {str: 'hello'}
{ str: 'hello' }
> y = {str: 'hello'}
{ str: 'hello' }
> y.str === x.str
true
So I couldnt figure out why my code wasnt working. But when I switch from using === to str1.localeCompare BOOM, it works. Whats the difference between the two?

=== looks for exactly the same bytes in the strings.
.localeCompare() allows for the fact that you may want to ignore certain differences in the strings (such as puncutation or diacriticals or case) and still allow them to compare the same or you want to ignore certain differences when deciding which string is before the other. And, it provides lots of options to control what comparison features are or are not used.
If you read the MDN documentation for string.prototype.localeCompare(), you can see a whole bunch of options you can pass in to control how the compare works. On a plain ascii string with no special characters in it that are all the same case, you are unlikely to see a difference, but start getting into diacriticals or case issues and localCompare() has both more features and more options to control the comparison.
Some of the options available for controlling the comparison:
numeric collation
diacritical sensitivity
ability to ignore punctuation
case first
control whether upper or lower case compares first
In addition, localeCompare() returns a value (negative, 0 or positive) that is perfectly aligned to use with a .sort() callback.

Related

How can I convert this UTF-8 string to plain text in javascript and how can a normal user write it in a textarea [duplicate]

While reviewing JavaScript concepts, I found String.normalize(). This is not something that shows up in W3School's "JavaScript String Reference", and, hence, it is the reason I might have missed it before.
I found more information about it in HackerRank which states:
Returns a string containing the Unicode Normalization Form of the
calling string's value.
With the example:
var s = "HackerRank";
console.log(s.normalize());
console.log(s.normalize("NFKC"));
having as output:
HackerRank
HackerRank
Also, in GeeksForGeeks:
The string.normalize() is an inbuilt function in javascript which is
used to return a Unicode normalisation form of a given input string.
with the example:
<script>
// Taking a string as input.
var a = "GeeksForGeeks";
// calling normalize function.
b = a.normalize('NFC')
c = a.normalize('NFD')
d = a.normalize('NFKC')
e = a.normalize('NFKD')
// Printing normalised form.
document.write(b +"<br>");
document.write(c +"<br>");
document.write(d +"<br>");
document.write(e);
</script>
having as output:
GeeksForGeeks
GeeksForGeeks
GeeksForGeeks
GeeksForGeeks
Maybe the examples given are just really bad as they don't allow me to see any change.
I wonder... what's the point of this method?
It depends on what will do with strings: often you do not need it (if you are just getting input from user, and putting it to user). But to check/search/use as key/etc. such strings, you may want a unique way to identify the same string (semantically speaking).
The main problem is that you may have two strings which are semantically the same, but with two different representations: e.g. one with a accented character [one code point], and one with a character combined with accent [one code point for character, one for combining accent]. User may not be in control on how the input text will be sent, so you may have two different user names, or two different password. But also if you mangle data, you may get different results, depending on initial string. Users do not like it.
An other problem is about unique order of combining characters. You may have an accent, and a lower tail (e.g. cedilla): you may express this with several combinations: "pure char, tail, accent", "pure char, accent, tail", "char+tail, accent", "char+accent, cedilla".
And you may have degenerate cases (especially if you type from a keyboard): you may get code points which should be removed (you may have a infinite long string which could be equivalent of few bytes.
In any case, for sorting strings, you (or your library) requires a normalized form: if you already provide the right, the lib will not need to transform it again.
So: you want that the same (semantically speaking) string has the same sequence of unicode code points.
Note: If you are doing directly on UTF-8, you should also care about special cases of UTF-8: same codepoint could be written in different ways [using more bytes]. Also this could be a security problem.
The K is often used for "searches" and similar tasks: CO2 and CO₂ will be interpreted in the same manner, but this could change the meaning of the text, so it should often used only internally, for temporary tasks, but keeping the original text.
As stated in MDN documentation, String.prototype.normalize() return the Unicode Normalized Form of the string. This because in Unicode, some characters can have different representation code.
This is the example (taken from MDN):
const name1 = '\u0041\u006d\u00e9\u006c\u0069\u0065';
const name2 = '\u0041\u006d\u0065\u0301\u006c\u0069\u0065';
console.log(`${name1}, ${name2}`);
// expected output: "Amélie, Amélie"
console.log(name1 === name2);
// expected output: false
console.log(name1.length === name2.length);
// expected output: false
const name1NFC = name1.normalize('NFC');
const name2NFC = name2.normalize('NFC');
console.log(`${name1NFC}, ${name2NFC}`);
// expected output: "Amélie, Amélie"
console.log(name1NFC === name2NFC);
// expected output: true
console.log(name1NFC.length === name2NFC.length);
// expected output: true
As you can see, the string Amélie as two different Unicode representations. With normalization, we can reduce the two forms to the same string.
Very beautifully explained here --> https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize
Short answer : The point is, characters are represented through a coding scheme like ascii, utf-8 , etc.,(We use mostly UTF-8). And some characters have more than one representation. So 2 string may render similarly, but their unicode may vary! So string comparrision may fail here! So we use normaize to return a single type of representation
// source from MDN
let string1 = '\u00F1'; // ñ
let string2 = '\u006E\u0303'; // ñ
string1 = string1.normalize('NFC');
string2 = string2.normalize('NFC');
console.log(string1 === string2); // true
console.log(string1.length); // 1
console.log(string2.length); // 1
Normalization of strings isn't exclusive of JavaScript - see for instances in Python. The values valid for the arguments are defined by the Unicode (more on Unicode normalization).
When it comes to JavaScript, note that there's documentation with String.normalize() and String.prototype.normalize(). As #ChrisG mentions
String.prototype.normalize() is correct in a technical sense, because
normalize() is a dynamic method you call on instances, not the class
itself. The point of normalize() is to be able to compare Strings that
look the same but don't consist of the same characters, as shown in
the example code on MDN.
Then, when it comes to its usage, found a great example of the usage of String.normalize() that has
let s1 = 'sabiá';
let s2 = 'sabiá';
// one is in NFC, the other in NFD, so they're different
console.log(s1 == s2); // false
// with normalization, they become the same
console.log(s1.normalize('NFC') === s2.normalize('NFC')); // true
// transform string into array of codepoints
function codepoints(s) { return Array.from(s).map(c => c.codePointAt(0).toString(16)); }
// printing the codepoints you can see the difference
console.log(codepoints(s1)); // [ "73", "61", "62", "69", "e1" ]
console.log(codepoints(s2)); // [ "73", "61", "62", "69", "61", "301" ]
So while saibá e saibá in this example look the same to the human eye or even if we used console.log(), we can see that without normalization when comparing them we'd get different results. Then, by analyzing the codepoints, we see they're different.
There are some great answers here already, but I wanted to throw in a practical example.
I enjoy Bible translation as a hobby. I wasn't too thrilled at the flashcard option out there in the wild in my price range (free) so I made my own. The problem is, there is more than one way to do Hebrew and Greek in Unicode to get the exact same thing. For example:
בָּא
בָּא
These should look identical on your screen, and for all practical purposes they are identical. However, the first was typed with the qamats (the little t shaped thing under it) before the dagesh (the dot in the middle of the letter) and the second was typed with the dagesh before the qamats. Now, since you're just reading this, you don't care. And your web browser doesn't care. But when my flashcards compare the two, then they aren't the same. To the code behind the scenes, it's no different than saying "center" and "centre" are the same.
Similarly, in Greek:
ἀ
ἀ
These two should look nearly identical, but the top is one Unicode character and the second one is two Unicode characters. Which one is going to end up typed in my flashcards is going to depend on which keyboard I'm sitting at.
When I'm adding flashcards, believe it or not, I don't always type in vocab lists of 100 words. That's why God gave us spreadsheets. And sometimes the places I'm importing the lists from do it one way, and sometimes they do it the other way, and sometimes they mix it. But when I'm typing, I'm not trying to memorize the order that the dagesh or quamats appear or if the accents are typed as a separate character or not. Regardless if I remember to type the dagesh first or not, I want to get the right answer, because really it's the same answer in every practical sense either way.
So I normalize the order before saving the flashcards and I normalize the order before checking it, and the result is that it doesn't matter which way I type it, it comes out right!
If you want to check out the results:
https://sthelenskungfu.com/flashcards/
You need a Google or Facebook account to log in, so it can track progress and such. As far as I know (or care) only my daughter and I currently use it.
It's free, but eternally in beta.

How to explain such esoteric JS code

This code is equal to alert(1), but how does it work ? I don't see eval here.
/ㅤ/-[ㅤ=''],ᅠ=!ㅤ+ㅤ,ㅤㅤ=!ᅠ+ㅤ,ㅤᅠ=ㅤ+{},ᅠㅤ=ᅠ[ㅤ++],ᅠᅠ=ᅠ[ᅠㅤㅤ=ㅤ
],ᅠㅤᅠ=++ᅠㅤㅤ+ㅤ,ㅤㅤㅤ=ㅤᅠ[ᅠㅤㅤ+ᅠㅤᅠ],ᅠ[ㅤㅤㅤ+=ㅤᅠ[ㅤ]+(ᅠ.ㅤㅤ+ㅤᅠ)[ㅤ]+ㅤㅤ[ᅠㅤᅠ]+ᅠㅤ+ᅠᅠ+ᅠ
[ᅠㅤㅤ]+ㅤㅤㅤ+ᅠㅤ+ㅤᅠ[ㅤ]+ᅠᅠ][ㅤㅤㅤ](ㅤㅤ[ㅤ]+ㅤㅤ[ᅠㅤㅤ]+ᅠ[ᅠㅤᅠ]+ᅠᅠ+ᅠㅤ+"(ㅤ)")()
This is JSFuck, an esoteric programming language, that is actually valid JavaScript, so you don't need any special interpreter/compiler to run it.
The most popular one involves the use of just 6 characters ([]()!+), but yours is a bit different since it also uses /, =, ", ', ,, {, } and (blank).
It works by taking advantage of some nice features of JavaScript.
For instance, we know that [] is a truthy value, therefore ![] yields false.
With that same logic, we can get true by executing !![].
Numbers can be achieved too. We know that false is equal to 0, so the following expression makes sense: 0 + false == 0, right ? And it does. We know that false can be written as ![], and we know that we can omit the 0 on the left-side of the expression: +![] == 0.
Same can be said with true and 1: +!![]
The number 2 can be achieved by adding up two 1s: (+!![])+(+!![]), and so on.
With logic like these you can do pretty much anything.
For instance, a popular way to get the letter "a" is by producing a NaN result, converting it to string ("NaN"), and then getting the letter at index 1, which is "a".
Ok so.. We know we can get "alert(1)", but how do we execute this?
We can't use eval, because that will require to use characters not allowed on JSFuck.
Well, the way most people do it is like this:
Identify a well-known function of Array.prototype, let's say indexOf
Obtain its constructor instance
Pass in stringified code to this constructor
Execute the result
So, as a summary:
// You can try this on your browser!
[]["indexOf"]["constructor"]("alert(1)")()
We know that we can generate alphabetic characters on JSFuck, and we also know we can generate numbers, so that line of code up there is actually very possible.

Why does Number('') returns 0 whereas parseInt('') returns NaN

I have gone through similar questions and answers on StackOverflow and found this:
parseInt("123hui")
returns 123
Number("123hui")
returns NaN
As, parseInt() parses up to the first non-digit and returns whatever it had parsed and Number() tries to convert the entire string into a number, why unlikely behaviour in case of parseInt('') and Number('').
I feel ideally parseInt should return NaNjust like it does with Number("123hui")
Now my next question:
As 0 == '' returns true I believe it interprets like 0 == Number('') which is true. So does the compiler really treat it like 0 == Number('') and not like 0 == parseInt('') or am I missing some points?
The difference is due in part to Number() making use of additional logic for type coercion. Included in the rules it follows for that is:
A StringNumericLiteral that is empty or contains only white space is converted to +0.
Whereas parseInt() is defined to simply find and evaluate numeric characters in the input, based on the given or detected radix. And, it was defined to expect at least one valid character.
13) If S contains a code unit that is not a radix-R digit, let Z be the substring of S consisting of all code units before the first such code unit; otherwise, let Z be S.
14) If Z is empty, return NaN.
Note: 'S' is the input string after any leading whitespace is removed.
As 0=='' returns true I believe it interprets like 0==Number('') [...]
The rules that == uses are defined as Abstract Equality.
And, you're right about the coercion/conversion that's used. The relevant step is #6:
If Type(x) is Number and Type(y) is String,
return the result of the comparison x == ToNumber(y).
To answer your question about 0==''returning true :
Below is the comparison of a number and string:
The Equals Operator (==)
Type (x) Type(y) Result
-------------------------------------------
x and y are the same type Strict Equality (===) Algorithm
Number String x == toNumber(y)
and toNumber does the following to a string argument:
toNumber:
Argument type Result
------------------------
String In effect evaluates Number(string)
“abc” -> NaN
“123” -> 123
Number('') returns 0. So that leaves you with 0==0 which is evaluated using Strict Equality (===) Algorithm
The Strict Equals Operator (===)
Type values Result
----------------------------------------------------------
Number x same value as y true
(but not NaN)
You can find the complete list # javascriptweblog.wordpress.com - truth-equality-and-javascript.
parseInt("") is NaN because the standard says so even if +"" is 0 instead (also simply because the standard says so, implying for example that "" == 0).
Don't look for logic in this because there's no deep profound logic, just history.
You are in my opinion making a BIG mistake... the sooner you correct it the better will be for your programming life with Javascript. The mistake is that you are assuming that every choice made in programming languages and every technical detail about them is logical. This is simply not true.
Especially for Javascript.
Please remeber that Javascript was "designed" in a rush and, just because of fate, it became extremely popular overnight. This forced the community to standardize it before any serious thought to the details and therefore it was basically "frozen" in its current sad state before any serious testing on the field.
There are parts that are so bad they aren't even funny (e.g. with statement or the == equality operator that is so broken that serious js IDEs warn about any use of it: you get things like A==B, B==C and A!=C even using just normal values and without any "special" value like null, undefined, NaN or empty strings "" and not because of precision problems).
Nonsense special cases are everywhere in Javascript and trying to put them in a logical frame is, unfortunately, a wasted effort. Just learn its oddities by reading a lot and enjoy the fantastic runtime environment it provides (this is where Javascript really shines... browsers and their JIT are a truly impressive piece of technology: you can write a few lines and get real useful software running on a gajillion of different computing devices).
The official standard where all oddities are enumerated is quite hard to read because aims to be very accurate, and unfortunately the rules it has to specify are really complex.
Moreover as the language gains more features the rules will get even more and more complex: for example what is for ES5 just another weird "special" case (e.g. ToPrimitive operation behavior for Date objects) becomes a "normal" case in ES6 (where ToPrimitive can be customized).
Not sure if this "normalization" is something to be happy about... the real problem is the frozen starting point and there are no easy solutions now (if you don't want to throw away all existing javascript code, that is).
The normal path for a language is starting clean and nice and symmetric and small. Then when facing real world problems the language gains (is infected by) some ugly parts (because the world is ugly and asymmetrical).
Javascript is like that. Except that it didn't start nice and clean and moreover there was no time to polish it before throwing it in the game.

Javascript eval - obfuscation?

I came across some eval code:
eval('[+!+[]+!+[]+!+[]+!+[]+!+[]]');
This code equals the integer 5.
What is this type of thing called? I've tried searching the web but I can't seem to figure out what this is referred to. I find this very interesting and would like to know where/how one learns how to print different things instead of just the integer 5. Letters, symbols and etc. Since I can't pin point a pattern in that code I've had 0 success taking from and adding to it to make different results.
Is this some type of obfuscation?
This type of obfuscation, eval() aside, is known as Non-alphanumeric obfuscation.
To be completely Non-alphanumeric, the eval would have to be performed by Array constructor prototypes functions and subscript notation:
[]["sort"]["constructor"]("string to be evaled");
These strings are then converted to non-alphanumeric form.
AFAIK, it was first proposed by Yosuke Hosogawa around 2009.
If you want to see it in action see this tool: http://www.jsfuck.com/
It is not considered a good type of obfuscation because it is easy to reverse back to the original source code, without even having to run the code (statically). Plus, it increases the file size tremendously.
But its an interesting form of obfuscation that explores JavaScript type coercion. To learn more about it I recommend this presentation, slide 33:
http://www.slideshare.net/auditmark/owasp-eu-tour-2013-lisbon-pedro-fortuna-protecting-java-script-source-code-using-obfuscation
That's called Non-alphanumeric JavaScript and it's possible because of JavaScript type coercion capabilities. There are actually some ways to call eval/Function without having to use alpha characters:
[]["filter"]["constructor"]('[+!+[]+!+[]+!+[]+!+[]+!+[]]')()
After you replace strings "filter" and "constructor" by non-alphanumeric representations you have your full non-alphanumeric JavaScript.
If you want to play around with this, there is a site where you can do it: http://www.jsfuck.com/.
Check this https://github.com/aemkei/jsfuck/blob/master/jsfuck.js for more examples like the following:
'a': '(false+"")[1]',
'b': '(Function("return{}")()+"")[2]',
'c': '([]["filter"]+"")[3]',
...
To get the value as 5, the expression should have been like this
+[!+[] + !+[] + !+[] + !+[] + !+[]]
Let us analyze the common elements first. !+[].
An empty array literal in JavaScript is considered as Falsy.
+ operator applied to an array literal, will try convert it to a number and since it is empty, JavaScript will evaluate it to 0.
! operator converts 0 to true.
So,
console.log(!+[]);
would print true. Now, the expression can be reduced like this
+[true + true + true + true + true]
Since true is treated as 1 in arithmetic expressions, the actual expression becomes
+[ 5 ]
The + operator tries to convert the array ([ 5 ]) to a number and which results in 5. That is why you are getting 5.
I don't know of any term used to describe this type of code, aside from "abusing eval()".
I find this very interesting and would like to know where/how one
learns how to print different things instead of just the integer 5.
Letters, symbols and etc. Since I can't pin point a pattern in that
code I've had 0 success taking from and adding to it to make different
results.
This part I can at least partially answer. The eval() you pasted relies heavily on Javascript's strange type coercion rules. There are a lot of pages on the web describing various strange consequences of the coercion rules, which you can easily search for. But I can't find any reference on type coercion with the specific goal of getting "surprising" output from things like eval(), unless you count this video by Destroy All Software (the Javascript part starts at 1:20). Most of them understandably focus on how to avoid strange bugs in your code. For your purpose, I believe the most useful things I know of are:
The ! operator will convert anything to a boolean.
The unary + operator will convert anything to a number.
The binary + operator will coerce its arguments to either numbers or strings before adding or concatenating them. Normally it will only go with strings if one or the other argument is already a string, but there are exceptions.
The bitwise operators output integers, regardless of input.
Converting to numbers is complicated. Booleans will go to 0 or 1. Strings will attempt to use parseInt(), or produce NaN if that fails. I forget the rest.
Converting an object to a string or number will invoke its "toString" or "toValue" method respectively, if one exists.
You get the idea. thefourtheye's answer walks through exactly how these rules apply to the example you gave. I'm not even going to try summarizing what happens when Dates, Functions, and Regexps are involved.
Is this some type of obfuscation?
Normally you'd simply get obfuscation for free as part of minification, so I have no idea why someone would write that in real production code (assuming that's where you found it).

Why Javascript ===/== string equality sometimes has constant time complexity and sometimes has linear time complexity?

After I found that the common/latest Javascript implementations are using String Interning for perfomance boost (Do common JavaScript implementations use string interning?), I thought === for strings would get the constant O(1) time. So I gave a wrong answer to this question:
JavaScript string equality performance comparison
Since according to the OP of that question it is O(N), doubling the string input doubles the time the equality needs. He didn't provide any jsPerf so more investigation is needed,
So my scenario using string interning would be:
var str1 = "stringwithmillionchars"; //stored in address 51242
var str2 = "stringwithmillionchars"; //stored in address 12313
The "stringwithmillionchars" would be stored once let's say in address 201012 of memory
and both str1 and str2 would be "pointing" to this address 201012. This address could then be determined with some kind of hashing to map to specific locations in memory.
So when doing
"stringwithmillionchars" === "stringwithmillionchars"
would look like
getContentOfAddress(51242)===getContentOfAddress(12313)
or 201012 === 201012
which would take O(1)/constant time
JSPerfs/Performance updates:
JSPerf seems to show constant time even if the string is 16 times longer?? Please have a look:
http://jsperf.com/eqaulity-is-constant-time
Probably the strings are too small on the above:
This probably show linear time (thanks to sergioFC) the strings are built with a loop. I tried without functions - still linear time / I changed it a bit http://jsfiddle.net/f8yf3c7d/3/ .
According to https://www.dropbox.com/s/8ty3hev1b109qjj/compare.html?dl=0 (12MB file that sergioFC made) when you have a string and you already have assigned the value in quotes no matter how big the t1 and t2 are (e.g 5930496 chars), it is taking it 0-1ms/instant time.
It seems that when you build a string using a for loop or a function then the string is not interned. So interning happens only when you directly assign a string with quotes like var str = "test";
Based on all the Performance Tests (see original post) for strings a and b the operation a === b takes:
constant time O(1) if the strings are interned. From the examples it seems that interning only happens with directly assigned strings like var str = "test"; and not if you build it with concatenation using for-loops or functions.
linear time O(N) since in all the other cases the length of the two strings is compared first. If it is equal then we have character by character comparison. Else of course they are not equal. N is the length of the string.
According to the ECMAScript 5.1 Specification's Strict Equal Comparison algorithm, even if the type of Objects being compared is String, all the characters are checked to see if they are equal.
If Type(x) is String, then return true if x and y are exactly the same sequence of characters (same length and same characters in corresponding positions); otherwise, return false.
Interning is strictly an implementation thingy, to boost performance. The language standard doesn't impose any rules in that regard. So, its up to the implementers of the specification to intern strings or not.
First of all, it would be nice to see a JSPerf test which demonstrates the claim that doubling the string size doubles the execution time.
Next, let's take that as granted. Here's my (unproven, unchecked, and probably unrelated to reality) theory.
Compairing two memory addresses is fast, no matter how much data is references. But you have to INTERN this strings first. If you have in your code
var a = "1234";
var b = "1234";
Then the engine first has to understand that these two strings are the same and can point to the same address. So at least once these strings has to be compared fully. So basically here are the following options:
The engine compares and interns strings directly when parsing the code. In this case equals strings should get the same address.
The engine may say "these strings are two big, I don't want to intern them" and has two copies.
The engine may intern these strings later.
In the two latter cases string comparison will influence the test results. In the last case - even if the strings are finally interned.
But as I wrote, a wild theory, for theory's sage. I'd first like to see some JSPerf.

Categories

Resources