How can I convert this UTF-8 string to plain text in javascript and how can a normal user write it in a textarea [duplicate] - javascript

While reviewing JavaScript concepts, I found String.normalize(). This is not something that shows up in W3School's "JavaScript String Reference", and, hence, it is the reason I might have missed it before.
I found more information about it in HackerRank which states:
Returns a string containing the Unicode Normalization Form of the
calling string's value.
With the example:
var s = "HackerRank";
console.log(s.normalize());
console.log(s.normalize("NFKC"));
having as output:
HackerRank
HackerRank
Also, in GeeksForGeeks:
The string.normalize() is an inbuilt function in javascript which is
used to return a Unicode normalisation form of a given input string.
with the example:
<script>
// Taking a string as input.
var a = "GeeksForGeeks";
// calling normalize function.
b = a.normalize('NFC')
c = a.normalize('NFD')
d = a.normalize('NFKC')
e = a.normalize('NFKD')
// Printing normalised form.
document.write(b +"<br>");
document.write(c +"<br>");
document.write(d +"<br>");
document.write(e);
</script>
having as output:
GeeksForGeeks
GeeksForGeeks
GeeksForGeeks
GeeksForGeeks
Maybe the examples given are just really bad as they don't allow me to see any change.
I wonder... what's the point of this method?

It depends on what will do with strings: often you do not need it (if you are just getting input from user, and putting it to user). But to check/search/use as key/etc. such strings, you may want a unique way to identify the same string (semantically speaking).
The main problem is that you may have two strings which are semantically the same, but with two different representations: e.g. one with a accented character [one code point], and one with a character combined with accent [one code point for character, one for combining accent]. User may not be in control on how the input text will be sent, so you may have two different user names, or two different password. But also if you mangle data, you may get different results, depending on initial string. Users do not like it.
An other problem is about unique order of combining characters. You may have an accent, and a lower tail (e.g. cedilla): you may express this with several combinations: "pure char, tail, accent", "pure char, accent, tail", "char+tail, accent", "char+accent, cedilla".
And you may have degenerate cases (especially if you type from a keyboard): you may get code points which should be removed (you may have a infinite long string which could be equivalent of few bytes.
In any case, for sorting strings, you (or your library) requires a normalized form: if you already provide the right, the lib will not need to transform it again.
So: you want that the same (semantically speaking) string has the same sequence of unicode code points.
Note: If you are doing directly on UTF-8, you should also care about special cases of UTF-8: same codepoint could be written in different ways [using more bytes]. Also this could be a security problem.
The K is often used for "searches" and similar tasks: CO2 and CO₂ will be interpreted in the same manner, but this could change the meaning of the text, so it should often used only internally, for temporary tasks, but keeping the original text.

As stated in MDN documentation, String.prototype.normalize() return the Unicode Normalized Form of the string. This because in Unicode, some characters can have different representation code.
This is the example (taken from MDN):
const name1 = '\u0041\u006d\u00e9\u006c\u0069\u0065';
const name2 = '\u0041\u006d\u0065\u0301\u006c\u0069\u0065';
console.log(`${name1}, ${name2}`);
// expected output: "Amélie, Amélie"
console.log(name1 === name2);
// expected output: false
console.log(name1.length === name2.length);
// expected output: false
const name1NFC = name1.normalize('NFC');
const name2NFC = name2.normalize('NFC');
console.log(`${name1NFC}, ${name2NFC}`);
// expected output: "Amélie, Amélie"
console.log(name1NFC === name2NFC);
// expected output: true
console.log(name1NFC.length === name2NFC.length);
// expected output: true
As you can see, the string Amélie as two different Unicode representations. With normalization, we can reduce the two forms to the same string.

Very beautifully explained here --> https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize
Short answer : The point is, characters are represented through a coding scheme like ascii, utf-8 , etc.,(We use mostly UTF-8). And some characters have more than one representation. So 2 string may render similarly, but their unicode may vary! So string comparrision may fail here! So we use normaize to return a single type of representation
// source from MDN
let string1 = '\u00F1'; // ñ
let string2 = '\u006E\u0303'; // ñ
string1 = string1.normalize('NFC');
string2 = string2.normalize('NFC');
console.log(string1 === string2); // true
console.log(string1.length); // 1
console.log(string2.length); // 1

Normalization of strings isn't exclusive of JavaScript - see for instances in Python. The values valid for the arguments are defined by the Unicode (more on Unicode normalization).
When it comes to JavaScript, note that there's documentation with String.normalize() and String.prototype.normalize(). As #ChrisG mentions
String.prototype.normalize() is correct in a technical sense, because
normalize() is a dynamic method you call on instances, not the class
itself. The point of normalize() is to be able to compare Strings that
look the same but don't consist of the same characters, as shown in
the example code on MDN.
Then, when it comes to its usage, found a great example of the usage of String.normalize() that has
let s1 = 'sabiá';
let s2 = 'sabiá';
// one is in NFC, the other in NFD, so they're different
console.log(s1 == s2); // false
// with normalization, they become the same
console.log(s1.normalize('NFC') === s2.normalize('NFC')); // true
// transform string into array of codepoints
function codepoints(s) { return Array.from(s).map(c => c.codePointAt(0).toString(16)); }
// printing the codepoints you can see the difference
console.log(codepoints(s1)); // [ "73", "61", "62", "69", "e1" ]
console.log(codepoints(s2)); // [ "73", "61", "62", "69", "61", "301" ]
So while saibá e saibá in this example look the same to the human eye or even if we used console.log(), we can see that without normalization when comparing them we'd get different results. Then, by analyzing the codepoints, we see they're different.

There are some great answers here already, but I wanted to throw in a practical example.
I enjoy Bible translation as a hobby. I wasn't too thrilled at the flashcard option out there in the wild in my price range (free) so I made my own. The problem is, there is more than one way to do Hebrew and Greek in Unicode to get the exact same thing. For example:
בָּא
בָּא
These should look identical on your screen, and for all practical purposes they are identical. However, the first was typed with the qamats (the little t shaped thing under it) before the dagesh (the dot in the middle of the letter) and the second was typed with the dagesh before the qamats. Now, since you're just reading this, you don't care. And your web browser doesn't care. But when my flashcards compare the two, then they aren't the same. To the code behind the scenes, it's no different than saying "center" and "centre" are the same.
Similarly, in Greek:
ἀ
ἀ
These two should look nearly identical, but the top is one Unicode character and the second one is two Unicode characters. Which one is going to end up typed in my flashcards is going to depend on which keyboard I'm sitting at.
When I'm adding flashcards, believe it or not, I don't always type in vocab lists of 100 words. That's why God gave us spreadsheets. And sometimes the places I'm importing the lists from do it one way, and sometimes they do it the other way, and sometimes they mix it. But when I'm typing, I'm not trying to memorize the order that the dagesh or quamats appear or if the accents are typed as a separate character or not. Regardless if I remember to type the dagesh first or not, I want to get the right answer, because really it's the same answer in every practical sense either way.
So I normalize the order before saving the flashcards and I normalize the order before checking it, and the result is that it doesn't matter which way I type it, it comes out right!
If you want to check out the results:
https://sthelenskungfu.com/flashcards/
You need a Google or Facebook account to log in, so it can track progress and such. As far as I know (or care) only my daughter and I currently use it.
It's free, but eternally in beta.

Related

Why doesn't my function correctly replace when using some regex pattern

This is an extension of this SO question
I made a function to see if i can correctly format any number. The answers below work on tools like https://regex101.com and https://regexr.com/, but not within my function(tried in node and browser):
const
const format = (num, regex) => String(num).replace(regex, '$1')
Basically given any whole number, it should not exceed 15 significant digits. Given any decimal, it should not exceed 2 decimal points.
so...
Now
format(0.12345678901234567890, /^\d{1,13}(\.\d{1,2}|\d{0,2})$/)
returns 0.123456789012345678 instead of 0.123456789012345
but
format(0.123456789012345,/^-?(\d*\.?\d{0,2}).*/)
returns number formatted to 2 deimal points as expected.
Let me try to explain what's going on.
For the given input 0.12345678901234567890 and the regex /^\d{1,13}(\.\d{1,2}|\d{0,2})$/, let's go step by step and see what's happening.
^\d{1,13} Does indeed match the start of the string 0
(\. Now you've opened a new group, and it does match .
\d{1,2} It does find the digits 1 and 2
|\d{0,2} So this part is skipped
) So this is the end of your capture group.
$ This indicates the end of the string, but it won't match, because you've still got 345678901234567890 remaining.
Javascript returns the whole string because the match failed in the end.
Let's try removing $ at the end, to become /^\d{1,13}(\.\d{1,2}|\d{0,2})/
You'd get back ".12345678901234567890". This generates a couple of questions.
Why did the preceding 0 get removed?
Because it was not part of your matching group, enclosed with ().
Why did we not get only two decimal places, i.e. .12?
Remember that you're doing a replace. Which means that by default, the original string will be kept in place, only the parts that match will get replaced. Since 345678901234567890 was not part of the match, it was left intact. The only part that matched was 0.12.
Answer to title question: your function doesn't replace, because there's nothing to replace - the regex doesn't match anything in the string. csb's answer explains that in all details.
But that's perhaps not the answer you really need.
Now, it seems like you have an XY problem. You ask why your call to .replace() doesn't work, but .replace() is definitely not a function you should use. Role of .replace() is replacing parts of string, while you actually want to create a different string. Moreover, in the comments you suggest that your formatting is not only for presenting data to user, but you also intend to use it in some further computation. You also mention cryptocurriencies.
Let's cope with these problems one-by-one.
What to do instead of replace?
Well, just produce the string you need instead of replacing something in the string you don't like. There are some edge cases. Instead of writing all-in-one regex, just handle them one-by-one.
The following code is definitely not best possible, but it's main aim is to be simple and show exactly what is going on.
function format(n) {
const max_significant_digits = 15;
const max_precision = 2;
let digits_before_decimal_point;
if (n < 0) {
// Don't count minus sign.
digits_before_decimal_point = n.toFixed(0).length - 1;
} else {
digits_before_decimal_point = n.toFixed(0).length;
}
if (digits_before_decimal_point > max_significant_digits) {
throw new Error('No good representation for this number');
}
const available_significant_digits_for_precision =
Math.max(0, max_significant_digits - digits_before_decimal_point);
const effective_max_precision =
Math.min(max_precision, available_significant_digits_for_precision);
const with_trailing_zeroes = n.toFixed(effective_max_precision);
// I want to keep the string and change just matching part,
// so here .replace() is a proper method to use.
const withouth_trailing_zeroes = with_trailing_zeroes.replace(/\.?0*$/, '');
return withouth_trailing_zeroes;
}
So, you got the number formatted the way you want. What now?
What can you use this string for?
Well, you can display it to the user. And that's mostly it. The value was rounded to (1) represent it in a different base and (2) fit in limited precision, so it's pretty much useless for any computation. And, BTW, why would you convert it to String in the first place, if what you want is a number?
Was the value you are trying to print ever useful in the first place?
Well, that's the most serious question here. Because, you know, floating point numbers are tricky. And they are absolutely abysmal for representing money. So, most likely the number you are trying to format is already a wrong number.
What to use instead?
Fixed-point arithmetic is the most obvious answer. Works most of the time. However, it's pretty tricky in JS, where number may slip into floating-point representation almost any time. So, it's better to use decimal arithmetic library. Optionally, switch to a language that has built-in bignums and decimals, like Python.

Comparing strings with localeCompare vs ===?

I ran into a pretty strange issue with my latest JS project. I usually compare strings using === but when comparing the string properties of two of different objects I got false even though they were the exact same strings. I tested this in my Node.js interpreter by doing the following:
> x = {str: 'hello'}
{ str: 'hello' }
> y = {str: 'hello'}
{ str: 'hello' }
> y.str === x.str
true
So I couldnt figure out why my code wasnt working. But when I switch from using === to str1.localeCompare BOOM, it works. Whats the difference between the two?
=== looks for exactly the same bytes in the strings.
.localeCompare() allows for the fact that you may want to ignore certain differences in the strings (such as puncutation or diacriticals or case) and still allow them to compare the same or you want to ignore certain differences when deciding which string is before the other. And, it provides lots of options to control what comparison features are or are not used.
If you read the MDN documentation for string.prototype.localeCompare(), you can see a whole bunch of options you can pass in to control how the compare works. On a plain ascii string with no special characters in it that are all the same case, you are unlikely to see a difference, but start getting into diacriticals or case issues and localCompare() has both more features and more options to control the comparison.
Some of the options available for controlling the comparison:
numeric collation
diacritical sensitivity
ability to ignore punctuation
case first
control whether upper or lower case compares first
In addition, localeCompare() returns a value (negative, 0 or positive) that is perfectly aligned to use with a .sort() callback.

Javascript eval - obfuscation?

I came across some eval code:
eval('[+!+[]+!+[]+!+[]+!+[]+!+[]]');
This code equals the integer 5.
What is this type of thing called? I've tried searching the web but I can't seem to figure out what this is referred to. I find this very interesting and would like to know where/how one learns how to print different things instead of just the integer 5. Letters, symbols and etc. Since I can't pin point a pattern in that code I've had 0 success taking from and adding to it to make different results.
Is this some type of obfuscation?
This type of obfuscation, eval() aside, is known as Non-alphanumeric obfuscation.
To be completely Non-alphanumeric, the eval would have to be performed by Array constructor prototypes functions and subscript notation:
[]["sort"]["constructor"]("string to be evaled");
These strings are then converted to non-alphanumeric form.
AFAIK, it was first proposed by Yosuke Hosogawa around 2009.
If you want to see it in action see this tool: http://www.jsfuck.com/
It is not considered a good type of obfuscation because it is easy to reverse back to the original source code, without even having to run the code (statically). Plus, it increases the file size tremendously.
But its an interesting form of obfuscation that explores JavaScript type coercion. To learn more about it I recommend this presentation, slide 33:
http://www.slideshare.net/auditmark/owasp-eu-tour-2013-lisbon-pedro-fortuna-protecting-java-script-source-code-using-obfuscation
That's called Non-alphanumeric JavaScript and it's possible because of JavaScript type coercion capabilities. There are actually some ways to call eval/Function without having to use alpha characters:
[]["filter"]["constructor"]('[+!+[]+!+[]+!+[]+!+[]+!+[]]')()
After you replace strings "filter" and "constructor" by non-alphanumeric representations you have your full non-alphanumeric JavaScript.
If you want to play around with this, there is a site where you can do it: http://www.jsfuck.com/.
Check this https://github.com/aemkei/jsfuck/blob/master/jsfuck.js for more examples like the following:
'a': '(false+"")[1]',
'b': '(Function("return{}")()+"")[2]',
'c': '([]["filter"]+"")[3]',
...
To get the value as 5, the expression should have been like this
+[!+[] + !+[] + !+[] + !+[] + !+[]]
Let us analyze the common elements first. !+[].
An empty array literal in JavaScript is considered as Falsy.
+ operator applied to an array literal, will try convert it to a number and since it is empty, JavaScript will evaluate it to 0.
! operator converts 0 to true.
So,
console.log(!+[]);
would print true. Now, the expression can be reduced like this
+[true + true + true + true + true]
Since true is treated as 1 in arithmetic expressions, the actual expression becomes
+[ 5 ]
The + operator tries to convert the array ([ 5 ]) to a number and which results in 5. That is why you are getting 5.
I don't know of any term used to describe this type of code, aside from "abusing eval()".
I find this very interesting and would like to know where/how one
learns how to print different things instead of just the integer 5.
Letters, symbols and etc. Since I can't pin point a pattern in that
code I've had 0 success taking from and adding to it to make different
results.
This part I can at least partially answer. The eval() you pasted relies heavily on Javascript's strange type coercion rules. There are a lot of pages on the web describing various strange consequences of the coercion rules, which you can easily search for. But I can't find any reference on type coercion with the specific goal of getting "surprising" output from things like eval(), unless you count this video by Destroy All Software (the Javascript part starts at 1:20). Most of them understandably focus on how to avoid strange bugs in your code. For your purpose, I believe the most useful things I know of are:
The ! operator will convert anything to a boolean.
The unary + operator will convert anything to a number.
The binary + operator will coerce its arguments to either numbers or strings before adding or concatenating them. Normally it will only go with strings if one or the other argument is already a string, but there are exceptions.
The bitwise operators output integers, regardless of input.
Converting to numbers is complicated. Booleans will go to 0 or 1. Strings will attempt to use parseInt(), or produce NaN if that fails. I forget the rest.
Converting an object to a string or number will invoke its "toString" or "toValue" method respectively, if one exists.
You get the idea. thefourtheye's answer walks through exactly how these rules apply to the example you gave. I'm not even going to try summarizing what happens when Dates, Functions, and Regexps are involved.
Is this some type of obfuscation?
Normally you'd simply get obfuscation for free as part of minification, so I have no idea why someone would write that in real production code (assuming that's where you found it).

Is it better to compare strings using toLowerCase or toUpperCase in JavaScript?

I'm going through a code review and I'm curious if it's better to convert strings to upper or lower case in JavaScript when attempting to compare them while ignoring case.
Trivial example:
var firstString = "I might be A different CASE";
var secondString = "i might be a different case";
var areStringsEqual = firstString.toLowerCase() === secondString.toLowerCase();
or should I do this:
var firstString = "I might be A different CASE";
var secondString = "i might be a different case";
var areStringsEqual = firstString.toUpperCase() === secondString.toUpperCase();
It seems like either "should" or would work with limited character sets like only English letters, so is one more robust than the other?
As a note, MSDN recommends normalizing strings to uppercase, but that is for managed code (presumably C# & F# but they have fancy StringComparers and base libraries):
http://msdn.microsoft.com/en-us/library/bb386042.aspx
Revised answer
It's been quite a while when I answered this question. While cultural issues still holds true (and I don't think they will ever go away), the development of ECMA-402 standard made my original answer... outdated (or obsolete?).
The best solution for comparing localized strings seems to be using function localeCompare() with appropriate locales and options:
var locale = 'en'; // that should be somehow detected and passed on to JS
var firstString = "I might be A different CASE";
var secondString = "i might be a different case";
if (firstString.localeCompare(secondString, locale, {sensitivity: 'accent'}) === 0) {
// do something when equal
}
This will compare two strings case-insensitive, but accent-sensitive (for example ą != a).
If this is not sufficient for performance reasons, you may want to use eithertoLocaleUpperCase()ortoLocaleLowerCase()` passing the locale as a parameter:
if (firstString.toLocaleUpperCase(locale) === secondString.toLocaleUpperCase(locale)) {
// do something when equal
}
In theory there should be no differences. In practice, subtle implementation details (or lack of implementation in the given browser) may yield different results...
Original answer
I am not sure if you really meant to ask this question in Internationalization (i18n) tag, but since you did...
Probably the most unexpected answer is: neither.
There are tons of problems with case conversion, which inevitably leads to functional issues if you want to convert the character case without indicating the language (like in JavaScript case). For instance:
There are many natural languages that don't have concept of upper- and lowercase characters. No point in trying to convert them (although this will work).
There are language specific rules for converting the string. German sharp S character (ß) is bound to be converted into two upper case S letters (SS).
Turkish and Azerbaijani (or Azeri if you prefer) has "very strange" concept of two i characters: dotless ı (which converts to uppercase I) and dotted i (which converts to uppercase İ <- this font does not allow for correct presentation, but this is really different glyph).
Greek language has many "strange" conversion rules. One particular rule regards to uppercase letter sigma (Σ) which depending on a place in a word has two lowercase counterparts: regular sigma (σ) and final sigma (ς). There are also other conversion rules in regard to "accented" characters, but they are commonly omitted during implementation of conversion function.
Some languages has title-case letters, i.e. Lj which should be converted to things like LJ or less appropriately LJ. The same may regard to ligatures.
Finally there are many compatibility characters that may mean the same as what you are trying to compare to, but be composed of completely different characters. To make it worse, things like "ae" may be the equivalent of "ä" in German and Finnish, but equivalent of "æ" in Danish.
I am trying to convince you that it is really better to compare user input literally, rather than converting it. If it is not user-related, it probably doesn't matter, but case conversion will always take time. Why bother?
Some other options have been presented, but if you must use toLowerCase, or
toUpperCase, I wanted some actual data on this. I pulled the full list
of two byte characters that fail with toLowerCase or toUpperCase. I then
ran this test:
let pairs = [
[0x00E5,0x212B],[0x00C5,0x212B],[0x0399,0x1FBE],[0x03B9,0x1FBE],[0x03B2,0x03D0],
[0x03B5,0x03F5],[0x03B8,0x03D1],[0x03B8,0x03F4],[0x03D1,0x03F4],[0x03B9,0x1FBE],
[0x0345,0x03B9],[0x0345,0x1FBE],[0x03BA,0x03F0],[0x00B5,0x03BC],[0x03C0,0x03D6],
[0x03C1,0x03F1],[0x03C2,0x03C3],[0x03C6,0x03D5],[0x03C9,0x2126],[0x0392,0x03D0],
[0x0395,0x03F5],[0x03D1,0x03F4],[0x0398,0x03D1],[0x0398,0x03F4],[0x0345,0x1FBE],
[0x0345,0x0399],[0x0399,0x1FBE],[0x039A,0x03F0],[0x00B5,0x039C],[0x03A0,0x03D6],
[0x03A1,0x03F1],[0x03A3,0x03C2],[0x03A6,0x03D5],[0x03A9,0x2126],[0x0398,0x03F4],
[0x03B8,0x03F4],[0x03B8,0x03D1],[0x0398,0x03D1],[0x0432,0x1C80],[0x0434,0x1C81],
[0x043E,0x1C82],[0x0441,0x1C83],[0x0442,0x1C84],[0x0442,0x1C85],[0x1C84,0x1C85],
[0x044A,0x1C86],[0x0412,0x1C80],[0x0414,0x1C81],[0x041E,0x1C82],[0x0421,0x1C83],
[0x1C84,0x1C85],[0x0422,0x1C84],[0x0422,0x1C85],[0x042A,0x1C86],[0x0463,0x1C87],
[0x0462,0x1C87]
];
let upper = 0, lower = 0;
for (let pair of pairs) {
let row = 'U+' + pair[0].toString(16).padStart(4, '0') + ' ';
row += 'U+' + pair[1].toString(16).padStart(4, '0') + ' pass: ';
let s = String.fromCodePoint(pair[0]);
let t = String.fromCodePoint(pair[1]);
if (s.toUpperCase() == t.toUpperCase()) {
row += 'toUpperCase ';
upper++;
} else {
row += ' ';
}
if (s.toLowerCase() == t.toLowerCase()) {
row += 'toLowerCase';
lower++;
}
console.log(row);
}
console.log('upper pass: ' + upper + ', lower pass: ' + lower);
Interestingly, one of the pairs fails with both. But based on this,
toUpperCase is the best option.
It never depends upon the browser as it is only the JavaScript which is involved.
both will give the performance based upon the no of characters need to be changed (flipping case)
var areStringsEqual = firstString.toLowerCase() === secondString.toLowerCase();
var areStringsEqual = firstString.toUpperCase() === secondString.toUpperCase();
If you use test prepared by #adeneo you can feel it's browser dependent, but make some other test inputs like:
"AAAAAAAAAAAAAAAAAAAAAAAAAAAA"
and
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
and compare.
Javascript performance depends upon the browser if some DOM API or any DOM manipulation/interaction is there, otherwise for all plain JavaScript, it will give the same performance.

Why Javascript ===/== string equality sometimes has constant time complexity and sometimes has linear time complexity?

After I found that the common/latest Javascript implementations are using String Interning for perfomance boost (Do common JavaScript implementations use string interning?), I thought === for strings would get the constant O(1) time. So I gave a wrong answer to this question:
JavaScript string equality performance comparison
Since according to the OP of that question it is O(N), doubling the string input doubles the time the equality needs. He didn't provide any jsPerf so more investigation is needed,
So my scenario using string interning would be:
var str1 = "stringwithmillionchars"; //stored in address 51242
var str2 = "stringwithmillionchars"; //stored in address 12313
The "stringwithmillionchars" would be stored once let's say in address 201012 of memory
and both str1 and str2 would be "pointing" to this address 201012. This address could then be determined with some kind of hashing to map to specific locations in memory.
So when doing
"stringwithmillionchars" === "stringwithmillionchars"
would look like
getContentOfAddress(51242)===getContentOfAddress(12313)
or 201012 === 201012
which would take O(1)/constant time
JSPerfs/Performance updates:
JSPerf seems to show constant time even if the string is 16 times longer?? Please have a look:
http://jsperf.com/eqaulity-is-constant-time
Probably the strings are too small on the above:
This probably show linear time (thanks to sergioFC) the strings are built with a loop. I tried without functions - still linear time / I changed it a bit http://jsfiddle.net/f8yf3c7d/3/ .
According to https://www.dropbox.com/s/8ty3hev1b109qjj/compare.html?dl=0 (12MB file that sergioFC made) when you have a string and you already have assigned the value in quotes no matter how big the t1 and t2 are (e.g 5930496 chars), it is taking it 0-1ms/instant time.
It seems that when you build a string using a for loop or a function then the string is not interned. So interning happens only when you directly assign a string with quotes like var str = "test";
Based on all the Performance Tests (see original post) for strings a and b the operation a === b takes:
constant time O(1) if the strings are interned. From the examples it seems that interning only happens with directly assigned strings like var str = "test"; and not if you build it with concatenation using for-loops or functions.
linear time O(N) since in all the other cases the length of the two strings is compared first. If it is equal then we have character by character comparison. Else of course they are not equal. N is the length of the string.
According to the ECMAScript 5.1 Specification's Strict Equal Comparison algorithm, even if the type of Objects being compared is String, all the characters are checked to see if they are equal.
If Type(x) is String, then return true if x and y are exactly the same sequence of characters (same length and same characters in corresponding positions); otherwise, return false.
Interning is strictly an implementation thingy, to boost performance. The language standard doesn't impose any rules in that regard. So, its up to the implementers of the specification to intern strings or not.
First of all, it would be nice to see a JSPerf test which demonstrates the claim that doubling the string size doubles the execution time.
Next, let's take that as granted. Here's my (unproven, unchecked, and probably unrelated to reality) theory.
Compairing two memory addresses is fast, no matter how much data is references. But you have to INTERN this strings first. If you have in your code
var a = "1234";
var b = "1234";
Then the engine first has to understand that these two strings are the same and can point to the same address. So at least once these strings has to be compared fully. So basically here are the following options:
The engine compares and interns strings directly when parsing the code. In this case equals strings should get the same address.
The engine may say "these strings are two big, I don't want to intern them" and has two copies.
The engine may intern these strings later.
In the two latter cases string comparison will influence the test results. In the last case - even if the strings are finally interned.
But as I wrote, a wild theory, for theory's sage. I'd first like to see some JSPerf.

Categories

Resources