String Comparison vs. Hashing

String Comparison vs. Hashing - javascript

I recently learned about the rolling hash data structure, and basically one of its prime uses to searching for a substring within a string. Here are some advantages that I noticed:
Comparing two strings can be expensive so this should be avoided if possible
Hashing the strings and comparing the hashes is generally much faster than comparing strings, however rehashing the new substring each time traditionally takes linear time
A rolling hash is able to rehash the new substring in constant time, making it much quicker and more efficient for this task
I went ahead and implemented a rolling hash in JavaScript and began to analyze the speed between a rolling hash, traditional rehashing, and just comparing the substrings against each other.
In my findings, the larger the substring, the longer it took for the traditional rehashing approach to run (as expected) where the rolling hash ran incredibly fast (as expected). However, comparing the substrings together ran much faster than the rolling hash. How could this be?
For the sake of perspective, let's say the running times for the functions searching through a ~2.4 million character string for a 100 character substring were the following:
Rolling Hash - 0.809 seconds
Traditional Rehashing - 71.009 seconds
Just comparing the strings (no hashing) 0.089 seconds
How could the string comparing be so much faster than the rolling hash? Could it just have something to do with JavaScript in particular? Strings are a primitive type in JavaScript; would this cause string comparisons to run in constant time?
My main confusion is as to how/why string comparisons are so fast in JavaScript, when I was under the impression that they were supposed to be relatively slow.
Note:
By string comparisons I'm referring to something like stringA === stringB
Note:
I asked this question over on the Computer Science Community and was informed that I should ask it here as well because this is most likely JavaScript specific.

After some testing and analysis, I've come to the conclusion that there were a few reasons as to why my rolling hash approach was running slightly slower than simply comparing the two strings.
If the rolling hash claims to run in constant time, how can it be slower than comparing strings?
Functions are relatively slow - calling a function is slightly slower than simply executing code inline. In my particular case, a
function had to be called on my object every time the rolling hash
rehashes its internal window, therefore taking slightly longer to run
compared to the string comparison, since that code was simply inline. Especially since my benchmark has the rolling hash "shift" over 2 million iterations, this function slow down can be seen more clearly.
But why is the string comparison so fast?
Strings are primitive - Basically, because strings are a primitive type in JavaScript, the attempting to compare two strings will most
likely invoke some routine that is coded directly within the
interpreter. This low level evaluation can be done as fast as the
architecture possibly can (similar to comparing numbers).
In Conclusion
Comparing strings in JavaScript will end up being faster than a rolling hash in this scenario because the strings are primitive, therefore allowing the interpreter to work with these elements very quickly, and because simply calling functions will create a slight overhead and slow down the process on a very small scale.

Related

Is an array of ints actually implemented as an array of ints in JavaScript / V8?

There is claim in this article that an array of ints in JavaScript is implemented by a C++ array of ints.
However; According to MDN unless you specifically use BigInts, in JavaScript all numbers are repressed as doubles.
If I do:
cont arr = [0, 1, 2, 3];
What is the actual representation in the V8 engine?
The code for V8 is here on github, but I don't know where to look:

(V8 developer here.)
"C++ array of ints" is a bit of a simplification, but the key idea described in that article is correct, and an array [0, 1, 2, 3] will be stored as an array of "Smis".
What's a "Smi"? While every Number in JavaScript must behave like an IEEE754 double, V8 internally represents numbers as "small integer" (31 bits signed integer value + 1 bit tag) when it can, i.e. when the number has an integral value in the range -2**30 to 2**30-1, to improve efficiency. Engines can generally do whatever they want under the hood, as long as things behave as if the implementation followed the spec to the letter. So when the spec (or MDN documentation) says "all Numbers are doubles", what it really means from the engine's (or an engine developer's) point of view is "all Numbers must behave as if they were doubles".
When an array contains only Smis, then the array itself keeps track of that fact, so that values loaded from such arrays know their type without having to check. This matters e.g. for a[i] + 1, where the implementation of + doesn't have to check whether a[i] is a Smi when it's already known that a is a Smi array.
When the first number that doesn't fit the Smi range is stored in the array, it'll be transitioned to an array of doubles (strictly speaking still not a "C++ array", rather a custom array on the garbage-collected heap, but it's similar to a C++ array, so that's a good way to explain it).
When the first non-Number is stored in an array, what happens depends on what state the array was in before: if it was a "Smi array", then it only needs to forget the fact that it contains only Smis. No rewriting is needed, as Smis are valid object pointers thanks to their tag bit. If the array was a "double array" before, then it does have to be rewritten, so that each element is a valid object pointer. All the doubles will be "boxed" as so-called "heap numbers" (objects on the managed heap that only wrap a double value) at this point.
In summary, I'd like to point out that in the vast majority of cases, there's no need to worry about any of these internal implementation tricks, or even be aware of them. I certainly understand your curiosity though! Also, array representations are one of the more common reasons why microbenchmarks that don't account for implementation details can easily be misleading by suggesting results that won't carry over to a larger app.
Addressing comments:
V8 does sometimes even use int16 or lower.
Nope, it does not. It may or may not start doing so in the future; though if anything does change, I'd guess that untagged int32 is more likely to be introduced than int16; also if anything does change about the implementation then of course the observable behavior would not change.
If you believe that your application would benefit from int16 storage, you can use an Int16Array to enforce that, but be sure to measure whether that actually benefits you, because quite likely it won't, and may even decrease performance depending on what your app does with its arrays.
It may start to be a double when you make it a decimal
Slightly more accurately: there are several reasons why an array of Smis needs to be converted to an array of doubles, such as:
storing a fractional value in it, e.g. 0.5
storing a large value in it, e.g. 2**34
storing NaN or Infinity or -0 in it

Complexity of Array methods

In a team project we need to remove the first element of an array, thus I called Array.prototype.shift(). Now a guy saw Why is pop faster than shift?, thus suggested to first reverse the array, pop, and reverse again, with Array.prototype.pop() and Array.prototype.reverse().
Intiutively this will be slower (?), since my approach takes O(n) I think, while the other needs O(n), plus O(n). Of course in asymptotic notation this will be the same. However notice the verb I used, think!
Of course I could write some, use jsPerf and benchmark, but this takes time (in contrast with deciding via the time complexity signs (e.g. a O(n3) vs O(n) algorithm).
However, convincing someone when using my opinion is much harder than pointing him to the Standard (if it refered to complexity).
So how to find the Time Complexity of these methods?
For example in C++ std::reverse() clearly states:
Complexity
Linear in half the distance between first and last: Swaps elements.

how to find the Time Complexity of these methods?
You cannot find them in the Standard.
ECMAScript is a Standard for scripting languages. JavaScript is such a language.
The ECMA specification does not specify a bounding complexity. Every JavaScript engine is free to implement its own functionality, as long as it is compatible with the Standard.
As a result, you have to benchmark with jsPerf, or even look at the souce code of a specific JavaScript Engine, if you would like.
Or, as robertklep's comment mentioned:
"The Standard doesn't mandate how these methods should be implemented. Also, JS is an interpreted language, so things like JIT and GC may start playing a role depending on array sizes and how often the code is called. In other words: benchmarking is probably your only option to get an idea on how different JS engines perform".
There are further evidence for this claim ([0], [1], [2]).

An undisputable method is to do a comparative benchmark (provided you do it correctly and the other party is not bad faith). Don't care about theoretical complexities.
Otherwise you will spend a hard time convincing someone who doesn't see the obvious: a shift moves every element once, while two reversals move it twice.
By the way, a shift is optimal, as every element has to move at least once. And if properly implemented as a memmov, it is very fast (while a single reversal cannot be as fast).

What's the rationale for using insertion sort over shell sort in Array.sort in V8

V8 uses quick-sort for arrays of the length over 10 elements, and insertion sort for arrays less than that. Here is the sources:
function InnerArraySort(array, length, comparefn) {
// In-place QuickSort algorithm.
// For short (length <= 10) arrays, insertion sort is used for efficiency.
I'm wondering what's the rationale for not using shell-sort instead of an insertion sort? I understand that it probably doesn't make a difference for an array of 10 elements, but still. Any ideas?

The original rationale is lost to history; the commit that introduced InsertionSort for short arrays (all the way back in 2008) only mentions that it's faster than QuickSort (for such short arrays). So it boils down to: someone implemented it that way, and nobody else saw a reason to change it since.
Since InsertionSort is known to be very efficient for short arrays, I agree that changing it probably doesn't make a difference -- and there are lots of things for the team to work on that actually do make a difference.

Great question. The rationale is simple, it is actually faster to use insertion sort on those small arrays, at least typically. Java in fact made the same switch a long while ago. Now they do insertion sort if the array is less than 7 long in their code. See here. It is under the function sort1 at the top.
Basically what happens (in most cases) for such small arrays is that the overhead for Quicksort makes it slower than insertion sort. Insertion sort in these cases is much more likely to approach it's best performance at O(n) while Quicksort is still likely to stay at O(n log n).
Shell sort on the other hand tends to be much slower than insertion sort. That being said, it can be much faster (relativly). The best case for insertion sort is still 0(n), whereas the best case for shell sort is O(n log n). All number under ten then should have the potential for being faster from a mathematical standpoint. Unfortunately for shell sort, there is a lot more swapping involved. Shell sort then can become much slower. Insertion sort tends to be able to pull off swapping with O(1) swaps, whereas Shell sort is likely to be around O(n) swaps. Swaps are costly in machines because they tend to end up using a third temp register for swapping (there are ways of using XOR, but that is still three commands on the CPU, typically). Therefore, insertion sort still wins on an actual machine, typically.

Reverse string comparison

I'm using a Dictionary (associative array, hash table, any of these synonyms).
The keys used to uniquely identify values are fairly long strings. However, I know that these strings tend to differ at the tail, rather than the head.
The fastest way to find a value in a JS object is to test the existence of
object[key], but is that also the case for extremely long, largely similar, keys (+100 chars), in a fairly large Dictionary (+1000 entries)?
Are there alternatives for this case, or is this a completely moot question, because accessing values by key is already insanely fast?

Long story short; It doesn't matter much. JS will internally use a hash table (as you already said yourself), so it will need to calculate a hash of your keys for insertion and (in some cases) for accessing elements.
Calculating a hash (for most reasonable hash functions) will take slightly longer for long keys than for short keys (I would guess about linearly longer), but it doesn't matter whether the changes are at the tail or at the head.
You could decide to roll your own hashes instead, cache these somehow, and use these as keys, but this would leave it up to you to deal with hash collisions. It will be very hard to do better than the default implementation, and is almost certainly not worth the trouble.
Moreover, for an associative array with only 1000 elements, probably none of this matters. Modern CPUs can process close to / around billions of instructions per second. Even just a lineair search through the whole array will likely perform just fine, unless you have to do it very very often.

Hash tables (dictionary, map, etc.) first check for hash code, and only then, if necessary (in case of collision - at least two keys have the same hash code) perform equals. If you experience performance problems, the first thing you have to check, IMHO, is hash codes collision. It may appear (bad implementation or weird keys) that the hash code is computed on, say, 3 first chars (it's a wild exaggeration, of course):
"abc123".hashCode() ==
"abc456".hashCode() ==
...
"abc789".hashCode()
and so you have a lot of collisions, have to perform equals, and finally slow O(N) complexity routine. In that case, you have to think over a better hash.

What's the difference between .substr(0,1) or .charAt(0)?

We were wondering in this thread if there was a real difference between the use of .substr(0,1) and the use of .charAt(0) when you want to get the first character (actually, it could apply to any case where you wan only one char).
Is any of each faster than the other?

Measuring it is the key!
Go to http://jsperf.com/substr-or-charat to benchmark it yourself.
substr(0,1) runs at 21,100,301 operations per second on my machine, charAt(0) runs 550,852,974 times per second.
I suspect that charAt accesses the string as an array internally, rather than splitting the string.
As found in the comments, accessing the char directly using string[0] is slightly faster than using charAt(0).

Unless your whole script is based on the need for doing fast string manipulation, I wouldn't worry about the performance aspect at all. I'd use charAt() on the grounds that it's readable and the most specific tool for the job provided by the language. Also, substr() is not strictly standard, and while it's very unlikely any new ECMAScript implementation would omit it, it could happen. The standards-based alternatives to str.charAt(0) are str.substring(0, 1) and str.slice(0, 1), and for ECMAScript 5 implementations, str[0].

Develop Reference

JavaScript is the programming language of the Web.