Reverse string comparison

Reverse string comparison - javascript

I'm using a Dictionary (associative array, hash table, any of these synonyms).
The keys used to uniquely identify values are fairly long strings. However, I know that these strings tend to differ at the tail, rather than the head.
The fastest way to find a value in a JS object is to test the existence of
object[key], but is that also the case for extremely long, largely similar, keys (+100 chars), in a fairly large Dictionary (+1000 entries)?
Are there alternatives for this case, or is this a completely moot question, because accessing values by key is already insanely fast?

Long story short; It doesn't matter much. JS will internally use a hash table (as you already said yourself), so it will need to calculate a hash of your keys for insertion and (in some cases) for accessing elements.
Calculating a hash (for most reasonable hash functions) will take slightly longer for long keys than for short keys (I would guess about linearly longer), but it doesn't matter whether the changes are at the tail or at the head.
You could decide to roll your own hashes instead, cache these somehow, and use these as keys, but this would leave it up to you to deal with hash collisions. It will be very hard to do better than the default implementation, and is almost certainly not worth the trouble.
Moreover, for an associative array with only 1000 elements, probably none of this matters. Modern CPUs can process close to / around billions of instructions per second. Even just a lineair search through the whole array will likely perform just fine, unless you have to do it very very often.

Hash tables (dictionary, map, etc.) first check for hash code, and only then, if necessary (in case of collision - at least two keys have the same hash code) perform equals. If you experience performance problems, the first thing you have to check, IMHO, is hash codes collision. It may appear (bad implementation or weird keys) that the hash code is computed on, say, 3 first chars (it's a wild exaggeration, of course):
"abc123".hashCode() ==
"abc456".hashCode() ==
...
"abc789".hashCode()
and so you have a lot of collisions, have to perform equals, and finally slow O(N) complexity routine. In that case, you have to think over a better hash.

Related

String Comparison vs. Hashing

I recently learned about the rolling hash data structure, and basically one of its prime uses to searching for a substring within a string. Here are some advantages that I noticed:
Comparing two strings can be expensive so this should be avoided if possible
Hashing the strings and comparing the hashes is generally much faster than comparing strings, however rehashing the new substring each time traditionally takes linear time
A rolling hash is able to rehash the new substring in constant time, making it much quicker and more efficient for this task
I went ahead and implemented a rolling hash in JavaScript and began to analyze the speed between a rolling hash, traditional rehashing, and just comparing the substrings against each other.
In my findings, the larger the substring, the longer it took for the traditional rehashing approach to run (as expected) where the rolling hash ran incredibly fast (as expected). However, comparing the substrings together ran much faster than the rolling hash. How could this be?
For the sake of perspective, let's say the running times for the functions searching through a ~2.4 million character string for a 100 character substring were the following:
Rolling Hash - 0.809 seconds
Traditional Rehashing - 71.009 seconds
Just comparing the strings (no hashing) 0.089 seconds
How could the string comparing be so much faster than the rolling hash? Could it just have something to do with JavaScript in particular? Strings are a primitive type in JavaScript; would this cause string comparisons to run in constant time?
My main confusion is as to how/why string comparisons are so fast in JavaScript, when I was under the impression that they were supposed to be relatively slow.
Note:
By string comparisons I'm referring to something like stringA === stringB
Note:
I asked this question over on the Computer Science Community and was informed that I should ask it here as well because this is most likely JavaScript specific.

After some testing and analysis, I've come to the conclusion that there were a few reasons as to why my rolling hash approach was running slightly slower than simply comparing the two strings.
If the rolling hash claims to run in constant time, how can it be slower than comparing strings?
Functions are relatively slow - calling a function is slightly slower than simply executing code inline. In my particular case, a
function had to be called on my object every time the rolling hash
rehashes its internal window, therefore taking slightly longer to run
compared to the string comparison, since that code was simply inline. Especially since my benchmark has the rolling hash "shift" over 2 million iterations, this function slow down can be seen more clearly.
But why is the string comparison so fast?
Strings are primitive - Basically, because strings are a primitive type in JavaScript, the attempting to compare two strings will most
likely invoke some routine that is coded directly within the
interpreter. This low level evaluation can be done as fast as the
architecture possibly can (similar to comparing numbers).
In Conclusion
Comparing strings in JavaScript will end up being faster than a rolling hash in this scenario because the strings are primitive, therefore allowing the interpreter to work with these elements very quickly, and because simply calling functions will create a slight overhead and slow down the process on a very small scale.

linked list vs arrays for dictionaries

I was recently asked in an interview about advantages and disadvantages of linked list and arrays for dictionary of words implementation and also what is the best data structure for implementing it? This where I messed up things. After googling I couldn't specifically found exact answer that is specific to dictionaries but general linked list v arrays explanation. What is the best suited answer to above question?

If you're just going to use it for lookups, then an array is the obvious best choice of the two. You can build the dictionary from a list of words in O(n log n)--just build an array and sort it. Lookups are O(log n) with a binary search.
Although you can build a linked list of words in O(n), lookups will require, on average, that you look at n/2 words. The difference is pretty large. Given an English dictionary of 128K words, a linked list lookup will take on average 64,000 string comparisons. A binary search will require at most 17.
In addition, a linked list of n words will occupy more memory than an array of n words, because you need the next pointer in the list.
If you need the ability to update the dictionary, you'll probably still want to use an array if updates are infrequent compared to lookups (which is almost certainly the case). I can't think of a real-world example of a dictionary of words that's updated more frequently than it's queried.
As others have pointed out, neither array nor linked list is the best choice for a dictionary of words. But of the two options you're given, array is superior in almost all cases.

There is no one answer.
The two obvious choices would be something based on a hash table if you only want to look up individual items, or something based on a balanced tree if you want to look up ranges of items.
A sorted array can work well if you do a lot of searching and relatively little insertion or deletion. Finding situations where linked lists are preferred is rather more difficult. Depending on the situation (especially such things as finding all the words that start with, say, "ste"), tries can also work extremely well (and often do well at minimizing the storage needed for a given set of data as well).
Those are really broad categories though, not specific implementations. There are also variations such as extensible hashing and distributed hash tables that can be useful in specific situations (and also have somewhat tree-like properties, so things like range-based searching can be reasonable efficient).

Best data structure for implementing dictionaries is suffix trees. You can also have a look at tries.

Well, if you're building a dictionary, you'd want it to be a sorted structure. So you're going for a sorted-array or a sorted linked-list.
For a linked list retrieval is O(n) since you have to examine all words until you find the one you need. For a sorted array, you can use binary search to find the right location, which is O(log n).
For a sorted array, insertion is O(log n) to find the right location (binary search) and then O(n) to insert because you need to push everything down. For a linked list, it would be O(n) to find the location and then O(1) to insert because you only have to adjust pointers. The same applies for deletion.
Since you aren't going to be updating a dictionary much, you can just build and then sort the array in O(nlog n) time (using quicksort for example). After that, lookup is O(log n) using binary search. Furthermore, as delnan mentioned below, using an array has the advantage that everything you access is sequential in memory; i.e., the data are localized (locality of reference). This minimizes cache misses (which are expensive). With a linked list, the data are spread out all over and there is no guarantee that they are close together, which increases the chance of cache-misses. With this in mind, given the two options, use the array.
You can do an even better job if you implement a sorted hashmap using a red-black tree (your tree entries, which are the keys can be coupled with a hashmap); here search, insert, and delete are O(log n). But it really depends on your behavior profile; if you're only doing lookup, a simple hashmap would be best (O(1) retrieval).
Another interesting data-structure you can use is a Trie, where insertion and lookup are O(m); m being the length of the string.

hash table - how often is the hash calculated for a given key?

I was asked this during an interview. My immediate answer was for every read and write. The interviewer then asked, "Are you sure the hash isn't cached in the table somewhere?"
This made me second guess myself. In the end, I stuck to my original answer, but out of curiosity, I figured I'd as the question here.
Also note that this interview was for a JavaScript position but the question wasn't necessarily specific to JavaScript.
So, in general, is a key's hash computed once or for every read/write? What about specific to JavaScript?

Of course it depends on the implementation, and even if you ask about JS there are several implementations (V8, SpiderMonkey, MSFT etc.).
It also should depend on the application. If your application is one that more frequently use the last item put into the hashtable then it should make sense to somehow cache the hash. In some cases this would be preferable.
I guess the interviewer just tried to see how you handle second-guessing...

It depends on the hash table and the key types, and on whether we're talking about the key used to read/write or the keys already in the table. The hash values of the former can and sometimes is cached in the object (example: strings in Python). The hash values of the latter can and sometimes are cached in the table - instead of key, value pairs you store hash, key, value triples.
In both cases, the decision depends on the kind of keys: Are they large and expensive to hash? Is it worth the extra space and memory traffic? For example, it's probably a clear win for strings larger than a couple dozen characters, and probably useless or harmful for 2D points. Also note that the hash values can be used to avoid comparisons, which might be useful but doesn't seem as important.

Does assigning a new string value create garbage that needs collecting?

Consider this javascript code:
var s = "Some string";
s = "More string";
Will the garbage collector (GC) have work to do after this sort of operation?
(I'm wondering whether I should worry about assigning string literals when trying to minimize GC pauses.)
e: I'm slightly amused that, although I stated explicitly in my question that I needed to minimize GC, everyone assumed I'm wrong about that. If one really must know the particular details: I've got a game in javascript -- it runs fine in Chrome, but in Firefox has semi-frequent pauses, that seem to be due to GC. (I've even checked with the MemChaser extension for Firefox, and the pauses coincide exactly with garbage collection.)

Yes, strings need to be garbage-collected, just like any other type of dynamically allocated object. And yes, this is a valid concern as careless allocation of objects inside busy loops can definitely cause performance issues.
However, string values are immutable (non-changable), and most modern JavaScript implementations use "string interning", that is they store only one instance of each unique string value. This means that if you have something like this...
var s1 = "abc",
s2 = "abc";
...only one instance of "abc" will be allocated. This only applies to string values, not String objects.
A couple of things to keep in mind:
Functions like substring, slice, etc. will allocate a new object for each function call (if called with different parameters).
Even though both variable point to the same data in memory, there are still two variables to process when the GC cycle runs. Having too many local variables can also hurt you as each of them will need to be processed by the GC, adding overhead.
Some further reading on writing high-performance JavaScript:
https://developer.mozilla.org/en-US/docs/JavaScript/Memory_Management
https://www.scirra.com/blog/76/how-to-write-low-garbage-real-time-javascript
http://jonraasch.com/blog/10-javascript-performance-boosting-tips-from-nicholas-zakas

Yes, but unless you are doing this in a loop millions of times it won't likely be a factor for you to worry about.

As you already noticed, JavaScript is not JavaScript. It runs on different platforms and thus will have different performance characteristics.
So the definite answer to the question "Will the GC have work to do after this sort of operation?" is: maybe. If the script is as short as you've shown it, then a JIT-Compiler might well drop the first string completely. But there's no rule in the language definition that says it has to be that way or the other way. So in the end it's like it is all too often in JavaScript: you have to try it.
The more interesting question might also be: how can you avoid garbage collection. And that is try to minimize the allocation of new objects. Games typically have a pretty constant amount of objects and often there won't be new objects until an old one gets unused. For strings this might be harder as they are immutable in JS. So try to replace strings with other (mutable) representations where possible.

Yes, the garbage collector will have a string object containing "Some string" to get rid of. And, in answer to your question, that string assignment will make work for the GC.
Because strings are immutable and are used a lot, the JS engine has a pretty efficient way of dealing with them. You should not notice any pauses from garbage collecting a few strings. The garbage collector has work to do all the time in the normal course of javascript programming. That's how it's supposed to work.
If you are observing pauses from GC, I rather doubt it's from a few strings. There is more likely a much bigger issue going on. Either you have thousands of objects needing GC or some very complicated task for the GC. We couldn't really speculate on that without study of the overall code.
This should not be a concern unless you were doing some enormous loop and dealing with tens of thousands of objects. In that case, one might want to program a little more carefully to minimize the number of intermediate objects that are created. But, absent that level of objects, you should first right clear, reliable code and then optimize for performance only when something has shown you that there is a performance issue to worry about.

To answer your question "I'm wondering whether I should worry about assigning string literals when trying to minimize GC pauses": No.
You really don't need to worry about this sort of thing with regard to garbage collection.
GC is only a concern when creating & destroying huge numbers of Javascript objects, or large numbers of DOM elements.

What is the complexity of retrieval/insertion in JavaScript associative arrays (dynamic object properties) in the major javascript engines?

Take the following code example:
var myObject = {};
var i = 100;
while (i--) {
myObject["foo"+i] = new Foo(i);
}
console.log(myObject["foo42"].bar());
I have a few questions.
What kind of data structure do the major engines (IE, Mozilla, Chrome, Safari) use for storing key-value pairs? I'd hope it's some kind Binary Search tree, but I think they may use linked lists (due to the fact iterating is done in insertion order).
If they do use a search tree, is it self balancing? Because the above code with a conventional search tree will create an unbalanced tree, causing worst case scenario of O(n) for searching, rather than O(log n) for a balanced tree.
I'm only asking this because I will be writing a library which will require efficient retrieval of keys from a data structure, and while I could implement my own or an existing red-black tree I would rather use native object properties if they're efficient enough.

The question is hard to answer for a couple reasons. First, the modern browsers all heavily and dynamically optimize code while it is executing so the algorithms chosen to access the properties might be different for the same code. Second, each engine uses different algorithms and heuristics to determine which access algorithm to use. Third, the ECMA specification dictates what the result of must be, not how the result is achieved so the engines have a lot of freedom to innovate in this area.
That said, given your example all the engines I am familiar with will use some form of a hash table to retrieve the value associated with foo42 from myobject. If you use an object like an associative array JavaScript engines will tend to favor a hash table. None that I am aware of use a tree for string properties. Hash tables are worst case O(N), best case O(1) and tend to be closer to O(1) than O(N) if the key generator is any good. Each engine will have a pattern you could use to get it to perform O(N) but that will be different for each engine. A balanced tree would guarantee worst case O(log N) but modifying a balanced tree while keeping it balanced is not O(log N) and hash tables are more often better than O(log N) for string keys and are O(1) to update (once you determine you need to, which is the same big-O as read) if there is space in the table (periodically O(N) to rebuild the table but the tables usually double in space which means you will only pay O(N) 7 or 8 times for the life of the table).
Numeric properties are special, however. If you access an object using integer numeric properties that have few or no gaps in range, that is, use the object like it is an array, the values will tend to be stored in a linear block of memory with O(1) access. Even if your access has gaps the engines will probably shift to a sparse array access which will probably be, at worst, O(log N).
Accessing a property by identifier is also special. If you access the property like,
myObject.foo42
and execute this code often (that is, the speed of this matters) and with the same or similar object this is likely to be optimized into one or two machine instructions. What makes objects similar also differs for each engine but if they are constructed by the same literal or function they are more likely to be treated as similar.
No engine that does at all well on the JavaScript benchmarks will use the same algorithm for every object. They all must dynamically determine how the object is being used and try to adjust the access algorithm accordingly.

Develop Reference

JavaScript is the programming language of the Web.