linked list vs arrays for dictionaries - javascript

I was recently asked in an interview about advantages and disadvantages of linked list and arrays for dictionary of words implementation and also what is the best data structure for implementing it? This where I messed up things. After googling I couldn't specifically found exact answer that is specific to dictionaries but general linked list v arrays explanation. What is the best suited answer to above question?

If you're just going to use it for lookups, then an array is the obvious best choice of the two. You can build the dictionary from a list of words in O(n log n)--just build an array and sort it. Lookups are O(log n) with a binary search.
Although you can build a linked list of words in O(n), lookups will require, on average, that you look at n/2 words. The difference is pretty large. Given an English dictionary of 128K words, a linked list lookup will take on average 64,000 string comparisons. A binary search will require at most 17.
In addition, a linked list of n words will occupy more memory than an array of n words, because you need the next pointer in the list.
If you need the ability to update the dictionary, you'll probably still want to use an array if updates are infrequent compared to lookups (which is almost certainly the case). I can't think of a real-world example of a dictionary of words that's updated more frequently than it's queried.
As others have pointed out, neither array nor linked list is the best choice for a dictionary of words. But of the two options you're given, array is superior in almost all cases.

There is no one answer.
The two obvious choices would be something based on a hash table if you only want to look up individual items, or something based on a balanced tree if you want to look up ranges of items.
A sorted array can work well if you do a lot of searching and relatively little insertion or deletion. Finding situations where linked lists are preferred is rather more difficult. Depending on the situation (especially such things as finding all the words that start with, say, "ste"), tries can also work extremely well (and often do well at minimizing the storage needed for a given set of data as well).
Those are really broad categories though, not specific implementations. There are also variations such as extensible hashing and distributed hash tables that can be useful in specific situations (and also have somewhat tree-like properties, so things like range-based searching can be reasonable efficient).

Best data structure for implementing dictionaries is suffix trees. You can also have a look at tries.

Well, if you're building a dictionary, you'd want it to be a sorted structure. So you're going for a sorted-array or a sorted linked-list.
For a linked list retrieval is O(n) since you have to examine all words until you find the one you need. For a sorted array, you can use binary search to find the right location, which is O(log n).
For a sorted array, insertion is O(log n) to find the right location (binary search) and then O(n) to insert because you need to push everything down. For a linked list, it would be O(n) to find the location and then O(1) to insert because you only have to adjust pointers. The same applies for deletion.
Since you aren't going to be updating a dictionary much, you can just build and then sort the array in O(nlog n) time (using quicksort for example). After that, lookup is O(log n) using binary search. Furthermore, as delnan mentioned below, using an array has the advantage that everything you access is sequential in memory; i.e., the data are localized (locality of reference). This minimizes cache misses (which are expensive). With a linked list, the data are spread out all over and there is no guarantee that they are close together, which increases the chance of cache-misses. With this in mind, given the two options, use the array.
You can do an even better job if you implement a sorted hashmap using a red-black tree (your tree entries, which are the keys can be coupled with a hashmap); here search, insert, and delete are O(log n). But it really depends on your behavior profile; if you're only doing lookup, a simple hashmap would be best (O(1) retrieval).
Another interesting data-structure you can use is a Trie, where insertion and lookup are O(m); m being the length of the string.

Related

How are arrays implemented in JavaScript? What happened to the good old lists?

JavaScript provides a variety of data structures to be used ranging from simple objects over arrays, sets, maps, the weak variants as well as ArrayBuffers.
Over the half past year I found myself in the spot to recreate some of the more common structures like Dequeues, count maps and mostly different variants of trees.
While looking at the Ecma specification I could not find a description on how arrays implemented on a memory level, supposedly this is up to the underlying engine?
Contrary to languages I am used to, arrays in JavaScript have a variable length, similar to list. Does that mean that elements are not necessarily aligned next to each other in memory? Does a splice push and pop actually result in new allocation if a certain threshold is reached, similar to for example ArrayLists in Java? I am wondering if arrays are the way to go for queues and stacks or if actual list implementations with references to the next element might be suited in JavaScript in some cases (e.g. regarding overhead opposed to the native implementation of arrays?).
If someone has some more in-depth literature, please feel encouraged to link them here.
While looking at the Ecma specification I could not find a description on how arrays implemented on a memory level, supposedly this is up to the underlying engine?
The ECMAScript specification does not specify or require a specific implementation. That is up to the engine that implements the array to decide how best to store the data.
Arrays in the V8 engine have multiple forms based on how the array is being used. A sequential array with no holes that contains only one data type is highly optimized into something similar to an array in C++. But, if it contains mixed types or if it contains holes (blocks of the array with no value - often called a sparse array), it would have an entirely different implementation structure. And, as you can imagine it may be dynamically changed from one implementation type to another if the data in the array changes to make it incompatible with its current optimized form.
Since arrays have indexed, random access, they are not implemented as linked lists internally which don't have an efficient way to do random, indexed access.
Growing an array may require reallocating a larger block of memory and copying the existing array into it. Calling something like .splice() to remove items will have to copy portions of the array down to the lower position.
Whether or not it makes more sense to use your own linked list implementation for a queue instead of an array depends upon a bunch of things. If the queue gets very large, then it may be faster to deal with the individual allocations of a list so avoid having to copy large portions of the queue around in order to manipulate it. If the queue never gets very large, then the overhead of a moving data in an array is small and the extra complication of a linked list and the extra allocations involved in it may not be worth it.
As an extreme example, if you had a very large FIFO queue, it would not be particularly optimal as an array because you'd be adding items at one end and removing items from the other end which would require copying the entire array down to insert or remove an item from the bottom end and if the length changed regularly, the engine would probably regularly have to reallocate too. Whether or not that copying overhead was relevant in your app or not would need to be tested with an actual performance test to see if it was worth doing something about.
But, if your queue was always entirely the same data type and never had any holes in it, then V8 can optimize it to a C++ style block of memory and when calling .splice() on that to remove an item can be highly optimized (using CPU block move instructions) which can be very, very fast. So, you'd really have to test to decide if it was worth trying to further optimize beyond an array.
Here's a very good talk on how V8 stores and optimizes arrays:
Elements Kinds in V8
Here are some other reference articles on the topic:
How do JavaScript arrays work under the hood
V8 array source code
Performance tips in V8
How does V8 optimize large arrays

Does sorting JSON keys/attributes alphabetically make a difference to performance

If I have a JSON file that let's say represents a dictionary of words, where the word is the key and the value is the definition of that word. Would sorting and organizing the keys in alphabetical order in the JSON file make a difference to the performance when searching a word to find its definition (plus maybe other details about that word) in JS - that is if there were thousands of words or even more?
Or does JSON and Javascript already have an algorithm built in to find results optimally without the need to sort the data for better performance?
Also, I wouldn't mind having an alternative data structure or format or library that could give faster results for this kind of search problem suggested to me! (but of course, this suggestion isn't part of the main question)
There is no need to sort the keys in and json structure. The reading of a key/value will not be faster after sorting. See the discussion here
If you have very big structures, you can implement a TRIE by yourself. See: Wiki TRIE. Or see this: trie.js
Sorting the keys does not impact the search performance. Javascript objects may not remember the order of insertion
If possible, use Map object - The Map object holds key-value pairs and remembers the original insertion order of the keys. Any value (both objects and primitive values) may be used as either a key or a value
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Map

Reverse string comparison

I'm using a Dictionary (associative array, hash table, any of these synonyms).
The keys used to uniquely identify values are fairly long strings. However, I know that these strings tend to differ at the tail, rather than the head.
The fastest way to find a value in a JS object is to test the existence of
object[key], but is that also the case for extremely long, largely similar, keys (+100 chars), in a fairly large Dictionary (+1000 entries)?
Are there alternatives for this case, or is this a completely moot question, because accessing values by key is already insanely fast?
Long story short; It doesn't matter much. JS will internally use a hash table (as you already said yourself), so it will need to calculate a hash of your keys for insertion and (in some cases) for accessing elements.
Calculating a hash (for most reasonable hash functions) will take slightly longer for long keys than for short keys (I would guess about linearly longer), but it doesn't matter whether the changes are at the tail or at the head.
You could decide to roll your own hashes instead, cache these somehow, and use these as keys, but this would leave it up to you to deal with hash collisions. It will be very hard to do better than the default implementation, and is almost certainly not worth the trouble.
Moreover, for an associative array with only 1000 elements, probably none of this matters. Modern CPUs can process close to / around billions of instructions per second. Even just a lineair search through the whole array will likely perform just fine, unless you have to do it very very often.
Hash tables (dictionary, map, etc.) first check for hash code, and only then, if necessary (in case of collision - at least two keys have the same hash code) perform equals. If you experience performance problems, the first thing you have to check, IMHO, is hash codes collision. It may appear (bad implementation or weird keys) that the hash code is computed on, say, 3 first chars (it's a wild exaggeration, of course):
"abc123".hashCode() ==
"abc456".hashCode() ==
...
"abc789".hashCode()
and so you have a lot of collisions, have to perform equals, and finally slow O(N) complexity routine. In that case, you have to think over a better hash.

Did I just sort in O(n) on JavaScript?

Using underscorejs library, I tried to abuse the indexing of a JavaScript object, in order to sort an array a of integers or strings:
_(a).chain().indexBy(_.identity).values().value()
I realize it is kind of a "hack", but it actually yielded a sorted array in O(n) time...
Am I dreaming?
You aren't actually sorting anything.
Instead, you're building a hashtable and traversing it in hash order, which may be the same as sorted order for some sets.
It is possible to sort by O(n) using Bucket Sort http://en.wikipedia.org/wiki/Bucket_sort which is I believe what you attempted to write here, but as mentioned above you can't rely on the order of values of an object.
It is possible to sort this way in O(n) if you have limited number of values.
Your algorithm is not a comparison sort:
A comparison sort is a type of sorting algorithm that only reads the
list elements through a single abstract comparison operation (often a
"less than or equal to" operator or a three-way comparison) that
determines which of two elements should occur first in the final
sorted list.
You are using knowledge about the structure of the values (i.e. knowing that they're integers or strings) in your algorithm, by using those integers/strings as indexes. You are not adhering to the limitations imposed on a comparison sort, and thus you are not restricted to the O(n log n) boundary on time complexity.
Yes, you are dreaming :-)
It beggars belief that you would have found such a holy grail by accident. If that sequence of operations is a comparison-based sort, people who know this stuff have actually proven that it cannot be done in O(n) time.
I strongly suggest you run that code with dataset sizes of 10, 100, 1000, and so on and you'll see your assumption is incorrect.
Then check to see if you are actually sorting the array or whether this is just an artifact of its organisation. It seems very likely that the indexBy is simply creating an index structure where the order just happens to be the sort order you want, not something that would be guaranteed for all inputs.

What is the complexity of retrieval/insertion in JavaScript associative arrays (dynamic object properties) in the major javascript engines?

Take the following code example:
var myObject = {};
var i = 100;
while (i--) {
myObject["foo"+i] = new Foo(i);
}
console.log(myObject["foo42"].bar());
I have a few questions.
What kind of data structure do the major engines (IE, Mozilla, Chrome, Safari) use for storing key-value pairs? I'd hope it's some kind Binary Search tree, but I think they may use linked lists (due to the fact iterating is done in insertion order).
If they do use a search tree, is it self balancing? Because the above code with a conventional search tree will create an unbalanced tree, causing worst case scenario of O(n) for searching, rather than O(log n) for a balanced tree.
I'm only asking this because I will be writing a library which will require efficient retrieval of keys from a data structure, and while I could implement my own or an existing red-black tree I would rather use native object properties if they're efficient enough.
The question is hard to answer for a couple reasons. First, the modern browsers all heavily and dynamically optimize code while it is executing so the algorithms chosen to access the properties might be different for the same code. Second, each engine uses different algorithms and heuristics to determine which access algorithm to use. Third, the ECMA specification dictates what the result of must be, not how the result is achieved so the engines have a lot of freedom to innovate in this area.
That said, given your example all the engines I am familiar with will use some form of a hash table to retrieve the value associated with foo42 from myobject. If you use an object like an associative array JavaScript engines will tend to favor a hash table. None that I am aware of use a tree for string properties. Hash tables are worst case O(N), best case O(1) and tend to be closer to O(1) than O(N) if the key generator is any good. Each engine will have a pattern you could use to get it to perform O(N) but that will be different for each engine. A balanced tree would guarantee worst case O(log N) but modifying a balanced tree while keeping it balanced is not O(log N) and hash tables are more often better than O(log N) for string keys and are O(1) to update (once you determine you need to, which is the same big-O as read) if there is space in the table (periodically O(N) to rebuild the table but the tables usually double in space which means you will only pay O(N) 7 or 8 times for the life of the table).
Numeric properties are special, however. If you access an object using integer numeric properties that have few or no gaps in range, that is, use the object like it is an array, the values will tend to be stored in a linear block of memory with O(1) access. Even if your access has gaps the engines will probably shift to a sparse array access which will probably be, at worst, O(log N).
Accessing a property by identifier is also special. If you access the property like,
myObject.foo42
and execute this code often (that is, the speed of this matters) and with the same or similar object this is likely to be optimized into one or two machine instructions. What makes objects similar also differs for each engine but if they are constructed by the same literal or function they are more likely to be treated as similar.
No engine that does at all well on the JavaScript benchmarks will use the same algorithm for every object. They all must dynamically determine how the object is being used and try to adjust the access algorithm accordingly.

Categories

Resources