Did I just sort in O(n) on JavaScript? - javascript

Using underscorejs library, I tried to abuse the indexing of a JavaScript object, in order to sort an array a of integers or strings:
_(a).chain().indexBy(_.identity).values().value()
I realize it is kind of a "hack", but it actually yielded a sorted array in O(n) time...
Am I dreaming?

You aren't actually sorting anything.
Instead, you're building a hashtable and traversing it in hash order, which may be the same as sorted order for some sets.

It is possible to sort by O(n) using Bucket Sort http://en.wikipedia.org/wiki/Bucket_sort which is I believe what you attempted to write here, but as mentioned above you can't rely on the order of values of an object.
It is possible to sort this way in O(n) if you have limited number of values.

Your algorithm is not a comparison sort:
A comparison sort is a type of sorting algorithm that only reads the
list elements through a single abstract comparison operation (often a
"less than or equal to" operator or a three-way comparison) that
determines which of two elements should occur first in the final
sorted list.
You are using knowledge about the structure of the values (i.e. knowing that they're integers or strings) in your algorithm, by using those integers/strings as indexes. You are not adhering to the limitations imposed on a comparison sort, and thus you are not restricted to the O(n log n) boundary on time complexity.

Yes, you are dreaming :-)
It beggars belief that you would have found such a holy grail by accident. If that sequence of operations is a comparison-based sort, people who know this stuff have actually proven that it cannot be done in O(n) time.
I strongly suggest you run that code with dataset sizes of 10, 100, 1000, and so on and you'll see your assumption is incorrect.
Then check to see if you are actually sorting the array or whether this is just an artifact of its organisation. It seems very likely that the indexBy is simply creating an index structure where the order just happens to be the sort order you want, not something that would be guaranteed for all inputs.

Related

Is an array of ints actually implemented as an array of ints in JavaScript / V8?

There is claim in this article that an array of ints in JavaScript is implemented by a C++ array of ints.
However; According to MDN unless you specifically use BigInts, in JavaScript all numbers are repressed as doubles.
If I do:
cont arr = [0, 1, 2, 3];
What is the actual representation in the V8 engine?
The code for V8 is here on github, but I don't know where to look:
(V8 developer here.)
"C++ array of ints" is a bit of a simplification, but the key idea described in that article is correct, and an array [0, 1, 2, 3] will be stored as an array of "Smis".
What's a "Smi"? While every Number in JavaScript must behave like an IEEE754 double, V8 internally represents numbers as "small integer" (31 bits signed integer value + 1 bit tag) when it can, i.e. when the number has an integral value in the range -2**30 to 2**30-1, to improve efficiency. Engines can generally do whatever they want under the hood, as long as things behave as if the implementation followed the spec to the letter. So when the spec (or MDN documentation) says "all Numbers are doubles", what it really means from the engine's (or an engine developer's) point of view is "all Numbers must behave as if they were doubles".
When an array contains only Smis, then the array itself keeps track of that fact, so that values loaded from such arrays know their type without having to check. This matters e.g. for a[i] + 1, where the implementation of + doesn't have to check whether a[i] is a Smi when it's already known that a is a Smi array.
When the first number that doesn't fit the Smi range is stored in the array, it'll be transitioned to an array of doubles (strictly speaking still not a "C++ array", rather a custom array on the garbage-collected heap, but it's similar to a C++ array, so that's a good way to explain it).
When the first non-Number is stored in an array, what happens depends on what state the array was in before: if it was a "Smi array", then it only needs to forget the fact that it contains only Smis. No rewriting is needed, as Smis are valid object pointers thanks to their tag bit. If the array was a "double array" before, then it does have to be rewritten, so that each element is a valid object pointer. All the doubles will be "boxed" as so-called "heap numbers" (objects on the managed heap that only wrap a double value) at this point.
In summary, I'd like to point out that in the vast majority of cases, there's no need to worry about any of these internal implementation tricks, or even be aware of them. I certainly understand your curiosity though! Also, array representations are one of the more common reasons why microbenchmarks that don't account for implementation details can easily be misleading by suggesting results that won't carry over to a larger app.
Addressing comments:
V8 does sometimes even use int16 or lower.
Nope, it does not. It may or may not start doing so in the future; though if anything does change, I'd guess that untagged int32 is more likely to be introduced than int16; also if anything does change about the implementation then of course the observable behavior would not change.
If you believe that your application would benefit from int16 storage, you can use an Int16Array to enforce that, but be sure to measure whether that actually benefits you, because quite likely it won't, and may even decrease performance depending on what your app does with its arrays.
It may start to be a double when you make it a decimal
Slightly more accurately: there are several reasons why an array of Smis needs to be converted to an array of doubles, such as:
storing a fractional value in it, e.g. 0.5
storing a large value in it, e.g. 2**34
storing NaN or Infinity or -0 in it

IndexedDB Performance of Index: Select and Insert

Question 1:
Q1.1
What is the complexity of IndexedDB (Javascript version) doing select or insert? Whether the indexes are "indexed"? Is it sorted or hashed? For example, when we use IDBKeyRange.only, it is taking O(1), O(log(n)), or O(n) time?
Q1.2
How about the IDBKeyRange.bound? Is it sorting the index first and then doing the select?
Q1.3
What is the performance of IDBObjectStore.add()?
Q1.4
For index.openCursor(), is the index sorted in advance?
Question 2:
We are using IDBObjectStore.createIndex() to create indexes.
If the answer is yes in Question 1 (which means the indexes are indexed), how to create index with options indexed or not indexed? In other words, I choose some indexes to be sorted or hashed, while others not. Do we have this choice?
The specification doesn't mandate performance characteristics. It's reasonable to assume an implementation using a B-tree, and therefore operations O(log n), but you'd need to test implementations in browsers. If a particular browser performs poorly on common operations, it may be worth reporting it as an issue.
Q1.1 What is the complexity of IndexedDB (Javascript version) doing
select or insert? Whether the indexes are "indexed"? Is it sorted or hashed?
Indexes are sorted. https://w3c.github.io/IndexedDB/#index-construct
Q1.2 How about the IDBKeyRange.bound? Is it sorting the index first and then doing the select?
The lookup is against the sorted index.
Q1.3 What is the performance of IDBObjectStore.add()?
Assuming a B-tree, O(log n)
Q1.4 For index.openCursor(), is the index sorted in advance?
Yes.
Although technically an implementation could create the index lazily -- the specification does not require particular performance or implementation, just that the results are indistinguishable.
Question 2:... how to create index with options indexed or not indexed? In other words, I choose some indexes to be sorted or hashed, while others not. Do we have this choice?
No - the API doesn't expose such an option.

What's the rationale for using insertion sort over shell sort in Array.sort in V8

V8 uses quick-sort for arrays of the length over 10 elements, and insertion sort for arrays less than that. Here is the sources:
function InnerArraySort(array, length, comparefn) {
// In-place QuickSort algorithm.
// For short (length <= 10) arrays, insertion sort is used for efficiency.
I'm wondering what's the rationale for not using shell-sort instead of an insertion sort? I understand that it probably doesn't make a difference for an array of 10 elements, but still. Any ideas?
The original rationale is lost to history; the commit that introduced InsertionSort for short arrays (all the way back in 2008) only mentions that it's faster than QuickSort (for such short arrays). So it boils down to: someone implemented it that way, and nobody else saw a reason to change it since.
Since InsertionSort is known to be very efficient for short arrays, I agree that changing it probably doesn't make a difference -- and there are lots of things for the team to work on that actually do make a difference.
Great question. The rationale is simple, it is actually faster to use insertion sort on those small arrays, at least typically. Java in fact made the same switch a long while ago. Now they do insertion sort if the array is less than 7 long in their code. See here. It is under the function sort1 at the top.
Basically what happens (in most cases) for such small arrays is that the overhead for Quicksort makes it slower than insertion sort. Insertion sort in these cases is much more likely to approach it's best performance at O(n) while Quicksort is still likely to stay at O(n log n).
Shell sort on the other hand tends to be much slower than insertion sort. That being said, it can be much faster (relativly). The best case for insertion sort is still 0(n), whereas the best case for shell sort is O(n log n). All number under ten then should have the potential for being faster from a mathematical standpoint. Unfortunately for shell sort, there is a lot more swapping involved. Shell sort then can become much slower. Insertion sort tends to be able to pull off swapping with O(1) swaps, whereas Shell sort is likely to be around O(n) swaps. Swaps are costly in machines because they tend to end up using a third temp register for swapping (there are ways of using XOR, but that is still three commands on the CPU, typically). Therefore, insertion sort still wins on an actual machine, typically.

linked list vs arrays for dictionaries

I was recently asked in an interview about advantages and disadvantages of linked list and arrays for dictionary of words implementation and also what is the best data structure for implementing it? This where I messed up things. After googling I couldn't specifically found exact answer that is specific to dictionaries but general linked list v arrays explanation. What is the best suited answer to above question?
If you're just going to use it for lookups, then an array is the obvious best choice of the two. You can build the dictionary from a list of words in O(n log n)--just build an array and sort it. Lookups are O(log n) with a binary search.
Although you can build a linked list of words in O(n), lookups will require, on average, that you look at n/2 words. The difference is pretty large. Given an English dictionary of 128K words, a linked list lookup will take on average 64,000 string comparisons. A binary search will require at most 17.
In addition, a linked list of n words will occupy more memory than an array of n words, because you need the next pointer in the list.
If you need the ability to update the dictionary, you'll probably still want to use an array if updates are infrequent compared to lookups (which is almost certainly the case). I can't think of a real-world example of a dictionary of words that's updated more frequently than it's queried.
As others have pointed out, neither array nor linked list is the best choice for a dictionary of words. But of the two options you're given, array is superior in almost all cases.
There is no one answer.
The two obvious choices would be something based on a hash table if you only want to look up individual items, or something based on a balanced tree if you want to look up ranges of items.
A sorted array can work well if you do a lot of searching and relatively little insertion or deletion. Finding situations where linked lists are preferred is rather more difficult. Depending on the situation (especially such things as finding all the words that start with, say, "ste"), tries can also work extremely well (and often do well at minimizing the storage needed for a given set of data as well).
Those are really broad categories though, not specific implementations. There are also variations such as extensible hashing and distributed hash tables that can be useful in specific situations (and also have somewhat tree-like properties, so things like range-based searching can be reasonable efficient).
Best data structure for implementing dictionaries is suffix trees. You can also have a look at tries.
Well, if you're building a dictionary, you'd want it to be a sorted structure. So you're going for a sorted-array or a sorted linked-list.
For a linked list retrieval is O(n) since you have to examine all words until you find the one you need. For a sorted array, you can use binary search to find the right location, which is O(log n).
For a sorted array, insertion is O(log n) to find the right location (binary search) and then O(n) to insert because you need to push everything down. For a linked list, it would be O(n) to find the location and then O(1) to insert because you only have to adjust pointers. The same applies for deletion.
Since you aren't going to be updating a dictionary much, you can just build and then sort the array in O(nlog n) time (using quicksort for example). After that, lookup is O(log n) using binary search. Furthermore, as delnan mentioned below, using an array has the advantage that everything you access is sequential in memory; i.e., the data are localized (locality of reference). This minimizes cache misses (which are expensive). With a linked list, the data are spread out all over and there is no guarantee that they are close together, which increases the chance of cache-misses. With this in mind, given the two options, use the array.
You can do an even better job if you implement a sorted hashmap using a red-black tree (your tree entries, which are the keys can be coupled with a hashmap); here search, insert, and delete are O(log n). But it really depends on your behavior profile; if you're only doing lookup, a simple hashmap would be best (O(1) retrieval).
Another interesting data-structure you can use is a Trie, where insertion and lookup are O(m); m being the length of the string.

Big O of JavaScript arrays

Arrays in JavaScript are very easy to modify by adding and removing items. It somewhat masks the fact that most languages arrays are fixed-size, and require complex operations to resize. It seems that JavaScript makes it easy to write poorly performing array code. This leads to the question:
What performance (in terms of big O time complexity) can I expect from JavaScript implementations in regards to array performance?
I assume that all reasonable JavaScript implementations have at most the following big O's.
Access - O(1)
Appending - O(n)
Prepending - O(n)
Insertion - O(n)
Deletion - O(n)
Swapping - O(1)
JavaScript lets you pre-fill an array to a certain size, using new Array(length) syntax. (Bonus question: Is creating an array in this manner O(1) or O(n)) This is more like a conventional array, and if used as a pre-sized array, can allow O(1) appending. If circular buffer logic is added, you can achieve O(1) prepending. If a dynamically expanding array is used, O(log n) will be the average case for both of those.
Can I expect better performance for some things than my assumptions here? I don't expect anything is outlined in any specifications, but in practice, it could be that all major implementations use optimized arrays behind the scenes. Are there dynamically expanding arrays or some other performance-boosting algorithms at work?
P.S.
The reason I'm wondering this is that I'm researching some sorting algorithms, most of which seem to assume appending and deleting are O(1) operations when describing their overall big O.
NOTE: While this answer was correct in 2012, engines use very different internal representations for both objects and arrays today. This answer may or may not be true.
In contrast to most languages, which implement arrays with, well, arrays, in Javascript Arrays are objects, and values are stored in a hashtable, just like regular object values. As such:
Access - O(1)
Appending - Amortized O(1) (sometimes resizing the hashtable is required; usually only insertion is required)
Prepending - O(n) via unshift, since it requires reassigning all the indexes
Insertion - Amortized O(1) if the value does not exist. O(n) if you want to shift existing values (Eg, using splice).
Deletion - Amortized O(1) to remove a value, O(n) if you want to reassign indices via splice.
Swapping - O(1)
In general, setting or unsetting any key in a dict is amortized O(1), and the same goes for arrays, regardless of what the index is. Any operation that requires renumbering existing values is O(n) simply because you have to update all the affected values.
guarantee
There is no specified time complexity guarantee for any array operation. How arrays perform depends on the underlying datastructure the engine chooses. Engines might also have different representations, and switch between them depending on certain heuristics. The initial array size might or might not be such an heuristic.
reality
For example, V8 uses (as of today) both hashtables and array lists to represent arrays. It also has various different representations for objects, so arrays and objects cannot be compared. Therefore array access is always better than O(n), and might even be as fast as a C++ array access. Appending is O(1), unless you reach the size of the datastructure and it has to be scaled (which is O(n)). Prepending is worse. Deletion can be even worse if you do something like delete array[index] (don't!), as that might force the engine to change its representation.
advice
Use arrays for numeric datastructures. That's what they are meant for. That's what engines will optimize them for. Avoid sparse arrays (or if you have to, expect worse performance). Avoid arrays with mixed datatypes (as that makes internal representations more complex).
If you really want to optimize for a certain engine (and version), check its sourcecode for the absolute answer.

Categories

Resources