Arrays in JavaScript are very easy to modify by adding and removing items. It somewhat masks the fact that most languages arrays are fixed-size, and require complex operations to resize. It seems that JavaScript makes it easy to write poorly performing array code. This leads to the question:
What performance (in terms of big O time complexity) can I expect from JavaScript implementations in regards to array performance?
I assume that all reasonable JavaScript implementations have at most the following big O's.
Access - O(1)
Appending - O(n)
Prepending - O(n)
Insertion - O(n)
Deletion - O(n)
Swapping - O(1)
JavaScript lets you pre-fill an array to a certain size, using new Array(length) syntax. (Bonus question: Is creating an array in this manner O(1) or O(n)) This is more like a conventional array, and if used as a pre-sized array, can allow O(1) appending. If circular buffer logic is added, you can achieve O(1) prepending. If a dynamically expanding array is used, O(log n) will be the average case for both of those.
Can I expect better performance for some things than my assumptions here? I don't expect anything is outlined in any specifications, but in practice, it could be that all major implementations use optimized arrays behind the scenes. Are there dynamically expanding arrays or some other performance-boosting algorithms at work?
P.S.
The reason I'm wondering this is that I'm researching some sorting algorithms, most of which seem to assume appending and deleting are O(1) operations when describing their overall big O.
NOTE: While this answer was correct in 2012, engines use very different internal representations for both objects and arrays today. This answer may or may not be true.
In contrast to most languages, which implement arrays with, well, arrays, in Javascript Arrays are objects, and values are stored in a hashtable, just like regular object values. As such:
Access - O(1)
Appending - Amortized O(1) (sometimes resizing the hashtable is required; usually only insertion is required)
Prepending - O(n) via unshift, since it requires reassigning all the indexes
Insertion - Amortized O(1) if the value does not exist. O(n) if you want to shift existing values (Eg, using splice).
Deletion - Amortized O(1) to remove a value, O(n) if you want to reassign indices via splice.
Swapping - O(1)
In general, setting or unsetting any key in a dict is amortized O(1), and the same goes for arrays, regardless of what the index is. Any operation that requires renumbering existing values is O(n) simply because you have to update all the affected values.
guarantee
There is no specified time complexity guarantee for any array operation. How arrays perform depends on the underlying datastructure the engine chooses. Engines might also have different representations, and switch between them depending on certain heuristics. The initial array size might or might not be such an heuristic.
reality
For example, V8 uses (as of today) both hashtables and array lists to represent arrays. It also has various different representations for objects, so arrays and objects cannot be compared. Therefore array access is always better than O(n), and might even be as fast as a C++ array access. Appending is O(1), unless you reach the size of the datastructure and it has to be scaled (which is O(n)). Prepending is worse. Deletion can be even worse if you do something like delete array[index] (don't!), as that might force the engine to change its representation.
advice
Use arrays for numeric datastructures. That's what they are meant for. That's what engines will optimize them for. Avoid sparse arrays (or if you have to, expect worse performance). Avoid arrays with mixed datatypes (as that makes internal representations more complex).
If you really want to optimize for a certain engine (and version), check its sourcecode for the absolute answer.
Related
There is claim in this article that an array of ints in JavaScript is implemented by a C++ array of ints.
However; According to MDN unless you specifically use BigInts, in JavaScript all numbers are repressed as doubles.
If I do:
cont arr = [0, 1, 2, 3];
What is the actual representation in the V8 engine?
The code for V8 is here on github, but I don't know where to look:
(V8 developer here.)
"C++ array of ints" is a bit of a simplification, but the key idea described in that article is correct, and an array [0, 1, 2, 3] will be stored as an array of "Smis".
What's a "Smi"? While every Number in JavaScript must behave like an IEEE754 double, V8 internally represents numbers as "small integer" (31 bits signed integer value + 1 bit tag) when it can, i.e. when the number has an integral value in the range -2**30 to 2**30-1, to improve efficiency. Engines can generally do whatever they want under the hood, as long as things behave as if the implementation followed the spec to the letter. So when the spec (or MDN documentation) says "all Numbers are doubles", what it really means from the engine's (or an engine developer's) point of view is "all Numbers must behave as if they were doubles".
When an array contains only Smis, then the array itself keeps track of that fact, so that values loaded from such arrays know their type without having to check. This matters e.g. for a[i] + 1, where the implementation of + doesn't have to check whether a[i] is a Smi when it's already known that a is a Smi array.
When the first number that doesn't fit the Smi range is stored in the array, it'll be transitioned to an array of doubles (strictly speaking still not a "C++ array", rather a custom array on the garbage-collected heap, but it's similar to a C++ array, so that's a good way to explain it).
When the first non-Number is stored in an array, what happens depends on what state the array was in before: if it was a "Smi array", then it only needs to forget the fact that it contains only Smis. No rewriting is needed, as Smis are valid object pointers thanks to their tag bit. If the array was a "double array" before, then it does have to be rewritten, so that each element is a valid object pointer. All the doubles will be "boxed" as so-called "heap numbers" (objects on the managed heap that only wrap a double value) at this point.
In summary, I'd like to point out that in the vast majority of cases, there's no need to worry about any of these internal implementation tricks, or even be aware of them. I certainly understand your curiosity though! Also, array representations are one of the more common reasons why microbenchmarks that don't account for implementation details can easily be misleading by suggesting results that won't carry over to a larger app.
Addressing comments:
V8 does sometimes even use int16 or lower.
Nope, it does not. It may or may not start doing so in the future; though if anything does change, I'd guess that untagged int32 is more likely to be introduced than int16; also if anything does change about the implementation then of course the observable behavior would not change.
If you believe that your application would benefit from int16 storage, you can use an Int16Array to enforce that, but be sure to measure whether that actually benefits you, because quite likely it won't, and may even decrease performance depending on what your app does with its arrays.
It may start to be a double when you make it a decimal
Slightly more accurately: there are several reasons why an array of Smis needs to be converted to an array of doubles, such as:
storing a fractional value in it, e.g. 0.5
storing a large value in it, e.g. 2**34
storing NaN or Infinity or -0 in it
V8 uses quick-sort for arrays of the length over 10 elements, and insertion sort for arrays less than that. Here is the sources:
function InnerArraySort(array, length, comparefn) {
// In-place QuickSort algorithm.
// For short (length <= 10) arrays, insertion sort is used for efficiency.
I'm wondering what's the rationale for not using shell-sort instead of an insertion sort? I understand that it probably doesn't make a difference for an array of 10 elements, but still. Any ideas?
The original rationale is lost to history; the commit that introduced InsertionSort for short arrays (all the way back in 2008) only mentions that it's faster than QuickSort (for such short arrays). So it boils down to: someone implemented it that way, and nobody else saw a reason to change it since.
Since InsertionSort is known to be very efficient for short arrays, I agree that changing it probably doesn't make a difference -- and there are lots of things for the team to work on that actually do make a difference.
Great question. The rationale is simple, it is actually faster to use insertion sort on those small arrays, at least typically. Java in fact made the same switch a long while ago. Now they do insertion sort if the array is less than 7 long in their code. See here. It is under the function sort1 at the top.
Basically what happens (in most cases) for such small arrays is that the overhead for Quicksort makes it slower than insertion sort. Insertion sort in these cases is much more likely to approach it's best performance at O(n) while Quicksort is still likely to stay at O(n log n).
Shell sort on the other hand tends to be much slower than insertion sort. That being said, it can be much faster (relativly). The best case for insertion sort is still 0(n), whereas the best case for shell sort is O(n log n). All number under ten then should have the potential for being faster from a mathematical standpoint. Unfortunately for shell sort, there is a lot more swapping involved. Shell sort then can become much slower. Insertion sort tends to be able to pull off swapping with O(1) swaps, whereas Shell sort is likely to be around O(n) swaps. Swaps are costly in machines because they tend to end up using a third temp register for swapping (there are ways of using XOR, but that is still three commands on the CPU, typically). Therefore, insertion sort still wins on an actual machine, typically.
I'm using a Dictionary (associative array, hash table, any of these synonyms).
The keys used to uniquely identify values are fairly long strings. However, I know that these strings tend to differ at the tail, rather than the head.
The fastest way to find a value in a JS object is to test the existence of
object[key], but is that also the case for extremely long, largely similar, keys (+100 chars), in a fairly large Dictionary (+1000 entries)?
Are there alternatives for this case, or is this a completely moot question, because accessing values by key is already insanely fast?
Long story short; It doesn't matter much. JS will internally use a hash table (as you already said yourself), so it will need to calculate a hash of your keys for insertion and (in some cases) for accessing elements.
Calculating a hash (for most reasonable hash functions) will take slightly longer for long keys than for short keys (I would guess about linearly longer), but it doesn't matter whether the changes are at the tail or at the head.
You could decide to roll your own hashes instead, cache these somehow, and use these as keys, but this would leave it up to you to deal with hash collisions. It will be very hard to do better than the default implementation, and is almost certainly not worth the trouble.
Moreover, for an associative array with only 1000 elements, probably none of this matters. Modern CPUs can process close to / around billions of instructions per second. Even just a lineair search through the whole array will likely perform just fine, unless you have to do it very very often.
Hash tables (dictionary, map, etc.) first check for hash code, and only then, if necessary (in case of collision - at least two keys have the same hash code) perform equals. If you experience performance problems, the first thing you have to check, IMHO, is hash codes collision. It may appear (bad implementation or weird keys) that the hash code is computed on, say, 3 first chars (it's a wild exaggeration, of course):
"abc123".hashCode() ==
"abc456".hashCode() ==
...
"abc789".hashCode()
and so you have a lot of collisions, have to perform equals, and finally slow O(N) complexity routine. In that case, you have to think over a better hash.
Using underscorejs library, I tried to abuse the indexing of a JavaScript object, in order to sort an array a of integers or strings:
_(a).chain().indexBy(_.identity).values().value()
I realize it is kind of a "hack", but it actually yielded a sorted array in O(n) time...
Am I dreaming?
You aren't actually sorting anything.
Instead, you're building a hashtable and traversing it in hash order, which may be the same as sorted order for some sets.
It is possible to sort by O(n) using Bucket Sort http://en.wikipedia.org/wiki/Bucket_sort which is I believe what you attempted to write here, but as mentioned above you can't rely on the order of values of an object.
It is possible to sort this way in O(n) if you have limited number of values.
Your algorithm is not a comparison sort:
A comparison sort is a type of sorting algorithm that only reads the
list elements through a single abstract comparison operation (often a
"less than or equal to" operator or a three-way comparison) that
determines which of two elements should occur first in the final
sorted list.
You are using knowledge about the structure of the values (i.e. knowing that they're integers or strings) in your algorithm, by using those integers/strings as indexes. You are not adhering to the limitations imposed on a comparison sort, and thus you are not restricted to the O(n log n) boundary on time complexity.
Yes, you are dreaming :-)
It beggars belief that you would have found such a holy grail by accident. If that sequence of operations is a comparison-based sort, people who know this stuff have actually proven that it cannot be done in O(n) time.
I strongly suggest you run that code with dataset sizes of 10, 100, 1000, and so on and you'll see your assumption is incorrect.
Then check to see if you are actually sorting the array or whether this is just an artifact of its organisation. It seems very likely that the indexBy is simply creating an index structure where the order just happens to be the sort order you want, not something that would be guaranteed for all inputs.
Take the following code example:
var myObject = {};
var i = 100;
while (i--) {
myObject["foo"+i] = new Foo(i);
}
console.log(myObject["foo42"].bar());
I have a few questions.
What kind of data structure do the major engines (IE, Mozilla, Chrome, Safari) use for storing key-value pairs? I'd hope it's some kind Binary Search tree, but I think they may use linked lists (due to the fact iterating is done in insertion order).
If they do use a search tree, is it self balancing? Because the above code with a conventional search tree will create an unbalanced tree, causing worst case scenario of O(n) for searching, rather than O(log n) for a balanced tree.
I'm only asking this because I will be writing a library which will require efficient retrieval of keys from a data structure, and while I could implement my own or an existing red-black tree I would rather use native object properties if they're efficient enough.
The question is hard to answer for a couple reasons. First, the modern browsers all heavily and dynamically optimize code while it is executing so the algorithms chosen to access the properties might be different for the same code. Second, each engine uses different algorithms and heuristics to determine which access algorithm to use. Third, the ECMA specification dictates what the result of must be, not how the result is achieved so the engines have a lot of freedom to innovate in this area.
That said, given your example all the engines I am familiar with will use some form of a hash table to retrieve the value associated with foo42 from myobject. If you use an object like an associative array JavaScript engines will tend to favor a hash table. None that I am aware of use a tree for string properties. Hash tables are worst case O(N), best case O(1) and tend to be closer to O(1) than O(N) if the key generator is any good. Each engine will have a pattern you could use to get it to perform O(N) but that will be different for each engine. A balanced tree would guarantee worst case O(log N) but modifying a balanced tree while keeping it balanced is not O(log N) and hash tables are more often better than O(log N) for string keys and are O(1) to update (once you determine you need to, which is the same big-O as read) if there is space in the table (periodically O(N) to rebuild the table but the tables usually double in space which means you will only pay O(N) 7 or 8 times for the life of the table).
Numeric properties are special, however. If you access an object using integer numeric properties that have few or no gaps in range, that is, use the object like it is an array, the values will tend to be stored in a linear block of memory with O(1) access. Even if your access has gaps the engines will probably shift to a sparse array access which will probably be, at worst, O(log N).
Accessing a property by identifier is also special. If you access the property like,
myObject.foo42
and execute this code often (that is, the speed of this matters) and with the same or similar object this is likely to be optimized into one or two machine instructions. What makes objects similar also differs for each engine but if they are constructed by the same literal or function they are more likely to be treated as similar.
No engine that does at all well on the JavaScript benchmarks will use the same algorithm for every object. They all must dynamically determine how the object is being used and try to adjust the access algorithm accordingly.