How does V8 optimise the creation of very large arrays? - javascript

Recently, I had to work on optimising a task that involved the creation of really large arrays (~ 10⁸ elements).
I tested a few different methods, and, according to jsperf, the following option seemed to be the fastest.
var max = 10000000;
var arr = new Array(max);
for (let i = 0; i < max; i++) {
arr[i] = true;
}
Which was ~ 85% faster than
var max = 10000000;
var arr = [];
for (let i = 0; i < max; i++) {
arr.push(true);
}
And indeed, the first snippet was much faster in my actual app as well.
However, my understanding was that the V8 engine was able to perform optimised operations on array with PACKED_SMI_ELEMENTS elements kind, as opposed to arrays of HOLEY_ELEMENTS.
So my question is the following:
if it's true that new Array(n) creates an array that's internally marked with HOLEY_ELEMENTS, (which I believe is true) and
if it's true that [] creates an array that's internally marked with PACKED_SMI_ELEMENTS (which I'm not too sure is true)
why is the first snippet faster than the second one?
Related questions I've been through:
Create a JavaScript array containing 1...N
Most efficient way to create a zero filled JavaScript array?

V8 developer here. The first snippet is faster because new Array(max) informs V8 how big you want the array to be, so it can allocate an array of the right size immediately; whereas in the second snippet with []/.push(), the array starts at zero capacity and has to be grown several times, which includes copying its existing elements to a new backing store.
https://www.youtube.com/watch?v=m9cTaYI95Zc is a good presentation but probably should have made it clearer how small the performance difference between packed and holey elements is, and how little you should worry about it.
In short: whenever you know how big you need an array to be, it makes sense to use new Array(n) to preallocate it to that size. When you don't know in advance how large it's going to be in the end, then start with an empty array (using [] or new Array() or new Array(0), doesn't matter) and grow it as needed (using a.push(...) or a[a.length] = ..., doesn't matter).
Side note: your "for loop with new Array() and push" benchmark creates an array that's twice as big as you want.

Related

Why do the following two pieces of code run so differently?

Look at these two pieces of code, the second only add the third line. But time is 84 times. Anybody can explain why?
let LIMIT = 9999999;
let arr = new Array(LIMIT);
// arr.push(1);
console.time('Array insertion time');
for (let i = 1; i < LIMIT; i++) {
arr[i] = i;
}
console.timeEnd('Array insertion time');
let LIMIT = 9999999;
let arr = new Array(LIMIT);
arr.push(1);
console.time('Array insertion time');
for (let i = 1; i < LIMIT; i++) {
arr[i] = i;
}
console.timeEnd('Array insertion time');
The arr.push(1) operation creates a "sparse" array: it has a single element present at index 9999999. V8 switches the internal representation of such a sparse array to "dictionary mode", i.e. the array's backing store is an index→element dictionary, because that's significantly more memory efficient than allocating space for 10 million elements when only one of them is used.
The flip side is that accessing (reading or writing) elements of a dictionary-mode array is slower than for arrays in "fast/dense mode": every access has to compute the right dictionary index, and (in the scenario at hand) the dictionary has to be grown several times, which means copying all existing elements to a new backing store.
As the array is filled up, V8 notices that it's getting denser, and at some point transitions it back to "fast/dense mode". By then, most of the slowdown has already been observed. The remainder of the loop has some increased cost as well though, because by this time, the arr[i] = i; store has seen two types of arrays (dictionary mode and dense mode), so on every iteration it must detect which state the array is in now and handle it accordingly, which (unsurprisingly) costs more time than not having to make that decision.
Generalized conclusion: with JavaScript being as dynamic and flexible as it is, engines can behave quite differently for very similar-looking pieces of code; for example because the engine optimizes one case for memory consumption and the other for execution speed, or because one of the cases lets it use some shortcut that's not applicable for the other (for whatever reason). The good news is that in many cases, correct and understandable/intuitive/simple code also tends to run quite well (in this example, the stray arr.push looks a lot like a bug).

Is there a way to return the rest of a JavaScript array

Is there a way to return the rest of an array in JavaScript i.e the portion of the array that consists of all elements but the first element of the array?
Note: I do not ask for returning a new array e.g. with arr.slice(1) etc. and I do not want to chop off the first element of the array e.g. with arr.shift().
For example, given the array [3, 5, 8] the rest of the array is [5, 8] and if the rest of the array is changed, e.g. by an assignment (a destructive operation), the array also changes. I just figured out that as a test that proves the rest is the rest of the array but not a new array consists of the rest of the elements of the array.
Note: The following code example is to describe what I want, but not specifically what I want to do (i.e. not the operations I want to perform). What I want to do is in the every algorithm at the bottom.
var arr = [3, 5, 8];
var rest = rest(arr); // rest is [5, 8]
rest.push(13); // rest is [5, 8, 13] and hence the arr is [3, 5, 8, 13]
An example I possibly need this and I would want to have it is following algorithm and many other I am writing in that GitHub organization, in both of which I use always arr.slice(1):
function every(lst, f) {
if (lst.length === 0) {
return false;
} else {
if (f(lst[0]) === true) {
return every(lst.slice(1), f);
} else {
return false;
}
}
}
I think having what I ask for instead of arr.slice(1) would keep the memory usage of such algorithms and retain the recursive-functional style I want to employ.
No, this is generally not possible. There are no "views on" or "pointers to" normal arrays1.
You might use a Proxy to fake it, but I doubt this is a good idea.
1: It's trivial to do this on typed arrays (which are views on a backing buffer), but notice that you cannot push to them.
I possibly need this and I would want to have it for recursive-functional style algorithms where I currently use arr.slice(1) but would prefer to keep memory usage low
Actually, all of these implementations do have low memory usage - they don't allocate more memory than the input. Repeatedly calling slice(1) does lead to high pressure on the garbage collector, though.
If you were looking for better efficiency, I would recommend to
avoid recursion. JS engines still didn't implement tail recursion, so recursion isn't cheap.
not to pass around (new copies of) arrays. Simply pass around an index at which to start, e.g. by using an inner recursive function that closes over the array parameter and accesses array[i] instead of array[0]. See #Pointy's updated answer for an example.
If you were looking for a more functional style, I would recommend to use folds. (Also known as reduce in JavaScript, although you might need to roll your own if you want laziness). Implement your algorithms in terms of fold, then it's easy to swap out the fold implementation for a more efficient (e.g. iterative) one.
Last but not least, for higher efficiency while keeping a recursive style you can use iterators. Their interface might not look especially functional, but if you insist you could easily create an immutable wrapper that lazily produces a linked list.
please test this function
function rest(arr) {
var a = arr.slice(1);
a.push = function() {
for (var i = 0, l = arguments.length; i < l; i++) {
this[this.length] = arguments[i];
arr[this.length] = arguments[i];
}
return this.length;
};
return a;
}
Based on the code posted in the update to the question, it's clear why you might want to be able to "alias" a portion of an array. Here is an alternative that is more typical of how I would solve the (correctly) perceived efficiency problem with your implementation:
function every(lst, f) {
function r(index) {
if (index >= lst.length)
return true; // different from OP, but I think correct
return f(lst[index]) && r(index+1);
}
return r(0);
}
That is still a recursive solution to the problem, but no array copy is made; the array is not changed at all. The general pattern is common even in more characteristically functional programming languages (Erlang comes to mind personally): the "public" API for some recursive code is augmented by an "internal" or "private" API that provides some extra tools for keeping track of the progress of the recursion.
original answer
You're looking for Array.prototype.shift.
var arr = [1, 2, 3];
var first = arr.shift();
console.log(first); // 1
console.log(arr); // [2, 3]
This is a linear time operation: the execution cost is relative to the length of the original array. For most small arrays that does not really matter much, but if you're doing lots of such work on large arrays you may want to explore a better data structure.
Note that with ordinary arrays it is not possible to create a new "shadow" array that overlaps another array. You can do something like that with typed arrays, but for general purpose use in most code typed arrays are somewhat awkward.
The first limitation of typed arrays is that they are, of course, typed, which means that the array "view" onto the backing storage buffer gives you values of only one consistent type. The second limitation is that the only available types are numeric types: integers and floating-point numbers of various "physical" (storage) sizes. The third limitation is that the size of a typed array is fixed; you can't extend the array without creating a new backing buffer and copying.
Such limitations would be quite familiar to a FORTRAN programmer of course.
So to create an array for holding 5 32-bit integers, you'd write
var ints = new Int32Array(5);
You can put values into the array just like you put values into an ordinary array, so long as you get the type right (well close enough):
for (let i = 0; i < 5; i++)
ints[i] = i;
console.log(ints); // [0, 1, 2, 3, 4]
Now: to do what the OP asked, you'd grab the buffer from the array we just created, and then make a new typed array on top of the same buffer at an offset from the start. The offsets when doing this are always in bytes, regardless of the type used to create the original array. That's super useful for things like looking at the individual parts of a floating point value, and other "bit-banging" sorts of jobs, though of course that doesn't come up much in normal JavaScript coding. Anyway, to get something like the rest array from the original question:
var rest = new Int32Array(ints.buffer, 4);
In that statement, the "4" means that the new array will be a view into the buffer starting 4 bytes from the beginning; 32-bit integers being 4 bytes long, that means that the new view will skip the first element of the original array.
Since JavaScript can't do this, the only real solution to your problem is WebAssembly. Otherwise use Proxy.

What's the shortest way to repopulate an array by reference?

I have an array which is binded by reference as a model (to handsontable). Let's call it data. At some point I need to recalc it from scratch (let's call a calced new array freshData; it may have a different length). Assigning data = freshData doesn't do the job since this only changes what data references and doesn't alter the binded model. But calling .splice and .push of data does the job:
data.splice(0,data.length);
for(var i = 0; i < freshData.length; i++)
data.push(freshData[i]);
I wonder: can this be done in a shorter manner? Like, without a loop or may be even using a single method? data.concat(freshData) doesn't help since it creates a new array, it doesn't change data itself. Also, this iterating looks somewhat suboptimal in terms of performance...
If you have ES2015 support OR babel:
data.push(...freshData)
Otherwise just go with
data.push.apply(data, freshData);
You can push the whole array at once without the loop:
data.push(...freshData)
https://runkit.com/arthur/59406f521229b300129a7960
You can use splice as a oneliner
data.splice(0, data.length, ...freshData);
Alternatively, use data.length = 0 to empty the array and then put in the new data, either using a loop or passing multiple parameters to push. Notice that using spread syntax or apply might overflow the callstack with too large arrays.
Or do it the hard way with assignments:
for (var i=0; i<freshData.length; i++)
data[i] = freshData[i];
data.length = i;
this iterating looks somewhat suboptimal in terms of performance...
No, there's always a need to iterate in some way or another. The time complexity will be linear to the size of the old and new data. But performance shouldn't be your first concern, focus on readability and correctness. If it's actually a tight spot, do your own comparison benchmark to be sure. I'd suspect that my assignment solution would be the fastest, as it doesn't do a method call and tends to avoid array resizing where possible.

Do arrays with gaps in their indices entail any benefits that compensate their disadvantages

In Javascript arrays may have gaps in their indices, which should not be confused with elements that are simply undefined:
var a = new Array(1), i;
a.push(1, undefined);
for (i = 0; i < a.length; i++) {
if (i in a) {
console.log("set with " + a[i]);
} else {
console.log("not set");
}
}
// logs:
// not set
// set with 1
// set with undefined
Since these gaps corrupt the length property I'm not sure, if they should be avoided whenever possible. If so, I would treat them as edge case and not by default:
// default:
function head(xs) {
return xs[0];
}
// only when necessary:
function gapSafeHead(xs) {
var i;
for (i = 0; i < xs.length; i++) {
if (i in xs) {
return xs[i];
}
}
}
Besides the fact that head is very concise, another advantage is that it can be used on all array-like data types. head is just a single simple example. If such gaps need to be considered throughout the code, the overhead should be significantly.
This is likely to come up in any language that overloads hash tables to provide something that colloquially is called an "array". PHP, Lua and JavaScript are three such languages. If you depend on strict sequential numeric array behavior, then it will be an inconvenience for you. More generally, the behavior provides conveniences as well.
Here's a classic algorithm question: to delete a member from the middle of a data structure, which data structure is "better": A linked list or an array?
You're supposed to say "linked list", because deleting a node from a linked list doesn't require you to shift the rest of the array down one index. But linked lists have other pitfalls, so is there another data structure we can use? You can use a sparse array*.
In many languages that provide this hashy type of arrays, removing any arbitrary member of the array will change the length. Unfortunately, JavaScript does not change the length, so you lose out a little there. But nevertheless, the array is "shorter", at least from the Object.keys perspective.
*Many sparse arrays are implemented using linked lists, so don't apply this too generally. In these languages, though, they're hash tables with predictable ordered numeric keys.
Of course, the question is a subjective one, but I argue that the gaps should certainly be avoided, if possible. Arrays are special Javascript objects with very specific purposes. You can totally hack on arrays, manipulate the length property, add properties with keys other than numbers (e.g myArray["foo"] = "bar"), but these mostly devolve into antipatterns. If you need some special form of pseudo-array, you can always just code it yourself with a regular object. After all, typeof [] === "object"
It's not like gaps inherently break your code, but I would avoid pursuing them intentionally.
Does that answer your question?

Store a data table as array of row objects, or as an object of column arrays?

Main question: whether to store a data table as array of row objects, or as an object of column arrays.
Proximate question: How to measure the memory footprint of an object.
Practical question: How do I read the memory profiler in Chrome?
Background
Working with rectangular data tables in Javascript, both in browser and/or Node.js. Many leading libraries like D3 and Crossfilter store data as arrays of objects, e.g.
var rows =
[{name: 'apple', price: 1.79, ...},
{name: 'berry', price: 3.49, ...},
{name: 'cherry', price: 4.29, ...}, ...
]
However, it seems with many columns (my use case) and potentially many rows, the overhead of storing keys can become very heavy, and it would be more efficient to store the data (and iterate over it) storing each column as an array, as in:
var cols = {
name: ['apple', 'berry', 'cherry', ...],
price: [1.79, 3.49, 4.29, ...],
...
}
Profiling question
One answer to this post describes using the Chrome memory profile: JavaScript object size
I set up the following simplistic benchmark below. The code can be copied/pasted to the console of Chrome and executed. I then looked at the Chrome profiler, but not sure how to read it.
At first glance, the retained size seems clearly in favor of columns:
window.rowData: 294,170,760 bytes
window.colData: 44,575,896 bytes
But if I click on each, they give me the same (huge) retained size:
window.rowData: 338,926,668 bytes
window.colData: 338,926,668 bytes
Benchmark code
The following code can be copy/pasted to Chrome console:
function makeid(len) {
var text = "";
var possible = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
for (var i = 0; i < len; i++)
text += possible.charAt(Math.floor(Math.random() * possible.length));
return text;
}
/* create set of 400 string keys with 8 random characters.*/
var keys = window.keys = [], i, c;
for ( i=0; i < 400; i++) {
keys.push(makeid(8));
}
/* METHOD 1: Create array of objects with {colName: cellValue} pairs */
var rows = window.rowData = [];
for ( i = 0; i < 10000; i++) {
var row = {};
for ( c = 0; c < 400; c++) {
row[keys[c]] = Math.random();
}
rows.push(row);
}
/* METHOD 2: Create set of columns { colName: [values]} */
var cols = window.colData = {};
for ( c=0; c<400; c++) {
var col = cols[keys[c]] = [];
for ( i=0; i<10000; i++) {
col[i] = rows[i][keys[c]];
}
}
I would be very careful about storing data in this fashion.
The main thing that worries me is usability. In my opinion, the biggest drawback of storing data in columns like this is that you now become responsible for managing insertion and removal of data in an atomic fashion. You will need to be very careful to ensure that if you remove or insert a value in one column, you also remove or insert a value at the same location in all of the other columns. You'll also have to make sure that whatever is using the data does not read values in the middle of the removal/insertion. If something tries to read a "row" from the data before the update finishes, it will see an inconsistent view which would be a bad thing. This all sounds very complicated and generally unpleasant to handle in Javascript to me.
When data is stored as objects in array, you can handle insertion/deletion very simply. Simply remove or add an entire object to the array and your done. The whole operation is atomic so you don't have to worry about timing, and you'll never have to worry about forgetting to remove an item from a column.
As far as memory usage is concerned, it really depends on the actual data you are storing. If you have data like that shown in your test example, where every "row" has a value in every "column" you will likely save some memory because the interpreter does not need to store the names of keys for each value in an object. How this is done is implementation specific, however, and after a little research I couldn't really identify if this is the case or not. I could easily imagine a clever interpreter using a look up table to store shared key names, in which case you will have almost negligible overhead when storing objects in an array compared to the column solution. Also, if you data happens to be sparse, I.E. not every row has a value for every column, you could actually use more memory storing data in columns. In the column scheme you will need to store a value in every single column for every row, even if it's a null or some other indicator of empty space, to maintain alignment. If you store objects in an array, you can leave off key/value pairs where necessary. If there are a lot of key/value pairs that you can leave off, you can save a ton of memory.
As Donald Knuth said, "Premature optimization is the root of all evil." By storing your data in columns like that, you will be taking on a lot of extra work to make sure that your data is consistent (which may lead to fragile code) and you will be making your code much harder to read because people won't be expecting data to be stored like that. You should only inflict these things upon yourself if you really, really, need to. My recommendation would be to stick to the objects in an array solution, since it make your code much easier to read and write, and it's pretty unlikely that you actually need to save the memory that would be saved by the column solution. If, down the line, you have performance issues you can re-visit the idea of storing data this way. Even then, I'd be willing to bet that there are other, easier ways of making things run faster.

Categories

Resources