Javascript: Removing empty elements from a large array by index pattern - javascript

I would like to reduce the size of a javascript array by removing empty nodes at regular intervals (most likely every even or odd node). Is there a simple and efficient way of doing this using built-in javascript or d3.js methods?
Background
For a data-driven, in-browser application, I have an array with indexing representing units on a horizontal time scale.
Initial timescale intervals are an algorithmic best guess, but often, far from all index points have data associated with them. In fact, on the basis of minimum time intervals found, I can often identify a regular pattern of unused elements which can be removed entirely and without impacting any data.
Mapping a timescale to real data is easy enough using d3 selections (empty elements are ignored), but given the size of these arrays and the fact that they are passed around a bit, early removal seems to make sense. Where data does exist, it is very large (a tree), so deletions are perhaps best made in-situ rather than through creation of a new array.
From the array documentation (native and d3.js) I see a couple of possible approaches, but am a little wary both of compatability issues and possible side effects. Perhaps surprisingly, I also found no examples related to array index pattern matching.
To sum up:
the nodes to be deleted follow a pattern (every 2nd node etc)
these nodes are guaranteed empty.
no further dependencies (jQuery etc) thanks.
Many thanks

You can either do it 'by hand' or use the filter function.
The filter function is faster... to write :
function isEmptyNode(x, i ) { return ( i & 1 ) } ; // to keep odd nodes
var myNewArray = myOldArray.filter(isEmptyNode) ;
... but the good old for loop (in-place) is way way way faster :
var dst=0;
for (var i=0, len=myArray.length; i<len ; i++ ) { if (i & 1) myArray[dst++]=myArray[i] }
myArray.length = dst;
You can easily change the if (i & 1) by a if (myTestFunction(i)) to have a more generic filtering.
For the performances, yo can check here that it's more than 100 times faster with a for loop :
http://jsperf.com/filter-odd-items-in-array/2

Related

Javascript: Efficiently move items in and out of a fixed-size array

If I have an array that I want to be of fixed size N for the purpose of caching the most recent of N items, then once limit N is reached, I'll have to get rid of the oldest item while adding the newest item.
Note: I don't care if the newest item is at the beginning or end of the array, just as long as the items get removed in the order that they are added.
The obvious ways are either:
push() and shift() (so that cache[0] contains the oldest item), or
unshift() and pop() (so that cache[0] contains the newest item)
Basic idea:
var cache = [], limit = 10000;
function cacheItem( item ) {
// In case we want to do anything with the oldest item
// before it's gone forever.
var oldest = [];
cache.push( item );
// Use WHILE and >= instead of just IF in case the cache
// was altered by more than one item at some point.
while ( cache.length >= limit ) {
oldest.push( cache.shift() );
}
return oldest;
}
However, I've read about memory issues with shift and unshift since they alter the beginning of the array and move everything else around, but unfortunately, one of those methods has to be used to do it this way!
Qs:
Are there other ways to do this that would be better performance-wise?
If the two ways I already mentioned are the best, are there specific advantages/disadvantages I need to be aware of?
Conclusion
After doing some more research into data structures (I've never programmed in other languages, so if it's not native to Javascript, I likely haven't heard of it!) and doing a bunch of benchmarking in multiple browsers with both small and large arrays as well as small and large numbers of reads / writes, here's what I found:
The 'circular buffer' method proposed by Bergi is hands-down THE best as far performance (for reasons explained in the answer and comments), and hence it has been accepted as the answer. However, it's not as intuitive, and makes it difficult to write your own 'extra' functions (since you always have to take offset into account). If you're going to use this method, I recommend an already-created one like this circular buffer on GitHub.
The 'pop/unpush' method is much more intuitive, and performs fairly well, accept at the most extreme numbers.
The 'copyWithin' method is, sadly, terrible for performance (tested in multiple browsers), quickly creating unacceptable latency. It also has no IE support. It's such a simple method! I wish it worked better.
The 'linked list' method, proposed in the comments by Felix Kling, is actually a really good option. I initially disregarded it because it seemed like a lot of extra stuff I didn't need, but to my surprise....
What I actually needed was a Least Recently Used (LRU) Map (which employs a doubly-linked list). Now, since I didn't specify my additional requirements in my original question, I'm still marking Bergi's answer as the best answer to that specific question. However, since I needed to know if a value already existed in my cache, and if so, mark it as the newest item in the cache, the additional logic I had to add to my circular buffer's add() method (primarily indexOf()) made it not much more efficient than the 'pop/unpush' method. HOWEVER, the performance of the LRUMap in these situations blew both of the other two out of the water!
So to summarize:
Linked List -- most options while still maintaining great performance
Circular Buffer -- best performance for just adding and getting
Pop / Unpush -- most intuitive and simplest
copyWithin -- terrible performance currently, no reason to use
If I have an array that caches the most recent of N items, once limit N is reached, I'll have to get rid of the oldest while adding the newest.
You are not looking to copy stuff around within the array, which would take O(n) steps every time.
Instead, this is the perfect use case for a ring buffer. Just keep an offset to the "start" and "end" of the list, then access your buffer with that offset and modulo its length.
var cache = new Array(10000);
cache.offset = 0;
function cacheItem(item) {
cache[cache.offset++] = item;
cache.offset %= cache.length;
}
function cacheGet(i) { // backwards, 0 is most recent
return cache[(cache.offset - 1 - i + cache.length) % cache.length];
}
You could use Array#copyWithin.
The copyWithin() method shallow copies part of an array to another location in the same array and returns it, without modifying its size.
Description
The copyWithin works like C and C++'s memmove, and is a high-performance method to shift the data of an Array. This especially applies to the TypedArray method of the same name. The sequence is copied and pasted as one operation; pasted sequence will have the copied values even when the copy and paste region overlap.
The copyWithin function is intentionally generic, it does not require that its this value be an Array object.
The copyWithin method is a mutable method. It does not alter the length of this, but will change its content and create new properties if necessary.
var array = [0, 1, 2, 3, 4, 5];
array.copyWithin(0, 1);
console.log(array);
You need to splice the existing item and put it in the front using unshift (as the newest item). If the item doesn't already exist in your cache, then you can unshift and pop.
function cacheItem( item )
{
var index = cache.indexOf( item );
index != -1 ? cache.splice( index, 1 ) : cache.pop();
cache.unshift( item );
}
item needs to be a String or Number, or otherwise you'll need to write your own implementation of indexOf using findIndex to locate and object (if item is an object).

How are the JavaScript Arrays internally resizing?

I've been trying to implement a collection type of class (similar to List found in C#) in JavaScript that has some custom functionalities. I also wanted it to be somewhat optimized (I've read some articles on how to properly use JavaScript Arrays).
I thought to myself "if we don't define an initial size to an Array and we keep adding objects to it, internally it will have to allocate a new size for each insertion, that must be slow. I can avoid this by allocating a new size myself (changing the array length), somewhat similar to how it is done in C#, doubling in size whenever the max capacity is reached (I know it's not this trivial but it's a start)".
I tried to implement this idea and found out that it is way slower (about 10 times slower):
// This simplified approach of my implementation is faster...
var array = [];
var counter = 0;
function addItem(newItem) {
array[++counter] = newItem;
}
// ...then this version that resizes the array when a limit is reached
var array = [];
array.length = INITIAL_SIZE;
/*
Alternatively
var array = new Array(INITIAL_SIZE);
*/
var counter = 0;
function addItem(newItem) {
if( CheckCapacity(counter + 1) ) { // Function that checks if the maximum size is reached and if it is, change the array.length to the new size
array[++counter] = newItem;
}
}
Before testing this, I thought to myself, "since I've a new size for the array when I call CheckCapacity(counter + 1), internally it (JavaScript Array) won't have to make as much operations compared to the first function since I make sure that there is space available, more than necessary", i.e., the array[++counter] = newItem line on the second function should be faster compared to the same one in the first function.
I've even used different arrays which contained pre-calculated sizes for the one holding the items; it still was slower.
So back to my question, how is the implementation of a JavaScript Array allocating the necessary size? Am I correct to assume that not much can be done to speed this process up? To me it made sense that the of the drawbacks of having an object (the JavaScript Array) that dynamically allocates more memory each time a new item is added, would be the loss of speed (unless it has pretty good algorithms implemented, but I don't know, hence my question).
In JavaScript, an Array is an abstraction. How it is implemented (and when allocation and resizing is performed) is left up to the JavaScript engine - the ECMAScript specification does not dictate how this is done. So there is basically no precise way to know.
In practice, JavaScript engines are very clever about how the allocate memory and the make sure not to allocate too much. In my opinion, they are far more sophisticated than C#'s List -- because JavaScript engines can dynamically change the underlying data structure depending on the situation. The algorithms vary, but most will consider whether there are any "holes" in your array:
var array = [];
array[0] = "foo" // Is a resizable array
array[1] = "bar" // Is a resizable array
array[2] = "baz" // Is a resizable array
array[1000000] = "hello"; // Is now a hash table
console.log(array[1000000]) // "hello"
If you use arrays normally and use contiguous keys starting at zero, then there are no "holes" and most JavaScript engines will represent the JavaScript array by using a resizable array data structure. Now consider the fourth assignment, I've created a so-called "hole" of roughly a size of a million (the hole spans slots 3-999999). It turns out, JavaScript engines are clever enough not to allocate ~1 million slots in memory for this massive hole. It detects that we have a hole, it will now, represent the JavaScript array using a Dictionary / hash-table like data structure (it uses a binary search tree where the keys are hashed) to save space. It won't store space for the hole, just four mappings: (0, "foo"), (1, "bar"), (2, "baz"), (1000000, "hello").
Unfortunately, accessing the Array is now slower for the engine because it will now have to compute a hash and traverse a tree. When there are no holes, we use a resizable array and we have quicker access times, but when we have a hole the Array's performance is slower. The common terminology is to say an Array is a dense array, when it is without any holes (it uses a resizable array = better performance), and an Array is a sparse array, when it with one or more holes (it uses a hash table = slower performance). For best performance in general, try to use dense arrays.
Now to finish off, let me tell you that the following is a bad idea:
var array = new Array(1000000);
array[0] = "foo"; // Is a hash table
The array above has a hole of size ~1 million (it's like this: ["foo", undefined, undefined, ... undefined]) and so therefore, it is using a hash-table as the underlying data structure. So implementing the resizing yourself is a bad idea - it will create a hole and cause worst performance than better. You're only confusing the JavaScript engine.
This is what your code was doing, your array always had a hole in it and therefore was using a hash table as the underlying data structure; giving slower performance compared to an array without any holes (aka the first version of your code).
Am I correct to assume that not much can be done to speed this process up?
Yes, there is little to be done on the user's side regarding pre-allocation of space. To speed up JavaScript arrays in general you want to avoid creating sparse arrays (avoid created holes):
Don't pre-allocate using new Array(size). Instead "grow as you go". The engine will work out the size of the underlying resizable array itself.
Use contiguous integer keys starting at 0. Don't start from a big integer. Don't add keys that are not integers (e.g. don't use strings as keys).
Try not to delete keys in the middle of arrays (don't delete the element at index 5 from an array with indices 0-9 filled in).
Don't convert to and from dense and sparse arrays (i.e. don't repeatedly add and remove holes). There's an overhead for the engine to convert to and from the resizable array vs hash-table representations.
The disadvantage of [JavaScript Arrays over C# Lists is that they] dynamically allocate more memory each time a new item is added
No, not necessarily. C# Lists and JavaScript Arrays are basically the same when the JavaScript array has no holes. Both are resizable arrays. The difference is that:
C# Lists give the user more control over the behaviour of the resizable array. In JavaScript, you have no control over it -- it's inside the engine.
C# Lists allow the user preallocate memory for better performance, whereas in JavaScript, you should let the engine automatically work out how to preallocate memory in the underlying resizable array for better performance.

Is it a bad idea to use indexOf inside loops?

I was studying big O notation for a technical interview and then I realized that javascript's indexOf method may have a time complexity of O(N) as it traverses through each element of an array and returns the index where its found.
We also know that a time complexity of O(n^2) (n square) is not a good performance measure for larger data.
So is it a bad idea to use indexOf inside loops? In javascript, its common to see code where indexOf method is being used inside loops, may be to measure equality or to prepare some object.
Rather than arrays, should we prefer objects wherever necessary as they provide lookup with constant time performance O(1).
Any suggestions will be appreciated.
It can be a bad idea to use indexOf inside loops especially if the dataStructure you are searching through is quite large.
One work around for this is to have a hash table or dictionary containing the index of every item which you can generate in O(N) time by looping through the data structure and updating it every time you add to the data structure.
If you push something on the end of the data structure it will take O(1) Time to update this table and the worst case scenario is if you push something to the beginning of the data structure it will take O(N).
In most scenarios it will be worth it as getting the index will be O(1) Time.
To be honest, tl;dr. But, I did some speed tests of the various ways of checking for occurrences in a string (if that is your goal for using indexOf. If you are actually trying to get the position of the match, I personally don't know how to help you there). The ways I tested were:
.includes()
.match()
.indexOf()
(There are also the variants such as .search(), .lastIndexOf(), etc. Those I have not tested).
Here is the test:
var test = 'test string';
console.time('match');
console.log(test.match(/string/));
console.timeEnd('match');
console.time('includes');
console.log(test.includes('string'));
console.timeEnd('includes');
console.time('indexOf');
console.log(test.indexOf('string') !== 0);
console.timeEnd('indexOf');
I know they are not loops, but show you that all are basically the same speed. And honestly, each do different things, depending on what you need (do you want to search by RegEx? Do you need to be pre ECMAScript 2015 compatible? etc. - I have not even listed all of them) is it really necessary to analyze it this much?
From my tests, sometimes indexOf() would win, sometimes one of the other ones would win.
based on the browser, the indexOf has different implementations (using graphs, trees, ...). So, the time complexity for each indexOf also differs.
Though, what is clear is that implementing indexOf to have O(n) would be so naive and I don't think there is a browser to have it implements like a simple loop. Therefore, using indexOf in a for loop is not the same as using 2 nested for loops.
So, this one:
// could be O(n*m) which m is so small
// could be O(log n)
// or any other O(something) that is for sure smaller than O(n^2)
console.time('1')
firstArray.forEach(item => {
secondArray.indexOf(item)
})
console.time('1')
is different than:
// has O(n^2)
console.time('2')
firstArray.forEach(item => {
secondArray.forEach(secondItem => {
// extra things to do here
})
})
console.time('2')

Are integers faster than strings as keys for lookup tables in Javascript?

My data returned from the backend contains a lot of referential data and I need to access it efficiently, so I'm thinking about creating (object's id) => (object itself) type lookups. The IDs for objects are returned as string and I wonder if integers are faster than strings as hash keys?
playerLookup = {};
for (var i = 0; i < players.length; i++) {
var player = players[i];
playerLookup[player.id] = player;
// vs.
playerLookup[parseInt(player.id)] = player;
}
According to jsperf test http://jsperf.com/testasdfa, the integer lookup is considerably (~25%) faster on Chrome. Not sure if the test the scenario properly. What do you think?
My opinion is that the fastest way to find the element is through modular hash tabling.
Make playerLookup an array of n elements, with each element of the array set to -1 (or some value that lets you know that bit hasn't been set yet).
when you store a playerId, store it in playerLookup[parseInt(player.id) % n]
The work complexity of finding an item from the hash table this way is 1, but the work complexity of the methods you've listed above is x where x = playerLookup.length (regardless of whether you're using strings or numbers as keys).
To make the hash table smaller, choose a smaller n. The smaller the n, the more likely we'll get clashes.
Clashes
To deal with clashes, make each element of playerLookup an array. If you're adding a playerId to playerLookup in a spot that already contains another player, list the new one alongside this one (i.e. now both are in the same spot). If you lookup a player and find a spot in the hashtable with more than one player in it, simply iterate across this array until you find the player. This iteration will be of the same work complexity as the first implementation you had in mind, but with two advantages:
It's less likely to even need to happen, because of the modular hash table
When it does happen, the array we're iterating across is, on average, n times smaller than the original one you would have implemented (where n is our modulus variable).
For mathematical reasons (with moduli) I recommend choosing a prime number for n.

Why is looping through an Array so much faster than JavaScript's native `indexOf`?

Why is looping through an Array so much faster than JavaScript's native indexOf? Is there an error or something that I'm not accounting for? I expected native implementations would be faster.
For Loop While Loop indexOf
Chrome 10.0 50,948,997 111,272,979 12,807,549
Firefox 3.6 9,308,421 62,184,430 2,089,243
Opera 11.10 11,756,258 49,118,462 2,335,347
http://jsben.ch/#/xm2BV
5 years from then, lot of changes happened in browsers. Now, indexOf performance has increased and is definitely better than any other custom alternative.
Chrome Version 49.0.2623.87 (64-bit)
Ok, looking at the other benchmarks here I am scratching my head at the way that most developers seem to do their benchmarking.
Apologies, but the way it is done leads to horribly wrong conclusions, so I have to go a bit meta and give a comment on the answers provided.
What is wrong with the other benchmarks here
Measuring where to find element 777 in an array that never changes, always leading to index 117 seems so inappropriate for obvious reasons, that I have trouble explaining why. You can't reasonably extrapolate anything from such an overly specific benchmark! The only analogy I can come up with is performing anthropological research on one person, and then calling the findings a generalized overview of the entire culture of the country that this person lives in. The other benchmarks aren't much better.
Even worse: the accepted answer is an image without a link to the benchmark that was used, so we have no way to control if the code for that benchmark is correct (I hope it is a screenshot to a jsperf link that was originally in the question and later edited out in favour of the new jsben.ch link). It's not even an explanation of the original question: why one performs better than the other (a highly debatable statement to begin with).
First, you should know that not all benchmarking sites are created equal - some can add significant errors to certain types of measurements due to their own framework interfering with the timing.
Now, we are supposed to be comparing the performance of different ways to do linear search on an array. Think about the algorithm itself for a second:
look at a value for a given index into an array.
compare the value to another value.
if equal, return the index
if it is not equal, move to the next index and compare the next value.
That's the whole linear search algorithm, right?
So some of the linked benchmarks compare sorted and unsorted arrays (sometimes incorrectly labeled "random", despite being in the same order each iteration - relevant XKCD). It should be obvious that this does not affect the above algorithm in any way - the comparison operator does not see that all values increase monotonically.
Yes, ordered vs unsorted arrays can matter, when comparing the performance of linear search to binary or interpolation search algorithms, but nobody here is doing that!
Furthermore, all benchmarks shown use a fixed length array, with a fixed index into it. All that tells you is how quickly indexOf finds that exact index for that exact length - as stated above, you cannot generalise anything from this.
Here is the result of more-or-less copying the benchmark linked in the question to perf.zone (which is more reliable than jsben.ch), but with the following modifications:
we pick a random value of the array each run, meaning we assume each element is as likely to be picked as any other
we benchmark for 100 and for 1000 elements
we compare integers and short strings.
https://run.perf.zone/view/for-vs-while-vs-indexof-100-integers-1516292563568
https://run.perf.zone/view/for-vs-while-vs-indexof-1000-integers-1516292665740
https://run.perf.zone/view/for-vs-while-vs-indexof-100-strings-1516297821385
https://run.perf.zone/view/for-vs-while-vs-indexof-1000-strings-1516293164213
Here are the results on my machine:
https://imgur.com/a/fBWD9
As you can see, the result changes drastically depending on the benchmark and the browser being used, and each of the options wins in at least one of the scenarios: cached length vs uncached length, while loop vs for-loop vs indexOf.
So there is no universal answer here, and this will surely change in the future as browsers and hardware changes as well.
Should you even be benchmarking this?
It should be noted that before you proceed to build benchmarks, you should determine whether or not the linear search part is a bottleneck to begin with! It probably isn't, and if it is, the better strategy is probably to use a different data structure for storing and retrieving your data anyway, and/or a different algorithm.
That is not to say that this question is irrelevant - it is rare, but it can happen that linear search performance matters; I happen to have an example of that: establishing the speed of constructing/searching through a prefix trie constructed through nested objects (using dictionary look-up) or nested arrays (requiring linear search).
As can be seen this github comment, the benchmarks involve various realistic and best/worst-case payloads on various browsers and platforms. Only after going through all that do I draw conclusions about expected performance. In my case, for most realistic situations the linear search through an array is faster than dictionary look-up, but worst-case performance is worse to the point of freezing the script (and easy to construct), so the implementation was marked as as an "unsafe" method to signal to others that they should think about the context the code would be used.
Jon J's answer is also a good example of taking a step back to think about the real problem.
What to do when you do have to micro-benchmark
So let's assume we know that we did our homework and established that we need to optimize our linear search.
What matters then is the eventual index at which we expect to find our element (if at all), the type of data being searched, and of course which browsers to support.
In other words: is any index equally likely to be found (uniform distribution), or is it more likely to be centered around the middle (normal distribution)? Will be find our data at the start or near the end? Is our value guaranteed to be in the array, or only a certain percentage of the time? What percentage?
Am I searching an array of strings? Objects Numbers? If they're numbers, are they floating point values or integers? Are we trying to optimize for older smartphones, up-to-date laptops, or school desktops stuck with IE10?
This is another important thing: do not optimize for the best-case performance, optimize for realistic worst-case performance. If you are building a web-app where 10% of your customers use very old smart phones, optimize for that; their experience will be one that is unbearable with bad performance, while the micro-optimization is wasted on the newest generation flagship phones.
Ask yourself these questions about the data you are applying linear search to, and the context within which you do it. Then make test-cases fitting for these criteria, and test them on the browsers/hardware that represents the targets you are supporting.
Probably because the actual indexOf implementation is doing a lot more than just looping through the array. You can see the Firefox internal implementation of it here:
https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Array/indexOf
There are several things that can slow down the loop that are there for sanity purposes:
The array is being re-cast to an Object
The fromIndex is being cast to a Number
They're using Math.max instead of a ternary
They're using Math.abs
indexOf does a bunch of type-checking and validation that the for loop and while loop ignore.
Here's the indexOf algorithm:
https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Array/indexOf
Edit: My guess is indexOf is faster for big arrays because it caches the length of the array before looping through it.
Run the test one more time with the edits I've made.
I've increased the size of the array, and made the index you're searching for larger as well. It seems in large arrays indexOf may be a faster choice.
http://jsben.ch/#/xm2BV
EDIT: Based on more tests, indexOf seems to run faster than a for loop in the version of Safari I'm using (5.0.3) and slower in just about everything else.
It might be worth noting that if all you are trying to do is keep a list of items and check for existence (e.g. avoid adding duplicate IDs to an array), it would be far faster to keep an OBJECT with keys that reflect each ID. If you think I'm wrong, compare the following with an array + indexOf. We are talking 181ms for the object method vs. 1 MINUTE for the array indexOf method.
var objs = []
var i_uid = {} // method 1
var a_uid = [] // method 2
var total_count = 100000, idLen = 5
var ts, te, cObj = 0
// method 1
ts = new Date()
while (cObj < total_count) {
var u = uid(idLen),
o = {
uid: u,
text: 'something',
created: new Date()
}
if (!i_uid[u]) { // ensure unique uids only
objs.push(o)
i_uid[u] = cObj // current array position as placeholder
cObj++
}
else {
console.log('unique violation [duplicate uid', u, ']')
}
}
te = new Date()
console.log('loaded ' + total_count + ' with object method in', (te - ts), 'ms')
i_uid = {} // free-up
cObj = 0 // reset
objs = [] // reset
// method 2
ts = new Date()
while (cObj < total_count) {
var u = uid(idLen),
o = {
uid: u,
text: 'something',
created: new Date()
}
if (a_uid.indexOf(u) == -1) { // ensure unique uids only
objs.push(o)
a_uid.push(u)
cObj++
}
else {
console.log('unique violation [duplicate uid', u, ']')
}
}
te = new Date()
console.log('loaded ' + total_count + ' with array + indexOf method in', (te - ts), 'ms')
function uid(l) {
var t = '',
p = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789',
pl = p.length
for (var i = 0; i < l; i++)
t += p.charAt(Math.floor(Math.random() * pl))
return t
}

Categories

Resources