JavaScript string concatenation speed - javascript

Can someone explain this one to me:
http://jsperf.com/string-concatenation-1/2
If you're lazy, I tested A) vs B):
A)
var innerHTML = "";
items.forEach(function(item) {
innerHTML += item;
});
B)
var innerHTML = items.join("");
Where items for both tests is the same 500-element array of strings, with each string being random and between 100 and 400 characters in length.
A) ends up being 10x faster. How can this be--I always thought concatenating using join("") was an optimization trick. Is there something flawed with my tests?

Using join("") was an optimization trick for composing large strings on IE6 to avoid O(n**2) buffer copies. It was never expected to be a huge performance win for composing small strings since the O(n**2) only really dominates the overhead of an array for largish n.
Modern interpreters get around this by using "dependent strings". See this mozilla bug for an explanation of dependent strings and some of the advantages and drawbacks.
Basically, modern interpreters knows about a number of different kinds of strings:
An array of characters
A slice (substring) of another string
A concatenation of two other strings
This makes concatenation and substring O(1) at the cost of sometimes keeping too much of a substringed buffer alive resulting in inefficiency or complexity in the garbage collector.
Some modern interpreters have played around with the idea of further decomposing (1) into byte[]s for ASCII only strings, and arrays of uint16s when a string contains a UTF-16 code unit that can't fit into one byte. But I don't know if that idea is actually in any interpreter.

Here the author of Lua programming language explains the buffer overhead that #Mike Samuel is telling about. The examples are in Lua, but the issue is the same in JavaScript.

Related

Is an array of ints actually implemented as an array of ints in JavaScript / V8?

There is claim in this article that an array of ints in JavaScript is implemented by a C++ array of ints.
However; According to MDN unless you specifically use BigInts, in JavaScript all numbers are repressed as doubles.
If I do:
cont arr = [0, 1, 2, 3];
What is the actual representation in the V8 engine?
The code for V8 is here on github, but I don't know where to look:
(V8 developer here.)
"C++ array of ints" is a bit of a simplification, but the key idea described in that article is correct, and an array [0, 1, 2, 3] will be stored as an array of "Smis".
What's a "Smi"? While every Number in JavaScript must behave like an IEEE754 double, V8 internally represents numbers as "small integer" (31 bits signed integer value + 1 bit tag) when it can, i.e. when the number has an integral value in the range -2**30 to 2**30-1, to improve efficiency. Engines can generally do whatever they want under the hood, as long as things behave as if the implementation followed the spec to the letter. So when the spec (or MDN documentation) says "all Numbers are doubles", what it really means from the engine's (or an engine developer's) point of view is "all Numbers must behave as if they were doubles".
When an array contains only Smis, then the array itself keeps track of that fact, so that values loaded from such arrays know their type without having to check. This matters e.g. for a[i] + 1, where the implementation of + doesn't have to check whether a[i] is a Smi when it's already known that a is a Smi array.
When the first number that doesn't fit the Smi range is stored in the array, it'll be transitioned to an array of doubles (strictly speaking still not a "C++ array", rather a custom array on the garbage-collected heap, but it's similar to a C++ array, so that's a good way to explain it).
When the first non-Number is stored in an array, what happens depends on what state the array was in before: if it was a "Smi array", then it only needs to forget the fact that it contains only Smis. No rewriting is needed, as Smis are valid object pointers thanks to their tag bit. If the array was a "double array" before, then it does have to be rewritten, so that each element is a valid object pointer. All the doubles will be "boxed" as so-called "heap numbers" (objects on the managed heap that only wrap a double value) at this point.
In summary, I'd like to point out that in the vast majority of cases, there's no need to worry about any of these internal implementation tricks, or even be aware of them. I certainly understand your curiosity though! Also, array representations are one of the more common reasons why microbenchmarks that don't account for implementation details can easily be misleading by suggesting results that won't carry over to a larger app.
Addressing comments:
V8 does sometimes even use int16 or lower.
Nope, it does not. It may or may not start doing so in the future; though if anything does change, I'd guess that untagged int32 is more likely to be introduced than int16; also if anything does change about the implementation then of course the observable behavior would not change.
If you believe that your application would benefit from int16 storage, you can use an Int16Array to enforce that, but be sure to measure whether that actually benefits you, because quite likely it won't, and may even decrease performance depending on what your app does with its arrays.
It may start to be a double when you make it a decimal
Slightly more accurately: there are several reasons why an array of Smis needs to be converted to an array of doubles, such as:
storing a fractional value in it, e.g. 0.5
storing a large value in it, e.g. 2**34
storing NaN or Infinity or -0 in it

JavaScript String split: fixed-width vs. delimited performance

I am building a string to be parsed into an array by JavaScript. I can make it delimited or I can make the fields fixed-width. To test it, I built this jsperf test using a data string where the fields are both fixed-width and comma-delimited:
https://jsperf.com/string-split-fixed
I have only tested on Windows with Firefox and Chrome, so please run the test from other OSes and browsers. My two test results are clear: String.prototype.split() is the winner by a large margin.
Is my fixed-width code not efficient enough, or is the built-in string split function simply superior? Is there a way to code it so that the fixed-width parsing triumphs? If this was C/C++, the fixed-width code, written properly, would be the clear winner. But I know JavaScript is an entirely different beast.
String.prototype.split() is a built-in JavaScript function. Expect it to be highly optimized for the particular JS engine and be written not in JavaScript but in C++.
It should thus not come a surprise that you can't match its performance with pure JavaScript code.
String operations like splitting a delimited string are inherently memory-bound. Hence, knowing the location of delimiters doesn't really help much, since the entire string still needs to be traversed at least once (to copy the delimited fragments). Fixed-position splitting might be faster for strings that exceed D-cache size, but your string is just 13KB long, so traversing it multiple times isn't going to matter.

What's the rationale for using insertion sort over shell sort in Array.sort in V8

V8 uses quick-sort for arrays of the length over 10 elements, and insertion sort for arrays less than that. Here is the sources:
function InnerArraySort(array, length, comparefn) {
// In-place QuickSort algorithm.
// For short (length <= 10) arrays, insertion sort is used for efficiency.
I'm wondering what's the rationale for not using shell-sort instead of an insertion sort? I understand that it probably doesn't make a difference for an array of 10 elements, but still. Any ideas?
The original rationale is lost to history; the commit that introduced InsertionSort for short arrays (all the way back in 2008) only mentions that it's faster than QuickSort (for such short arrays). So it boils down to: someone implemented it that way, and nobody else saw a reason to change it since.
Since InsertionSort is known to be very efficient for short arrays, I agree that changing it probably doesn't make a difference -- and there are lots of things for the team to work on that actually do make a difference.
Great question. The rationale is simple, it is actually faster to use insertion sort on those small arrays, at least typically. Java in fact made the same switch a long while ago. Now they do insertion sort if the array is less than 7 long in their code. See here. It is under the function sort1 at the top.
Basically what happens (in most cases) for such small arrays is that the overhead for Quicksort makes it slower than insertion sort. Insertion sort in these cases is much more likely to approach it's best performance at O(n) while Quicksort is still likely to stay at O(n log n).
Shell sort on the other hand tends to be much slower than insertion sort. That being said, it can be much faster (relativly). The best case for insertion sort is still 0(n), whereas the best case for shell sort is O(n log n). All number under ten then should have the potential for being faster from a mathematical standpoint. Unfortunately for shell sort, there is a lot more swapping involved. Shell sort then can become much slower. Insertion sort tends to be able to pull off swapping with O(1) swaps, whereas Shell sort is likely to be around O(n) swaps. Swaps are costly in machines because they tend to end up using a third temp register for swapping (there are ways of using XOR, but that is still three commands on the CPU, typically). Therefore, insertion sort still wins on an actual machine, typically.

Javascript: substring time complexity [duplicate]

If we have a huge string, named str1, say 5 million characters long, and then str2 = str1.substr(5555, 100) so that str2 is 100 characters long and is a substring of str1 starting at 5555 (or any other randomly selected position).
How JavaScript stores str2 internally? Is the string contents copied or the new string is sort of virtual and only a reference to the original string and values for position and size are stored?
I know this is implementation dependent, ECMAScript standard (probably) does not define what's under the hood of the string implementation. But I want to know from some expert who knows V8 or SpiderMonkey from inside well enough to clarify this.
Thank you
AFAIK V8 has four string representations:
ASCII
UTF-16
concatenation of multiple strings
slice of another string
Adventures in the land of substrings and RegExps has great explanations and illustrations.
Thus, it does not have to copy the string; it just has to beginning and ending markers to the other string.
SpiderMonkey does the same thing. (See Large substrings ~9000x faster in Firefox than Chrome: why? ... though the answer for Chrome is outdated.)
This can give real speed boosts, but sometimes this is undesirable, since it can cause small strings to hold onto the memory of the larger parent string (V8 bug report)
This old blog post of mine explains it, as well as some other string representation forms: https://web.archive.org/web/20170607033600/http://blog.cdleary.com:80/2012/01/string-representation-in-spidermonkey/
Search for "dependent string". I think I know what you might be getting at with the question: they can be problematic things, at times, because if there are no references to the original, you can keep a giant string around in order to keep a bitty little substring that's actually semantically reachable. There are things that an implementation could do to mitigate that problem, like record information on a GC-generation basis to see if such one-dependent-string entities exist and collapse them to their minimal size, but last I knew of that was not being done. (Essentially with that kind of approach you're recovering runtime_refcount == 1 style information at GC-sweep time.)

String Comparison vs. Hashing

I recently learned about the rolling hash data structure, and basically one of its prime uses to searching for a substring within a string. Here are some advantages that I noticed:
Comparing two strings can be expensive so this should be avoided if possible
Hashing the strings and comparing the hashes is generally much faster than comparing strings, however rehashing the new substring each time traditionally takes linear time
A rolling hash is able to rehash the new substring in constant time, making it much quicker and more efficient for this task
I went ahead and implemented a rolling hash in JavaScript and began to analyze the speed between a rolling hash, traditional rehashing, and just comparing the substrings against each other.
In my findings, the larger the substring, the longer it took for the traditional rehashing approach to run (as expected) where the rolling hash ran incredibly fast (as expected). However, comparing the substrings together ran much faster than the rolling hash. How could this be?
For the sake of perspective, let's say the running times for the functions searching through a ~2.4 million character string for a 100 character substring were the following:
Rolling Hash - 0.809 seconds
Traditional Rehashing - 71.009 seconds
Just comparing the strings (no hashing) 0.089 seconds
How could the string comparing be so much faster than the rolling hash? Could it just have something to do with JavaScript in particular? Strings are a primitive type in JavaScript; would this cause string comparisons to run in constant time?
My main confusion is as to how/why string comparisons are so fast in JavaScript, when I was under the impression that they were supposed to be relatively slow.
Note:
By string comparisons I'm referring to something like stringA === stringB
Note:
I asked this question over on the Computer Science Community and was informed that I should ask it here as well because this is most likely JavaScript specific.
After some testing and analysis, I've come to the conclusion that there were a few reasons as to why my rolling hash approach was running slightly slower than simply comparing the two strings.
If the rolling hash claims to run in constant time, how can it be slower than comparing strings?
Functions are relatively slow - calling a function is slightly slower than simply executing code inline. In my particular case, a
function had to be called on my object every time the rolling hash
rehashes its internal window, therefore taking slightly longer to run
compared to the string comparison, since that code was simply inline. Especially since my benchmark has the rolling hash "shift" over 2 million iterations, this function slow down can be seen more clearly.
But why is the string comparison so fast?
Strings are primitive - Basically, because strings are a primitive type in JavaScript, the attempting to compare two strings will most
likely invoke some routine that is coded directly within the
interpreter. This low level evaluation can be done as fast as the
architecture possibly can (similar to comparing numbers).
In Conclusion
Comparing strings in JavaScript will end up being faster than a rolling hash in this scenario because the strings are primitive, therefore allowing the interpreter to work with these elements very quickly, and because simply calling functions will create a slight overhead and slow down the process on a very small scale.

Categories

Resources