'indexOf' method performance optimization

'indexOf' method performance optimization - javascript

According to Is JavaScript a pass-by-reference or pass-by-value language? JavaScript passes string objects by value. Because of that, calling the indexOf method will trigger copying the content.
In my case, I am parsing a big string to look for data. I heavily use indexOf to find data. String can big as long as 100-200KB and I could need to call indexOf up to 1000 times per full scan.
I'm afraid this will cause polluting 'memory' with unnecessarily copied string and could impact performance.
Am I correct in my conclusion? If so, what is the proper way to deal with my task?
Potentially, I could use regular expressions, but at the moment that looks too complex.

Strings are a strange beast in Javascript, they seem to inhabit a middle ground between primitive types and objects. While technically they're considered primitive, they can in many circumstances be treated as if they were a reference, due to their immutability.
Given that they are immutable, I'd be very surprised if a copy of the string was passed to any function, since that would be both hideously expensive and totally unnecessary.
A seemingly easy way to find out would be to pass a string to a function and change one of its characters to see if that's reflected in the original string on return. However, as mentioned, the immutability of strings makes that impossible.
It may be possible to test this in an indirect manner by executing indexOf("a") against one of two string in a loop of many million times.
The string to search would either be "a" or "a very long string containing many thousands of characters ...".
If the strings are passed by reference, there should be no noticeable difference in the times. Passing by value should be detectable since you have to copy the string millions of times and a short string should copy faster than a long one.
But, as I said, it's probably unnecessary since the engine will most likely be as optimised as possible.

.indexOf() is merely a search of the string and just returns a numeric index into the string. There is no need to make a copy and no copy is made. This doesn't really have anything to do with value/reference in Javascript. The operation is merely a search that returns an index. There is simply no need to make a copy.
Strings in Javascript are immutable. That means they can never be changed and these indexes into the string always point at the same place in the string. Any operation that operates on a string to make a change, returns a new string leaving the old one.
This allows for some interesting optimizations in the implementation. Because a string is immutable, it can be shared among all points of code that have a reference to it. Whenever anyone calls a function to modify the string, it simply returns a new string object created from the old one plus the modification.
If you were to use the index from .indexOf() with .slice() or something like that, then you would be copying part of the original string into a new string object (likely using some additional memory).
If you want to test this for yourself, feel free to run as many .indexOf() operations as you want on a large string and watch memory usage.

Related

Javascript: substring time complexity [duplicate]

If we have a huge string, named str1, say 5 million characters long, and then str2 = str1.substr(5555, 100) so that str2 is 100 characters long and is a substring of str1 starting at 5555 (or any other randomly selected position).
How JavaScript stores str2 internally? Is the string contents copied or the new string is sort of virtual and only a reference to the original string and values for position and size are stored?
I know this is implementation dependent, ECMAScript standard (probably) does not define what's under the hood of the string implementation. But I want to know from some expert who knows V8 or SpiderMonkey from inside well enough to clarify this.
Thank you

AFAIK V8 has four string representations:
ASCII
UTF-16
concatenation of multiple strings
slice of another string
Adventures in the land of substrings and RegExps has great explanations and illustrations.
Thus, it does not have to copy the string; it just has to beginning and ending markers to the other string.
SpiderMonkey does the same thing. (See Large substrings ~9000x faster in Firefox than Chrome: why? ... though the answer for Chrome is outdated.)
This can give real speed boosts, but sometimes this is undesirable, since it can cause small strings to hold onto the memory of the larger parent string (V8 bug report)

This old blog post of mine explains it, as well as some other string representation forms: https://web.archive.org/web/20170607033600/http://blog.cdleary.com:80/2012/01/string-representation-in-spidermonkey/
Search for "dependent string". I think I know what you might be getting at with the question: they can be problematic things, at times, because if there are no references to the original, you can keep a giant string around in order to keep a bitty little substring that's actually semantically reachable. There are things that an implementation could do to mitigate that problem, like record information on a GC-generation basis to see if such one-dependent-string entities exist and collapse them to their minimal size, but last I knew of that was not being done. (Essentially with that kind of approach you're recovering runtime_refcount == 1 style information at GC-sweep time.)

Reverse string comparison

I'm using a Dictionary (associative array, hash table, any of these synonyms).
The keys used to uniquely identify values are fairly long strings. However, I know that these strings tend to differ at the tail, rather than the head.
The fastest way to find a value in a JS object is to test the existence of
object[key], but is that also the case for extremely long, largely similar, keys (+100 chars), in a fairly large Dictionary (+1000 entries)?
Are there alternatives for this case, or is this a completely moot question, because accessing values by key is already insanely fast?

Long story short; It doesn't matter much. JS will internally use a hash table (as you already said yourself), so it will need to calculate a hash of your keys for insertion and (in some cases) for accessing elements.
Calculating a hash (for most reasonable hash functions) will take slightly longer for long keys than for short keys (I would guess about linearly longer), but it doesn't matter whether the changes are at the tail or at the head.
You could decide to roll your own hashes instead, cache these somehow, and use these as keys, but this would leave it up to you to deal with hash collisions. It will be very hard to do better than the default implementation, and is almost certainly not worth the trouble.
Moreover, for an associative array with only 1000 elements, probably none of this matters. Modern CPUs can process close to / around billions of instructions per second. Even just a lineair search through the whole array will likely perform just fine, unless you have to do it very very often.

Hash tables (dictionary, map, etc.) first check for hash code, and only then, if necessary (in case of collision - at least two keys have the same hash code) perform equals. If you experience performance problems, the first thing you have to check, IMHO, is hash codes collision. It may appear (bad implementation or weird keys) that the hash code is computed on, say, 3 first chars (it's a wild exaggeration, of course):
"abc123".hashCode() ==
"abc456".hashCode() ==
...
"abc789".hashCode()
and so you have a lot of collisions, have to perform equals, and finally slow O(N) complexity routine. In that case, you have to think over a better hash.

Can I increase lookup speed by positioning properties in object?

I've seen a lot of questions about the fastest way to access object properties (like using . vs []), but can't seem to find whether it's faster to retrieve object properties that are declared higher than others in object literal syntax.
I'm working with an object that could contain up to 40,000 properties, each of which is an Array of length 2. I'm using it as a lookup by value.
I know that maybe 5% of the properties will be the ones I need to retrieve most often. Is either of the following worth doing for increased performance (decreased lookup time)?
Set the most commonly needed properties at the top of the object literal syntax?
If #1 has no effect, should I create two separate objects, one with the most common 5% of properties, search that one first, then if the property isn't found there, then look through the object with all the less-common properties?
Or, is there a better way?

I did a js perf here: http://jsperf.com/object-lookup-perf
I basically injected 40000 props with random keys into an object, saved the "first" and "last" keys and looked them up in different tests. I was surprised by the result, because accessing the first was 35% slower than accessing the last entry.
Also, having an object of 5 or 40000 entries didn’t make any noticeable difference.
The test case can most likely be improved and I probably missed something, but there is a start for you.
Note: I only tested chrome

Yes, something like "indexOf" searches front to back, so placing common items higher in the list will return them faster. Most "basic" search algorithms are basic top down (simple sort) searches. At least for arrays.

If you have so many properties, they must be computed, no ? So you can replace the (string, most probably) computation by an integer hash computation, then use this hash in a regular array.
You might even use one single array by putting values in the 2*ith, 2*i+1th slot.
If you can use a typed array here, do it and you could no go faster.

Set the most commonly needed properties at the top of the object literal syntax?
No. Choose readability over performance. If you've got few enough properties that you use a literal in the code, it won't matter anyway; and you should order the properties in a logical sequence.
Property lookup in objects is usually based on hash maps, and position should not make a substantial difference. Depending on the implementation of the hash, they might be neglible slower, but I'd guess this is quite random and depends heavily on the applied optimisations. It should not matter.
If #1 has no effect, should I create two separate objects, one with the most common 5% of properties, search that one first, then if the property isn't found there, then look through the object with all the less-common properties?
Yes. If you've got really huge objects (with thousands of properties), this is a good idea. Depending on the used data structure, the size of the object might influence the lookup time, so if you've got a smaller object for the more frequent properties it should be faster. It's possible that different structures are chosen for the two objects, which could perform better than the single one - especially if you know beforehand in which object to look. However you will need to test this hypothesis with your actual data, and you should beware of premature [micro-]optimisation.

Why aren't strings mutable? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Why can't strings be mutable in Java and .NET?
Why .NET String is immutable?
Several languages have chosen for this, such as C#, Java, and Python. If it is intended to save memory or gain efficiency for operations like compare, what effect does it have on concatenation and other modifying operations?

Immutable types are a good thing generally:
They work better for concurrency (you don't need to lock something that can't change!)
They reduce errors: mutable objects are vulnerable to being changed when you don't expect it which can introduce all kinds of strange bugs ("action at a distance")
They can be safely shared (i.e. multiple references to the same object) which can reduce memory consumption and improve cache utilisation.
Sharing also makes copying a very cheap O(1) operation when it would be O(n) if you have to take a defensive copy of a mutable object. This is a big deal because copying is an incredibly common operation (e.g. whenever you want to pass parameters around....)
As a result, it's a pretty reasonable language design choice to make strings immutable.
Some languages (particularly functional languages like Haskell and Clojure) go even further and make pretty much everything immutable. This enlightening video is very much worth a look if you are interested in the benefits of immutability.
There are a couple of minor downsides for immutable types:
Operations that create a changed string like concatenation are more expensive because you need to construct new objects. Typically the cost is O(n+m) for concatenating two immutable Strings, though it can go as low as O(log (m+n)) if you use a tree-based string data structure like a Rope. Plus you can always use special tools like Java's StringBuilder if you really need to concatenate Strings efficiently.
A small change on a large string can result in the need to construct a completely new copy of the large String, which obviously increases memory consumption. Note however that this isn't usually a big issue in garbage-collected languages since the old copy will get garbage collected pretty quickly if you don't keep a reference to it.
Overall though, the advantages of immutability vastly outweigh the minor disadvantages. Even if you are only interested in performance, the concurrency advantages and cheapness of copying will in general make immutable strings much more performant than mutable ones with locking and defensive copying.

It's mainly intended to prevent programming errors. For example, Strings are frequently used as keys in hashtables. If they could change, the hashtable would become corrupted. And that's just one example where having a piece of data change while you're using it causes problems. Security is another: if you checking whether a user is allowed to access a file at a given path before executing the operation they requested, the string containing the path better not be mutable...
It becomes even more important when you're doing multithreading. Immutable data can be safely passed around between threads while mutable data causes endless headaches.
Basically, immutable data makes the code that works on it easier to reason about. Which is why purely functional languages try to keep everything immutable.

In Java not only String but all primitive Wrapper classes (Integer, Double, Character etc) are immutable. I am not sure of the exact reason but I think these are the basic data types on which all the programming schemes work. If they change, things could go wild. To be more specific, I'll use an example: Say you have opened a socket connection to a remote host. The host name would be a String and port would be Integer. What if these values are modified after the connection is established.
As far as performance is concerned, Java allocates memory to these classes from a separate memory section called Literal Pool, and not from stack or Heap. The Literal Pool is indexed and if you use a string "String" twice, they point to the same object from Literal pool.

Having strings as immutable also allows the new string references easy, as the same/similar strings will be readily available from the pool of the Strings previously created. Thereby reducing the cost of new object creation.

Does assigning a new string value create garbage that needs collecting?

Consider this javascript code:
var s = "Some string";
s = "More string";
Will the garbage collector (GC) have work to do after this sort of operation?
(I'm wondering whether I should worry about assigning string literals when trying to minimize GC pauses.)
e: I'm slightly amused that, although I stated explicitly in my question that I needed to minimize GC, everyone assumed I'm wrong about that. If one really must know the particular details: I've got a game in javascript -- it runs fine in Chrome, but in Firefox has semi-frequent pauses, that seem to be due to GC. (I've even checked with the MemChaser extension for Firefox, and the pauses coincide exactly with garbage collection.)

Yes, strings need to be garbage-collected, just like any other type of dynamically allocated object. And yes, this is a valid concern as careless allocation of objects inside busy loops can definitely cause performance issues.
However, string values are immutable (non-changable), and most modern JavaScript implementations use "string interning", that is they store only one instance of each unique string value. This means that if you have something like this...
var s1 = "abc",
s2 = "abc";
...only one instance of "abc" will be allocated. This only applies to string values, not String objects.
A couple of things to keep in mind:
Functions like substring, slice, etc. will allocate a new object for each function call (if called with different parameters).
Even though both variable point to the same data in memory, there are still two variables to process when the GC cycle runs. Having too many local variables can also hurt you as each of them will need to be processed by the GC, adding overhead.
Some further reading on writing high-performance JavaScript:
https://developer.mozilla.org/en-US/docs/JavaScript/Memory_Management
https://www.scirra.com/blog/76/how-to-write-low-garbage-real-time-javascript
http://jonraasch.com/blog/10-javascript-performance-boosting-tips-from-nicholas-zakas

Yes, but unless you are doing this in a loop millions of times it won't likely be a factor for you to worry about.

As you already noticed, JavaScript is not JavaScript. It runs on different platforms and thus will have different performance characteristics.
So the definite answer to the question "Will the GC have work to do after this sort of operation?" is: maybe. If the script is as short as you've shown it, then a JIT-Compiler might well drop the first string completely. But there's no rule in the language definition that says it has to be that way or the other way. So in the end it's like it is all too often in JavaScript: you have to try it.
The more interesting question might also be: how can you avoid garbage collection. And that is try to minimize the allocation of new objects. Games typically have a pretty constant amount of objects and often there won't be new objects until an old one gets unused. For strings this might be harder as they are immutable in JS. So try to replace strings with other (mutable) representations where possible.

Yes, the garbage collector will have a string object containing "Some string" to get rid of. And, in answer to your question, that string assignment will make work for the GC.
Because strings are immutable and are used a lot, the JS engine has a pretty efficient way of dealing with them. You should not notice any pauses from garbage collecting a few strings. The garbage collector has work to do all the time in the normal course of javascript programming. That's how it's supposed to work.
If you are observing pauses from GC, I rather doubt it's from a few strings. There is more likely a much bigger issue going on. Either you have thousands of objects needing GC or some very complicated task for the GC. We couldn't really speculate on that without study of the overall code.
This should not be a concern unless you were doing some enormous loop and dealing with tens of thousands of objects. In that case, one might want to program a little more carefully to minimize the number of intermediate objects that are created. But, absent that level of objects, you should first right clear, reliable code and then optimize for performance only when something has shown you that there is a performance issue to worry about.

To answer your question "I'm wondering whether I should worry about assigning string literals when trying to minimize GC pauses": No.
You really don't need to worry about this sort of thing with regard to garbage collection.
GC is only a concern when creating & destroying huge numbers of Javascript objects, or large numbers of DOM elements.

Develop Reference

JavaScript is the programming language of the Web.