What's the difference between .substr(0,1) or .charAt(0)? - javascript

We were wondering in this thread if there was a real difference between the use of .substr(0,1) and the use of .charAt(0) when you want to get the first character (actually, it could apply to any case where you wan only one char).
Is any of each faster than the other?

Measuring it is the key!
Go to http://jsperf.com/substr-or-charat to benchmark it yourself.
substr(0,1) runs at 21,100,301 operations per second on my machine, charAt(0) runs 550,852,974 times per second.
I suspect that charAt accesses the string as an array internally, rather than splitting the string.
As found in the comments, accessing the char directly using string[0] is slightly faster than using charAt(0).

Unless your whole script is based on the need for doing fast string manipulation, I wouldn't worry about the performance aspect at all. I'd use charAt() on the grounds that it's readable and the most specific tool for the job provided by the language. Also, substr() is not strictly standard, and while it's very unlikely any new ECMAScript implementation would omit it, it could happen. The standards-based alternatives to str.charAt(0) are str.substring(0, 1) and str.slice(0, 1), and for ECMAScript 5 implementations, str[0].

Related

Complexity of Array methods

In a team project we need to remove the first element of an array, thus I called Array.prototype.shift(). Now a guy saw Why is pop faster than shift?, thus suggested to first reverse the array, pop, and reverse again, with Array.prototype.pop() and Array.prototype.reverse().
Intiutively this will be slower (?), since my approach takes O(n) I think, while the other needs O(n), plus O(n). Of course in asymptotic notation this will be the same. However notice the verb I used, think!
Of course I could write some, use jsPerf and benchmark, but this takes time (in contrast with deciding via the time complexity signs (e.g. a O(n3) vs O(n) algorithm).
However, convincing someone when using my opinion is much harder than pointing him to the Standard (if it refered to complexity).
So how to find the Time Complexity of these methods?
For example in C++ std::reverse() clearly states:
Complexity
Linear in half the distance between first and last: Swaps elements.
how to find the Time Complexity of these methods?
You cannot find them in the Standard.
ECMAScript is a Standard for scripting languages. JavaScript is such a language.
The ECMA specification does not specify a bounding complexity. Every JavaScript engine is free to implement its own functionality, as long as it is compatible with the Standard.
As a result, you have to benchmark with jsPerf, or even look at the souce code of a specific JavaScript Engine, if you would like.
Or, as robertklep's comment mentioned:
"The Standard doesn't mandate how these methods should be implemented. Also, JS is an interpreted language, so things like JIT and GC may start playing a role depending on array sizes and how often the code is called. In other words: benchmarking is probably your only option to get an idea on how different JS engines perform".
There are further evidence for this claim ([0], [1], [2]).
An undisputable method is to do a comparative benchmark (provided you do it correctly and the other party is not bad faith). Don't care about theoretical complexities.
Otherwise you will spend a hard time convincing someone who doesn't see the obvious: a shift moves every element once, while two reversals move it twice.
By the way, a shift is optimal, as every element has to move at least once. And if properly implemented as a memmov, it is very fast (while a single reversal cannot be as fast).

String Comparison vs. Hashing

I recently learned about the rolling hash data structure, and basically one of its prime uses to searching for a substring within a string. Here are some advantages that I noticed:
Comparing two strings can be expensive so this should be avoided if possible
Hashing the strings and comparing the hashes is generally much faster than comparing strings, however rehashing the new substring each time traditionally takes linear time
A rolling hash is able to rehash the new substring in constant time, making it much quicker and more efficient for this task
I went ahead and implemented a rolling hash in JavaScript and began to analyze the speed between a rolling hash, traditional rehashing, and just comparing the substrings against each other.
In my findings, the larger the substring, the longer it took for the traditional rehashing approach to run (as expected) where the rolling hash ran incredibly fast (as expected). However, comparing the substrings together ran much faster than the rolling hash. How could this be?
For the sake of perspective, let's say the running times for the functions searching through a ~2.4 million character string for a 100 character substring were the following:
Rolling Hash - 0.809 seconds
Traditional Rehashing - 71.009 seconds
Just comparing the strings (no hashing) 0.089 seconds
How could the string comparing be so much faster than the rolling hash? Could it just have something to do with JavaScript in particular? Strings are a primitive type in JavaScript; would this cause string comparisons to run in constant time?
My main confusion is as to how/why string comparisons are so fast in JavaScript, when I was under the impression that they were supposed to be relatively slow.
Note:
By string comparisons I'm referring to something like stringA === stringB
Note:
I asked this question over on the Computer Science Community and was informed that I should ask it here as well because this is most likely JavaScript specific.
After some testing and analysis, I've come to the conclusion that there were a few reasons as to why my rolling hash approach was running slightly slower than simply comparing the two strings.
If the rolling hash claims to run in constant time, how can it be slower than comparing strings?
Functions are relatively slow - calling a function is slightly slower than simply executing code inline. In my particular case, a
function had to be called on my object every time the rolling hash
rehashes its internal window, therefore taking slightly longer to run
compared to the string comparison, since that code was simply inline. Especially since my benchmark has the rolling hash "shift" over 2 million iterations, this function slow down can be seen more clearly.
But why is the string comparison so fast?
Strings are primitive - Basically, because strings are a primitive type in JavaScript, the attempting to compare two strings will most
likely invoke some routine that is coded directly within the
interpreter. This low level evaluation can be done as fast as the
architecture possibly can (similar to comparing numbers).
In Conclusion
Comparing strings in JavaScript will end up being faster than a rolling hash in this scenario because the strings are primitive, therefore allowing the interpreter to work with these elements very quickly, and because simply calling functions will create a slight overhead and slow down the process on a very small scale.

How can I programmatically identify evil regexes?

Is there an algorithm to determine whether a given JavaScript regex is vulnerable to ReDoS? The algorithm doesn't have to be perfect - some false positives and false negatives are acceptable. (I'm specifically interested in ECMA-262 regexes.)
It is hard to verify whether a regexp is evil or not without actually running it. You could try detecting some of the patterns detailed in the Wiki and generalise them:
e.g. For
(a+)+
([a-zA-Z]+)*
(a|aa)+
(a|a?)+
(.*a){x} for x > 10
You could check for )+ or )* or ){ sequences and validate against them. However, I guarantee that an attacker will find their way round them.
In essence it is a minefield to allow user set regexps. However, if you can timeout the regexp search, terminate the thread and then mark that regexp as "bad" you can mitigate the threat somewhat. In the case that the regexp is used later, maybe you could validate it by running it against an expected input at point of entry?
Later you will still need to be able to terminate it if the text evaluated at the later stage has a different effect with your regexp and mark it as bad so it will not be used again without user intervention.
TL;DR sort of, but not fully
In [9]: re.compile("(a+)+", re.DEBUG)
max_repeat 1 4294967295
subpattern 1
max_repeat 1 4294967295
literal 97
Note those nested repeat 1..N, for large N, that's bad.
This takes care of all Wikipedia examples, except (a|aa)+and a*b?a*x.
Likewise it's hard to account for back-references, if your engine supports those.
IMO evil regexp is combination of two factors: combinatorial explosion and oversight in engine implementation. Thus, worst case also depends on regexp engine and sometimes flags. Backtracking is not always easy to identify.
Simple cases, however, can be identified.

Is there a performance penalty using capture groups in RegExp#test?

Disclaimer: my question is not focused on the exercise, it's just an example (although if you have any interesting tips on the example itself, feel free to share!).
Say I'm working with parsing some strings with Regex in JavaScript, and my main focus is performance (speed).
I have a piece of regex which checks for a numeric string, and then parses it using Number if it's numeric:
if (/^\[[0-9]+]$/.test(str)) {
val = Number(str.match(/^\[([0-9]+)$/)[1]);
}
Note how the conditional test does not have a capture group around the digits. This leads to writing out basically the same regex twice, except with a capture group the second time.
What I would like to know is this; does adding a capture group to a regex used alongside test() in a condition affect performance in any way? I'd like to simply use the capture regex in both places, as long as there is no performance hit.
And to the question as why I'm doing test() then match() rather than match() and checking null; I want to keep parsing as fast as possible when there's a miss, but it's ok to be a little slower when there's a hit.
If it's not clear from the above, I'm referring to JavaScript's regex engine - although if this differs across engines it'd be nice to know too. I'm working specifically in Node.js here, should it also differ across JS engines.
Thanks in advance!
Doing 2 regexps - that are very similar in scope - will almost always be slower than doing a single one because regexps are greedy (that means that they will try to match as much as they can, usually meaning take the maximum amount of time possible).
What you're asking is basically: is the cost of fewer memory in the worst case scenario (aka using the .test to save on memory from capture) faster than just using the extra memory? The answer is no, using extra memory speeds up your process.
Don't take my word for it though, here's a jsperf: http://jsperf.com/regex-perf-numbers

Recursive Descent Parser for something simple?

I'm writing a parser for a templating language which compiles into JS (if that's relevant). I started out with a few simple regexes, which seemed to work, but regexes are very fragile, so I decided to write a parser instead. I started by writing a simple parser that remembered state by pushing/popping off of a stack, but things kept escalating until I had a recursive descent parser on my hands.
Soon after, I compared the performance of all my previous parsing methods. The recursive descent parser was by far the slowest. I'm stuck: Is it worth using a recursive descent parser for something simple, or am I justified in taking shortcuts? I would love to go the pure regex route, which is insanely fast (almost 3 times faster than the RD parser), but is very hacky and unmaintainable to a degree. I suppose performance isn't terribly important because compiled templates are cached, but is a recursive descent parser the right tool for every task? I guess my question could be viewed as more of a philosophical one: to what degree is it worth sacrificing maintainability/flexibility for performance?
Recursive descent parsers can be extremely fast.
These are usually organized with a lexer, that uses regular expressions to recognize langauge tokens that are fed to the parser. Most of the work in processing the source text is done character-by-character by the lexer using the insanely fast FSAs that the REs are often compiled into.
The parser only sees tokens occasionally compared to the rate at which the lexer sees characters, and so its speed often doesn't matter. However, when comparing parser-to-parser speeds, ignoring time required to lex the tokens, recursive descent parsers can be very fast because they implement the parser stack using function calls which are already very efficient compared to general parser push-current-state-on-simulated-stack.
So, you can have you cake and eat it, too. Use regexps for the lexemes. Use the parser (any kind, recursive descent are just fine) to process lexemes. You should be pleased with the performance.
This approach also satisifies the observation made by other answers: write it in a way to make it maintainable. Lexer/Parser separation does this very nicely, I assure you.
Readability first, performance later...
So if your parser makes code more readable then it is the right tool
to what degree is it worth sacrificing
maintainability/flexibility for
performance?
I think it's very important to write clear maintanable code as a first priority. Until your code not only indicates that it is a bottleneck, but that your application performance also suffers from it, you should always consider clear code to be the best code.
It's also important to not reinvent the wheel. The comment on taking a look at another parser is a very good one. There often found common solutions for writing routines such as this.
Recusion is very elegant when applied to something applicible. In my own experiance slow code due to recursion is an exception, not the norm.
A Recursive Descent Parser should be faster
...or you're doing something wrong.
First off, your code should be broken into 2 distinct steps. Lexer + Parser.
Some reference examples online will first tokenize the entire syntax first into a large intermediate data structure, then pass that along to the parser. While good for demonstration; don't do this, it doubles time and memory complexity. Instead, as soon as a match is determined by the lexer, notify the parser of either a state transition or state transition + data.
As for the lexer. This is probably where you'll find your current bottleneck. If the lexer is cleanly separated from your parser, you can try wapping between Regex and non-Regex implementations to compare performance.
Regex isn't, by any means, faster than reading raw strings. It just avoids some common mistakes by default. Specifically, the unnecessary creation of string objects. Ideally, your lexer should scan your code and produce an output with zero intermediate data except the bare minimum required to track state within your parser. Memory-wise you should have:
Raw input (ie source)
Parser state (ex isExpression, isSatement, row, col)
Data (Ex AST, Tree, 2D Array, etc).
For instance, if your current lexer matches a non-terminal and copies every char over one-by one until it reaches the next terminal; you're essentially recreating that string for every letter matched. Keep in mind that string data types are immutable, concat will always create a new string. You should be scanning the text using pointer arithmetic or some equivalent.
To fix this problem, you need to scan from the startPos of the non-terminal to the end of the non-terminal and copy only when a match is complete.
Regex supports all of this by default out of the box, which is why it's a preferred tool for writing lexers. Instead of trying to write a Regex that parses your entire grammer, write one that only focuses on matching terminals & non-terminals as capture groups. Skip tokenization, and pass the results directly into your parser/state machine.
The key here is, don't try to use Regex as a state machine. At best it will only work for Regular (ie Chomsky Type III, no stack) declarative syntaxes -- hence the name Regular Expression. For example, HTML is a Context-Free (ie Chomsky Type II, stack based) declarative syntax, which is why a Rexeg alone is never enough to parse it. Your grammar, and generally all templating syntaxes, fall into this category. You've clearly hit the limit of Regex already, so you're on the right track.
Use Regex for tokenization only. If you're really concerned with performance, rewrite your lexer to eliminate any and all unnecessary string copying and/or intermediate data. See if you can outperform the Regex version.
The key being. The Regex version is easier to understand and maintain, whereas your hand-rolled lexer will likely be just a tinge faster if written correctly. Conventional wisdom says, do yourself a favor and prefer the former. In terms of Big-O complexity, there shouldn't be any difference between the two. They're two forms of the same thing.

Categories

Resources