Is there a performance penalty using capture groups in RegExp#test? - javascript

Disclaimer: my question is not focused on the exercise, it's just an example (although if you have any interesting tips on the example itself, feel free to share!).
Say I'm working with parsing some strings with Regex in JavaScript, and my main focus is performance (speed).
I have a piece of regex which checks for a numeric string, and then parses it using Number if it's numeric:
if (/^\[[0-9]+]$/.test(str)) {
val = Number(str.match(/^\[([0-9]+)$/)[1]);
}
Note how the conditional test does not have a capture group around the digits. This leads to writing out basically the same regex twice, except with a capture group the second time.
What I would like to know is this; does adding a capture group to a regex used alongside test() in a condition affect performance in any way? I'd like to simply use the capture regex in both places, as long as there is no performance hit.
And to the question as why I'm doing test() then match() rather than match() and checking null; I want to keep parsing as fast as possible when there's a miss, but it's ok to be a little slower when there's a hit.
If it's not clear from the above, I'm referring to JavaScript's regex engine - although if this differs across engines it'd be nice to know too. I'm working specifically in Node.js here, should it also differ across JS engines.
Thanks in advance!

Doing 2 regexps - that are very similar in scope - will almost always be slower than doing a single one because regexps are greedy (that means that they will try to match as much as they can, usually meaning take the maximum amount of time possible).
What you're asking is basically: is the cost of fewer memory in the worst case scenario (aka using the .test to save on memory from capture) faster than just using the extra memory? The answer is no, using extra memory speeds up your process.
Don't take my word for it though, here's a jsperf: http://jsperf.com/regex-perf-numbers

Related

How can I programmatically identify evil regexes?

Is there an algorithm to determine whether a given JavaScript regex is vulnerable to ReDoS? The algorithm doesn't have to be perfect - some false positives and false negatives are acceptable. (I'm specifically interested in ECMA-262 regexes.)
It is hard to verify whether a regexp is evil or not without actually running it. You could try detecting some of the patterns detailed in the Wiki and generalise them:
e.g. For
(a+)+
([a-zA-Z]+)*
(a|aa)+
(a|a?)+
(.*a){x} for x > 10
You could check for )+ or )* or ){ sequences and validate against them. However, I guarantee that an attacker will find their way round them.
In essence it is a minefield to allow user set regexps. However, if you can timeout the regexp search, terminate the thread and then mark that regexp as "bad" you can mitigate the threat somewhat. In the case that the regexp is used later, maybe you could validate it by running it against an expected input at point of entry?
Later you will still need to be able to terminate it if the text evaluated at the later stage has a different effect with your regexp and mark it as bad so it will not be used again without user intervention.
TL;DR sort of, but not fully
In [9]: re.compile("(a+)+", re.DEBUG)
max_repeat 1 4294967295
subpattern 1
max_repeat 1 4294967295
literal 97
Note those nested repeat 1..N, for large N, that's bad.
This takes care of all Wikipedia examples, except (a|aa)+and a*b?a*x.
Likewise it's hard to account for back-references, if your engine supports those.
IMO evil regexp is combination of two factors: combinatorial explosion and oversight in engine implementation. Thus, worst case also depends on regexp engine and sometimes flags. Backtracking is not always easy to identify.
Simple cases, however, can be identified.

Is there a way to match only top level parentheses with regex?

With Javascript, suppose I have a string like (1)(((2)(3))4), can I get a regex to match just (1) and (((2)(3))4), or do I need to do something more complicated?
Ideally the regex would return ["((2)(3))","4"] if you searched ((2)(3))4. Actually that's really a requirement. The point is to group things into the chunks that need to be worked on first, like the way parentheses work in math.
No, there is no way to match only top level parentheses with regex
Looking only at the top level doesn't make the problem easier than general "parsing" of recursive structures. (See this relevant popular SO question with a great answer).
Here's a simple intuitive reason why Regex can't parse arbitrary levels of nesting:
To keep track of the level of nesting, one must count. If one wants to be able to keep track of an arbitrary level of nesting, one needs an arbitrarily large number while running the program.
But regular expressions are exactly those that can be implemented by DFAs, that is Deterministice finite automatons. These have only a finite number of states. Thus they can't keep track of an arbitrarily large number.
This argument works also for your specific concern of being only interested in the top level parentheses.
To recognize the top level parentheses, you must keep track of arbitrary nesting preceding any one of them:
((((..arbitrarily deep nesting...))))((.....)).......()......
^toplevel ^^ ^ ^^
So yes, you need something more powerful than regex.
While if you are very pragmatic, for your concrete application it might be okay to say that you won't encounter any nesting deeper than, say, 1000 (and so you might be willing to go with regex), it's also a very practical fact that any regex recognizing a nesting level of more than 2 is basically unreadable.
Well, here is one way to do it. As Jo So pointed out, you can't really do it in javascript with indefinite amounts of recursion, but you can make something arbitrarily recursive pretty easily. I'm not sure how the performance scales though.
First I figured out that you need recursion. Then I realized that you can just make your regex 'recursive' by just copying and pasting recursively, like so (using curly braces for clarity):
Starting regex
Finds stuff in brackets that isn't itself brackets.
/{([^{}])*}/g
Then copy and paste the whole regex inside itself! (I spaced it out so you can see where it was pasted in.) So now it is basically like a( x | a( x )b )b
/{([^{}] | {([^{}])*} )*}/g
That will get you one level of recursion and you can continue ad nauseum in this fashion and actually double the amount of recursions each time:
//matches {4{3{2{1}}}}
/{([^{}]|{([^{}]|{([^{}]|{([^{}])*})*})*})*}/g
//matches {8{7{6{5{4{3{2{1}}}}}}}}
/{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}])*})*})*})*})*})*})*})*}/g
Finally I just add |[^{}]+ on the end of the expression to match stuff that is completely outside of brackets. Crazy, but it works for my needs. I feel like there is probably some clever way to combine this concept with a recursive function in order to get a truly recursive matcher, but I can't think of it now.
If you can be sure that the parentheses are balanced (I'm sure there are other resources out there that can answer that question for you if required) and if by "top-level" you're happy to find local as well as global maxima then all you need to do is find any content that starts with an open bracket and closes with a close-bracket, with no intermediate open-bracket between the two:
I think the following should do that for you and helpfully group any "top-level" content:
\(([^\(]*?)\)
That content may not all be at the same "level", but if you think of the nested brackets as describing the branching of a tree, the regex will return to you the leaves. If you pre-process your text to be wrapped in parentheses to start with, and the earlier assumptions are met, you can guarantee always getting at least one "leaf".

regex for matching finite-depth nested strings -- slow, crashy behavior

I was writing some regexes in my text editor (Sublime) today in an attempt to quickly find specific segments of source code, and it required getting a little creative because sometimes the function call might contain more function calls. For example I was looking for jQuery selectors:
$("div[class='should_be_using_dot_notation']");
$(escapeJQSelector("[name='crazy{"+getName(object)+"}']"));
I don't consider it unreasonable to expect one of my favorite powertools (regex) to help me do this sort of searching, but it's clear that the expression required to parse the second bit of code there will be somewhat complex as there are two levels of nested parens.
I am sufficiently versed in the theory to know that this sort of parsing is exactly what a context-free grammar parser is for, and that building out a regex is likely to suck up more memory and time (perhaps in an exponential rather than O(n^3) fashion). However I am not expecting to see that sort of feature available in my text editor or web browser any time soon, and I just wanted to squeak by with a big nasty regex.
Starting from this (This matches zero levels of nested parens, and no trivial empty ones):
\$\([^)(]+?\)
Here's what the one-level nested parens one I came up with looks like:
\$\(((\([^)(]*\))|[^)(])+?\)
Breaking it down:
\$\( begin text
( groups the contents of the $() call
(\( groups a level 1 nested pair of parens
[^)(]* only accept a valid pair of parens (it shall contain anything but parens)
\)) close level 1 nesting
| contents also can be
[^)(] anything else that also is not made of parens
)+? not sure if this should be plus or star or if can be greedy (the contents are made up of either a level 1 paren group or any other character)
\) end
This worked great! But I need one more level of nesting.
I started typing up the two-level nested expression in my editor and it began to pause for 2-3 seconds at a time when I put in *'s.
So I gave up on that and moved to regextester.com, and before very long at all, the entire browser tab was frozen.
My question is two-fold.
What's a good way of constructing an arbitrary-level regex? Is this something that only human pattern-recognition can ever hope to achieve? It seems to me that I can get a good deal of intuition for how to go about making the regex capable of matching two levels of nesting based on the similarities between the first two. I think this could just be distilled down into a few "guidelines".
Why does regex parsing on non-enormous regexes block or freeze for so long?
I understand the O(n) linear time is for n where n is length of input to run the regex over (i.e. my test strings). But in a system where it recompiles the regex each time I type a new character into it, what would cause it to freeze up? Is this necessarily a bug in the regex code (I hope not, I thought the Javascript regex impl was pretty solid)? Part of my reasoning moving to a different regex tester from my editor was that I'd no longer be running it (on each keypress) over all ~2000 lines of source code, but it did not prevent the whole environment from locking up as I edited my regex. It would make sense if each character changed in the regex would correspond to some simple transformation in the DFA that represents that expression. But this appears not to be the case. If there are certain exponential time or space consequences to adding a star in a regex, it could explain this super-slow-to-update behavior.
Meanwhile I'll just go work out the next higher nested regexes by hand and copy them in to the fields once i'm ready to test them...
Um. Okay, so nobody wants to write the answer, but basically the answer here is
Backtracking
It can cause exponential runtime when you do certain non-greedy things.
The answer to the first part of my question:
The two-nested expression is as follows:
\$\(((\(((\([^)(]*\))|[^)(])*\))|[^)(])*\)
The transformation to make the next nested expression is to replace instances of [^)(]* with ((\([^)(]*\))|[^)(])*, or, as a meta-regex (where the replace-with section does not need escaping):
s/\[^\)\(\]\*/((\([^)(]*\))|[^)(])*/
This is conceptually straightforward: In the expression matching N levels of nesting, if we replace the part that forbids more nesting with something that matches one more level of nesting then we get the expression for N+1 levels of nesting!
To match an arbitrary number of nested (), with only one pair on each level of nesting, you could use the following, changing 2 to whatever number of nested () you require
/(?:\([^)(]*){2}(?:[^)(]*\)){2}/
To avoid excessive backtracking you want to avoid using nested quantifiers, particularly when the sub-pattern on both sides of an inner alternation is capable of matching the same substring.

tunable diff algorithm

I'm interested in finding a more-sophisticated-than-typical algorithm for finding differences between strings, that can be "tuned" via some parameters, to balance between such things as "maximize count of identical characters" vs. "maximize the length of spans" vs. "try to keep whole words intact".
Ultimately, I want to be able to make the results as human readable as possible. For instance, if a long sentence has been replaced with an entirely new sentence, where the only things it has in common with the original are the words "the" "and" and "a" in that order, I might want it treated as if the whole sentence is changed, rather than just that 4 particular spans are changed --- just like how a reasonable person would see it.
Does such a thing exist? Although I'm working in javascript/node.js, an algorithm in any language would be helpful.
I'm actually ok with something that uses Monte Carlo methods or the like, if its results are better. Computation time is not an issue (within reason), nor is determinism.
Note: although this is beyond the scope of what I'm asking, I'll throw one more thing out there just in case: It would also be great if it could recognize changes that are out of order....for instance if someone changes the order of two paragraphs while leaving them otherwise identical, it would be awesome if it recognized it as a simple move, rather than as one subtraction and and one unrelated addition.
I've had good luck with diff_match_patch. There are some good options for tuning it for readability.
Try http://prettydiff.com/ Its code is already formatted for compatibility with CommonJS, which is the framework Node uses.

What's the difference between .substr(0,1) or .charAt(0)?

We were wondering in this thread if there was a real difference between the use of .substr(0,1) and the use of .charAt(0) when you want to get the first character (actually, it could apply to any case where you wan only one char).
Is any of each faster than the other?
Measuring it is the key!
Go to http://jsperf.com/substr-or-charat to benchmark it yourself.
substr(0,1) runs at 21,100,301 operations per second on my machine, charAt(0) runs 550,852,974 times per second.
I suspect that charAt accesses the string as an array internally, rather than splitting the string.
As found in the comments, accessing the char directly using string[0] is slightly faster than using charAt(0).
Unless your whole script is based on the need for doing fast string manipulation, I wouldn't worry about the performance aspect at all. I'd use charAt() on the grounds that it's readable and the most specific tool for the job provided by the language. Also, substr() is not strictly standard, and while it's very unlikely any new ECMAScript implementation would omit it, it could happen. The standards-based alternatives to str.charAt(0) are str.substring(0, 1) and str.slice(0, 1), and for ECMAScript 5 implementations, str[0].

Categories

Resources