regex for matching finite-depth nested strings -- slow, crashy behavior - javascript

I was writing some regexes in my text editor (Sublime) today in an attempt to quickly find specific segments of source code, and it required getting a little creative because sometimes the function call might contain more function calls. For example I was looking for jQuery selectors:
$("div[class='should_be_using_dot_notation']");
$(escapeJQSelector("[name='crazy{"+getName(object)+"}']"));
I don't consider it unreasonable to expect one of my favorite powertools (regex) to help me do this sort of searching, but it's clear that the expression required to parse the second bit of code there will be somewhat complex as there are two levels of nested parens.
I am sufficiently versed in the theory to know that this sort of parsing is exactly what a context-free grammar parser is for, and that building out a regex is likely to suck up more memory and time (perhaps in an exponential rather than O(n^3) fashion). However I am not expecting to see that sort of feature available in my text editor or web browser any time soon, and I just wanted to squeak by with a big nasty regex.
Starting from this (This matches zero levels of nested parens, and no trivial empty ones):
\$\([^)(]+?\)
Here's what the one-level nested parens one I came up with looks like:
\$\(((\([^)(]*\))|[^)(])+?\)
Breaking it down:
\$\( begin text
( groups the contents of the $() call
(\( groups a level 1 nested pair of parens
[^)(]* only accept a valid pair of parens (it shall contain anything but parens)
\)) close level 1 nesting
| contents also can be
[^)(] anything else that also is not made of parens
)+? not sure if this should be plus or star or if can be greedy (the contents are made up of either a level 1 paren group or any other character)
\) end
This worked great! But I need one more level of nesting.
I started typing up the two-level nested expression in my editor and it began to pause for 2-3 seconds at a time when I put in *'s.
So I gave up on that and moved to regextester.com, and before very long at all, the entire browser tab was frozen.
My question is two-fold.
What's a good way of constructing an arbitrary-level regex? Is this something that only human pattern-recognition can ever hope to achieve? It seems to me that I can get a good deal of intuition for how to go about making the regex capable of matching two levels of nesting based on the similarities between the first two. I think this could just be distilled down into a few "guidelines".
Why does regex parsing on non-enormous regexes block or freeze for so long?
I understand the O(n) linear time is for n where n is length of input to run the regex over (i.e. my test strings). But in a system where it recompiles the regex each time I type a new character into it, what would cause it to freeze up? Is this necessarily a bug in the regex code (I hope not, I thought the Javascript regex impl was pretty solid)? Part of my reasoning moving to a different regex tester from my editor was that I'd no longer be running it (on each keypress) over all ~2000 lines of source code, but it did not prevent the whole environment from locking up as I edited my regex. It would make sense if each character changed in the regex would correspond to some simple transformation in the DFA that represents that expression. But this appears not to be the case. If there are certain exponential time or space consequences to adding a star in a regex, it could explain this super-slow-to-update behavior.
Meanwhile I'll just go work out the next higher nested regexes by hand and copy them in to the fields once i'm ready to test them...

Um. Okay, so nobody wants to write the answer, but basically the answer here is
Backtracking
It can cause exponential runtime when you do certain non-greedy things.
The answer to the first part of my question:
The two-nested expression is as follows:
\$\(((\(((\([^)(]*\))|[^)(])*\))|[^)(])*\)
The transformation to make the next nested expression is to replace instances of [^)(]* with ((\([^)(]*\))|[^)(])*, or, as a meta-regex (where the replace-with section does not need escaping):
s/\[^\)\(\]\*/((\([^)(]*\))|[^)(])*/
This is conceptually straightforward: In the expression matching N levels of nesting, if we replace the part that forbids more nesting with something that matches one more level of nesting then we get the expression for N+1 levels of nesting!

To match an arbitrary number of nested (), with only one pair on each level of nesting, you could use the following, changing 2 to whatever number of nested () you require
/(?:\([^)(]*){2}(?:[^)(]*\)){2}/
To avoid excessive backtracking you want to avoid using nested quantifiers, particularly when the sub-pattern on both sides of an inner alternation is capable of matching the same substring.

Related

In a stringified array is it possible to differentiate between quotes that were in a string and those that surrounded the string itself?

Some Context:
• I'm still learning to code atm (started less than a year ago)
• I'm mostly self taught at that since I think my computer science class feels
too slow.
• The website I'm learning on is code.org, specifically in the "game lab"
• The site's coding environments only use ES5 because they don't want to
update them to ES6 or something like that
• In class we're making function libraries and while not required, I want
mine to be "highly usable," for lack of a better term, while also being
reasonably short (prefer not to automate things if I can get them done
quicker somehow, but that's just personal preference).
So now for where the actual question comes in: in a stringified array, is it possible to differentiate between a quotation mark that was inside a string and a quotation mark that actually denotes a string? Because I noticed something confusing with the output of JSON.parse(JSON.stringify()) on code.org, specifically, if you write something like,
JSON.parse(JSON.stringify(['hi","hi']))
the output will be ["hi","hi"] which looks just like an array containing two strings (on code.org it doesn't show the \'s), but still contains just one, which is fine unless you're using a regular expression to detect whether or not a match is within a string (if every quotation mark after the match has a "partner"), which is what I'm doing in 4 different functions. One flattens a list (since ES5 doesn't have Array.prototype.flat()), one removes all instances of the arguments from a list, one removes all instances of specified operand types, and one replaces all instances of an argument with the one that follows it.
Now I know the odds of a string containing an odd number of quotation marks (whether single or double) is likely extremely low, but it still bothers me that not having a way to differentiate between quotes formerly within a string and quotes which formerly denoted a string (in an array after it's been stringified) as these functions otherwise function exactly as intended. The regular expression I'm using to determine if there's an even number of quotes left in the stringified array is /(?=[^"]*(?:(?:"[^"]*){2})*$)/ where you put the match before the lookahead assertion and anything you absolutely want to follow before the first [^"]*.
To highlight the actual issue I'm trying to solve, this is my flatten function (since it's the shortest of the 4), and yeah, yeah, I know "eval bad" but it's extremely convenient to use here since it shortens the actual modification into a single line, and I highly doubt anyone's actually going to find a way to abuse it given its implementation ("this" needs to be an array for splice to work, so if I'm not mistaken, there isn't really a way to abuse it, but tell me if I'm wrong, since I probably am).
Array.prototype.flatten = function() {
eval(('this.splice(0,this.length,' + JSON.stringify(this).replace(/[\[\]](?=[^"]*(?:(?:"[^"]*){2})*$)/g, '') + ')').replace(/,(?=((,[^"]*(?:(?:"[^"]*){2})*)*.$))/g, ''));
return this;
};
This works really well outside of the previously specified conditions, but if I were to call it with something like [1,'"'] it'd find 3 quotation marks after the \[ and wouldn't be able to remove it but would be able to remove the \], thus when eval actually gets to .splice(), it would look like eval('this.splice(0,this.length,[1,"\"")') causing the error Unexpected token ')' to be thrown
Any help on this is appreciated, even if it's just telling me it isn't possible, thanks for reading my ramblings.
TL;DR: in a stringified array is it possible to differentiate between " and \" (string wrapping quotes of strings within a stringified array and quotes within a string within a stringified array) in a regular expression or any other method using only the tools available in ES5 (site I'm learning on doesn't want to update their project environments for whatever reason)
You are having a problem because your input is not a context free grammar and can not be correctly parsed with regular expressions.
Can you explain why JSON.parse is unacceptable? It is even in ancient browsers and versions of node.js.
Someone writing a json parser might use bison or yacc, so if this is a learning experience consider playing with jison.
I ended up finding a way to do this, for whatever reason (either I didn't notice last night because I was tired or it legitimately changed overnight, though likely the former) I can now see the " when viewing the value of the the stringified array, and lo and behold modifying the regular expression so that it ignored instances of " resolved the issue.
New regular expression for quotation mark pair matching now reads:
// old even number of quotation marks after match check
/(?=[^"]*(?:(?:"[^"]*){2})*$)/
// new even number of quotation marks after match check
/(?=(\\"|[^"])*(?:(?:(?<!\\)"(\\"|[^"])*){2})*$)/
// (only real difference is that it accounts for the \)
Sorry for anyone who may have misunderstood the question due to how all over the place it was, I'm aware that I tend to end up writing a lot more than is necessary and it often leads to tangents that muddle my view of what I was initially asking, which in turn makes the point I'm actually trying to get across even harder to grasp at. Thanks to those who still tried to help me regardless of how much of a mess of a first question this was.

JS lexing---multi line string

I am making a JS lexer as part of my study. In JS, single line stings start from " or ' and ends with the same character except if that character is preceded by a backslash.
In my current code, I loop through every character and append them to existing tokens based on flags like "string" or "regex". so it feels natural to implement multi line string with " or ' because it seems that it does not affect any other part of my lexer
Is there any practical reason why new line is not allowed as contents of strings?
Many languages, but not all, prohibit unescaped newlines in string literals. So JavaScript is certainly not unique here.
But the motivation really has little to do with the ease, difficulty or efficiency of lexical analysis. In fact, for lexical analysis the simplest syntax is to allow any character rather than having to include special-case checks. [Note 1]
There are other considerations, though; notably, the importance of a program to be readable and easy to debug. Long strings put an extra load on someone reading the code, because they may not be aware that a section of program text is actually part of a string literal. (There's a similar problem with multiline comments, which is why it's usually considered good style to mark every line in a long comment in some way, for example with a vertical column of stars at the left-hand margin. No such solution exists for string literals, though.)
Also, unterminated multiline strings can be annoying to correct. If strings are cannot span lines, the error will be detected on the line containing the problem. But multiline strings might continue until the beginning of the next string, then triggering a syntax error when the contents of the next string are accidentally parsed as program text. Or worse, resulting in a completely incorrect parse of what was supposed to be program text, followed by another incorrect string literal starting where the second literal ends, and continuing from there.
That also makes it hard for developer tools, such as editors and syntax highlighters, to deal with program text as it is being typed.
In the end, you may or may not find these arguments compelling, and a language designer might have other aesthetic preferences as well. I can't really speak for the original designers of the JavaScript language, and neither of us can take a voyage in time to argue with them and maybe change their decision.
For better or worse, languages are designed according to particular subjective judgements, and if the language is successful these judgements become permanent features. They are things you have to accept if you are using a language and they're not usually worth obsessing about. You get used to them, or you find a different language to program in, with its own syntax quirks.
When you design your own language, you will need to resolve a large number of syntactic questions, and you will undoubtedly run into cases where the answer is not clearcut because there is no objectively correct unique solution. Whatever you do, someone will want to argue with you. Perhaps you can refer them to this answer.
Notes:
There is actually a historic reason for not allowing multiline string literals, which is much clearer but has been more or less irrelevant for several decades.
Once Upon A Time, common filesystems considered text files to be linear arrays of fixed-length lines (often 80 character lines, matching a Hollerith card). One advantage of such a filesystem is that it could instantly navigate to a particular line number in a file, since all lines were the same length. But in any case, for systems where programs were entered on punched cards, the fixed length lines were just part of the environment.
To make all lines the same length, lines needed to be filled out with space characters. This would obviously make multiline string literals awkward, and that's why C never allowed multiline string literals, instead relying on a syntactic feature where consecutive string literals are automatically concatenated into a single literal.
In the end, fixed-line-length filesystems proved to be unpopular, and I don't think you're likley to run into one these days. But a careful reading of the C and Posix standards shows that such filesystems must still be usable by conforming implementations, with the consequence that a fully portable program must be prepared to deal with line length limits on output and trailing whitespace on input.
There is also such syntax
const string =
'line1\
line2\
line3'

Inefficient regular expression? [duplicate]

I recently became aware of Regular expression Denial of Service attacks, and decided to root out so-called 'evil' regex patterns wherever I could find them in my codebase - or at least those that are used on user input. The examples given at the OWASP link above and wikipedia are helpful, but they don't do a great job of explaining the problem in simple terms.
A description of evil regexes, from wikipedia:
the regular expression applies repetition ("+", "*") to a complex subexpression;
for the repeated subexpression, there exists a match which is also a suffix of another valid match.
With examples, again from wikipedia:
(a+)+
([a-zA-Z]+)*
(a|aa)+
(a|a?)+
(.*a){x} for x > 10
Is this a problem that just doesn't have a simpler explanation? I'm looking for something that would make it easier to avoid this problem while writing regexes, or to find them within an existing codebase.
Why Are Evil Regexes A Problem?
Because computers do exactly what you tell them to do, even if it's not what you meant or is totally unreasonable. If you ask a regex engine to prove that, for some given input, there either is or is not a match for a given pattern, then the engine will attempt to do that no matter how many different combinations must be tested.
Here is a simple pattern inspired by the first example in the OP's post:
^((ab)*)+$
Given the input:
abababababababababababab
The regex engine tries something like (abababababababababababab) and a match is found on the first try.
But then we throw the monkey wrench in:
abababababababababababab a
The engine will first try (abababababababababababab) but that fails because of that extra a. This causes catastrophic backtracking, because our pattern (ab)*, in a show of good faith, will release one of its captures (it will "backtrack") and let the outer pattern try again. For our regex engine, that looks something like this:
(abababababababababababab) - Nope
(ababababababababababab)(ab) - Nope
(abababababababababab)(abab) - Nope
(abababababababababab)(ab)(ab) - Nope
(ababababababababab)(ababab) - Nope
(ababababababababab)(abab)(ab) - Nope
(ababababababababab)(ab)(abab) - Nope
(ababababababababab)(ab)(ab)(ab) - Nope
(abababababababab)(abababab) - Nope
(abababababababab)(ababab)(ab) - Nope
(abababababababab)(abab)(abab) - Nope
(abababababababab)(abab)(ab)(ab) - Nope
(abababababababab)(ab)(ababab) - Nope
(abababababababab)(ab)(abab)(ab) - Nope
(abababababababab)(ab)(ab)(abab) - Nope
(abababababababab)(ab)(ab)(ab)(ab) - Nope
(ababababababab)(ababababab) - Nope
(ababababababab)(abababab)(ab) - Nope
(ababababababab)(ababab)(abab) - Nope
(ababababababab)(ababab)(ab)(ab) - Nope
(ababababababab)(abab)(abab)(ab) - Nope
(ababababababab)(abab)(ab)(abab) - Nope
(ababababababab)(abab)(ab)(ab)(ab) - Nope
(ababababababab)(ab)(abababab) - Nope
(ababababababab)(ab)(ababab)(ab) - Nope
(ababababababab)(ab)(abab)(abab) - Nope
(ababababababab)(ab)(abab)(ab)(ab) - Nope
(ababababababab)(ab)(ab)(ababab) - Nope
(ababababababab)(ab)(ab)(abab)(ab) - Nope
(ababababababab)(ab)(ab)(ab)(abab) - Nope
(ababababababab)(ab)(ab)(ab)(ab)(ab) - Nope
                              ...
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abababab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(abab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab) - Nope
The number of possible combinations scales exponentially with the length of the input and, before you know it, the regex engine is eating up all your system resources trying to solve this thing until, having exhausted every possible combination of terms, it finally gives up and reports "There is no match." Meanwhile your server has turned into a burning pile of molten metal.
How to Spot Evil Regexes
It's actually very tricky. Catastrophic backtracking in modern regex engines is similar in nature to the halting problem which Alan Turing proved was impossible to solve. I have written problematic regexes myself, even though I know what they are and generally how to avoid them. Wrapping everything you can in an atomic group can help to prevent the backtracking issue. It basically tells the regex engine not to revisit a given expression - "lock whatever you matched on the first try". Note, however, that atomic expressions don't prevent backtracking within the expression, so ^(?>((ab)*)+)$ is still dangerous, but ^(?>(ab)*)+$ is safe (it'll match (abababababababababababab) and then refuse to give up any of it's matched characters, thus preventing catastrophic backtracking).
Unfortunately, once it's written, it's actually very hard to immediately or quickly find a problem regex. In the end, recognizing a bad regex is like recognizing any other bad code - it takes a lot of time and experience and/or a single catastrophic event.
Interestingly, since this answer was first written, a team at the University of Texas at Austin published a paper describing the development of a tool capable of performing static analysis of regular expressions with the express purpose of finding these "evil" patterns. The tool was developed to analyse Java programs, but I suspect that in the coming years we'll see more tools developed around analysing and detecting problematic patterns in JavaScript and other languages, especially as the rate of ReDoS attacks continues to climb.
Static Detection of DoS Vulnerabilities in
Programs that use Regular Expressions
Valentin Wüstholz, Oswaldo Olivo, Marijn J. H. Heule, and Isil Dillig
The University of Texas at Austin
Detecting evil regexes
Try Nicolaas Weideman's RegexStaticAnalysis project.
Try my ensemble-style vuln-regex-detector which has a CLI for Weideman's tool and others.
Rules of thumb
Evil regexes are always due to ambiguity in the corresponding NFA, which you can visualize with tools like regexper.
Here are some forms of ambiguity. Don't use these in your regexes.
Nesting quantifiers like (a+)+ (aka "star height > 1"). This can cause exponential blow-up. See substack's safe-regex tool.
Quantified Overlapping Disjunctions like (a|a)+. This can cause exponential blow-up.
Avoid Quantified Overlapping Adjacencies like \d+\d+. This can cause polynomial blow-up.
Additional resources
I wrote this paper on super-linear regexes. It includes loads of references to other regex-related research.
What you call an "evil" regex is a regex that exhibits catastrophic backtracking. The linked page (which I wrote) explains the concept in detail. Basically, catastrophic backtracking happens when a regex fails to match and different permutations of the same regex can find a partial match. The regex engine then tries all those permutations. If you want to go over your code and inspect your regexes these are the 3 key issues to look at:
Alternatives must be mutually exclusive. If multiple alternatives can match the same text then the engine will try both if the remainder of the regex fails. If the alternatives are in a group that is repeated, you have catastrophic backtracking. A classic example is (.|\s)* to match any amount of any text when the regex flavor does not have a "dot matches line breaks" mode. If this is part of a longer regex then a subject string with a sufficiently long run of spaces (matched by both . and \s) will break the regex. The fix is to use (.|\n)* to make the alternatives mutually exclusive or even better to be more specific about which characters are really allowed, such as [\r\n\t\x20-\x7E] for ASCII printables, tabs, and line breaks.
Quantified tokens that are in sequence must either be mutually exclusive with each other or be mutually exclusive what comes between them. Otherwise both can match the same text and all combinations of the two quantifiers will be tried when the remainder of the regex fails to match. A classic example is a.*?b.*?c to match 3 things with "anything" between them. When c can't be matched the first .*? will expand character by character until the end of the line or file. For each expansion the second .*? will expand character by character to match the remainder of the line or file. The fix is to realize that you can't have "anything" between them. The first run needs to stop at b and the second run needs to stop at c. With single characters a[^b]*+b[^c]*+c is an easy solution. Since we now stop at the delimiter, we can use possessive quantifiers to further increase performance.
A group that contains a token with a quantifier must not have a quantifier of its own unless the quantified token inside the group can only be matched with something else that is mutually exclusive with it. That ensures that there is no way that fewer iterations of the outer quantifier with more iterations of the inner quantifier can match the same text as more iterations of the outer quantifier with fewer iterations of the inner quantifier. This is the problem illustrated in JDB's answer.
While I was writing my answer I decided that this merited a full article on my website. This is now online too.
I would sum it up as "A repetition of a repetition". The first example you listed is a good one, as it states "the letter a, one or more times in a row. This can again happen one or more times in a row".
What to look for in this case is combination of the quantifiers, such as * and +.
A somewhat more subtle thing to look out for is the third and fourth one. Those examples contain an OR operation, in which both sides can be true. This combined with a quantifier of the expression can result in a LOT of potential matches depending on the input string.
To sum it up, TLDR-style:
Be careful how quantifiers are used in combination with other operators.
I have surprisingly come across ReDOS quite a few times performing source code reviews. One thing I would recommend is to use a timeout with whatever Regular Expression engine that you are using.
For example, in C# I can create the regular expression with a TimeSpan attribute.
string pattern = #"^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$";
Regex regexTags = new Regex(pattern, RegexOptions.None, TimeSpan.FromSeconds(1.0));
try
{
string noTags = regexTags.Replace(description, "");
System.Console.WriteLine(noTags);
}
catch (RegexMatchTimeoutException ex)
{
System.Console.WriteLine("RegEx match timeout");
}
This regex is vulnerable to denial of service and without the timeout will spin and eat resources. With the timeout, it will throw a RegexMatchTimeoutException after the given timeout and will not cause the resource usage leading to a Denial of Service condition.
You will want to experiment with the timeout value to make sure it works for your usage.
I would say this is related to the regex engine in use. You may not always be able to avoid these types of regexes, but if your regex engine is built right, then it is less of a problem. See this blog series for a great deal of information on the topic of regex engines.
Note the caveat at the bottom of the article, in that backtracking is an NP-Complete problem. There currently is no way to efficiently process them, and you might want to disallow them in your input.
I don't think you can recognize such regexes, at least not all of them or not without restrictively limiting their expressiveness. If you'd really care about ReDoSs, I'd try to sandbox them and kill their processing with a timeout. It also might be possible that there are RegEx implementations that let you limit their max backtracking amount.
There are some ways I can think of that you could implement some simplification rules by running them on small test inputs or analyzing the regex's structure.
(a+)+ can be reduced using some sort of rule for replacing redundant operators to just (a+)
([a-zA-Z]+)* could also be simplified with our new redundancy combining rule to ([a-zA-Z]*)
The computer could run tests by running the small subexpressions of the regex against randomly-generated sequences of the relevant characters or sequences of characters, and seeing what groups they all end up in. For the first one, the computer is like, hey the regex wants a's, so lets try it with 6aaaxaaq. It then sees that all the a's, and only the first groupm end up in one group, and concludes that no matter how many a's is puts, it won't matter, since + gets all in the group. The second one, is like, hey, the regex wants a bunch of letters, so lets try it with -fg0uj=, and then it sees that again each bunch is all in one group, so it gets rid of the + at the end.
Now we need a new rule to handle the next ones: The eliminate-irrelevant-options rule.
With (a|aa)+, the computer takes a look at it and is like, we like that big second one, but we can use that first one to fill in more gaps, lets get ans many aa's as we can, and see if we can get anything else after we're done. It could run it against another test string, like `eaaa#a~aa.' to determine that.
You can protect yourself from (a|a?)+ by having the computer realize that the strings matched by a? are not the droids we are looking for, because since it can always match anywhere, we decide that we don't like things like (a?)+, and throw it out.
We protect from (.*a){x} by getting it to realize that the characters matched by a would have already been grabbed by .*. We then throw out that part and use another rule to replace the redundant quantifiers in (.*){x}.
While implementing a system like this would be very complicated, this is a complicated problem, and a complicated solution may be necessary. You should also use techniques other people have brought up, like only allowing the regex some limited amount of execution resources before killing it if it doesn't finish.

Is there a performance penalty using capture groups in RegExp#test?

Disclaimer: my question is not focused on the exercise, it's just an example (although if you have any interesting tips on the example itself, feel free to share!).
Say I'm working with parsing some strings with Regex in JavaScript, and my main focus is performance (speed).
I have a piece of regex which checks for a numeric string, and then parses it using Number if it's numeric:
if (/^\[[0-9]+]$/.test(str)) {
val = Number(str.match(/^\[([0-9]+)$/)[1]);
}
Note how the conditional test does not have a capture group around the digits. This leads to writing out basically the same regex twice, except with a capture group the second time.
What I would like to know is this; does adding a capture group to a regex used alongside test() in a condition affect performance in any way? I'd like to simply use the capture regex in both places, as long as there is no performance hit.
And to the question as why I'm doing test() then match() rather than match() and checking null; I want to keep parsing as fast as possible when there's a miss, but it's ok to be a little slower when there's a hit.
If it's not clear from the above, I'm referring to JavaScript's regex engine - although if this differs across engines it'd be nice to know too. I'm working specifically in Node.js here, should it also differ across JS engines.
Thanks in advance!
Doing 2 regexps - that are very similar in scope - will almost always be slower than doing a single one because regexps are greedy (that means that they will try to match as much as they can, usually meaning take the maximum amount of time possible).
What you're asking is basically: is the cost of fewer memory in the worst case scenario (aka using the .test to save on memory from capture) faster than just using the extra memory? The answer is no, using extra memory speeds up your process.
Don't take my word for it though, here's a jsperf: http://jsperf.com/regex-perf-numbers

Is there a way to match only top level parentheses with regex?

With Javascript, suppose I have a string like (1)(((2)(3))4), can I get a regex to match just (1) and (((2)(3))4), or do I need to do something more complicated?
Ideally the regex would return ["((2)(3))","4"] if you searched ((2)(3))4. Actually that's really a requirement. The point is to group things into the chunks that need to be worked on first, like the way parentheses work in math.
No, there is no way to match only top level parentheses with regex
Looking only at the top level doesn't make the problem easier than general "parsing" of recursive structures. (See this relevant popular SO question with a great answer).
Here's a simple intuitive reason why Regex can't parse arbitrary levels of nesting:
To keep track of the level of nesting, one must count. If one wants to be able to keep track of an arbitrary level of nesting, one needs an arbitrarily large number while running the program.
But regular expressions are exactly those that can be implemented by DFAs, that is Deterministice finite automatons. These have only a finite number of states. Thus they can't keep track of an arbitrarily large number.
This argument works also for your specific concern of being only interested in the top level parentheses.
To recognize the top level parentheses, you must keep track of arbitrary nesting preceding any one of them:
((((..arbitrarily deep nesting...))))((.....)).......()......
^toplevel ^^ ^ ^^
So yes, you need something more powerful than regex.
While if you are very pragmatic, for your concrete application it might be okay to say that you won't encounter any nesting deeper than, say, 1000 (and so you might be willing to go with regex), it's also a very practical fact that any regex recognizing a nesting level of more than 2 is basically unreadable.
Well, here is one way to do it. As Jo So pointed out, you can't really do it in javascript with indefinite amounts of recursion, but you can make something arbitrarily recursive pretty easily. I'm not sure how the performance scales though.
First I figured out that you need recursion. Then I realized that you can just make your regex 'recursive' by just copying and pasting recursively, like so (using curly braces for clarity):
Starting regex
Finds stuff in brackets that isn't itself brackets.
/{([^{}])*}/g
Then copy and paste the whole regex inside itself! (I spaced it out so you can see where it was pasted in.) So now it is basically like a( x | a( x )b )b
/{([^{}] | {([^{}])*} )*}/g
That will get you one level of recursion and you can continue ad nauseum in this fashion and actually double the amount of recursions each time:
//matches {4{3{2{1}}}}
/{([^{}]|{([^{}]|{([^{}]|{([^{}])*})*})*})*}/g
//matches {8{7{6{5{4{3{2{1}}}}}}}}
/{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}])*})*})*})*})*})*})*})*}/g
Finally I just add |[^{}]+ on the end of the expression to match stuff that is completely outside of brackets. Crazy, but it works for my needs. I feel like there is probably some clever way to combine this concept with a recursive function in order to get a truly recursive matcher, but I can't think of it now.
If you can be sure that the parentheses are balanced (I'm sure there are other resources out there that can answer that question for you if required) and if by "top-level" you're happy to find local as well as global maxima then all you need to do is find any content that starts with an open bracket and closes with a close-bracket, with no intermediate open-bracket between the two:
I think the following should do that for you and helpfully group any "top-level" content:
\(([^\(]*?)\)
That content may not all be at the same "level", but if you think of the nested brackets as describing the branching of a tree, the regex will return to you the leaves. If you pre-process your text to be wrapped in parentheses to start with, and the earlier assumptions are met, you can guarantee always getting at least one "leaf".

Categories

Resources