How can I make this regular expression not result in "catastrophic backtracking"? - javascript

I'm trying to use a URL matching regular expression that I got from http://daringfireball.net/2010/07/improved_regex_for_matching_urls
(?xi)
\b
( # Capture 1: entire matched URL
(?:
https?:// # http or https protocol
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
)
Based on the answers to another question, it appears that there are cases that cause this regex to backtrack catastrophically. For example:
var re = /\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i;
re.test("http://google.com/?q=(AAAAAAAAAAAAAAAAAAAAAAAAAAAAA)")
... can take a really long time to execute (e.g. in Chrome)
It seems to me that the problem lies in this part of the code:
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
... which seems to be roughly equivalent to (.+|\((.+|(\(.+\)))*\))+, which looks like it contains (.+)+
Is there a change I can make that will avoid that?

Changing it to the following should prevent the catastrophic backtracking:
(?xi)
\b
( # Capture 1: entire matched URL
(?:
https?:// # http or https protocol
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
)
The only change that was made was to remove the + after the first [^\s()<>] in each of the "balanced parens" portions of the regex.
Here is the one-line version for testing with JS:
var re = /\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i;
re.test("http://google.com/?q=(AAAAAAAAAAAAAAAAAAAAAAAAAAAAA")
The problem portion of the original regex is the balanced parentheses section, to simplify the explanation of why the backtracking occurs I am going to completely remove the nested parentheses portion of it because it isn't relevant here:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # original
\(([^\s()<>]+)*\) # expanded below
\( # literal '('
( # start group, repeat zero or more times
[^\s()<>]+ # one or more non-special characters
)* # end group
\) # literal ')'
Consider what happens here with the string '(AAAAA', the literal ( would match and then AAAAA would be consumed by the group, and the ) would fail to match. At this point the group would give up one A, leaving AAAA captured and attempting to continue the match at this point. Because the group has a * following it, the group can match multiple times so now you would have ([^\s()<>]+)* matching AAAA, and then A on the second pass. When this fails an additional A would be given up by the original capture and consumed by the second capture.
This would go on for a long while resulting in the following attempts to match, where each comma-separated group indicates a different time that the group is matched, and how many characters that instance matched:
AAAAA
AAAA, A
AAA, AA
AAA, A, A
AA, AAA
AA, AA, A
AA, A, AA
AA, A, A, A
....
I may have counted wrong, but I'm pretty sure it adds up to 16 steps before it is determined that the regex cannot match. As you continue to add additional characters to the string the number of steps to figure this out grows exponentially.
By removing the + and changing this to \(([^\s()<>])*\), you would avoid this backtracking scenario.
Adding the alternation back in to check for the nested parentheses doesn't cause any problems.
Note that you may want to add some sort of anchor to the end of the string, because currently "http://google.com/?q=(AAAAAAAAAAAAAAAAAAAAAAAAAAAAA" will match up to just before the (, so re.test(...) would return true because http://google.com/?q= matches.

Related

Regex split iif expression

I am trying to test a Regex which should be able to split the following expressions into 3 parts:
test
true
false
If there are multiple iif nested, it should give me multiple matches
And I have some patterns:
iif(testExpression, trueExpression, falseExpression)
iif((#HasMinimunRegulatedCAR#==0),(([t219]>1.5) OR ([t219]<-0.5)),(([t223]>1.5) OR ([t223]<-0.5)))
iif((#HasMinimunRegulatedCAR#==1), iif((#MFIUsePAR30#==1), ([t224]>0.25), iif((#MFIUsePAR90#==1), ([t225]>0.25), (1==1))), iif((#MFIUsePAR30#==1), ([t220]>0.25), iif((#MFIUsePAR90#==1), ([t221]>0.25),(1==1))))
I am using this expression but it doesn't work when I have multiple iif nested
(^(iif\()|[^,]+)
I am running my tests using:
https://regex101.com/
The expected output should be
testExpression
trueExpression
falseExpression
(#HasMinimunRegulatedCAR#==0)
([t219]>1.5) OR ([t219]<-0.5)
([t223]>1.5) OR ([t223]<-0.5)
(#HasMinimunRegulatedCAR#==1)
iif((#MFIUsePAR30#==1), ([t224]>0.25), iif((#MFIUsePAR90#==1), ([t225]>0.25), (1==1)))
iif((#MFIUsePAR30#==1), ([t220]>0.25), iif((#MFIUsePAR90#==1), ([t221]>0.25),(1==1)))
If .Net regular expressions are an option, you can use balancing groups to capture your iifs.
One option will capture all nested parentheses into a group called Exp:
(?>
(?:iif)?(?<Open>\((?<Start>)) # When seeing an open parenthesis, mark
# a beginning of an expression.
# Use the stack Open to count our depth.
|
(?<-Open>(?<Exp-Start>)\)) # Close parenthesis. Capture an expression.
|
[^()\n,] # Match any regular character
|
(?<Exp-Start>),(?<Start>) # On a comma, capture an expression,
# and start a new one.
)+?
(?(Open)(?!))
Working Example, switch to the Table tab.
This option captures more than you ask for, but it does give you the full data you need to fully parse the equation.
Another option is to capture only the top-level parts:
iif\(
(?:
(?<TopLevel>(?>
(?<Open>\() # Increase depth count using the Open stack.
|
(?<-Open>\)) # Decrease depth count.
|
[^()\n,]+ # Match any other boring character
|
(?(Open),|(?!)) # Match a comma, bun only if we're already inside parentheses.
)+)
(?(Open)(?!))
|
,\s* # Match a comma between top-level expressions.
)+
\)
Working Example
Here, the group $TopLevel will have three captures. For example, on your last example, it captures:
Capture 0: (#HasMinimunRegulatedCAR#==1)
Capture 1: iif((#MFIUsePAR30#==1), ([t224]>0.25), iif((#MFIUsePAR90#==1), ([t225]>0.25), (1==1)))
Capture 2: iif((#MFIUsePAR30#==1), ([t220]>0.25), iif((#MFIUsePAR90#==1), ([t221]>0.25),(1==1)))

Livecycle RegExp - trouble with decimal

Within Livecycle, I am validating that the number entered is a 0 through 10 and allows quarter hours. With the help of this post, I've written the following.
if (!xfa.event.newText.match(/^(([10]))$|^((([0-9]))$|^((([0-9]))\.?((25)|(50)|(5)|(75)|(0)|(00))))$/))
{
xfa.event.change = "";
};
The problem is periods are not being accepted. I have tried wrapping the \. in parenthesis but that did not work either. The field is a text field with no special formatting and the code in the change event.
Yikes, that's a convoluted regex. This can be simplified a lot:
/^(?:10|[0-9](?:\.(?:[27]?5)?0*)?)$/
Explanation:
^ # Start of string
(?: # Start of group:
10 # Either match 10
| # or
[0-9] # Match 0-9
(?: # optionally followed by this group:
\. # a dot
(?:[27]?5)? # either 25, 75 or 5 (also optional)
0* # followed by optional zeroes
)? # As said before, make the group optional
) # End of outer group
$ # End of string
Test it live on regex101.com.

Pattern suggestions to work around alternation operator limitations

I'm trying to build a regex pattern matcher.
The complete string pattern is as follow:
AB123456C12
Letter A
Letter B
six digits
one letter
two digits.
I'm trying to match as much as possible, but partial inputs are allowed as long as the initial AB is present.
The RegEx engine is Javascript. Hoping to be fully cross-browser compatible.
I do have a pattern that works:
^AB([0-9]{6}[A-Z][0-9]{0,2}|[0-9]{0,6})$
But it only works when the arguments of the alternation operator are in this position. Said otherwise,
^AB([0-9]{0,6}|[0-9]{6}[A-Z][0-9]{0,2})$
doesn't work - which makes me believe that the solution may not work in some obscure browser.
So, any other way to define that pattern?
Thanks.
Edited for clarity: the followings are inputs that must be matched by the regex:
AB
AB123
AB123456Z
The followings input are to be rejected:
B
B123456Z12
ABC
123456
This may help
^AB[0-9]{6}[A-Z][0-9]{2}$
I think you are looking for this.
# ^AB(?:(?:[0-9]{6}(?:[A-Z][0-9]{0,2})?)|[0-9]{1,5})?$
^ # BOS
AB # AB
(?: # Optional cl-1
(?: # Required cl-2
[0-9]{6} # Required 6 digits
(?: # Optional cl-3
[A-Z] # Required A-Z letter
[0-9]{0,2} # Required 0-2 (0 means optional)
)? # End cl-3
) # End cl-2
| # or
[0-9]{1,5} # Required 1-5 digits
)? # End cl-1
$ # EOS

Javascript Regular expression for currency amount with spaces

I have this regular expression
/^[',",\+,<,>,\(,\*,\-,%]?([£,$,€]?\d+([\,,\.]\d+)?[£,$,€]?\s*[\-,\/,\,,\.,\+]?[\/]?\s*)+[',",\+, <,>,\),\*,\-,%]?$/
It matches this very well $55.5, but in few of my test data I have some values like $ 55.5 (I mean, it has a space after $ sign).
The answers on this link are not working for me.
Currency / Percent Regular Expression
So, how can I change it to accept the spaces as well?
Try following RegEx:
/^[',",\+,<,>,\(,\*,\-,%]?([£,$,€]?\s*\d+([\,,\.]\d+)?[£,$,€]?\s*[\-,\/,\,,\.,\+]?[\/]?\s*)+[',",\+, <,>,\),\*,\-,%]?$/
Let me know if it worked!
Demo Here
TLDR:
/^[',",\+,<,>,\(,\*,\-,%]?([£,$,€]?\s*\d+([\,,\.]\d+)?[£,$,€]?\s*[\-,\/,\,,\.,\+]?[\/]?\s*)+[',",\+, <,>,\),\*,\-,%]?$/
The science bit
Ok, I'm guessing that you didn't construct the original regular expression, so here are the pieces of it, with the addition marked:
^ # match from the beginning of the string
[',",\+,<,>,\(,\*,\-,%]? # optionally one of these symbols
( # start a group
[£,$,€]? # optionally one of these symbols
\s* # <--- NEW ADDITION: optionally one or more whitespace
\d+ # then one or more decimal digits
( # start group
[\,,\.] # comma or a dot
\d+ # then one or more decimal digits
)? # group optional (comma/dot and digits or neither)
[£,$,€]? # optionally one of these symbols
\s* # optionally whitespace
[\-,\/,\,,\.,\+]? # optionally one of these symbols
[\/]? # optionally a /
\s* # optionally whitespace
)+ # this whole group one or more times
[',",\+, <,>,\),\*,\-,%]? # optionally one of these symbols
$ # match to the end of the string
Much of this is poking about matching stuff around the currency amount, so you could reduce that.

Check for open, unescaped quotes in a given string

A part of the application I'm building allows you to evaluate bash commands in an interactive terminal. On enter, the command is run. I'm trying to make it a bit more flexible, and allow for commands spanning multiple lines.
I already check for a trailing backslash, now I'm trying to figure out how to tell if there is an open string. I have not been successful in writing a regex for this, as it should also support escaped quotes.
For example:
echo "this is a
\"very\" cool quote"
If you want a regex that matches a string (subject) only if it doesn't contain unbalanced (unescaped) quotes, then try the following:
/^(?:[^"\\]|\\.|"(?:\\.|[^"\\])*")*$/.test(subject)
Explanation:
^ # Match the start of the string.
(?: # Match either...
[^"\\] # a character besides quotes or backslash
| # or
\\. # any escaped character
| # or
" # a closed string, i. e. one that starts with a quote,
(?: # followed by either
\\. # an escaped character
| # or
[^"\\] # any other character except quote or backslash
)* # any number of times,
" # and a closing quote.
)* # Repeat as often as needed.
$ # Match the end of the string.

Categories

Resources