Regex split iif expression - javascript

I am trying to test a Regex which should be able to split the following expressions into 3 parts:
test
true
false
If there are multiple iif nested, it should give me multiple matches
And I have some patterns:
iif(testExpression, trueExpression, falseExpression)
iif((#HasMinimunRegulatedCAR#==0),(([t219]>1.5) OR ([t219]<-0.5)),(([t223]>1.5) OR ([t223]<-0.5)))
iif((#HasMinimunRegulatedCAR#==1), iif((#MFIUsePAR30#==1), ([t224]>0.25), iif((#MFIUsePAR90#==1), ([t225]>0.25), (1==1))), iif((#MFIUsePAR30#==1), ([t220]>0.25), iif((#MFIUsePAR90#==1), ([t221]>0.25),(1==1))))
I am using this expression but it doesn't work when I have multiple iif nested
(^(iif\()|[^,]+)
I am running my tests using:
https://regex101.com/
The expected output should be
testExpression
trueExpression
falseExpression
(#HasMinimunRegulatedCAR#==0)
([t219]>1.5) OR ([t219]<-0.5)
([t223]>1.5) OR ([t223]<-0.5)
(#HasMinimunRegulatedCAR#==1)
iif((#MFIUsePAR30#==1), ([t224]>0.25), iif((#MFIUsePAR90#==1), ([t225]>0.25), (1==1)))
iif((#MFIUsePAR30#==1), ([t220]>0.25), iif((#MFIUsePAR90#==1), ([t221]>0.25),(1==1)))

If .Net regular expressions are an option, you can use balancing groups to capture your iifs.
One option will capture all nested parentheses into a group called Exp:
(?>
(?:iif)?(?<Open>\((?<Start>)) # When seeing an open parenthesis, mark
# a beginning of an expression.
# Use the stack Open to count our depth.
|
(?<-Open>(?<Exp-Start>)\)) # Close parenthesis. Capture an expression.
|
[^()\n,] # Match any regular character
|
(?<Exp-Start>),(?<Start>) # On a comma, capture an expression,
# and start a new one.
)+?
(?(Open)(?!))
Working Example, switch to the Table tab.
This option captures more than you ask for, but it does give you the full data you need to fully parse the equation.
Another option is to capture only the top-level parts:
iif\(
(?:
(?<TopLevel>(?>
(?<Open>\() # Increase depth count using the Open stack.
|
(?<-Open>\)) # Decrease depth count.
|
[^()\n,]+ # Match any other boring character
|
(?(Open),|(?!)) # Match a comma, bun only if we're already inside parentheses.
)+)
(?(Open)(?!))
|
,\s* # Match a comma between top-level expressions.
)+
\)
Working Example
Here, the group $TopLevel will have three captures. For example, on your last example, it captures:
Capture 0: (#HasMinimunRegulatedCAR#==1)
Capture 1: iif((#MFIUsePAR30#==1), ([t224]>0.25), iif((#MFIUsePAR90#==1), ([t225]>0.25), (1==1)))
Capture 2: iif((#MFIUsePAR30#==1), ([t220]>0.25), iif((#MFIUsePAR90#==1), ([t221]>0.25),(1==1)))

Related

Using RegExp to test words for letter count

I've been trying to use RegExp in JS to test a string for a certain count of a substrings. I would like a purely RegExp approach so I can combine it with my other search criteria, but have not had any luck.
As a simple example I would like to test a word if it has exactly 2-3 as.
case
test string
expected result
1
cat
false
2
shazam
true
3
abracadabra
false
Most of my guesses at regex fail case 3. Example: ^(?<!.*a.*)((.*a.*){2,3})(?!.*a.*)$
Could use this regex.
With any other character, including whitespace.
^[^a]*(?:a[^a]*){2,3}$
or if using multi-lines and don't want to span.
^[^a\r\n]*(?:a[^a\r\n]*){2,3}$
^ # Begin
[^a]* # optional not-a
(?: # Grp
a # single a
[^a]* # optional not-a
){2,3} # Get 2 or 3 a only
$ # End
https://regex101.com/r/O3SYKx/1

RegEx for an autocomplete feature

I am writing a search bar with an autocomplete feature that is hooked up to an endpoint. I am using regex to determine the "context" that I am in inside of the query I type in the search bar. The three contexts are "attribute," "value," and "operator." The two operators that are allowed are "AND" and "OR." Below is an example of an example query.
Color: Blue AND Size: "Women's Large" (<-- multi-word values or attribute names are surrounded by quotation marks)
I need my regex to match after you put a space after Blue, and if the user begins type "A/AN/AND/O/OR", I need it to match. Once they have put a space after the operator, I need it to stop matching.
This is the expression I have come up with.
const contextIsOperator = /[\w\d\s"]+: *[\w\s\d"]+ [\w]*$/
It matches once I put a space after "Blue," but matches for everything I put after that. If I replace the last * in the expression with a +, it works when I put a space after "Blue" and start manually typing one of the operators, but not if I just have a space after "Blue."
The pattern I have in my head written in words is:
group of one or more characters/digits/spaces/quotation marks
followed by a colon
followed by an optional space
followed by another group of one or more characters/digits/space/quotation marks
followed by a space (after the value)
followed by one or more characters (this is the operator)
How do I solve this problem?
Change [\w]* to something that just matches AND, OR, or one of their prefixes. Then you can make it optional with ?
[\w\s"]+: *[\w\s"]+ (A|AN|AND|O|OR)?$
DEMO
Note that Size: Women's Large won't match this because the apostrophe isn't in \w; that only matches letters, digits, and underscore. You'll need to add any other punctuation characters that you want to allow in these fields to the character set.
Is as, your language is not deterministic enough to be properly modeled with a regex. That being said, there are 2 approaches you can take:
Require all values (the stuff after a : and before an operator) to be enclosed in quotes
Build a simple state machine that can parse the data more intelligently. (Google Finite State Machine Parser)
If you choose to use the first method, you can use the following regex:
^(("?[\w\s]+"?): ?("[\w\s']+")( (AND|OR) )?)+$
I would explain the different components, but regex101 already does for me with really good visuals and detail.
Edit: this is the final one, check the unit tests here
const regex = /((("[\w\s"'']+(?="\b))"|[\w"'']+):\s?(("[\w\s"'']+(?="\b))"|[\w"'']+)\s(AND|OR)(?=\b\s))+/
That monstrosity should match (NOTE: QUOTED KEYS/VALUES MUST BE DOUBLE QUOTED):
Color: Blue AND "Size5":"Women's Large"
"weird KEy":regularvalue OR otherKey: "quoted value"
Here you go, try this out
^(?:"[^"]*"|[^\s:]+):[ ](?:"[^"]*"|[^\s:]+)[ ](?:A(?:N(?:D(?:[ ](*SKIP)(?!))?)?)?|O(?:R(?:[ ](*SKIP)(?!))?)?)?
https://regex101.com/r/neUQ0g/1
Explained
^ # BOS
(?: # Attribute
"
[^"]*
"
|
[^\s:]+
)
:
[ ]
(?: # Value
"
[^"]*
"
|
[^\s:]+
)
[ ] # Start matching after Attribute: Value + space
(?: # Operator
A
(?:
N
(?:
D
(?: # Stop matching after 'AND '
[ ]
(*SKIP)
(?!)
)?
)?
)?
|
O
(?:
R
(?: # Stop matching after 'OR '
[ ]
(*SKIP)
(?!)
)?
)?
)?

JavaScript regular expression to remember last bracket type encountered

In JavaScript want to be able to match text that is:
(surrounded by parentheses)
[surrounded by square brackets]
not surrounded by either type of bracket
In the following expression...
none[square](round)(accept]able)[wrong).text
... there should be 4 matches, for none, [square], (round) and (accept]able). However [wrong) should not match because there is no closing ] to be found.
In my best attempt so far...
([([])[A-Za-z]+[\])]|[^\[()\]]+
... (accept], able and [wrong) are incorrectly matched, while (accept]able) as a whole is not matched. I'm not too concerned about (accept]able); I would prefer no match at all to a match with imbalanced brackets.
I am guessing that I need to replace the [\])] expression with one that checks the value of the initial matching group, and uses ) if the first match was ( or ] if the first match was [.
I have tried working with conditional expressions. These seem to work well in PCRE and Python, but not in JavaScript.
Is this a problem that can be solved in a JavaScript regular expression on its own, or will I have to handle this piecemeal in a bulky JavaScript function?
A way to do that consists to match the two cases (acceptable and non-acceptable) and to separate the results in two different capture groups. So whatever you need to do with the results you only have to test which group succeeds:
/(\[[^\]]*\]|\([^)]*\)|[a-z]+)|([\[(][\s\S]*?(?:[\])]|$))/gi
pattern details:
( # acceptable capture group
\[ [^\]]* \]
|
\( [^)]* \)
|
[a-z]+
)
|
( # non-acceptable capture group
[\[(] [\s\S]*? (?: [\])] | $ ) # unclosed parens
)
This pattern doesn't care if a square bracket is enclosed between round brackets and vice-versa, but you can easily be more constrictive with this pattern that forbids any other brackets between brackets (square or round):
( # acceptable capture group
\[ [^()\[\]]* \]
|
\( [^()\[\]]* \)
|
[a-z]+
)
|
( # non-acceptable capture group
[\[(] [\s\S]*? (?: [\])] | $ ) # unclosed parens
)
Note about these two patterns: You can choose the default behavior when a unclosed bracket is found. The two patterns are designed to stop the non-acceptable part at the first closing bracket or if not found at the end of the string, but you can change this behavior and choose that an unclosing bracket stops always at the end of the string like this: [\[(][\s\S]*$
I'm not quite sure if I get all of the possible strings, but maybe this does the trick?
/\[([A-Za-z]*)\]|\(([\]A-Za-z]*)\)/gm
You can use the following :
/^(\[[^\[]+?\]|\([^\(]+?\)|[^\[\(]+)$/gm
See DEMO
This will do it for you:
\((\w*\s*)\)|\[(\w*)\]|\((\w*\s*|\])*\)|\((\w*\s*|\[)*\)|\[(\w*\s*|\()*\]|\[(\w*\s*|\))*\]|^\b\w*\s*\b
Demo here:
https://regex101.com/r/mV6gD2/2

Regular expression for excluding some characters with multiline matching

I want to ensure that the user input doesn't contain characters like <, > or &#, whether it is text input or textarea. My pattern:
var pattern = /^((?!&#|<|>).)*$/m;
The problem is, that it still matches multiline strings from a textarea like
this text matches
though this should not, because of this character <
EDIT:
To be more clear, I need exclude &# combination only, not & or #.
Please suggest the solution. Very grateful.
You're probably not looking for m (multiline) switch but s (DOTALL) switch in Javascript. Unfortunately s doesn't exist in Javascript.
However good news that DOTALL can be simulated using [\s\S]. Try following regex:
/^(?![\s\S]*?(&#|<|>))[\s\S]*$/
OR:
/^((?!&#|<|>)[\s\S])*$/
Live Demo
I don't think you need a lookaround assertion in this case. Simply use a negated character class:
var pattern = /^[^<>&#]*$/m;
If you're also disallowing the following characters, -, [, ], make sure to escape them or put them in proper order:
var pattern = /^[^][<>&#-]*$/m;
Alternate answer to specific question:
anubhava's solution works accurately, but is slow because it must perform a negative lookahead at each and every character position in the string. A simpler approach is to use reverse logic. i.e. Instead of verifying that: /^((?!&#|<|>)[\s\S])*$/ does match, verify that /[<>]|&#/ does NOT match. To illustrate this, lets create a function: hasSpecial() which tests if a string has one of the special chars. Here are two versions, the first uses anubhava's second regex:
function hasSpecial_1(text) {
// If regex matches, then string does NOT contain special chars.
return /^((?!&#|<|>)[\s\S])*$/.test(text) ? false : true;
}
function hasSpecial_2(text) {
// If regex matches, then string contains (at least) one special char.
return /[<>]|&#/.test(text) ? true : false;
}
These two functions are functionally equivalent, but the second one is probably quite a bit faster.
Note that when I originally read this question, I misinterpreted it to really want to exclude HTML special chars (including HTML entities). If that were the case, then the following solution will do just that.
Test if a string contains HTML special Chars:
It appears that the OP want to ensure a string does not contain any special HTML characters including: <, >, as well as decimal and hex HTML entities such as:  ,  , etc. If this is the case then the solution should probably also exclude the other (named) type of HTML entities such as: &, <, etc. The solution below excludes all three forms of HTML entities as well as the <> tag delimiters.
Here are two approaches: (Note that both approaches do allow the sequence: &# if it is not part of a valid HTML entity.)
FALSE test using positive regex:
function hasHtmlSpecial_1(text) {
/* Commented regex:
# Match string having no special HTML chars.
^ # Anchor to start of string.
[^<>&]* # Zero or more non-[<>&] (normal*).
(?: # Unroll the loop. ((special normal*)*)
& # Allow a & but only if
(?! # not an HTML entity (3 valid types).
(?: # One from 3 types of HTML entities.
[a-z\d]+ # either a named entity,
| \#\d+ # or a decimal entity,
| \#x[a-f\d]+ # or a hex entity.
) # End group of HTML entity types.
; # All entities end with ";".
) # End negative lookahead.
[^<>&]* # More (normal*).
)* # End unroll the loop.
$ # Anchor to end of string.
*/
var re = /^[^<>&]*(?:&(?!(?:[a-z\d]+|#\d+|#x[a-f\d]+);)[^<>&]*)*$/i;
// If regex matches, then string does NOT contain HTML special chars.
return re.test(text) ? false : true;
}
Note that the above regex utilizes Jeffrey Friedl's "Unrolling-the-Loop" efficiency technique and will run very quickly for both matching and non-matching cases. (See his regex masterpiece: Mastering Regular Expressions (3rd Edition))
TRUE test using negative regex:
function hasHtmlSpecial_2(text) {
/* Commented regex:
# Match string having one special HTML char.
[<>] # Either a tag delimiter
| & # or a & if start of
(?: # one of 3 types of HTML entities.
[a-z\d]+ # either a named entity,
| \#\d+ # or a decimal entity,
| \#x[a-f\d]+ # or a hex entity.
) # End group of HTML entity types.
; # All entities end with ";".
*/
var re = /[<>]|&(?:[a-z\d]+|#\d+|#x[a-f\d]+);/i;
// If regex matches, then string contains (at least) one special HTML char.
return re.test(text) ? true : false;
}
Note also that I have included a commented version of each of these (non-trivial) regexes in the form of a JavaScript comment.

How can I make this regular expression not result in "catastrophic backtracking"?

I'm trying to use a URL matching regular expression that I got from http://daringfireball.net/2010/07/improved_regex_for_matching_urls
(?xi)
\b
( # Capture 1: entire matched URL
(?:
https?:// # http or https protocol
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
)
Based on the answers to another question, it appears that there are cases that cause this regex to backtrack catastrophically. For example:
var re = /\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i;
re.test("http://google.com/?q=(AAAAAAAAAAAAAAAAAAAAAAAAAAAAA)")
... can take a really long time to execute (e.g. in Chrome)
It seems to me that the problem lies in this part of the code:
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
... which seems to be roughly equivalent to (.+|\((.+|(\(.+\)))*\))+, which looks like it contains (.+)+
Is there a change I can make that will avoid that?
Changing it to the following should prevent the catastrophic backtracking:
(?xi)
\b
( # Capture 1: entire matched URL
(?:
https?:// # http or https protocol
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
)
The only change that was made was to remove the + after the first [^\s()<>] in each of the "balanced parens" portions of the regex.
Here is the one-line version for testing with JS:
var re = /\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i;
re.test("http://google.com/?q=(AAAAAAAAAAAAAAAAAAAAAAAAAAAAA")
The problem portion of the original regex is the balanced parentheses section, to simplify the explanation of why the backtracking occurs I am going to completely remove the nested parentheses portion of it because it isn't relevant here:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # original
\(([^\s()<>]+)*\) # expanded below
\( # literal '('
( # start group, repeat zero or more times
[^\s()<>]+ # one or more non-special characters
)* # end group
\) # literal ')'
Consider what happens here with the string '(AAAAA', the literal ( would match and then AAAAA would be consumed by the group, and the ) would fail to match. At this point the group would give up one A, leaving AAAA captured and attempting to continue the match at this point. Because the group has a * following it, the group can match multiple times so now you would have ([^\s()<>]+)* matching AAAA, and then A on the second pass. When this fails an additional A would be given up by the original capture and consumed by the second capture.
This would go on for a long while resulting in the following attempts to match, where each comma-separated group indicates a different time that the group is matched, and how many characters that instance matched:
AAAAA
AAAA, A
AAA, AA
AAA, A, A
AA, AAA
AA, AA, A
AA, A, AA
AA, A, A, A
....
I may have counted wrong, but I'm pretty sure it adds up to 16 steps before it is determined that the regex cannot match. As you continue to add additional characters to the string the number of steps to figure this out grows exponentially.
By removing the + and changing this to \(([^\s()<>])*\), you would avoid this backtracking scenario.
Adding the alternation back in to check for the nested parentheses doesn't cause any problems.
Note that you may want to add some sort of anchor to the end of the string, because currently "http://google.com/?q=(AAAAAAAAAAAAAAAAAAAAAAAAAAAAA" will match up to just before the (, so re.test(...) would return true because http://google.com/?q= matches.

Categories

Resources