Need a regular expression in javascript [duplicate] - javascript

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 2 years ago.
I know that I can negate group of chars as in [^bar] but I need a regular expression where negation applies to the specific word - so in my example how do I negate an actual bar, and not "any chars in bar"?

A great way to do this is to use negative lookahead:
^(?!.*bar).*$
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].

Unless performance is of utmost concern, it's often easier just to run your results through a second pass, skipping those that match the words you want to negate.
Regular expressions usually mean you're doing scripting or some sort of low-performance task anyway, so find a solution that is easy to read, easy to understand and easy to maintain.

Solution:
^(?!.*STRING1|.*STRING2|.*STRING3).*$
xxxxxx OK
xxxSTRING1xxx KO (is whether it is desired)
xxxSTRING2xxx KO (is whether it is desired)
xxxSTRING3xxx KO (is whether it is desired)

You could either use a negative look-ahead or look-behind:
^(?!.*?bar).*
^(.(?<!bar))*?$
Or use just basics:
^(?:[^b]+|b(?:$|[^a]|a(?:$|[^r])))*$
These all match anything that does not contain bar.

The following regex will do what you want (as long as negative lookbehinds and lookaheads are supported), matching things properly; the only problem is that it matches individual characters (i.e. each match is a single character rather than all characters between two consecutive "bar"s), possibly resulting in a potential for high overhead if you're working with very long strings.
b(?!ar)|(?<!b)a|a(?!r)|(?<!ba)r|[^bar]

I came across this forum thread while trying to identify a regex for the following English statement:
Given an input string, match everything unless this input string is exactly 'bar'; for example I want to match 'barrier' and 'disbar' as well as 'foo'.
Here's the regex I came up with
^(bar.+|(?!bar).*)$
My English translation of the regex is "match the string if it starts with 'bar' and it has at least one other character, or if the string does not start with 'bar'.

The accepted answer is nice but is really a work-around for the lack of a simple sub-expression negation operator in regexes. This is why grep --invert-match exits. So in *nixes, you can accomplish the desired result using pipes and a second regex.
grep 'something I want' | grep --invert-match 'but not these ones'
Still a workaround, but maybe easier to remember.

If it's truly a word, bar that you don't want to match, then:
^(?!.*\bbar\b).*$
The above will match any string that does not contain bar that is on a word boundary, that is to say, separated from non-word characters. However, the period/dot (.) used in the above pattern will not match newline characters unless the correct regex flag is used:
^(?s)(?!.*\bbar\b).*$
Alternatively:
^(?!.*\bbar\b)[\s\S]*$
Instead of using any special flag, we are looking for any character that is either white space or non-white space. That should cover every character.
But what if we would like to match words that might contain bar, but just not the specific word bar?
(?!\bbar\b)\b\[A-Za-z-]*bar[a-z-]*\b
(?!\bbar\b) Assert that the next input is not bar on a word boundary.
\b\[A-Za-z-]*bar[a-z-]*\b Matches any word on a word boundary that contains bar.
See Regex Demo

Extracted from this comment by bkDJ:
^(?!bar$).*
The nice property of this solution is that it's possible to clearly negate (exclude) multiple words:
^(?!bar$|foo$|banana$).*

I wish to complement the accepted answer and contribute to the discussion with my late answer.
#ChrisVanOpstal shared this regex tutorial which is a great resource for learning regex.
However, it was really time consuming to read through.
I made a cheatsheet for mnemonic convenience.
This reference is based on the braces [], (), and {} leading each class, and I find it easy to recall.
Regex = {
'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()', '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*', '+', '?', 'greedy v.s. lazy'],
'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s'],
}

Just thought of something else that could be done. It's very different from my first answer, as it doesn't use regular expressions, so I decided to make a second answer post.
Use your language of choice's split() method equivalent on the string with the word to negate as the argument for what to split on. An example using Python:
>>> text = 'barbarasdbarbar 1234egb ar bar32 sdfbaraadf'
>>> text.split('bar')
['', '', 'asd', '', ' 1234egb ar ', '32 sdf', 'aadf']
The nice thing about doing it this way, in Python at least (I don't remember if the functionality would be the same in, say, Visual Basic or Java), is that it lets you know indirectly when "bar" was repeated in the string due to the fact that the empty strings between "bar"s are included in the list of results (though the empty string at the beginning is due to there being a "bar" at the beginning of the string). If you don't want that, you can simply remove the empty strings from the list.

I had a list of file names, and I wanted to exclude certain ones, with this sort of behavior (Ruby):
files = [
'mydir/states.rb', # don't match these
'countries.rb',
'mydir/states_bkp.rb', # match these
'mydir/city_states.rb'
]
excluded = ['states', 'countries']
# set my_rgx here
result = WankyAPI.filter(files, my_rgx) # I didn't write WankyAPI...
assert result == ['mydir/city_states.rb', 'mydir/states_bkp.rb']
Here's my solution:
excluded_rgx = excluded.map{|e| e+'\.'}.join('|')
my_rgx = /(^|\/)((?!#{excluded_rgx})[^\.\/]*)\.rb$/
My assumptions for this application:
The string to be excluded is at the beginning of the input, or immediately following a slash.
The permitted strings end with .rb.
Permitted filenames don't have a . character before the .rb.

Related

Replace a phrase in a string that is being broken up into 2 separate lines [duplicate]

Is there a simple way to ignore the white space in a target string when searching for matches using a regular expression pattern? For example, if my search is for "cats", I would want "c ats" or "ca ts" to match. I can't strip out the whitespace beforehand because I need to find the begin and end index of the match (including any whitespace) in order to highlight that match and any whitespace needs to be there for formatting purposes.
You can stick optional whitespace characters \s* in between every other character in your regex. Although granted, it will get a bit lengthy.
/cats/ -> /c\s*a\s*t\s*s/
While the accepted answer is technically correct, a more practical approach, if possible, is to just strip whitespace out of both the regular expression and the search string.
If you want to search for "my cats", instead of:
myString.match(/m\s*y\s*c\s*a\*st\s*s\s*/g)
Just do:
myString.replace(/\s*/g,"").match(/mycats/g)
Warning: You can't automate this on the regular expression by just replacing all spaces with empty strings because they may occur in a negation or otherwise make your regular expression invalid.
Addressing Steven's comment to Sam Dufel's answer
Thanks, sounds like that's the way to go. But I just realized that I only want the optional whitespace characters if they follow a newline. So for example, "c\n ats" or "ca\n ts" should match. But wouldn't want "c ats" to match if there is no newline. Any ideas on how that might be done?
This should do the trick:
/c(?:\n\s*)?a(?:\n\s*)?t(?:\n\s*)?s/
See this page for all the different variations of 'cats' that this matches.
You can also solve this using conditionals, but they are not supported in the javascript flavor of regex.
You could put \s* inbetween every character in your search string so if you were looking for cat you would use c\s*a\s*t\s*s\s*s
It's long but you could build the string dynamically of course.
You can see it working here: http://www.rubular.com/r/zzWwvppSpE
If you only want to allow spaces, then
\bc *a *t *s\b
should do it. To also allow tabs, use
\bc[ \t]*a[ \t]*t[ \t]*s\b
Remove the \b anchors if you also want to find cats within words like bobcats or catsup.
This approach can be used to automate this
(the following exemplary solution is in python, although obviously it can be ported to any language):
you can strip the whitespace beforehand AND save the positions of non-whitespace characters so you can use them later to find out the matched string boundary positions in the original string like the following:
def regex_search_ignore_space(regex, string):
no_spaces = ''
char_positions = []
for pos, char in enumerate(string):
if re.match(r'\S', char): # upper \S matches non-whitespace chars
no_spaces += char
char_positions.append(pos)
match = re.search(regex, no_spaces)
if not match:
return match
# match.start() and match.end() are indices of start and end
# of the found string in the spaceless string
# (as we have searched in it).
start = char_positions[match.start()] # in the original string
end = char_positions[match.end()] # in the original string
matched_string = string[start:end] # see
# the match WITH spaces is returned.
return matched_string
with_spaces = 'a li on and a cat'
print(regex_search_ignore_space('lion', with_spaces))
# prints 'li on'
If you want to go further you can construct the match object and return it instead, so the use of this helper will be more handy.
And the performance of this function can of course also be optimized, this example is just to show the path to a solution.
The accepted answer will not work if and when you are passing a dynamic value (such as "current value" in an array loop) as the regex test value. You would not be able to input the optional white spaces without getting some really ugly regex.
Konrad Hoffner's solution is therefore better in such cases as it will strip both the regest and test string of whitespace. The test will be conducted as though both have no whitespace.

How to exclude a whole word from a match using regexp [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 2 years ago.
I know that I can negate group of chars as in [^bar] but I need a regular expression where negation applies to the specific word - so in my example how do I negate an actual bar, and not "any chars in bar"?
A great way to do this is to use negative lookahead:
^(?!.*bar).*$
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Unless performance is of utmost concern, it's often easier just to run your results through a second pass, skipping those that match the words you want to negate.
Regular expressions usually mean you're doing scripting or some sort of low-performance task anyway, so find a solution that is easy to read, easy to understand and easy to maintain.
Solution:
^(?!.*STRING1|.*STRING2|.*STRING3).*$
xxxxxx OK
xxxSTRING1xxx KO (is whether it is desired)
xxxSTRING2xxx KO (is whether it is desired)
xxxSTRING3xxx KO (is whether it is desired)
You could either use a negative look-ahead or look-behind:
^(?!.*?bar).*
^(.(?<!bar))*?$
Or use just basics:
^(?:[^b]+|b(?:$|[^a]|a(?:$|[^r])))*$
These all match anything that does not contain bar.
The following regex will do what you want (as long as negative lookbehinds and lookaheads are supported), matching things properly; the only problem is that it matches individual characters (i.e. each match is a single character rather than all characters between two consecutive "bar"s), possibly resulting in a potential for high overhead if you're working with very long strings.
b(?!ar)|(?<!b)a|a(?!r)|(?<!ba)r|[^bar]
I came across this forum thread while trying to identify a regex for the following English statement:
Given an input string, match everything unless this input string is exactly 'bar'; for example I want to match 'barrier' and 'disbar' as well as 'foo'.
Here's the regex I came up with
^(bar.+|(?!bar).*)$
My English translation of the regex is "match the string if it starts with 'bar' and it has at least one other character, or if the string does not start with 'bar'.
The accepted answer is nice but is really a work-around for the lack of a simple sub-expression negation operator in regexes. This is why grep --invert-match exits. So in *nixes, you can accomplish the desired result using pipes and a second regex.
grep 'something I want' | grep --invert-match 'but not these ones'
Still a workaround, but maybe easier to remember.
If it's truly a word, bar that you don't want to match, then:
^(?!.*\bbar\b).*$
The above will match any string that does not contain bar that is on a word boundary, that is to say, separated from non-word characters. However, the period/dot (.) used in the above pattern will not match newline characters unless the correct regex flag is used:
^(?s)(?!.*\bbar\b).*$
Alternatively:
^(?!.*\bbar\b)[\s\S]*$
Instead of using any special flag, we are looking for any character that is either white space or non-white space. That should cover every character.
But what if we would like to match words that might contain bar, but just not the specific word bar?
(?!\bbar\b)\b\[A-Za-z-]*bar[a-z-]*\b
(?!\bbar\b) Assert that the next input is not bar on a word boundary.
\b\[A-Za-z-]*bar[a-z-]*\b Matches any word on a word boundary that contains bar.
See Regex Demo
Extracted from this comment by bkDJ:
^(?!bar$).*
The nice property of this solution is that it's possible to clearly negate (exclude) multiple words:
^(?!bar$|foo$|banana$).*
I wish to complement the accepted answer and contribute to the discussion with my late answer.
#ChrisVanOpstal shared this regex tutorial which is a great resource for learning regex.
However, it was really time consuming to read through.
I made a cheatsheet for mnemonic convenience.
This reference is based on the braces [], (), and {} leading each class, and I find it easy to recall.
Regex = {
'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()', '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*', '+', '?', 'greedy v.s. lazy'],
'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s'],
}
Just thought of something else that could be done. It's very different from my first answer, as it doesn't use regular expressions, so I decided to make a second answer post.
Use your language of choice's split() method equivalent on the string with the word to negate as the argument for what to split on. An example using Python:
>>> text = 'barbarasdbarbar 1234egb ar bar32 sdfbaraadf'
>>> text.split('bar')
['', '', 'asd', '', ' 1234egb ar ', '32 sdf', 'aadf']
The nice thing about doing it this way, in Python at least (I don't remember if the functionality would be the same in, say, Visual Basic or Java), is that it lets you know indirectly when "bar" was repeated in the string due to the fact that the empty strings between "bar"s are included in the list of results (though the empty string at the beginning is due to there being a "bar" at the beginning of the string). If you don't want that, you can simply remove the empty strings from the list.
I had a list of file names, and I wanted to exclude certain ones, with this sort of behavior (Ruby):
files = [
'mydir/states.rb', # don't match these
'countries.rb',
'mydir/states_bkp.rb', # match these
'mydir/city_states.rb'
]
excluded = ['states', 'countries']
# set my_rgx here
result = WankyAPI.filter(files, my_rgx) # I didn't write WankyAPI...
assert result == ['mydir/city_states.rb', 'mydir/states_bkp.rb']
Here's my solution:
excluded_rgx = excluded.map{|e| e+'\.'}.join('|')
my_rgx = /(^|\/)((?!#{excluded_rgx})[^\.\/]*)\.rb$/
My assumptions for this application:
The string to be excluded is at the beginning of the input, or immediately following a slash.
The permitted strings end with .rb.
Permitted filenames don't have a . character before the .rb.

Add an additional condition to RFC 5322 Email Regex [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 2 years ago.
I know that I can negate group of chars as in [^bar] but I need a regular expression where negation applies to the specific word - so in my example how do I negate an actual bar, and not "any chars in bar"?
A great way to do this is to use negative lookahead:
^(?!.*bar).*$
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Unless performance is of utmost concern, it's often easier just to run your results through a second pass, skipping those that match the words you want to negate.
Regular expressions usually mean you're doing scripting or some sort of low-performance task anyway, so find a solution that is easy to read, easy to understand and easy to maintain.
Solution:
^(?!.*STRING1|.*STRING2|.*STRING3).*$
xxxxxx OK
xxxSTRING1xxx KO (is whether it is desired)
xxxSTRING2xxx KO (is whether it is desired)
xxxSTRING3xxx KO (is whether it is desired)
You could either use a negative look-ahead or look-behind:
^(?!.*?bar).*
^(.(?<!bar))*?$
Or use just basics:
^(?:[^b]+|b(?:$|[^a]|a(?:$|[^r])))*$
These all match anything that does not contain bar.
The following regex will do what you want (as long as negative lookbehinds and lookaheads are supported), matching things properly; the only problem is that it matches individual characters (i.e. each match is a single character rather than all characters between two consecutive "bar"s), possibly resulting in a potential for high overhead if you're working with very long strings.
b(?!ar)|(?<!b)a|a(?!r)|(?<!ba)r|[^bar]
I came across this forum thread while trying to identify a regex for the following English statement:
Given an input string, match everything unless this input string is exactly 'bar'; for example I want to match 'barrier' and 'disbar' as well as 'foo'.
Here's the regex I came up with
^(bar.+|(?!bar).*)$
My English translation of the regex is "match the string if it starts with 'bar' and it has at least one other character, or if the string does not start with 'bar'.
The accepted answer is nice but is really a work-around for the lack of a simple sub-expression negation operator in regexes. This is why grep --invert-match exits. So in *nixes, you can accomplish the desired result using pipes and a second regex.
grep 'something I want' | grep --invert-match 'but not these ones'
Still a workaround, but maybe easier to remember.
If it's truly a word, bar that you don't want to match, then:
^(?!.*\bbar\b).*$
The above will match any string that does not contain bar that is on a word boundary, that is to say, separated from non-word characters. However, the period/dot (.) used in the above pattern will not match newline characters unless the correct regex flag is used:
^(?s)(?!.*\bbar\b).*$
Alternatively:
^(?!.*\bbar\b)[\s\S]*$
Instead of using any special flag, we are looking for any character that is either white space or non-white space. That should cover every character.
But what if we would like to match words that might contain bar, but just not the specific word bar?
(?!\bbar\b)\b\[A-Za-z-]*bar[a-z-]*\b
(?!\bbar\b) Assert that the next input is not bar on a word boundary.
\b\[A-Za-z-]*bar[a-z-]*\b Matches any word on a word boundary that contains bar.
See Regex Demo
Extracted from this comment by bkDJ:
^(?!bar$).*
The nice property of this solution is that it's possible to clearly negate (exclude) multiple words:
^(?!bar$|foo$|banana$).*
I wish to complement the accepted answer and contribute to the discussion with my late answer.
#ChrisVanOpstal shared this regex tutorial which is a great resource for learning regex.
However, it was really time consuming to read through.
I made a cheatsheet for mnemonic convenience.
This reference is based on the braces [], (), and {} leading each class, and I find it easy to recall.
Regex = {
'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()', '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*', '+', '?', 'greedy v.s. lazy'],
'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s'],
}
Just thought of something else that could be done. It's very different from my first answer, as it doesn't use regular expressions, so I decided to make a second answer post.
Use your language of choice's split() method equivalent on the string with the word to negate as the argument for what to split on. An example using Python:
>>> text = 'barbarasdbarbar 1234egb ar bar32 sdfbaraadf'
>>> text.split('bar')
['', '', 'asd', '', ' 1234egb ar ', '32 sdf', 'aadf']
The nice thing about doing it this way, in Python at least (I don't remember if the functionality would be the same in, say, Visual Basic or Java), is that it lets you know indirectly when "bar" was repeated in the string due to the fact that the empty strings between "bar"s are included in the list of results (though the empty string at the beginning is due to there being a "bar" at the beginning of the string). If you don't want that, you can simply remove the empty strings from the list.
I had a list of file names, and I wanted to exclude certain ones, with this sort of behavior (Ruby):
files = [
'mydir/states.rb', # don't match these
'countries.rb',
'mydir/states_bkp.rb', # match these
'mydir/city_states.rb'
]
excluded = ['states', 'countries']
# set my_rgx here
result = WankyAPI.filter(files, my_rgx) # I didn't write WankyAPI...
assert result == ['mydir/city_states.rb', 'mydir/states_bkp.rb']
Here's my solution:
excluded_rgx = excluded.map{|e| e+'\.'}.join('|')
my_rgx = /(^|\/)((?!#{excluded_rgx})[^\.\/]*)\.rb$/
My assumptions for this application:
The string to be excluded is at the beginning of the input, or immediately following a slash.
The permitted strings end with .rb.
Permitted filenames don't have a . character before the .rb.

RegEx to invalid a string if it contains a certain word in Javascript [duplicate]

I know it's possible to match a word and then reverse the matches using other tools (e.g. grep -v). However, is it possible to match lines that do not contain a specific word, e.g. hede, using a regular expression?
Input:
hoho
hihi
haha
hede
Code:
grep "<Regex for 'doesn't contain hede'>" input
Desired output:
hoho
hihi
haha
The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:
^((?!hede).)*$
The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.
And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s in the following pattern):
/^((?!hede).)*$/s
or use it inline:
/(?s)^((?!hede).)*$/
(where the /.../ are the regex delimiters, i.e., not part of the pattern)
If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]:
/^((?!hede)[\s\S])*$/
Explanation
A string is just a list of n characters. Before, and after each character, there's an empty string. So a list of n characters will have n+1 empty strings. Consider the string "ABhedeCD":
┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐
S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│
└──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘
index 0 1 2 3 4 5 6 7
where the e's are the empty strings. The regex (?!hede). looks ahead to see if there's no substring "hede" to be seen, and if that is the case (so something else is seen), then the . (dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.
So, in my example, every empty string is first validated to see if there's no "hede" up ahead, before a character is consumed by the . (dot). The regex (?!hede). will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).)*. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)*$
As you can see, the input "ABhedeCD" will fail because on e3, the regex (?!hede) fails (there is "hede" up ahead!).
Note that the solution to does not start with “hede”:
^(?!hede).*$
is generally much more efficient than the solution to does not contain “hede”:
^((?!hede).)*$
The former checks for “hede” only at the input string’s first position, rather than at every position.
If you're just using it for grep, you can use grep -v hede to get all lines which do not contain hede.
ETA Oh, rereading the question, grep -v is probably what you meant by "tools options".
Answer:
^((?!hede).)*$
Explanation:
^the beginning of the string,
( group and capture to \1 (0 or more times (matching the most amount possible)),
(?! look ahead to see if there is not,
hede your string,
) end of look-ahead,
. any character except \n,
)* end of \1 (Note: because you are using a quantifier on this capture, only the LAST repetition of the captured pattern will be stored in \1)
$ before an optional \n, and the end of the string
The given answers are perfectly fine, just an academic point:
Regular Expressions in the meaning of theoretical computer sciences ARE NOT ABLE do it like this. For them it had to look something like this:
^([^h].*$)|(h([^e].*$|$))|(he([^h].*$|$))|(heh([^e].*$|$))|(hehe.+$)
This only does a FULL match. Doing it for sub-matches would even be more awkward.
If you want the regex test to only fail if the entire string matches, the following will work:
^(?!hede$).*
e.g. -- If you want to allow all values except "foo" (i.e. "foofoo", "barfoo", and "foobar" will pass, but "foo" will fail), use: ^(?!foo$).*
Of course, if you're checking for exact equality, a better general solution in this case is to check for string equality, i.e.
myStr !== 'foo'
You could even put the negation outside the test if you need any regex features (here, case insensitivity and range matching):
!/^[a-f]oo$/i.test(myStr)
The regex solution at the top of this answer may be helpful, however, in situations where a positive regex test is required (perhaps by an API).
FWIW, since regular languages (aka rational languages) are closed under complementation, it's always possible to find a regular expression (aka rational expression) that negates another expression. But not many tools implement this.
Vcsn supports this operator (which it denotes {c}, postfix).
You first define the type of your expressions: labels are letter (lal_char) to pick from a to z for instance (defining the alphabet when working with complementation is, of course, very important), and the "value" computed for each word is just a Boolean: true the word is accepted, false, rejected.
In Python:
In [5]: import vcsn
c = vcsn.context('lal_char(a-z), b')
c
Out[5]: {a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z} → 𝔹
then you enter your expression:
In [6]: e = c.expression('(hede){c}'); e
Out[6]: (hede)^c
convert this expression to an automaton:
In [7]: a = e.automaton(); a
finally, convert this automaton back to a simple expression.
In [8]: print(a.expression())
\e+h(\e+e(\e+d))+([^h]+h([^e]+e([^d]+d([^e]+e[^]))))[^]*
where + is usually denoted |, \e denotes the empty word, and [^] is usually written . (any character). So, with a bit of rewriting ()|h(ed?)?|([^h]|h([^e]|e([^d]|d([^e]|e.)))).*.
You can see this example here, and try Vcsn online there.
Here's a good explanation of why it's not easy to negate an arbitrary regex. I have to agree with the other answers, though: if this is anything other than a hypothetical question, then a regex is not the right choice here.
With negative lookahead, regular expression can match something not contains specific pattern. This is answered and explained by Bart Kiers. Great explanation!
However, with Bart Kiers' answer, the lookahead part will test 1 to 4 characters ahead while matching any single character. We can avoid this and let the lookahead part check out the whole text, ensure there is no 'hede', and then the normal part (.*) can eat the whole text all at one time.
Here is the improved regex:
/^(?!.*?hede).*$/
Note the (*?) lazy quantifier in the negative lookahead part is optional, you can use (*) greedy quantifier instead, depending on your data: if 'hede' does present and in the beginning half of the text, the lazy quantifier can be faster; otherwise, the greedy quantifier be faster. However if 'hede' does not present, both would be equal slow.
Here is the demo code.
For more information about lookahead, please check out the great article: Mastering Lookahead and Lookbehind.
Also, please check out RegexGen.js, a JavaScript Regular Expression Generator that helps to construct complex regular expressions. With RegexGen.js, you can construct the regex in a more readable way:
var _ = regexGen;
var regex = _(
_.startOfLine(),
_.anything().notContains( // match anything that not contains:
_.anything().lazy(), 'hede' // zero or more chars that followed by 'hede',
// i.e., anything contains 'hede'
),
_.endOfLine()
);
Benchmarks
I decided to evaluate some of the presented Options and compare their performance, as well as use some new Features.
Benchmarking on .NET Regex Engine: http://regexhero.net/tester/
Benchmark Text:
The first 7 lines should not match, since they contain the searched Expression, while the lower 7 lines should match!
Regex Hero is a real-time online Silverlight Regular Expression Tester.
XRegex Hero is a real-time online Silverlight Regular Expression Tester.
Regex HeroRegex HeroRegex HeroRegex HeroRegex Hero is a real-time online Silverlight Regular Expression Tester.
Regex Her Regex Her Regex Her Regex Her Regex Her Regex Her Regex Hero is a real-time online Silverlight Regular Expression Tester.
Regex Her is a real-time online Silverlight Regular Expression Tester.Regex Hero
egex Hero egex Hero egex Hero egex Hero egex Hero egex Hero Regex Hero is a real-time online Silverlight Regular Expression Tester.
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRegex Hero is a real-time online Silverlight Regular Expression Tester.
Regex Her
egex Hero
egex Hero is a real-time online Silverlight Regular Expression Tester.
Regex Her is a real-time online Silverlight Regular Expression Tester.
Regex Her Regex Her Regex Her Regex Her Regex Her Regex Her is a real-time online Silverlight Regular Expression Tester.
Nobody is a real-time online Silverlight Regular Expression Tester.
Regex Her o egex Hero Regex Hero Reg ex Hero is a real-time online Silverlight Regular Expression Tester.
Results:
Results are Iterations per second as the median of 3 runs - Bigger Number = Better
01: ^((?!Regex Hero).)*$ 3.914 // Accepted Answer
02: ^(?:(?!Regex Hero).)*$ 5.034 // With Non-Capturing group
03: ^(?!.*?Regex Hero).* 7.356 // Lookahead at the beginning, if not found match everything
04: ^(?>[^R]+|R(?!egex Hero))*$ 6.137 // Lookahead only on the right first letter
05: ^(?>(?:.*?Regex Hero)?)^.*$ 7.426 // Match the word and check if you're still at linestart
06: ^(?(?=.*?Regex Hero)(?#fail)|.*)$ 7.371 // Logic Branch: Find Regex Hero? match nothing, else anything
P1: ^(?(?=.*?Regex Hero)(*FAIL)|(*ACCEPT)) ????? // Logic Branch in Perl - Quick FAIL
P2: .*?Regex Hero(*COMMIT)(*FAIL)|(*ACCEPT) ????? // Direct COMMIT & FAIL in Perl
Since .NET doesn't support action Verbs (*FAIL, etc.) I couldn't test the solutions P1 and P2.
Summary:
The overall most readable and performance-wise fastest solution seems to be 03 with a simple negative lookahead. This is also the fastest solution for JavaScript, since JS does not support the more advanced Regex Features for the other solutions.
Not regex, but I've found it logical and useful to use serial greps with pipe to eliminate noise.
eg. search an apache config file without all the comments-
grep -v '\#' /opt/lampp/etc/httpd.conf # this gives all the non-comment lines
and
grep -v '\#' /opt/lampp/etc/httpd.conf | grep -i dir
The logic of serial grep's is (not a comment) and (matches dir)
Since no one else has given a direct answer to the question that was asked, I'll do it.
The answer is that with POSIX grep, it's impossible to literally satisfy this request:
grep "<Regex for 'doesn't contain hede'>" input
The reason is that with no flags, POSIX grep is only required to work with Basic Regular Expressions (BREs), which are simply not powerful enough for accomplishing that task, because of lack of alternation in subexpressions. The only kind of alternation it supports involves providing multiple regular expressions separated by newlines, and that doesn't cover all regular languages, e.g. there's no finite collection of BREs that matches the same regular language as the extended regular expression (ERE) ^(ab|cd)*$.
However, GNU grep implements extensions that allow it. In particular, \| is the alternation operator in GNU's implementation of BREs. If your regular expression engine supports alternation, parentheses and the Kleene star, and is able to anchor to the beginning and end of the string, that's all you need for this approach. Note however that negative sets [^ ... ] are very convenient in addition to those, because otherwise, you need to replace them with an expression of the form (a|b|c| ... ) that lists every character that is not in the set, which is extremely tedious and overly long, even more so if the whole character set is Unicode.
Thanks to formal language theory, we get to see how such an expression looks like. With GNU grep, the answer would be something like:
grep "^\([^h]\|h\(h\|eh\|edh\)*\([^eh]\|e[^dh]\|ed[^eh]\)\)*\(\|h\(h\|eh\|edh\)*\(\|e\|ed\)\)$" input
(found with Grail and some further optimizations made by hand).
You can also use a tool that implements EREs, like egrep, to get rid of the backslashes, or equivalently, pass the -E flag to POSIX grep (although I was under the impression that the question required avoiding any flags to grep whatsoever):
egrep "^([^h]|h(h|eh|edh)*([^eh]|e[^dh]|ed[^eh]))*(|h(h|eh|edh)*(|e|ed))$" input
Here's a script to test it (note it generates a file testinput.txt in the current directory). Several of the expressions presented in other answers fail this test.
#!/bin/bash
REGEX="^\([^h]\|h\(h\|eh\|edh\)*\([^eh]\|e[^dh]\|ed[^eh]\)\)*\(\|h\(h\|eh\|edh\)*\(\|e\|ed\)\)$"
# First four lines as in OP's testcase.
cat > testinput.txt <<EOF
hoho
hihi
haha
hede
h
he
ah
head
ahead
ahed
aheda
ahede
hhede
hehede
hedhede
hehehehehehedehehe
hedecidedthat
EOF
diff -s -u <(grep -v hede testinput.txt) <(grep "$REGEX" testinput.txt)
In my system it prints:
Files /dev/fd/63 and /dev/fd/62 are identical
as expected.
For those interested in the details, the technique employed is to convert the regular expression that matches the word into a finite automaton, then invert the automaton by changing every acceptance state to non-acceptance and vice versa, and then converting the resulting FA back to a regular expression.
As everyone has noted, if your regular expression engine supports negative lookahead, the regular expression is much simpler. For example, with GNU grep:
grep -P '^((?!hede).)*$' input
However, this approach has the disadvantage that it requires a backtracking regular expression engine. This makes it unsuitable in installations that are using secure regular expression engines like RE2, which is one reason to prefer the generated approach in some circumstances.
Using Kendall Hopkins' excellent FormalTheory library, written in PHP, which provides a functionality similar to Grail, and a simplifier written by myself, I've been able to write an online generator of negative regular expressions given an input phrase (only alphanumeric and space characters currently supported, and the length is limited): http://www.formauri.es/personal/pgimeno/misc/non-match-regex/
For hede it outputs:
^([^h]|h(h|e(h|dh))*([^eh]|e([^dh]|d[^eh])))*(h(h|e(h|dh))*(ed?)?)?$
which is equivalent to the above.
with this, you avoid to test a lookahead on each positions:
/^(?:[^h]+|h++(?!ede))*+$/
equivalent to (for .net):
^(?>(?:[^h]+|h+(?!ede))*)$
Old answer:
/^(?>[^h]+|h+(?!ede))*$/
Aforementioned (?:(?!hede).)* is great because it can be anchored.
^(?:(?!hede).)*$ # A line without hede
foo(?:(?!hede).)*bar # foo followed by bar, without hede between them
But the following would suffice in this case:
^(?!.*hede) # A line without hede
This simplification is ready to have "AND" clauses added:
^(?!.*hede)(?=.*foo)(?=.*bar) # A line with foo and bar, but without hede
^(?!.*hede)(?=.*foo).*bar # Same
An, in my opinon, more readable variant of the top answer:
^(?!.*hede)
Basically, "match at the beginning of the line if and only if it does not have 'hede' in it" - so the requirement translated almost directly into regex.
Of course, it's possible to have multiple failure requirements:
^(?!.*(hede|hodo|hada))
Details: The ^ anchor ensures the regex engine doesn't retry the match at every location in the string, which would match every string.
The ^ anchor in the beginning is meant to represent the beginning of the line. The grep tool matches each line one at a time, in contexts where you're working with a multiline string, you can use the "m" flag:
/^(?!.*hede)/m # JavaScript syntax
or
(?m)^(?!.*hede) # Inline flag
Here's how I'd do it:
^[^h]*(h(?!ede)[^h]*)*$
Accurate and more efficient than the other answers. It implements Friedl's "unrolling-the-loop" efficiency technique and requires much less backtracking.
Another option is that to add a positive look-ahead and check if hede is anywhere in the input line, then we would negate that, with an expression similar to:
^(?!(?=.*\bhede\b)).*$
with word boundaries.
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
RegEx Circuit
jex.im visualizes regular expressions:
If you want to match a character to negate a word similar to negate character class:
For example, a string:
<?
$str="aaa bbb4 aaa bbb7";
?>
Do not use:
<?
preg_match('/aaa[^bbb]+?bbb7/s', $str, $matches);
?>
Use:
<?
preg_match('/aaa(?:(?!bbb).)+?bbb7/s', $str, $matches);
?>
Notice "(?!bbb)." is neither lookbehind nor lookahead, it's lookcurrent, for example:
"(?=abc)abcde", "(?!abc)abcde"
The OP did not specify or Tag the post to indicate the context (programming language, editor, tool) the Regex will be used within.
For me, I sometimes need to do this while editing a file using Textpad.
Textpad supports some Regex, but does not support lookahead or lookbehind, so it takes a few steps.
If I am looking to retain all lines that Do NOT contain the string hede, I would do it like this:
1. Search/replace the entire file to add a unique "Tag" to the beginning of each line containing any text.
Search string:^(.)
Replace string:<##-unique-##>\1
Replace-all
2. Delete all lines that contain the string hede (replacement string is empty):
Search string:<##-unique-##>.*hede.*\n
Replace string:<nothing>
Replace-all
3. At this point, all remaining lines Do NOT contain the string hede. Remove the unique "Tag" from all lines (replacement string is empty):
Search string:<##-unique-##>
Replace string:<nothing>
Replace-all
Now you have the original text with all lines containing the string hede removed.
If I am looking to Do Something Else to only lines that Do NOT contain the string hede, I would do it like this:
1. Search/replace the entire file to add a unique "Tag" to the beginning of each line containing any text.
Search string:^(.)
Replace string:<##-unique-##>\1
Replace-all
2. For all lines that contain the string hede, remove the unique "Tag":
Search string:<##-unique-##>(.*hede)
Replace string:\1
Replace-all
3. At this point, all lines that begin with the unique "Tag", Do NOT contain the string hede. I can now do my Something Else to only those lines.
4. When I am done, I remove the unique "Tag" from all lines (replacement string is empty):
Search string:<##-unique-##>
Replace string:<nothing>
Replace-all
Since the introduction of ruby-2.4.1, we can use the new Absent Operator in Ruby’s Regular Expressions
from the official doc
(?~abc) matches: "", "ab", "aab", "cccc", etc.
It doesn't match: "abc", "aabc", "ccccabc", etc.
Thus, in your case ^(?~hede)$ does the job for you
2.4.1 :016 > ["hoho", "hihi", "haha", "hede"].select{|s| /^(?~hede)$/.match(s)}
=> ["hoho", "hihi", "haha"]
Through PCRE verb (*SKIP)(*F)
^hede$(*SKIP)(*F)|^.*$
This would completely skips the line which contains the exact string hede and matches all the remaining lines.
DEMO
Execution of the parts:
Let us consider the above regex by splitting it into two parts.
Part before the | symbol. Part shouldn't be matched.
^hede$(*SKIP)(*F)
Part after the | symbol. Part should be matched.
^.*$
PART 1
Regex engine will start its execution from the first part.
^hede$(*SKIP)(*F)
Explanation:
^ Asserts that we are at the start.
hede Matches the string hede
$ Asserts that we are at the line end.
So the line which contains the string hede would be matched. Once the regex engine sees the following (*SKIP)(*F) (Note: You could write (*F) as (*FAIL)) verb, it skips and make the match to fail. | called alteration or logical OR operator added next to the PCRE verb which inturn matches all the boundaries exists between each and every character on all the lines except the line contains the exact string hede. See the demo here. That is, it tries to match the characters from the remaining string. Now the regex in the second part would be executed.
PART 2
^.*$
Explanation:
^ Asserts that we are at the start. ie, it matches all the line starts except the one in the hede line. See the demo here.
.* In the Multiline mode, . would match any character except newline or carriage return characters. And * would repeat the previous character zero or more times. So .* would match the whole line. See the demo here.
Hey why you added .* instead of .+ ?
Because .* would match a blank line but .+ won't match a blank. We want to match all the lines except hede , there may be a possibility of blank lines also in the input . so you must use .* instead of .+ . .+ would repeat the previous character one or more times. See .* matches a blank line here.
$ End of the line anchor is not necessary here.
The TXR Language supports regex negation.
$ txr -c '#(repeat)
#{nothede /~hede/}
#(do (put-line nothede))
#(end)' Input
A more complicated example: match all lines that start with a and end with z, but do not contain the substring hede:
$ txr -c '#(repeat)
#{nothede /a.*z&~.*hede.*/}
#(do (put-line nothede))
#(end)' -
az <- echoed
az
abcz <- echoed
abcz
abhederz <- not echoed; contains hede
ahedez <- not echoed; contains hede
ace <- not echoed; does not end in z
ahedz <- echoed
ahedz
Regex negation is not particularly useful on its own but when you also have intersection, things get interesting, since you have a full set of boolean set operations: you can express "the set which matches this, except for things which match that".
It may be more maintainable to two regexes in your code, one to do the first match, and then if it matches run the second regex to check for outlier cases you wish to block for example ^.*(hede).* then have appropriate logic in your code.
OK, I admit this is not really an answer to the posted question posted and it may also use slightly more processing than a single regex. But for developers who came here looking for a fast emergency fix for an outlier case then this solution should not be overlooked.
The below function will help you get your desired output
<?PHP
function removePrepositions($text){
$propositions=array('/\bfor\b/i','/\bthe\b/i');
if( count($propositions) > 0 ) {
foreach($propositions as $exceptionPhrase) {
$text = preg_replace($exceptionPhrase, '', trim($text));
}
$retval = trim($text);
}
return $retval;
}
?>
I wanted to add another example for if you are trying to match an entire line that contains string X, but does not also contain string Y.
For example, let's say we want to check if our URL / string contains "tasty-treats", so long as it does not also contain "chocolate" anywhere.
This regex pattern would work (works in JavaScript too)
^(?=.*?tasty-treats)((?!chocolate).)*$
(global, multiline flags in example)
Interactive Example: https://regexr.com/53gv4
Matches
(These urls contain "tasty-treats" and also do not contain "chocolate")
example.com/tasty-treats/strawberry-ice-cream
example.com/desserts/tasty-treats/banana-pudding
example.com/tasty-treats-overview
Does Not Match
(These urls contain "chocolate" somewhere - so they won't match even though they contain "tasty-treats")
example.com/tasty-treats/chocolate-cake
example.com/home-cooking/oven-roasted-chicken
example.com/tasty-treats/banana-chocolate-fudge
example.com/desserts/chocolate/tasty-treats
example.com/chocolate/tasty-treats/desserts
As long as you are dealing with lines, simply mark the negative matches and target the rest.
In fact, I use this trick with sed because ^((?!hede).)*$ looks not supported by it.
For the desired output
Mark the negative match: (e.g. lines with hede), using a character not included in the whole text at all. An emoji could probably be a good choice for this purpose.
s/(.*hede)/🔒\1/g
Target the rest (the unmarked strings: e.g. lines without hede). Suppose you want to keep only the target and delete the rest (as you want):
s/^🔒.*//g
For a better understanding
Suppose you want to delete the target:
Mark the negative match: (e.g. lines with hede), using a character not included in the whole text at all. An emoji could probably be a good choice for this purpose.
s/(.*hede)/🔒\1/g
Target the rest (the unmarked strings: e.g. lines without hede). Suppose you want to delete the target:
s/^[^🔒].*//g
Remove the mark:
s/🔒//g
^((?!hede).)*$ is an elegant solution, except since it consumes characters you won't be able to combine it with other criteria. For instance, say you wanted to check for the non-presence of "hede" and the presence of "haha." This solution would work because it won't consume characters:
^(?!.*\bhede\b)(?=.*\bhaha\b)
How to use PCRE's backtracking control verbs to match a line not containing a word
Here's a method that I haven't seen used before:
/.*hede(*COMMIT)^|/
How it works
First, it tries to find "hede" somewhere in the line. If successful, at this point, (*COMMIT) tells the engine to, not only not backtrack in the event of a failure, but also not to attempt any further matching in that case. Then, we try to match something that cannot possibly match (in this case, ^).
If a line does not contain "hede" then the second alternative, an empty subpattern, successfully matches the subject string.
This method is no more efficient than a negative lookahead, but I figured I'd just throw it on here in case someone finds it nifty and finds a use for it for other, more interesting applications.
Simplest thing that I could find would be
[^(hede)]
Tested at https://regex101.com/
You can also add unit-test cases on that site
A simpler solution is to use the not operator !
Your if statement will need to match "contains" and not match "excludes".
var contains = /abc/;
var excludes =/hede/;
if(string.match(contains) && !(string.match(excludes))){ //proceed...
I believe the designers of RegEx anticipated the use of not operators.

Unable to find a string matching a regex pattern

While trying to submit a form a javascript regex validation always proves to be false for a string.
Regex:- ^(([a-zA-Z]:)|(\\\\{2}\\w+)\\$?)(\\\\(\\w[\\w].*))+(.jpeg|.JPEG|.jpg|.JPG)$
I have tried following strings against it
abc.jpg,
abc:.jpg,
a:.jpg,
a:asdas.jpg,
What string could possible match this regex ?
This regex won't match against anything because of that $? in the middle of the string.
Apparently using the optional modifier ? on the end string symbol $ is not correct (if you paste it on https://regex101.com/ it will give you an error indeed). If the javascript parser ignores the error and keeps the regex as it is this still means you are going to match an end string in the middle of a string which is supposed to continue.
Unescaped it was supposed to match a \$ (dollar symbol) but as it is written it won't work.
If you want your string to be accepted at any cost you can probably use Firebug or a similar developer tool and edit the string inside the javascript code (this, assuming there's no server side check too and assuming it's not wrong aswell). If you ignore the $? then a matching string will be \\\\w\\\\ww.jpg (but since the . is unescaped even \\\\w\\\\ww%jpg is a match)
Of course, I wrote this answer assuming the escaping is indeed the one you showed in the question. If you need to find a matching pattern for the correctly escaped one ^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))+(\.jpeg|\.JPEG|\.jpg|\.JPG)$ then you can use this tool to find one http://fent.github.io/randexp.js/ (though it will find weird matches). A matching pattern is c:\zz.jpg
If you are just looking for a regular expression to match what you got there, go ahead and test this out:
(\w+:?\w*\.[jpe?gJPE?G]+,)
That should match exactly what you are looking for. Remove the optional comma at the end if you feel like it, of course.
If you remove escape level, the actual regex is
^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))+(.jpeg|.JPEG|.jpg|.JPG)$
After ^start the first pipe (([a-zA-Z]:)|(\\{2}\w+)\$?) which matches an alpha followed by a colon or two backslashes followed by one or more word characters, followed by an optional literal $. There is some needless parenthesis used inside.
The second part (\\(\w[\w].*))+ matches a backslash, followed by two word characters \w[\w] which looks weird because it's equivalent to \w\w (don't need a character class for second \w). Followed by any amount of any character. This whole thing one or more times.
In the last part (.jpeg|.JPEG|.jpg|.JPG) one probably forgot to escape the dot for matching a literal. \. should be used. This part can be reduced to \.(JPE?G|jpe?g).
It would match something like
A:\12anything.JPEG
\\1$\anything.jpg
Play with it at regex101. A better readable could be
^([a-zA-Z]:|\\{2}\w+\$?)(\\\w{2}.*)+\.(jpe?g|JPE?G)$
Also read the explanation on regex101 to understand any pattern, it's helpful!

Categories

Resources