I am looking at this sub-expression (this is in JavaScript):
(?:^|.....)
I know that ? means "zero or one times" when it follows a character, but not sure what it means in this context.
When working with groups, you often have several options that modify the behavior of the group:
(foo) // default behavior, matches "foo" and stores a back-reference
(?:foo) // non-capturing group: matches "foo", but doesn't store a back-ref
(?i:foo) // matches "foo" case-insensitively
(?=foo) // matches "foo", but does not advance the current position
// ("positive zero-width look-ahead assertion")
(?!foo) // matches anything but "foo", and does not advance the position
// ("negative zero-width look-ahead assertion")
to name a few.
They all begin with "?", which is the way to indicate a group modifier. The question mark has nothing to do with optionality in this case.
It simply says:
(?:^foo) // match "foo" at the start of the line, but do not store a back-ref
Sometimes it's just overkill to store a back-reference to some part of the match that you are not going to use anyway. When the group is there only to make a complex expression atomic (e.g. it should either match or fail as a whole), storing a back-reference is an unnecessary waste of resources that can even slow down the regex a bit. And sometimes, you just want to be group 1 the first group relevant to you, instead of the first group in the regex.
You're probably seeing it in this context
(?:...)
It means that the group won't be captured or used for back-references.
EDIT: To reflect your modified question:
(?:^|....)
means "match the beginning of the line or match ..." but don't capture the group or use it for back-references.
(?:some stuff) means that you don't want to match the expression in the parentheses separately. Normally the pieces of a regexp grouped in parentheses are grouped and can be referenced individually (this is called using backreferences).
See http://www.regular-expressions.info/brackets.html
Short Answer
It flags the (parenthetical) group as a non-capturing group.
Details About This Particular Expression
The notation for a non-capturing group is:
(?:<expresson>)
In the instance you presented, the caret (^) is part of the expression not part of the capturing group notation. And this instance it's not a special character either.
It looks like they're using an 'or' operator (the pipe) with the caret. So they're looking to match something that is a caret or whatever was on the right of the pipe, but not capture the expression as a group (accomplished with the ?: in the beginning of the grouping characters.
In General
Non-capturing groups allow you to group an expression in a way that won't be back-refernceable, and will also increase performance of the expression.
"(?:x) Matches 'x' but does not remember the match."
https://developer.mozilla.org/en/Core_JavaScript_1.5_Guide/Regular_Expressions
?: Generally indicates making the group a non capture. You can do some research here.
I'm almost positive any regex engine should but when I switch between engines I run into some quirks.
Edit: This should be the case, non captures seems to work fine.
Related
I have a regex for a game that should match strings in the form of go [anything] or [cardinal direction], and capture either the [anything] or the [cardinal direction]. For example, the following would match:
go north
go foo
north
And the following would not match:
foo
go
I was able to do this using two separate regexes: /^(?:go (.+))$/ to match the first case, and /^(north|east|south|west)$/ to match the second case. I tried to combine the regexes to be /^(?:go (.+))|(north|east|south|west)$/. The regex matches all of my test cases correctly, but it doesn't correctly capture for the second case. I tried plugging the regex into RegExr and noticed that even though the first case wasn't being matched against, it was still being captured.
How can I correct this?
Try using the positive lookbehind feature to find the word "go".
(north|east|south|west|(?<=go ).+)$
Note that this solution prevents you from including ^ at the start of the regex, because the text "go" is not actually included in the group.
You have to move the closing parenthesis to the end of the pattern to have both patterns between anchors, or else you would allow a match before one of the cardinal directions and it would still capture the cardinal direction at the end of the string.
Then in the JavaScript you can check for the group 1 or group 2 value.
^(?:go (.+)|(north|east|south|west))$
^
Regex demo
Using a lookbehind assertion (if supported), you might also get a match only instead of capture groups.
In that case, you can match the rest of the line, asserting go to the left at the start of the string, or match only 1 of the cardinal directions:
(?<=^go ).+|^(?:north|east|south|west)$
Regex demo
This is an example string:
123456#p654321
Currently, I am using this match to capture 123456 and 654321 in to two different groups:
([0-9].*)#p([0-9].*)
But on occasions, the #p654321 part of the string will not be there, so I will only want to capture the first group. I tried to make the second group "optional" by appending ? to it, which works, but only as long as there is a #p at the end of the remaining string.
What would be the best way to solve this problem?
You have the #p outside of the capturing group, which makes it a required piece of the result. You are also using the dot character (.) improperly. Dot (in most reg-ex variants) will match any character. Change it to:
([0-9]*)(?:#p([0-9]*))?
The (?:) syntax is how you get a non-capturing group. We then capture just the digits that you're interested in. Finally, we make the whole thing optional.
Also, most reg-ex variants have a \d character class for digits. So you could simplify even further:
(\d*)(?:#p(\d*))?
As another person has pointed out, the * operator could potentially match zero digits. To prevent this, use the + operator instead:
(\d+)(?:#p(\d+))?
Your regex will actually match no digits, because you've used * instead of +.
This is what (I think) you want:
(\d+)(?:#p(\d+))?
On this article, it mentioned
Make sure you are clear on the fact that an expression pattern is
tested on each individual character. And that, just because the engine
moves forward when following the pattern and looking for a match it
still backtracks and examines each character in a string until a match
is found or if the global flag is set until all characters are
examined.
But what I tested in Javascript
"aaa#bbb".match(/a+#b+/g)
does not produce a result like:
["aaa#bbb", "aa#bbb", "a#bbb"]
It only produces ["aaa#bbb"]. It seems it does not examine each character to test the pattern. Can anyone can explain a little on matching steps ? Thanks.
/g does not mean it will try to find every possible subset of characters in the input string which may match the given pattern. It means that once a match is found, it will continue searching for more substrings which may match the pattern starting from the previous match onward.
For example:
"aaa#bbb ... aaaa#bbbb".match(/a+#b+/g);
Will produce
["aaa#bbb", "aaaa#bbbb"]
That explanation is mixing two distinct concepts that IMO should be kept separated
A) backtracking
When looking for a match the normal behavior for a quantifier (?, *, +) is to be "greedy", i.e. to munch as much as possible... for example in /(a+)([^b]+)/ tested with aaaacccc all the a will be part of group 1 even if of course they also match the char set [^b] (everything but b).
However if grabbing too much is going to prevent a match the RE rules require that the quantifier "backtracks" capturing less if this allows the expression to match. For example in (a+)([^b]+) tested with aaaa the group 1 will get only three as, to leave one for group 2 to match.
You can change this greedy behavior with "non-greedy quantifiers" like *?, +?, ??. In this case stills the engine does backtracking but with the reverse meaning: a non-greedy subexpression will eat as little as possible that allows the rest of expression to match. For example (a+)(a+b+) tested with aaabbb will leave two as for group 1 and abbb for group 2, but (a+?)(a+b+) with the same string instead will leave only one a for group 1 because this is the minimum that allows matching the remaining part.
Note that also because of backtracking the greedy/non-greedy options doesn't change if an expression has a match or not, but only how big is the match and how much goes to each sub-expression.
B) the "global" option
This is something totally unrelated to backtracking and simply means that instead of stopping at the first match the search must find all non-overlapping matches. This is done by finding the first match and then starting again the search after the end of the match.
Note that each match is computed using the standard regexp rules and there is no look-ahead or backtracking between different matches: in other words if making for example a greedy match shorter would give more matches in the string this option is not considered... a+[^b]+ tested with aaaaaa is going to give only one match even if g option is specified and even if the substrings aa, aa, aa would have been each a valid match for the regexp.
When the global flag is used, it starts searching for the next match after the end of the previous match, to prevent generating lots of overlapping matches like that.
If you don't specify /g, the engine will stop as soon as a match is found.
If you do specify /g, it will keep going after a match. It will, however, still not produce overlapping matches, which is what you're asking about.
Its because.,
What Regex try to do:
All regex expression will try to match the best match.
What Regex wont do
It will not match the combinations for a single match as in your case.
When your "aaa#bbb".match(/a+#b+/g) scenario works
Rather, aaa#bbbHiaa#bbbHelloa#bbbSEEYOU try for some thing like this, which will give you
aaa#bbb
aa#bbb
a#bbb
I try to solve the same problem in javascript with regexp mentioned here: Check if string is repetition of an unknown substring
I translated the regex in the first answer to Javascript: ^(.+){2,}$
But it does not work as I expect:
'SingleSingleSingle'.replace(/^(.+){2,}$/m, '$1') // returns 'e' instead of exptected 'Single'
What am I overlooking?
I currently have no explanation for why it returns e, but . matches any character and .{2,} basically just means "match any two or more characters".
What you want is to match whatever you captured in the capture group, by using backreferences:
/^(.+)\1+$/m
I just noticed that this is also what the answer you linked to suggests to use: /(.+)\1+/. The expression is exactly the same, there is nothing you have to change for JavaScript.
I think the reason why you get 'e' is that {2,} implies two or more repetitions of a match to the regex that preceeds it, in this case (.+) . {2,} does not guarantee that the repetitions match each other, only that they all qualify as a match for (.+).
From what I can see (using Expresso) it looks like the first match to
(.+) is 'SingleSingleSingl' (due to greedy matching) and the second match is 'e'. Since capturing groups only remember their last match, that is why replace() is giving you back 'e'. If you use (.+?) (for non-greedy or reluctant matching) each individual character will match, but you will still only get the last one, 'e'.
Using a back reference, as Felix mentioned, is the only way that I know of to guarantee that the repetitions match each other.
I want to capture thing in nothing globally and case insensitively.
For some reason this doesn't work:
"Nothing thing nothing".match(/no(thing)/gi);
jsFiddle
The captured array is Nothing,nothing instead of thing,thing.
I thought parentheses delimit the matching pattern? What am I doing wrong?
(yes, I know this will also match in nothingness)
If you use the global flag, the match method will return all overall matches. This is the equivalent of the first element of every match array you would get without global.
To get all groups from each match, loop:
var match;
while(match = /no(thing)/gi.exec("Nothing thing nothing"))
{
// Do something with match
}
This will give you ["Nothing", "thing"] and ["nothing", "thing"].
Parentheses or no, the whole of the matched substring is always captured--think of it as the default capturing group. What the explicit capturing groups do is enable you to work with smaller chunks of text within the overall match.
The tutorial you linked to does indeed list the grouping constructs under the heading "pattern delimiters", but it's wrong, and the actual description isn't much better:
(pattern), (?:pattern) Matches entire contained pattern.
Well of course they're going to match it (or try to)! But what the parentheses do is treat the entire contained subpattern as a unit, so you can (for example) add a quantifier to it:
(?:foo){3} // "foofoofoo"
(?:...) is a pure grouping construct, while (...) also captures whatever the contained subpattern matches.
With just a quick look-through I spotted several more examples of inaccurate, ambiguous, or incomplete descriptions. I suggest you unbookmark that tutorial immediately and bookmark this one instead: regular-expressions.info.
Parentheses do nothing in this regex.
The regex /no(thing)/gi is same as /nothing/gi.
Parentheses are used for grouping. If you don't put any reference to groups (using $1, $2) or count for group, the () are useless.
So, this regex will find only this sequence n-o-t-h-i-n-g. The word thing does'nt starts with 'no', so it doen't match.
EDIT:
Change to /(no)?thing/gi and will work.
Will work because ()? indicates a optional part.