How do you use non captured elements in a Javascript regex? - javascript

I want to capture thing in nothing globally and case insensitively.
For some reason this doesn't work:
"Nothing thing nothing".match(/no(thing)/gi);
jsFiddle
The captured array is Nothing,nothing instead of thing,thing.
I thought parentheses delimit the matching pattern? What am I doing wrong?
(yes, I know this will also match in nothingness)

If you use the global flag, the match method will return all overall matches. This is the equivalent of the first element of every match array you would get without global.
To get all groups from each match, loop:
var match;
while(match = /no(thing)/gi.exec("Nothing thing nothing"))
{
// Do something with match
}
This will give you ["Nothing", "thing"] and ["nothing", "thing"].

Parentheses or no, the whole of the matched substring is always captured--think of it as the default capturing group. What the explicit capturing groups do is enable you to work with smaller chunks of text within the overall match.
The tutorial you linked to does indeed list the grouping constructs under the heading "pattern delimiters", but it's wrong, and the actual description isn't much better:
(pattern), (?:pattern) Matches entire contained pattern.
Well of course they're going to match it (or try to)! But what the parentheses do is treat the entire contained subpattern as a unit, so you can (for example) add a quantifier to it:
(?:foo){3} // "foofoofoo"
(?:...) is a pure grouping construct, while (...) also captures whatever the contained subpattern matches.
With just a quick look-through I spotted several more examples of inaccurate, ambiguous, or incomplete descriptions. I suggest you unbookmark that tutorial immediately and bookmark this one instead: regular-expressions.info.

Parentheses do nothing in this regex.
The regex /no(thing)/gi is same as /nothing/gi.
Parentheses are used for grouping. If you don't put any reference to groups (using $1, $2) or count for group, the () are useless.
So, this regex will find only this sequence n-o-t-h-i-n-g. The word thing does'nt starts with 'no', so it doen't match.
EDIT:
Change to /(no)?thing/gi and will work.
Will work because ()? indicates a optional part.

Related

JS RegEx replacement of a non-captured group?

I'm currently going through the book "Eloquent JavaScript". There's an exercice at the end of Chapter 9 on Regular Expressions that I couldn't understand its solution very well. Description of the exercice can be found here.
TL;DR : The objective is to replace single quotes (') with double quotes (") in a given string while keeping single quotes in contractions. Using the replace methode with a RegEx of course.
Now, after actually resolving this exercice using my own method, I checked the proposed solution which looks like this :
console.log(text.replace(/(^|\W)'|'(\W|$)/g, '$1"$2'));
The RegEx looks fine and it's quite understandable, but what I fail to understand is the usage of replacements, mainly why using $2 works ? As far as I know this regular expression will only take one path of two, either (^|\W)' or '(\W|$) each of these paths will only result in a single captured group, so we will only have $1 available. And yet $2 is capturing what comes after the single quote without having an explicit second capture group that does this in the regular expression. One can argue that there are two groups, but then again $2 is capturing a different string than the one intended by the second group.
My questions :
Why $2 is actually a valid string and is not undefined, and what is it referring to precisely?
Is this one of JavaScript RegEx quirks ?
Does this mean $1, $2... don't always refer to explicit groups ?
The backreferences are initialized with an empty string upon each match, so there will be no issues if a group is not matched. And it is no quirk, it is in compliance with the ES5 standard.
Here is a quote from Backreferences to Failed Groups:
According to the official ECMA standard, a backreference to a non-participating capturing group must successfully match nothing just a backreference to a participating group that captured nothing does.
So, once a backreference is not participating in the match, it refers to an empty string, not undefined. And it is not a quirk, just a "feature". That is not quite expected sometimes, but it is just how it works.
In your scenario, either of the backreferences is empty upon a match since there are two alternative branches and only one matches each time. The point is to restore the char matched in either of the groups. Both backreferences are used as either of them contains the text to restore while the other only contains empty text.

The behavior of /g mode matching

On this article, it mentioned
Make sure you are clear on the fact that an expression pattern is
tested on each individual character. And that, just because the engine
moves forward when following the pattern and looking for a match it
still backtracks and examines each character in a string until a match
is found or if the global flag is set until all characters are
examined.
But what I tested in Javascript
"aaa#bbb".match(/a+#b+/g)
does not produce a result like:
["aaa#bbb", "aa#bbb", "a#bbb"]
It only produces ["aaa#bbb"]. It seems it does not examine each character to test the pattern. Can anyone can explain a little on matching steps ? Thanks.
/g does not mean it will try to find every possible subset of characters in the input string which may match the given pattern. It means that once a match is found, it will continue searching for more substrings which may match the pattern starting from the previous match onward.
For example:
"aaa#bbb ... aaaa#bbbb".match(/a+#b+/g);
Will produce
["aaa#bbb", "aaaa#bbbb"]
That explanation is mixing two distinct concepts that IMO should be kept separated
A) backtracking
When looking for a match the normal behavior for a quantifier (?, *, +) is to be "greedy", i.e. to munch as much as possible... for example in /(a+)([^b]+)/ tested with aaaacccc all the a will be part of group 1 even if of course they also match the char set [^b] (everything but b).
However if grabbing too much is going to prevent a match the RE rules require that the quantifier "backtracks" capturing less if this allows the expression to match. For example in (a+)([^b]+) tested with aaaa the group 1 will get only three as, to leave one for group 2 to match.
You can change this greedy behavior with "non-greedy quantifiers" like *?, +?, ??. In this case stills the engine does backtracking but with the reverse meaning: a non-greedy subexpression will eat as little as possible that allows the rest of expression to match. For example (a+)(a+b+) tested with aaabbb will leave two as for group 1 and abbb for group 2, but (a+?)(a+b+) with the same string instead will leave only one a for group 1 because this is the minimum that allows matching the remaining part.
Note that also because of backtracking the greedy/non-greedy options doesn't change if an expression has a match or not, but only how big is the match and how much goes to each sub-expression.
B) the "global" option
This is something totally unrelated to backtracking and simply means that instead of stopping at the first match the search must find all non-overlapping matches. This is done by finding the first match and then starting again the search after the end of the match.
Note that each match is computed using the standard regexp rules and there is no look-ahead or backtracking between different matches: in other words if making for example a greedy match shorter would give more matches in the string this option is not considered... a+[^b]+ tested with aaaaaa is going to give only one match even if g option is specified and even if the substrings aa, aa, aa would have been each a valid match for the regexp.
When the global flag is used, it starts searching for the next match after the end of the previous match, to prevent generating lots of overlapping matches like that.
If you don't specify /g, the engine will stop as soon as a match is found.
If you do specify /g, it will keep going after a match. It will, however, still not produce overlapping matches, which is what you're asking about.
Its because.,
What Regex try to do:
All regex expression will try to match the best match.
What Regex wont do
It will not match the combinations for a single match as in your case.
When your "aaa#bbb".match(/a+#b+/g) scenario works
Rather, aaa#bbbHiaa#bbbHelloa#bbbSEEYOU try for some thing like this, which will give you
aaa#bbb
aa#bbb
a#bbb

silent group not working in javascript regex match()

I'm trying to extract (potentially hyphenated) words from a string that have been marked with a '#'.
So for example from the string
var s = '#moo, #baa and #moo-baa are writing an email to a#bc.de'
I would like to return
['#moo', '#baa', '#moo-baa']
To make sure I don't capture the email address, I check that the group is preceded by a white-space character OR the beginning of the line:
s.match(/(^|\s)#(\w+[-\w+]*)/g)
This seems to do the trick, but it also captures the spaces, which I don't want:
["#moo", " #baa", " #moo-baa"]
Silencing the grouping like this
s.match(/(?:^|\s)#(\w+[-\w+]*)/g)
doesn't seem to work, it returns the same result as before. I also tried the opposite, and checked that there's no \w or \S in front of the group, but that also excludes the beginning of the line. I know I could simply trim the spaces off, but I'd really like to get this working with just a single 'match' call.
Anybody have a suggestion what I'm doing wrong? Thanks a lot in advance!!
[edit]
I also just noticed: Why is it returning the '#' symbols as well?! I mean, it's what I want, but why is it doing that? They're outside of the group, aren't they?
As far as I know, the whole match is returned from String.match when using the "g" modifier. Because, with the modifier you are telling the function to match the whole expression instead of creating numbered matches from sub-expressions (groups). A global match does not return groups, instead the groups are the matches themselves.
In your case, the regular expression you were looking for might be this:
'#moo, #baa and #moo-baa are writing an email to a#bc.de'.match(/(?!\b)(#[\w\-]+)/g);
You are looking for every "#" symbol that doesn't follow a word boundary. So there is no need for silent groups.
If you don't want to capture the space, don't put the \s inside of the parentheses. Anything inside the parentheses will be returned as part of the capture group.

Metacharacters and parenthesis in regular expressions

Can anyone elaborate/translate this regular expression into English?
Thank you.
var g = "123456".match(/(.)(.)/);
I have noticed that the output looks like this:
12,1,2
and I know that dot means any character except new line. But what does this actually do?
A pair of parenthesis (without a ? as the first character, indicating other behaviour) will capture the contents to a group.
In your example, the first item in the array is the entire match, and subsequent items are any group matches.
It might be clearer if your code was something like:
var g = "123456".match(/.(.).(.)./);
This will match five characters, placing the second and fourth into groups 1 and 2 respectively, so outputting 12345,2,4
If you want pure grouping without capturing the content, use (?:...) syntax, the ?: part indicating a non-capturing group. (There are various assorted group things, like lookaheads and other fun stuff.)
Let me know if that is clear, or would further explanation help?
It looks for two characters - any characters because of the dots - and 'captures' them so that you can look for the whole string that was matched, and for each of the substrings (captures) as well.

What does the "?:^" regular expression mean?

I am looking at this sub-expression (this is in JavaScript):
(?:^|.....)
I know that ? means "zero or one times" when it follows a character, but not sure what it means in this context.
When working with groups, you often have several options that modify the behavior of the group:
(foo) // default behavior, matches "foo" and stores a back-reference
(?:foo) // non-capturing group: matches "foo", but doesn't store a back-ref
(?i:foo) // matches "foo" case-insensitively
(?=foo) // matches "foo", but does not advance the current position
// ("positive zero-width look-ahead assertion")
(?!foo) // matches anything but "foo", and does not advance the position
// ("negative zero-width look-ahead assertion")
to name a few.
They all begin with "?", which is the way to indicate a group modifier. The question mark has nothing to do with optionality in this case.
It simply says:
(?:^foo) // match "foo" at the start of the line, but do not store a back-ref
Sometimes it's just overkill to store a back-reference to some part of the match that you are not going to use anyway. When the group is there only to make a complex expression atomic (e.g. it should either match or fail as a whole), storing a back-reference is an unnecessary waste of resources that can even slow down the regex a bit. And sometimes, you just want to be group 1 the first group relevant to you, instead of the first group in the regex.
You're probably seeing it in this context
(?:...)
It means that the group won't be captured or used for back-references.
EDIT: To reflect your modified question:
(?:^|....)
means "match the beginning of the line or match ..." but don't capture the group or use it for back-references.
(?:some stuff) means that you don't want to match the expression in the parentheses separately. Normally the pieces of a regexp grouped in parentheses are grouped and can be referenced individually (this is called using backreferences).
See http://www.regular-expressions.info/brackets.html
Short Answer
It flags the (parenthetical) group as a non-capturing group.
Details About This Particular Expression
The notation for a non-capturing group is:
(?:<expresson>)
In the instance you presented, the caret (^) is part of the expression not part of the capturing group notation. And this instance it's not a special character either.
It looks like they're using an 'or' operator (the pipe) with the caret. So they're looking to match something that is a caret or whatever was on the right of the pipe, but not capture the expression as a group (accomplished with the ?: in the beginning of the grouping characters.
In General
Non-capturing groups allow you to group an expression in a way that won't be back-refernceable, and will also increase performance of the expression.
"(?:x) Matches 'x' but does not remember the match."
https://developer.mozilla.org/en/Core_JavaScript_1.5_Guide/Regular_Expressions
?: Generally indicates making the group a non capture. You can do some research here.
I'm almost positive any regex engine should but when I switch between engines I run into some quirks.
Edit: This should be the case, non captures seems to work fine.

Categories

Resources