JS RegEx replacement of a non-captured group? - javascript

I'm currently going through the book "Eloquent JavaScript". There's an exercice at the end of Chapter 9 on Regular Expressions that I couldn't understand its solution very well. Description of the exercice can be found here.
TL;DR : The objective is to replace single quotes (') with double quotes (") in a given string while keeping single quotes in contractions. Using the replace methode with a RegEx of course.
Now, after actually resolving this exercice using my own method, I checked the proposed solution which looks like this :
console.log(text.replace(/(^|\W)'|'(\W|$)/g, '$1"$2'));
The RegEx looks fine and it's quite understandable, but what I fail to understand is the usage of replacements, mainly why using $2 works ? As far as I know this regular expression will only take one path of two, either (^|\W)' or '(\W|$) each of these paths will only result in a single captured group, so we will only have $1 available. And yet $2 is capturing what comes after the single quote without having an explicit second capture group that does this in the regular expression. One can argue that there are two groups, but then again $2 is capturing a different string than the one intended by the second group.
My questions :
Why $2 is actually a valid string and is not undefined, and what is it referring to precisely?
Is this one of JavaScript RegEx quirks ?
Does this mean $1, $2... don't always refer to explicit groups ?

The backreferences are initialized with an empty string upon each match, so there will be no issues if a group is not matched. And it is no quirk, it is in compliance with the ES5 standard.
Here is a quote from Backreferences to Failed Groups:
According to the official ECMA standard, a backreference to a non-participating capturing group must successfully match nothing just a backreference to a participating group that captured nothing does.
So, once a backreference is not participating in the match, it refers to an empty string, not undefined. And it is not a quirk, just a "feature". That is not quite expected sometimes, but it is just how it works.
In your scenario, either of the backreferences is empty upon a match since there are two alternative branches and only one matches each time. The point is to restore the char matched in either of the groups. Both backreferences are used as either of them contains the text to restore while the other only contains empty text.

Related

JavaScript regular expression unexpected behavior

Let's have the following (a bit complex) regular expression in JavaScript:
\{\{\s*(?:(?:\:)([\w\$]+))?\#(?:([\w\$\/]+#?)?([\s\S]*?))?(\.([\w\$\/]*))?\s*\}\}
I am wondering why it matches the whole string here:
{{:control#}}x{{*>*}}
but not in the following case (where a space is added after #):
{{:control# }}x{{*>*}}
In PHP or Python, it matches in both cases just the first part {{: ... }}.
I want JavaScript to match only the first part as well. Is it possible without hacking (?!}}) before [\s\S]?
Moreover, is performance the reason for this different behavior in JavaScript, or is it just a bug in specification?
You can use a lazy ?? quantifier to achieve the same behavior in JavaScript:
\{\{\s*(?:(?::)([\w$]+))?#(?:([\w$\/]+#?)?([\s\S]*?))??(\.([\w$\/]*))?\s*}}
^^
See demo
From rexegg.com:
A?? Zero or one A, zero if that still allows the overall pattern to match (lazy)
This is no bug, and is right according to the ECMA standard specifications JavaScript abides by.
Here, in (?:([\w$\/]+#?)?([\s\S]*?))?, we have an optional non-capturing group that can match an empty text. JavaScript regex engine "consumes" empty texts in optional groups for them to be later accessible via backreferences. This problem is closely connected with the Backreferences to Failed Groups. E.g. ((q)?b\2) will match b in JavaScript, but it won't match in Python and PCRE.
According to the official ECMA standard, a backreference to a non-participating capturing group must successfully match nothing just like a backreference to a participating group that captured nothing does.
This subpattern is responsible for the behaviour:
([\w\$\/]+#?)? // P1
as it matches greedily, your whole test string (without the space) gets consumed.
As #stribizhev suggests, qualifying the designated part of your regex for non-greedy matching, results in a conservative match.
Both versions will match up to and including #, since both match patterns contain this character without any occurrence restrictions.
The second test string (including the space after #) matches non-greedily, since the P1 does not match white-space. Instead this white-space gets matcehd by the subsequent subexpression ( [\s\S]*? ), thus finishing the match.

Why 'ABC'.replace('B', '$`') gives AAC

Why this code prints AAC instead of expected A$`C?
console.log('ABC'.replace('B', '$`'));
==>
AAC
And how to make it give the expected result?
To insert a literal $ you have to pass $$, because $`:
Inserts the portion of the string that precedes the matched substring.
console.log('ABC'.replace('B', "$$`"));
See the documentation.
Other patterns:
Pattern
Inserts
$$
Inserts a $.
$&
Inserts the matched substring.
$`
Inserts the portion of the string that precedes the matched substring.
$'
Inserts the portion of the string that follows the matched substring.
$n
Where n is a positive integer less than 100, inserts the _n_th parenthesized submatch string, provided the first argument was a RegExp object. Note that this is 1-indexed. If a group n is not present (e.g., if group is 3), it will be replaced as a literal (e.g., $3).
$<Name>
Where Name is a capturing group name. If the group is not in the match, or not in the regular expression, or if a string was passed as the first argument to replace instead of a regular expression, this resolves to a literal (e.g., $<Name>). Only available in browser versions supporting named capturing groups.
JSFiddle
Also, there are even more things on the reference link I’ve posted above. If you still have any issue or doubt you probably can find an answer there, the screenshot above was taken from the link posted at the beginning of the answer.
It is worth saying, in my opinion, that any pattern that doesn’t match the above doesn’t need to be escaped, hence $ doesn’t need to be escaped, same story happens with $AAA.
In the comments above a user asked about why you need to “escape” $ with another $: despite I’m not truly sure about that, I think it is also worth to point out, from what we said above, that any invalid pattern won’t be interpreted, hence I think (and suspect, at this point) that $$ is a very special case, because it covers the cases where you need to replace the match with a dollar sign followed by a “pattern-locked” character, like the tick (`) as an example (or really the & as another).
In any other case, though, the dollar sign doesn’t need to be escaped, hence it probably makes sense that they decided to create such a specific rule, else you would’ve needed to escape the $ everywhere else (and I think this could’ve had an impact on any string object, because that would mean that even in var a = "hello, $ hey this one is a dollar";, you would’ve needed to escape the $).
If you’re still interested and want to read more, please also check regular-expressions.info and this JSFiddle with more cases.
In the replacement the $ dollar sign has a special meaning and is used when data from the match should be used in the replacement.
MDN: String.prototype.replace(): Specifying a string as a parameter
$$ Inserts a "$".
$` Inserts the portion of the string that precedes the matched substring.
As long as the $ does not result in a combination that has a special meaning, then it will be just handled as a regular char. But you should still always write it as a $$ in the replacement because otherwise, it might fail in future if a new $x combination is added.

What is the purpose of the passive (non-capturing) group in a Javascript regex?

What is the purpose of the passive group in a Javascript regex?
The passive group is prefaced by a question mark colon: (?:group)
In other words, these 2 things appear identical:
"hello world".match(/hello (?:world)/)
"hello world".match(/hello world/)
In what situations do you need the non capturing group and why?
Two use cases for capturing groups
A capturing group in a regex has actually two distinct goals (as the name "capturing group" itself suggests):
Grouping — if you need a group to be a treated as a single entity in order to apply some stuff to the whole group.
Probably the most trivial example is including an optional sequence of characters, e.g. "foo" optionally followed by "bar", in regex terms: /foo(bar)?/ (capturing group) or /foo(?:bar)?/ (non-capturing group). Note that the trailing ? is applied to the whole group (bar) (which consists of a simple character sequence bar in this case). In case you just want to check if the input matches your regex, it really doesn't matter whether you use a capturing or a non-capturing group — they act the same (except that a non-capturing group is slightly faster).
Capturing — if you need to extract a part of the input.
For example, you want to get number of rabbits from an input like "The farm contains 8 cows and 89 rabbits" (not very good English, I know). The regex could be /(\d+)\s*rabbits\b/. On successful match, you can get the value matched by the capturing group from JavaScript code (or any other programming language).
In this example, you have a single capturing group, so you access it via its index 0 (see this answer for details).
Now imagine you want to ensure that the "place" is called "farm" or "ranch". If it's not the case, then you don't want to extract the number of rabbits (in regex terms — you don't want the regex to match).
So you rewrite your regex as follows: /(farm|ranch).*\b(\d+)\s*rabbits\b/. The regex works by itself, but your JavaScript is broken — there are two capturing groups now and you must change your code to get the contents of the second capturing group for the number of rabbits (i.e. change index from 0 to 1). The first group now contains the string "farm" or "ranch", which you didn't intend to extract.
A non-capturing group comes to rescue: /(?:farm|ranch).*\b(\d+)\s*rabbits\b/. It still matches either "farm" or "ranch", but doesn't capture it, thus not shifting the indexes of subsequent capturing groups. And your JavaScript code works fine without changing.
The example may be oversimplified, but consider that you have a very complex regex with many groups, and you want to capture only few of them. Non-capturing groups are really helpful then — you don't have to count all of your groups (only capturing ones).
Besides, non-capturing groups serve documentation purposes: for someone who reads you code, a non-capturing group is an indication that you are not interested in extracting contents, you just want to ensure that it matches.
A few words on separation of concerns
Capturing groups are a typical example of breaking the SoC principle. This syntax construct serves two distinct purposes. As the problems herewith became evident, an additional construct (?:) was introduced to disable one of the two features.
It was just a design mistake. Maybe a lack of "free special characters" played its role... but it was still a poor design.
Regex is a very old, powerful and widely used concept. For the reasons of backwards compatibility, this flaw is now unlikely to be fixed. It's just a lesson of how important the separation of concerns is.
Non-capturing have just one difference from "normal" (capturing) groups: they don't require the regex engine to remember what they matched.
The use case is that sometimes you must (or should) use a group not because you are interested in what it captures but for syntactic reasons. In these situations it makes sense to use a non-capturing group instead of a "standard" capturing one because it is less resource intensive -- but if you don't care about that, a capturing group will behave in the exact same manner.
Your specific example does not make a good case for using non-capturing groups exactly because the two expressions are identical. A better example might be:
input.match(/hello (?:world|there)/)
In addition to the answers above, if you're using String.prototype.split() and you use a capturing group, the output array contains the captured results (see MDN). If you use a non-capturing group that doesn't happen.
var myString = 'Hello 1 word. Sentence number 2.';
var splits = myString.split(/(\d)/);
console.log(splits);
Outputs:
["Hello ", "1", " word. Sentence number ", "2", "."]
Whereas swapping /(\d)/ for /(?:\d)/ results in:
["Hello ", " word. Sentence number ", "."]
When you want to apply modifiers to the group.
/hello (?:world)?/
/hello (?:world)*/
/hello (?:world)+/
/hello (?:world){3,6}/
etc.
Use them when you need a conditional and don't care about which of the choices cause the match.
Non-capturing groups can simplify the result of matching a complex expression. Here, the group 1 is always the name speaker. Without the non-capturing group, the speaker's name may end up in group 1 or group 2.
/hello (?:world|foobar )?said (.+)/
I have just found a different use for it. I was trying to capture a nested group but wanted the whole collection of the repeating group as one element:
So for AbbbbC
(A)((?:b)*)(C)
gives three groups A bbbb C
for AC also gives three groups A null C

What does $1, $2, etc. mean in Regular Expressions?

Time and time again I see $1 and $2 being used in code. What does it mean? Can you please include examples?
When you create a regular expression you have the option of capturing portions of the match and saving them as placeholders. They are numbered starting at $1.
For instance:
/A(\d+)B(\d+)C/
This will capture from A90B3C the values 90 and 3. If you need to group things but don't want to capture them, use the (?:...) version instead of (...).
The numbers start from left to right in the order the brackets are open. That means:
/A((\d+)B)(\d+)C/
Matching against the same string will capture 90B, 90 and 3.
This is esp. useful for Replacement String Syntax (i.e. Format Strings) Goes good for Cases/Case Foldings for Find & Replaces. To reference a capture, use $n where n is the capture register number. Using $0 means the entire match. Example : Find: (<a.*?>)(.*?)(</a>) Replace: $1\u$2\e$3

How do you use non captured elements in a Javascript regex?

I want to capture thing in nothing globally and case insensitively.
For some reason this doesn't work:
"Nothing thing nothing".match(/no(thing)/gi);
jsFiddle
The captured array is Nothing,nothing instead of thing,thing.
I thought parentheses delimit the matching pattern? What am I doing wrong?
(yes, I know this will also match in nothingness)
If you use the global flag, the match method will return all overall matches. This is the equivalent of the first element of every match array you would get without global.
To get all groups from each match, loop:
var match;
while(match = /no(thing)/gi.exec("Nothing thing nothing"))
{
// Do something with match
}
This will give you ["Nothing", "thing"] and ["nothing", "thing"].
Parentheses or no, the whole of the matched substring is always captured--think of it as the default capturing group. What the explicit capturing groups do is enable you to work with smaller chunks of text within the overall match.
The tutorial you linked to does indeed list the grouping constructs under the heading "pattern delimiters", but it's wrong, and the actual description isn't much better:
(pattern), (?:pattern) Matches entire contained pattern.
Well of course they're going to match it (or try to)! But what the parentheses do is treat the entire contained subpattern as a unit, so you can (for example) add a quantifier to it:
(?:foo){3} // "foofoofoo"
(?:...) is a pure grouping construct, while (...) also captures whatever the contained subpattern matches.
With just a quick look-through I spotted several more examples of inaccurate, ambiguous, or incomplete descriptions. I suggest you unbookmark that tutorial immediately and bookmark this one instead: regular-expressions.info.
Parentheses do nothing in this regex.
The regex /no(thing)/gi is same as /nothing/gi.
Parentheses are used for grouping. If you don't put any reference to groups (using $1, $2) or count for group, the () are useless.
So, this regex will find only this sequence n-o-t-h-i-n-g. The word thing does'nt starts with 'no', so it doen't match.
EDIT:
Change to /(no)?thing/gi and will work.
Will work because ()? indicates a optional part.

Categories

Resources