Regex Javascript Capture groups with quantifier Not Working - javascript

I have this nice regex:
*(?:(?:([0-9]+)(?:d| ?days?)(?:, ?| )?)|(?:([0-9]+)(?:h| ?hours?)(?:, ?| )?)|(?:([0-9]+)(?:m| ?minutes?)(?:, ?| )?)|(?:([0-9]+)(?:s| ?seconds?)(?:, ?| )?))+
that pretty much matches a human-readable time-delta. It works on php, python, and go, but for some reason the capture groups do not work on javascript. Here is a working php example on regex101 that shows the working capture groups. You will notice that upon changing it to javascript (ECMAscript) mode, the capture group will only capture the last value. Can somebody please help and clarify what I am doing wrong, and whu it doesn't work on js?

Here's a simpler example that demonstrates the issue:
console.log(
'34'.match(/(?:(3)|(4))+/)
);
In PHP, whenever a capture group is matched, it will be put into the result. In contrast, in JavaScript, things are more complicated: when there are capturing groups on one side of an alternation |, whenever the whole alternation token is entered, there are 2 possibilities:
The alternation that is taken contains the capture group, and the result will have the capture group index set to the matched value
The alternation that is taken does not contain the capture group, in which case the result will have undefined assigned to that index - even if the capturing group was matched previously.
This is described in the specification:
Any capturing parentheses inside a portion of the pattern skipped by | produce undefined values instead of Strings.
and
Step 4 of the RepeatMatcher clears Atom's captures each time Atom is repeated.
because each iteration of the outermost * clears all captured Strings contained in the quantified Atom
In your case, the easiest tweak to fix it would be to remove the repeating outermost capturing group, so that only one subsequence is matched at a time, eg 1m, and then 1d, then iterate through the matches, instead of trying to match everything all in one go. To ensure that all the matches are next to each other (eg 1m1d, and not 1m 1d), check the index while iterating through the matches to see if it's next to a previous match or not.

Related

RegExp capturing non-match

I have a regex for a game that should match strings in the form of go [anything] or [cardinal direction], and capture either the [anything] or the [cardinal direction]. For example, the following would match:
go north
go foo
north
And the following would not match:
foo
go
I was able to do this using two separate regexes: /^(?:go (.+))$/ to match the first case, and /^(north|east|south|west)$/ to match the second case. I tried to combine the regexes to be /^(?:go (.+))|(north|east|south|west)$/. The regex matches all of my test cases correctly, but it doesn't correctly capture for the second case. I tried plugging the regex into RegExr and noticed that even though the first case wasn't being matched against, it was still being captured.
How can I correct this?
Try using the positive lookbehind feature to find the word "go".
(north|east|south|west|(?<=go ).+)$
Note that this solution prevents you from including ^ at the start of the regex, because the text "go" is not actually included in the group.
You have to move the closing parenthesis to the end of the pattern to have both patterns between anchors, or else you would allow a match before one of the cardinal directions and it would still capture the cardinal direction at the end of the string.
Then in the JavaScript you can check for the group 1 or group 2 value.
^(?:go (.+)|(north|east|south|west))$
^
Regex demo
Using a lookbehind assertion (if supported), you might also get a match only instead of capture groups.
In that case, you can match the rest of the line, asserting go to the left at the start of the string, or match only 1 of the cardinal directions:
(?<=^go ).+|^(?:north|east|south|west)$
Regex demo

Regex issues in javascript? [duplicate]

I have this nice regex:
*(?:(?:([0-9]+)(?:d| ?days?)(?:, ?| )?)|(?:([0-9]+)(?:h| ?hours?)(?:, ?| )?)|(?:([0-9]+)(?:m| ?minutes?)(?:, ?| )?)|(?:([0-9]+)(?:s| ?seconds?)(?:, ?| )?))+
that pretty much matches a human-readable time-delta. It works on php, python, and go, but for some reason the capture groups do not work on javascript. Here is a working php example on regex101 that shows the working capture groups. You will notice that upon changing it to javascript (ECMAscript) mode, the capture group will only capture the last value. Can somebody please help and clarify what I am doing wrong, and whu it doesn't work on js?
Here's a simpler example that demonstrates the issue:
console.log(
'34'.match(/(?:(3)|(4))+/)
);
In PHP, whenever a capture group is matched, it will be put into the result. In contrast, in JavaScript, things are more complicated: when there are capturing groups on one side of an alternation |, whenever the whole alternation token is entered, there are 2 possibilities:
The alternation that is taken contains the capture group, and the result will have the capture group index set to the matched value
The alternation that is taken does not contain the capture group, in which case the result will have undefined assigned to that index - even if the capturing group was matched previously.
This is described in the specification:
Any capturing parentheses inside a portion of the pattern skipped by | produce undefined values instead of Strings.
and
Step 4 of the RepeatMatcher clears Atom's captures each time Atom is repeated.
because each iteration of the outermost * clears all captured Strings contained in the quantified Atom
In your case, the easiest tweak to fix it would be to remove the repeating outermost capturing group, so that only one subsequence is matched at a time, eg 1m, and then 1d, then iterate through the matches, instead of trying to match everything all in one go. To ensure that all the matches are next to each other (eg 1m1d, and not 1m 1d), check the index while iterating through the matches to see if it's next to a previous match or not.

How can I capture an optional asterisk with my regex?

I am trying to get sections from an API which is in markdown, so I'm using this:
(?<=\*\*(Test)\*\*)(.*?)(?=\*\*end\*\*)
https://regex101.com/r/r8aiVk/1
Result here should be Test and then this is a test, which it is. Awesome.
This works fine, however, some titles have an asterisk at the end, which is where I'm running into an issue. I loop through titles with the one regex, but I want to capture that one optional asterisk.
So with this example following, I want to be able to capture the asterisk along with the rest:
https://regex101.com/r/r8aiVk/2
The result here should be Test* this is a test.
I've tried various different ways, such as (\*?) and a few other variants, but I am unable to get this working.
The lookbehind implementation in JavaScript tricked you: to match the lookbehind pattern, the regex iterator goes backwards, and tries to match its pattern that way. Since it is executed at each location (your lookbehind is the first atom in the regex), it checks the start of string, then *, then **, then T, etc. and once it matches **Test**, it calls it a day. So, the next * is consumed with .*?.
You can get what you need using a mere consuming pattern:
/\*\*(Test\*?)\*\*(.*?)\*\*end\*\*/g
See the regex demo.
This pattern will be processed normally, from left to right, matching
\*\* - a ** substring
(Test\*?) - capturing Test or Test* into Group 1
\*\* - a ** substring
(.*?) - Capturing group 2: any 0+ chars other than line break chars, as few as possible
\*\*end\*\* - **end** substring.

What is the purpose of the passive (non-capturing) group in a Javascript regex?

What is the purpose of the passive group in a Javascript regex?
The passive group is prefaced by a question mark colon: (?:group)
In other words, these 2 things appear identical:
"hello world".match(/hello (?:world)/)
"hello world".match(/hello world/)
In what situations do you need the non capturing group and why?
Two use cases for capturing groups
A capturing group in a regex has actually two distinct goals (as the name "capturing group" itself suggests):
Grouping — if you need a group to be a treated as a single entity in order to apply some stuff to the whole group.
Probably the most trivial example is including an optional sequence of characters, e.g. "foo" optionally followed by "bar", in regex terms: /foo(bar)?/ (capturing group) or /foo(?:bar)?/ (non-capturing group). Note that the trailing ? is applied to the whole group (bar) (which consists of a simple character sequence bar in this case). In case you just want to check if the input matches your regex, it really doesn't matter whether you use a capturing or a non-capturing group — they act the same (except that a non-capturing group is slightly faster).
Capturing — if you need to extract a part of the input.
For example, you want to get number of rabbits from an input like "The farm contains 8 cows and 89 rabbits" (not very good English, I know). The regex could be /(\d+)\s*rabbits\b/. On successful match, you can get the value matched by the capturing group from JavaScript code (or any other programming language).
In this example, you have a single capturing group, so you access it via its index 0 (see this answer for details).
Now imagine you want to ensure that the "place" is called "farm" or "ranch". If it's not the case, then you don't want to extract the number of rabbits (in regex terms — you don't want the regex to match).
So you rewrite your regex as follows: /(farm|ranch).*\b(\d+)\s*rabbits\b/. The regex works by itself, but your JavaScript is broken — there are two capturing groups now and you must change your code to get the contents of the second capturing group for the number of rabbits (i.e. change index from 0 to 1). The first group now contains the string "farm" or "ranch", which you didn't intend to extract.
A non-capturing group comes to rescue: /(?:farm|ranch).*\b(\d+)\s*rabbits\b/. It still matches either "farm" or "ranch", but doesn't capture it, thus not shifting the indexes of subsequent capturing groups. And your JavaScript code works fine without changing.
The example may be oversimplified, but consider that you have a very complex regex with many groups, and you want to capture only few of them. Non-capturing groups are really helpful then — you don't have to count all of your groups (only capturing ones).
Besides, non-capturing groups serve documentation purposes: for someone who reads you code, a non-capturing group is an indication that you are not interested in extracting contents, you just want to ensure that it matches.
A few words on separation of concerns
Capturing groups are a typical example of breaking the SoC principle. This syntax construct serves two distinct purposes. As the problems herewith became evident, an additional construct (?:) was introduced to disable one of the two features.
It was just a design mistake. Maybe a lack of "free special characters" played its role... but it was still a poor design.
Regex is a very old, powerful and widely used concept. For the reasons of backwards compatibility, this flaw is now unlikely to be fixed. It's just a lesson of how important the separation of concerns is.
Non-capturing have just one difference from "normal" (capturing) groups: they don't require the regex engine to remember what they matched.
The use case is that sometimes you must (or should) use a group not because you are interested in what it captures but for syntactic reasons. In these situations it makes sense to use a non-capturing group instead of a "standard" capturing one because it is less resource intensive -- but if you don't care about that, a capturing group will behave in the exact same manner.
Your specific example does not make a good case for using non-capturing groups exactly because the two expressions are identical. A better example might be:
input.match(/hello (?:world|there)/)
In addition to the answers above, if you're using String.prototype.split() and you use a capturing group, the output array contains the captured results (see MDN). If you use a non-capturing group that doesn't happen.
var myString = 'Hello 1 word. Sentence number 2.';
var splits = myString.split(/(\d)/);
console.log(splits);
Outputs:
["Hello ", "1", " word. Sentence number ", "2", "."]
Whereas swapping /(\d)/ for /(?:\d)/ results in:
["Hello ", " word. Sentence number ", "."]
When you want to apply modifiers to the group.
/hello (?:world)?/
/hello (?:world)*/
/hello (?:world)+/
/hello (?:world){3,6}/
etc.
Use them when you need a conditional and don't care about which of the choices cause the match.
Non-capturing groups can simplify the result of matching a complex expression. Here, the group 1 is always the name speaker. Without the non-capturing group, the speaker's name may end up in group 1 or group 2.
/hello (?:world|foobar )?said (.+)/
I have just found a different use for it. I was trying to capture a nested group but wanted the whole collection of the repeating group as one element:
So for AbbbbC
(A)((?:b)*)(C)
gives three groups A bbbb C
for AC also gives three groups A null C

The behavior of /g mode matching

On this article, it mentioned
Make sure you are clear on the fact that an expression pattern is
tested on each individual character. And that, just because the engine
moves forward when following the pattern and looking for a match it
still backtracks and examines each character in a string until a match
is found or if the global flag is set until all characters are
examined.
But what I tested in Javascript
"aaa#bbb".match(/a+#b+/g)
does not produce a result like:
["aaa#bbb", "aa#bbb", "a#bbb"]
It only produces ["aaa#bbb"]. It seems it does not examine each character to test the pattern. Can anyone can explain a little on matching steps ? Thanks.
/g does not mean it will try to find every possible subset of characters in the input string which may match the given pattern. It means that once a match is found, it will continue searching for more substrings which may match the pattern starting from the previous match onward.
For example:
"aaa#bbb ... aaaa#bbbb".match(/a+#b+/g);
Will produce
["aaa#bbb", "aaaa#bbbb"]
That explanation is mixing two distinct concepts that IMO should be kept separated
A) backtracking
When looking for a match the normal behavior for a quantifier (?, *, +) is to be "greedy", i.e. to munch as much as possible... for example in /(a+)([^b]+)/ tested with aaaacccc all the a will be part of group 1 even if of course they also match the char set [^b] (everything but b).
However if grabbing too much is going to prevent a match the RE rules require that the quantifier "backtracks" capturing less if this allows the expression to match. For example in (a+)([^b]+) tested with aaaa the group 1 will get only three as, to leave one for group 2 to match.
You can change this greedy behavior with "non-greedy quantifiers" like *?, +?, ??. In this case stills the engine does backtracking but with the reverse meaning: a non-greedy subexpression will eat as little as possible that allows the rest of expression to match. For example (a+)(a+b+) tested with aaabbb will leave two as for group 1 and abbb for group 2, but (a+?)(a+b+) with the same string instead will leave only one a for group 1 because this is the minimum that allows matching the remaining part.
Note that also because of backtracking the greedy/non-greedy options doesn't change if an expression has a match or not, but only how big is the match and how much goes to each sub-expression.
B) the "global" option
This is something totally unrelated to backtracking and simply means that instead of stopping at the first match the search must find all non-overlapping matches. This is done by finding the first match and then starting again the search after the end of the match.
Note that each match is computed using the standard regexp rules and there is no look-ahead or backtracking between different matches: in other words if making for example a greedy match shorter would give more matches in the string this option is not considered... a+[^b]+ tested with aaaaaa is going to give only one match even if g option is specified and even if the substrings aa, aa, aa would have been each a valid match for the regexp.
When the global flag is used, it starts searching for the next match after the end of the previous match, to prevent generating lots of overlapping matches like that.
If you don't specify /g, the engine will stop as soon as a match is found.
If you do specify /g, it will keep going after a match. It will, however, still not produce overlapping matches, which is what you're asking about.
Its because.,
What Regex try to do:
All regex expression will try to match the best match.
What Regex wont do
It will not match the combinations for a single match as in your case.
When your "aaa#bbb".match(/a+#b+/g) scenario works
Rather, aaa#bbbHiaa#bbbHelloa#bbbSEEYOU try for some thing like this, which will give you
aaa#bbb
aa#bbb
a#bbb

Categories

Resources