Why is this second backreference not working? - javascript

I want to make use of backreferences as much as possible, avoiding the duplication of combinations of many patterns.
Other requirements: Use less literals without constructing new RegExp while maintaining generality.
Original title: Why is this negative lookahead with capturing group not working?
For example, a string:
1.'2.2'.33.'4.4'.5.(…etc)
— I want to match the characters separated by periods, and the quoted ones are not segmented and the quotes are truncated. That is to match:
1, 2.2, 33, 4.4, 5, (…etc).
A working regex is:
(?<=(["'])(?!\.)).*?(?=\1)|((?!["']|\.).)+
console.log(
"1.'2.2'.33.'4.4'.5.(…etc)".match(
/(?<=(["'])(?!\.)).*?(?=\1)|((?!["']|\.).)+/g
)
)
A non-working one is:
(?<=(["'])(?!\.)).*?(?=\1)|((?!\1|\.).)+
^^
console.log(
"1.'2.2'.33.'4.4'.5.(…etc)".match(
/(?<=(["'])(?!\.)).*?(?=\1)|((?!\1|\.).)+/g
)
)
— it does not match 1, 33, 5, (…etc).
Why is it (\1←^^) non-working and how to correct it? Thank you!

The main point of confusion seems to be that backreferences are not like "regex subroutines"; they don't let you reuse parts of the pattern elsewhere. What they do is they let you match the exact string that was matched before again.
For example:
console.log(/(\w)\1/.test('AB'));
console.log(/(\w)\1/.test('AA'));
console.log(/(\w)\1/.test('BB'));
(\w)\1 does not match AB, but it does match AA and BB. The \1 part only matches the exact string that was matched by the (\w) group before.
In your case,
(?<=(["'])(?!\.)).*?(?=\1)
|
((?!\1|\.).)+
there are two branches separated by |. The second branch contains a backreference (\1) to a capturing group in the first branch ((["'])).
This can never match because the second branch is only tried if the first branch failed to match anything, but in that case the first capturing group also failed to match anything, so what string would \1 refer to?
If the capturing group referred to by a backreference never matched anything, browsers behave as if it were the empty string.
The empty string always matches, so (?!\1) always fails.

console.log(
"1.'2.2'.33.'4.4'.5.(…etc)".match(
/(["'])[\d.]+\1|\d+/g
)
)

Related

Regex to follow pattern except between braces

I am having a tough time figuring out a clean Regex (in a Javascript implementation) that will capture as much of a line as it can following a pattern, but anything inside braces doesn't need to follow the pattern. I'm not sure the best way to explain that except by example:
For example:
Let's say the pattern is, the line must start with 0, end with a 0 anywhere, but only allow sequence of 1, 2 or 3 in between, so I use ^(0[123]+0). This should match the first part of the strings:
0213123123130
012312312312303123123
01231230123123031230
etc.
But I want to be able to insert {gibberish} between braces into the line and have the Regex allow it to disrupt the pattern. i.e., ignore the pattern of the curly braces and everything inside, but still capture the full string including the {gibberish}. So this would capture everything in bold:
01232231{whatever 3 gArBaGe? I want.}121{foo}2310312{bar}3120123
and a 0 inside the braces does not end the capture prematurely, even if the pattern is correct.
01213123123123{21310030123012301}31231230123
EDIT: Now, I know I could just do something like ^0[123]*?(?:{.*})*?[123]*?0 maybe? But that only works if there is a single set of braces, and now I have to duplicate my [123] pattern. As that [123] pattern gets more complex, having it appear more than once in the Regex starts getting really incomprehensible. Something like the best regex trick seemed promising but I couldn't figure out how to apply it here. Using crazy lookarounds seems like the only way now but I would hope there's a cleaner way.
Since you've specified that you want the whole match including the garbage, you can use ^0([123]+(?:{[^}]*}[123]*)*)0 and use $1 to get the part between the 0s, or $0 to get everything that matched.
https://regex101.com/r/iFSabs/3
Here's the rundown on how the regex works:
^ anchors the match to start at the beginning of the line
0 matches a literal zero character
([123]+(?:{[^}]*}[123]*)*) is a capturing group that captures everything inside of it.
[123]+ matches one or more instances of 1, 2, or 3
(?:{[^}]*}[123]*)* is a non-capturing group. I.e. it'll be part of the match, but won't have a $# for use in replacement or the match.
{[^}]*} matches a literal { followed by any number of non } characters followed by }
[123]* matches zero or more instances of 1, 2, or 3
Then this whole non-capturing group can be matched 0 or more times.
The process behind this regex is known as unrolling the loop. http://www.softec.lu/site/RegularExpressions/UnrollingTheLoop gives a good description of it. (with a few typo fixes)
The unrolling the loop technique is based on the hypothesis that in
most case, you [know] in a [repeated] alternation, which case should be
the most usual and which one is exceptional. We will called the first
one, the normal case and the second one, the special case. The general
syntax of the unrolling the loop technique could then be written as:
normal* ( special normal* )*
Which could means something like, match the normal case, if you find a
special case, matched it than match the normal case again. [You'll] notice
that part of this syntax could [potentially] lead to a super-linear
match.
Example using Regex#test and Regex#match:
const strings = [
'0213123123130',
'012312312312303123123',
'01231230123123031230',
'01213123123123{21310030123012301}31231230123',
'01212121{hello 0}121312',
'012321212211231{whatever 3 gArBaGe? I want.}1212313123120123',
'012321212211231{whatever 3 gArBaGe? I want.}121231{extra garbage}3123120123',
];
const regex = /^0([123]+(?:{[^}]*}[123]*)*)0/
console.log('tests')
console.log(strings.map(string => `'${string}': ${regex.test(string)}`))
console.log('matches');
let matches = strings
.map((string) => regex.exec(string))
.map((match) => (match ? match[1] : undefined));
console.log(matches);
Robo Robok's answer is where I'd go with if you want to only keep the non braced part, although using a slightly different regex ({[^}]*}) for a bit more performance.
How about the other way around? Checking the string with curly tags removed:
const string = '012321212211231{whatever 3 gArBaGe? I want.}1212313123120123{foo}123';
const stringWithoutTags = string.replace(/\{.*?\}/g, '');
const result = /^(0[123]+0)/.test(stringWithoutTags);
You say you need to capture everything, including the gibberish, so I think a simple pattern like this should work:
^(0(?:[123]|{.+?})+0)
That allows a string starting with 0, and then any of your pattern characters (1, 2, or 3), or one of the { gibberish } sections, and allows that to repeat to handle multiple gibberish sections, and finally it must end with a 0.
https://regex101.com/r/K4teGY/2
You might use
^0[123]*(?:{[^{}]*}[123]*)*0
^ Start of string
0 Match a zero
[123]* Match 0+ times either 1, 2 or 3
(?: Non capture group
{[^{}]*}[123]* match from an opening till closing } followed by 0+ either 1, 2 or 3
)* Close group and repeat 0+ times
0 Match a zero
Regex demo

Why isn't this group capturing all items that appear in parentheses?

I'm trying to create a regex that will capture a string not enclosed by parentheses in the first group, followed by any amount of strings enclosed by parentheses.
e.g.
2(3)(4)(5)
Should be: 2 - first group, 3 - second group, and so on.
What I came up with is this regex: (I'm using JavaScript)
([^()]*)(?:\((([^)]*))\))*
However, when I enter a string like A(B)(C)(D), I only get the A and D captured.
https://regex101.com/r/HQC0ib/1
Can anyone help me out on this, and possibly explain where the error is?
Since you cannot use a \G anchor in JS regex (to match consecutive matches), and there is no stack for each capturing group as in a .NET / PyPi regex libraries, you need to use a 2 step approach: 1) match the strings as whole streaks of text, and then 2) post-process to get the values required.
var s = "2(3)(4)(5) A(B)(C)(D)";
var rx = /[^()\s]+(?:\([^)]*\))*/g;
var res = [], m;
while(m=rx.exec(s)) {
res.push(m[0].split(/[()]+/).filter(Boolean));
}
console.log(res);
I added \s to the negated character class [^()] since I added the examples as a single string.
Pattern details
[^()\s]+ - 1 or more chars other than (, ) and whitespace
(?:\([^)]*\))* - 0 or more sequences of:
\( - a (
[^)]* - 0+ chars other than )
\) - a )
The splitting regex is [()]+ that matches 1 or more ) or ( chars, and filter(Boolean) removes empty items.
You cannot have an undetermined number of capture groups. The number of capture groups you get is determined by the regular expression, not by the input it parses. A capture group that occurs within another repetition will indeed only retain the last of those repetitions.
If you know the maximum number of repetitions you can encounter, then just repeat the pattern that many times, and make each of them optional with a ?. For instance, this will capture up to 4 items within parentheses:
([^()]*)(?:\(([^)]*)\))?(?:\(([^)]*)\))?(?:\(([^)]*)\))?(?:\(([^)]*)\))?
It's not an error. It's just that in regex when you repeat a capture group (...)* that only the last occurence will be put in the backreference.
For example:
On a string "a,b,c,d", if you match /(,[a-z])+/ then the back reference of capture group 1 (\1) will give ",d".
If you want it to return more, then you could surround it in another capture group.
--> With /((?:,[a-z])+)/ then \1 will give ",b,c,d".
To get those numbers between the parentheses you could also just try to match the word characters.
For example:
var str = "2(3)(14)(B)";
var matches = str.match(/\w+/g);
console.log(matches);

Satisfying two condition in one regex pattern in javascript

I am not sure if I have put the question right.
I want to satisfy both the text with one regular expression.
text1 = 'foobar';
text2 = 'foobar-baz';
Expected Output of text1
$1 should be bar
$2 should be ''
Expected Output of text2
$1 should be bar
$2 should be baz
Here is what I have tried:
/foo([a-z0-9\-_=\+\/]+)(\-(.*))?/i
result for text1 is correct but for text2, $1 gets the full string foobar-baz
The problem here is due to the possible inclusion of - in the first capturing group. There are 2 cases:
There are one or more - in the string, and you want to pick the last group delimited by the hyphen. Intuitively, we think of greedy quantifier, and a simple solution like:
input.match(/foo([a-z0-9_=+\/-]+)-(.*)/)
would work.
However the second case, where there are no - in the string, combined with the previous case, causes problem.
Since [a-z0-9_=+\/-]+ contains -, if you make -(.*) optional, given an input in the first case, it will just match to the end of the string and put everything in the first capturing group.
We need to control the backtracking behavior so that when there is at least one -, it must match it and match the last one, and allow the first group to gobble up everything when there is no -.
One solution which makes minimal change to your current regex is:
input.match(/foo([a-z0-9_=+\/-]+?)(?:-([a-z0-9_=+\/]*))?$/)
The lazy quantifier makes the engine tries from the left-most - first, and the anchor $ and the character class without - at the end forces the engine to split only at the last - if any.
Note that the second capturing group will be undefined when there is no -.
Sample input output:
'foogoobarbaz'.match(/foo([a-z0-9_=+\/-]+?)(?:-([a-z0-9_=+\/]*))?$/)
> [ "foogoobarbaz", "goobarbaz", undefined ]
'foogoobar-baz'.match(/foo([a-z0-9_=+\/-]+?)(?:-([a-z0-9_=+\/]*))?$/)
> [ "foogoobar-baz", "goobar", "baz" ]
'foogoo-bar-baz'.match(/foo([a-z0-9_=+\/-]+?)(?:-([a-z0-9_=+\/]*))?$/)
> [ "foogoo-bar-baz", "goo-bar", "baz" ]
You can use a non-capturing group:
/foo([a-z0-9\-_=\+\/]+)(?:-(.*))?/i
That solves the problem of avoiding the additional capture group. However, your pattern still has the problem of including - as a valid character for the first string. Because of that, when you execute the pattern against "foobar-baz", the entire fragment "bar-baz" will match the first group in the pattern.
You're going to have to decide what it is you want to match; your rule is currently at odds with the result you seek. If you remove the - from the first group:
/foo([a-z0-9_=\+\/]+)(?:-(.*))?/i
then you get the result you say you're looking for.

JavaScript Regular Expressions Basics

I'm trying to learn Regular Expressions and at the moment I've gathered a very basic understanding from all of the overviews from W3, Mozilla or http://www.regular-expressions.info/, but when I was exploring this wikibook http://en.wikibooks.org/wiki/JavaScript/Regular_Expressions it gave this example:
"abbc".replace(/(.)\1/g, "$1") => "abc"
which I have no idea why is true (the wikibook didn't really explain), but I tried it myself and it does drop the second b. I know \1 is a backreference to the captured group (.), but . is the any character besides a new line symbol... Wouldn't that still pick up the second b? Trying a few variations didn't clear things up either...
"abbc".replace(/(.)/g, "$1") => "abbc"
"aabc".replace(/(.)*/g, "$1") => "c"
Does anybody have a good in depth tutorial on Javascript Regular Expressions (I've looked at a couple of books and they're very generalized for about 15 languages and no real emphasis on Javascript).
First One
(.) matches and captures a single character to Group 1, so (.)\1 matches two of the same characters, for instance AA.
In the string, the only match for this pattern is bb.
By replacing these two characters bb by the Group 1 capture buffer $1, i.e. b, we replace two chars with one, effectively removing oneb`.
Second One
Again (.) matches and captures a single character, capturing it to Group 1.
The pattern matches each character in the string in turn.
The replacement is the Group 1 capture buffer $1, so we replace each character with itself. Therefore the string is unchanged.
Third One
Here, forgetting the parentheses for a moment, .* matches the whole string: this is the match.
The quantifier * means that the Group 1 is reset every time a single character is matched (new group numbers are not created, as group numbering is done from left to right).
For every character that is matched, that character is therefore captured to Group 1—until the next capture resets Group 1.
The end value of Group 1 is the the last capture, which is the last character c
We replace the match (i.e., the whole string) with Group 1 (i.e. c), so the replacement string is c.
The details of group numbering are important to grasp, and I highly recommend you read the linked article about "the gory details".
Reference
Capture Group Numbering & Naming: The Gory Details
JavaScript Regex Basics
Backreferences
This is quite simple when broken down:
With "abbc".replace(/(.)\1/g, "$1"), the result is "abc" because:
(.) references one character.
\1 references the first back reference
So what it says is "find 2 times the same letter" and replace it with the reference. So any doubled character would match and be replaced by the reference.

Regular expression match 0 or exact number of characters

I want to match an input string in JavaScript with 0 or 2 consecutive dashes, not 1, i.e. not range.
If the string is:
-g:"apple" AND --projectName:"grape": it should match --projectName:"grape".
-g:"apple" AND projectName:"grape": it should match projectName:"grape".
-g:"apple" AND -projectName:"grape": it should not match, i.e. return null.
--projectName:"grape": it should match --projectName:"grape".
projectName:"grape": it should match projectName:"grape".
-projectName:"grape": it should not match, i.e. return null.
To simplify this question considering this example, the RE should match the preceding 0 or 2 dashes and whatever comes next. I will figure out the rest. The question still comes down to matching 0 or 2 dashes.
Using -{0,2} matches 0, 1, 2 dashes.
Using -{2,} matches 2 or more dashes.
Using -{2} matches only 2 dashes.
How to match 0 or 2 occurrences?
Answer
If you split your "word-like" patterns on spaces, you can use this regex and your wanted value will be in the first capturing group:
(?:^|\s)((?:--)?[^\s-]+)
\s is any whitespace character (tab, whitespace, newline...)
[^\s-] is anything except a whitespace-like character or a -
Once again the problem is anchoring the regex so that the relevant part isn't completely optionnal: here the anchor ^ or a mandatory whitespace \s plays this role.
What we want to do
Basically you want to check if your expression (two dashes) is there or not, so you can use the ? operator:
(?:--)?
"Either two or none", (?:...) is a non capturing group.
Avoiding confusion
You want to match "zero or two dashes", so if this is your entire regex it will always find a match: in an empty string, in --, in -, in foobar... What will be match in these string will be an empty string, but the regex will return a match.
This is a common source of misunderstanding, so bear in mind the rule that if everything in your regex is optional, it will always find a match.
If you want to only return a match if your entire string is made of zero or two dashes, you need to anchor the regex:
^(?:--)?$
^$ match respectively the beginning and end of the string.
a(-{2})?(?!-)
This is using "a" as an example. This will match a followed by an optional 2 dashes.
Edit:
According to your example, this should work
(?<!-)(-{2})?projectName:"[a-zA-Z]*"
Edit 2:
I think Javascript has problems with lookbehinds.
Try this:
[^-](-{2})?projectName:"[a-zA-Z]*"
Debuggex Demo

Categories

Resources