Why isn't this group capturing all items that appear in parentheses? - javascript

I'm trying to create a regex that will capture a string not enclosed by parentheses in the first group, followed by any amount of strings enclosed by parentheses.
e.g.
2(3)(4)(5)
Should be: 2 - first group, 3 - second group, and so on.
What I came up with is this regex: (I'm using JavaScript)
([^()]*)(?:\((([^)]*))\))*
However, when I enter a string like A(B)(C)(D), I only get the A and D captured.
https://regex101.com/r/HQC0ib/1
Can anyone help me out on this, and possibly explain where the error is?

Since you cannot use a \G anchor in JS regex (to match consecutive matches), and there is no stack for each capturing group as in a .NET / PyPi regex libraries, you need to use a 2 step approach: 1) match the strings as whole streaks of text, and then 2) post-process to get the values required.
var s = "2(3)(4)(5) A(B)(C)(D)";
var rx = /[^()\s]+(?:\([^)]*\))*/g;
var res = [], m;
while(m=rx.exec(s)) {
res.push(m[0].split(/[()]+/).filter(Boolean));
}
console.log(res);
I added \s to the negated character class [^()] since I added the examples as a single string.
Pattern details
[^()\s]+ - 1 or more chars other than (, ) and whitespace
(?:\([^)]*\))* - 0 or more sequences of:
\( - a (
[^)]* - 0+ chars other than )
\) - a )
The splitting regex is [()]+ that matches 1 or more ) or ( chars, and filter(Boolean) removes empty items.

You cannot have an undetermined number of capture groups. The number of capture groups you get is determined by the regular expression, not by the input it parses. A capture group that occurs within another repetition will indeed only retain the last of those repetitions.
If you know the maximum number of repetitions you can encounter, then just repeat the pattern that many times, and make each of them optional with a ?. For instance, this will capture up to 4 items within parentheses:
([^()]*)(?:\(([^)]*)\))?(?:\(([^)]*)\))?(?:\(([^)]*)\))?(?:\(([^)]*)\))?

It's not an error. It's just that in regex when you repeat a capture group (...)* that only the last occurence will be put in the backreference.
For example:
On a string "a,b,c,d", if you match /(,[a-z])+/ then the back reference of capture group 1 (\1) will give ",d".
If you want it to return more, then you could surround it in another capture group.
--> With /((?:,[a-z])+)/ then \1 will give ",b,c,d".
To get those numbers between the parentheses you could also just try to match the word characters.
For example:
var str = "2(3)(14)(B)";
var matches = str.match(/\w+/g);
console.log(matches);

Related

Javascript regex remove substrings not in larger string

I have an input string containing a math expression that may contain comma-separated values that I need to remove they do not occur within an aggregation function. In those cases, I just want the first value to remain.
Consider the following example strings:
max ( 100,200,30,4 ) GOOD expression, do nothing
min ( 10,23,111 ) GOOD expression, do nothing
min ( 10,20 ) GOOD expression, do nothing
10,2,34 + 4 BAD expression, remove extra comma-number sequences => 10 + 4
So far I have tried surrounding a comma-number pattern (,\d+)+ with negative lookbehind/lookaheads:
str.replaceAll(/(?<!(max|min)\s\(\s\d+)(,\d+)+(?!\s\))/g, '');
However while this picks up the comma-number sequence outside of functions, this also incorrectly matches in valid situations as well:
max ( 100,200,30,4 ) GOOD expression
^^^ BAD match
min ( 10,23,111 ) GOOD expression
^^^ BAD match
min ( 10,20 ) GOOD expression
GOOD (non-match)
10,2,34 + 4 BAD expression
^^^^^ GOOD match
In each instance, I understand why it's matching but at a loss as to how to prevent it.
How can I do this?
You could use a capture group to capture what you want to keep, and match what you want to remove.
In the replacement you could check for group 1. If it exists, return the group, else return an empty string so that what is matched is removed.
((?:max|min)\s\(\s*\d+(?:\s*,\s*\d+)*\s*\))|(?:,\d+)+
( Capture group 1
(?:max|min)\s Match either max or min and a whitspace char
\(\s*\d+ match ( optional whitespace chars and 1+ digits
(?:\s*,\s*\d+)*\s* Optionally repeat matching a comma between optional whitespace chars and 1+ digits, followed by optional whitespace chars
\) Match )
) Close group 1
| Or
(?:,\d+)+ Match 1+ times a comma and 1+ digits (You could also add \s* again for optional whitespace chars before and after the comma)
Regex demo
const regex = /((?:max|min)\s\(\s*\d+(?:\s*,\s*\d+)*\s*\))|(?:,\d+)+/g;
let items = [
"max ( 100,200,30,4 )",
"min ( 10,23,111 )",
"min ( 10,20 )",
"10,2,34 + 4"
].map(s => s.replace(regex, (m, g1) => g1 !== undefined ? g1 : ""));
console.log(items)
Took me a while to figure out what was going on in The fourth bird's answer. Quite a stroke of genius if you ask me.
For the sake of discussion, I will simplify the regex to the following, to find substrings that are not part of larger strings:
// all bcd's that are not in abcde
const regex = /(abcde)|(?:bcd)/g
If a match is found above (on either side of the pipe), an array is returned containing the full match at index 0, with additional indexes 1..n populated by capture groups in the expression as they occur in the expression from left to right.
By putting a capture group just on one side of the pipe, we know which side the match occurred on by whether indexes 1..n have anything in them.
If the match is made on left side of the pipe, index 1 will contain abcde since the whole side is a capture group.
If the match is made on the right side of the pipe (a non-capture group), nothing is captured and index 1 will be undefined.
We can then use a simple replaceAll(regex, '$1');, where any matches found are replaced by the contents of the first capture group. Matches found on the left side of the pipe get replaced by themselves; those on the right get replaced with nothing.
// all bcd's that are not in abcde
const regex = /(abcde)|(?:bcd)/g
console.log('abcdebcdbcdbcd'.replaceAll(regex, '$1'))

Regex match multiple same expression multiple times

I have got this string {bgRed Please run a task, {red a list has been provided below}, I need to do a string replace to remove the braces and also the first word.
So below I would want to remove {bgRed and {red and then the trailing brace which I can do separate.
I have managed to create this regex, but it is only matching {bgRed and not {red, can someone lend a hand?
/^\{.+?(?=\s)/gm
Note you are using ^ anchor at the start and that makes your pattern only match at the start of a line (mind also the m modifier). .+?(?=\s|$) is too cumbersome, you want to match any 1+ chars up to the first whitespace or end of string, use {\S+ (or {\S* if you plan to match { without any non-whitespace chars after it).
You may use
s = s.replace(/{\S*|}/g, '')
You may trim the outcome to get rid of resulting leading/trailing spaces:
s = s.replace(/{\S*|}/g, '').trim()
See the regex demo and the regex graph:
Details
{\S* - { char followed with 0 or more non-whitespace characters
| - or
} - a } char.
If the goal is go to from
"{bgRed Please run a task, {red a list has been provided below}"
to
"Please run a task, a list has been provided below"
a regex with two capture groups seems simplest:
const original = "{bgRed Please run a task, {red a list has been provided below}";
const rex = /\{\w+ ([^{]+)\{\w+ ([^}]+)}/g;
const result = original.replace(rex, "$1$2");
console.log(result);
\{\w+ ([^{]+)\{\w+ ([^}]+)} is:
\{ - a literal {
\w+ - one or more word characters ("bgRed")
a literal space
([^{]+) one or more characters that aren't {, captured to group 1
\{ - another literal {
\w+ - one or more word characters ("red")
([^}]+) - one or more characters that aren't }, captured to group 2
} - a literal }
The replacement uses $1 and $2 to swap in the capture group contents.

regex if capture group matches string

I need to build a simple script to hyphenate Romanian words. I've seen several and they don't implement the rules correctly.
var words = "arta codru";
Rule: if 2 consonants are between 2 vowels, then they become split between syllables unless they belong in this array in which case both consonants move to the second syllable:
var exceptions_to_regex2 = ["bl","cl","dl","fl","gl","hl","pl","tl","vl","br","cr","dr","fr","gr","hr","pr","tr","vr"];
Expected result: ar-ta co-dru
The code so far:
https://playcode.io/156923?tabs=console&script.js&output
var words = "arta codru";
var exceptions_to_regex2 = ["bl","cl","dl","fl","gl","hl","pl","tl","vl","br","cr","dr","fr","gr","hr","pr","tr","vr"];
var regex2 = /([aeiou])([bcdfghjklmnprstvwxy]{1})(?=[bcdfghjklmnprstvwxy]{1})([aeiou])/gi;
console.log(words.replace(regex2, '$1$2-'));
console.log("desired result: ar-ta co-dru");
Now I would need to do something like this:
if (exceptions_to_regex2.includes($2+$3)){
words.replace(regex2, '$1-');
}
else {
words.replace(regex2, '$1$2-');
}
Obviously it doesn't work because I can't just use the capture groups as I would a regular variable. Please help.
You may code your exceptions as a pattern to check for after a vowel, and stop matching there, or you may still consume any other consonant before another vowel, and replace with the backreference to the whole match with a hyphen right after:
.replace(/[aeiou](?:(?=[bcdfghptv][lr])|[bcdfghj-nprstvwxy](?=[bcdfghj-nprstvwxy][aeiou]))/g, '$&-')
Add i modifier after g if you need case insensitive matching.
See the regex demo.
Details
[aeiou] - a vowel
(?: - start of a non-capturing group:
(?=[bcdfghptv][lr]) - a positive lookahead that requires the exception letter clusters to appear immediately to the right of the current position
| - or
[bcdfghj-nprstvwxy] - a consonant
(?=[bcdfghj-nprstvwxy][aeiou]) - followed with any consonant and a vowel
) - end of the non-capturing group.
The $& in the replacement pattern is the placeholder for the whole match value (at regex101, $0 can only be used at this moment, since the Web site does not support language specific only replacement patterns).

Replace last character of a matched string using regex

I am need to post-process lines in a file by replacing the last character of string matching a certain pattern.
The string in question is:
BRDA:2,1,0,0
I'd like to replace the last digit from 0 to 1. The second and third digits are variable, but the string will always start BRDA:2 that I want to affect.
I know I can match the entire string using regex like so
/BRDA:2,\d,\d,1
How would I get at that last digit for performing a replace on?
Thanks
You may match and capture the parts of the string with capturing groups to be able to restore those parts in the replacement result later with backreferences. What you need to replace/remove should be just matched.
So, use
var s = "BRDA:2,1,0,0"
s = s.replace(/(BRDA:2,\d+,\d+,)\d+/, "$11")
console.log(s)
If you need to match the whole string you also need to wrap the pattern with ^ and $:
s = s.replace(/^(BRDA:2,\d+,\d+,)\d+$/, "$11")
Details:
^ - a start of string anchor
(BRDA:2,\d+,\d+,) - Capturing group #1 matching:
BRDA:2, - a literal sunstring BRDA:2,
\d+ - 1 or more digits
, - a comma
\d+, - 1+ digits and a comma
\d+ - 1+ digits.
The replacement - $11 - is parsed by the JS regex engine as the $1 backreference to the value kept inside Group 1 and a 1 digit.

regular expression capture groups

I'm learning regular expression (currently on Javascript).
My question is that:
I have a straight string of some length.
In this string there are at least (obligatory) three patterns.
And as a result I want to rule.exec() string and get a three-elements array. Each pattern into a separate element.
How should I approach this? Currently I have reached it, but with a lot of up and downs and don't know what should EXACTLY be done to group a capture? Is it parenthesis () that separate each group of Regular Expression.
My Regular Expression Rule example:
var rule = /([a-zA-Z0-9].*\s?(#classs?)+\s+[a-zA-Z0-9][^><]*)/g;
var str = "<Home #class www.tarjom.ir><string2 stringValue2>";
var res;
var keys = [];
var values = [];
while((res = rule.exec(str)) != null)
{
values.push(res[0]);
}
console.log(values);
// begin to slice them
var sliced = [];
for(item in values)
{
sliced.push(values[item].split(" "));// converting each item into an array and the assign them to a super array
}
/// Last Updated on 7th of Esfand
console.log(sliced);
And the return result (with firefox 27 - firebug console.log)
[["Home", "#class", "www.tarjom.ir"]]
I have got what I needed, I just need a clarification on the return pattern.
Yes, parentheses capture everything between them. Captured groups are numbered by their opening parenthesis. So if /(foo)((bar)baz)/ matches, your first captured group will contain foo, your second barbaz, and your third bar. In some dialects, only the first 9 capturing groups are numbered.
Captured groups can be used for backreferencing. If you want to match "foobarfoo", /(foo)bar\1/ will do that, where \1 means "the first group I captured".
There are ways to avoid capturing, if you just need the parenthesis for grouping. For instance, if you want to match either "foo" or "foobar", /(foo(bar)?)/ will do so, but may have captured "bar" in its second group. If you want to avoid this, use /(foo(?:bar)?)/ to only have one capture, either "foo" or "foobar".
The reason your code shows three values, is because of something else. First, you do a match. Then, you take your first capture and split that on a space. That is what you put in your array of results. Note that you push the entire array in there at once, so you end up with an array of arrays. Hence the double brackets.
Your regex matches (pretending we're in Perl's eXtended legibility mode):
/ # matching starts
( # open 1st capturing group
[a-zA-Z0-9] # match 1 character that's in a-z, A-Z, or 0-9
.* # match as much of any character possible
\s? # optionally match a white space (this will generally never happen, since the .* before it will have gobbled it up)
( # open 2nd capturing group
#classs? # match '#class' or '#classs'
)+ # close 2n group, matching it once or more
\s+ # match one or more white space characters
[a-zA-Z0-9] # match 1 character that's in a-z, A-Z, or 0-9
[^><]* # match any number of characters that's not an angle bracket
) # close 1st capturing group
/g # modifiers - g: match globally (repeatedly match throughout entire input)

Categories

Resources