Javascript regex remove substrings not in larger string - javascript

I have an input string containing a math expression that may contain comma-separated values that I need to remove they do not occur within an aggregation function. In those cases, I just want the first value to remain.
Consider the following example strings:
max ( 100,200,30,4 ) GOOD expression, do nothing
min ( 10,23,111 ) GOOD expression, do nothing
min ( 10,20 ) GOOD expression, do nothing
10,2,34 + 4 BAD expression, remove extra comma-number sequences => 10 + 4
So far I have tried surrounding a comma-number pattern (,\d+)+ with negative lookbehind/lookaheads:
str.replaceAll(/(?<!(max|min)\s\(\s\d+)(,\d+)+(?!\s\))/g, '');
However while this picks up the comma-number sequence outside of functions, this also incorrectly matches in valid situations as well:
max ( 100,200,30,4 ) GOOD expression
^^^ BAD match
min ( 10,23,111 ) GOOD expression
^^^ BAD match
min ( 10,20 ) GOOD expression
GOOD (non-match)
10,2,34 + 4 BAD expression
^^^^^ GOOD match
In each instance, I understand why it's matching but at a loss as to how to prevent it.
How can I do this?

You could use a capture group to capture what you want to keep, and match what you want to remove.
In the replacement you could check for group 1. If it exists, return the group, else return an empty string so that what is matched is removed.
((?:max|min)\s\(\s*\d+(?:\s*,\s*\d+)*\s*\))|(?:,\d+)+
( Capture group 1
(?:max|min)\s Match either max or min and a whitspace char
\(\s*\d+ match ( optional whitespace chars and 1+ digits
(?:\s*,\s*\d+)*\s* Optionally repeat matching a comma between optional whitespace chars and 1+ digits, followed by optional whitespace chars
\) Match )
) Close group 1
| Or
(?:,\d+)+ Match 1+ times a comma and 1+ digits (You could also add \s* again for optional whitespace chars before and after the comma)
Regex demo
const regex = /((?:max|min)\s\(\s*\d+(?:\s*,\s*\d+)*\s*\))|(?:,\d+)+/g;
let items = [
"max ( 100,200,30,4 )",
"min ( 10,23,111 )",
"min ( 10,20 )",
"10,2,34 + 4"
].map(s => s.replace(regex, (m, g1) => g1 !== undefined ? g1 : ""));
console.log(items)

Took me a while to figure out what was going on in The fourth bird's answer. Quite a stroke of genius if you ask me.
For the sake of discussion, I will simplify the regex to the following, to find substrings that are not part of larger strings:
// all bcd's that are not in abcde
const regex = /(abcde)|(?:bcd)/g
If a match is found above (on either side of the pipe), an array is returned containing the full match at index 0, with additional indexes 1..n populated by capture groups in the expression as they occur in the expression from left to right.
By putting a capture group just on one side of the pipe, we know which side the match occurred on by whether indexes 1..n have anything in them.
If the match is made on left side of the pipe, index 1 will contain abcde since the whole side is a capture group.
If the match is made on the right side of the pipe (a non-capture group), nothing is captured and index 1 will be undefined.
We can then use a simple replaceAll(regex, '$1');, where any matches found are replaced by the contents of the first capture group. Matches found on the left side of the pipe get replaced by themselves; those on the right get replaced with nothing.
// all bcd's that are not in abcde
const regex = /(abcde)|(?:bcd)/g
console.log('abcdebcdbcdbcd'.replaceAll(regex, '$1'))

Related

Check for a specific suffix by RegEx and select entire match including suffix

First of all, this is my first question in the community hence please pardon my wrongs Experts! I am learning regex and faced a scenario where I am failing to create answer by myself.
Let's say if there is humongous paragraph, can we first match on the basis of a specific suffix (say '%') and Only then go back and select the desired logic including suffix?
e.g. part of the text is "abcd efghMNP 0.40 % ijkl mnopSNP -3.20 % xyz".
Now in this, if you notice - and I got this much - that there is pattern like /([MS]NP[\s\d\.-%]+)/
I want to replace "MNP 0.40 %" or "SNP -3.20 %" with blank. replacing part seems easy :) But the problem is with all my learning I am not able to select desired ONLY IF there exists a '%' at the end of match.
The sequence of match I wish to reach at is -- if suffix '%' exists, then match the previous pattern, and if successful then select everything including suffix and replace with empty.
There are several expressions that would do so, for instance this one with added constraints:
[A-Z]{3}\s+[-+]?[0-9.]+\s*%
Test
const regex = /[A-Z]{3}\s+[-+]?[0-9.]+\s*%/gm;
const str = `abcd efghMNP 0.40 % ijkl mnopSNP -3.20 % xyz
"MNP 0.40 %" or "SNP -3.20 %"`;
const subst = ``;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log(result);
Demo 1
Or a simplified version would be:
[A-Z]{3}(.*?)%
Demo 2
You can not go back in the matching if you have encountered a suffix %, but what you can do is to make it part of the pattern so that is has to be matched.
In Javascript you could perform a zero length lookahead assertion (?= making sure that what is on the right contains a pattern or in this case a % but that will not be a real benefit in this case as you want it to be part of the match.
A bit more specific match could be:
[MS]NP\s*-?\d+(?:\.\d+)?\s*%
[MS]NP Match M or S followed by NP
\s*-? Match 0+ times a whitespace char followed by an optional -
\d+(?:\.\d+)? Match 1+ digits followed by an optional part to match a dot and 1+ digits
\s*% Match 0+ whitespace chars followed by matching %
Regex demo

Why isn't this group capturing all items that appear in parentheses?

I'm trying to create a regex that will capture a string not enclosed by parentheses in the first group, followed by any amount of strings enclosed by parentheses.
e.g.
2(3)(4)(5)
Should be: 2 - first group, 3 - second group, and so on.
What I came up with is this regex: (I'm using JavaScript)
([^()]*)(?:\((([^)]*))\))*
However, when I enter a string like A(B)(C)(D), I only get the A and D captured.
https://regex101.com/r/HQC0ib/1
Can anyone help me out on this, and possibly explain where the error is?
Since you cannot use a \G anchor in JS regex (to match consecutive matches), and there is no stack for each capturing group as in a .NET / PyPi regex libraries, you need to use a 2 step approach: 1) match the strings as whole streaks of text, and then 2) post-process to get the values required.
var s = "2(3)(4)(5) A(B)(C)(D)";
var rx = /[^()\s]+(?:\([^)]*\))*/g;
var res = [], m;
while(m=rx.exec(s)) {
res.push(m[0].split(/[()]+/).filter(Boolean));
}
console.log(res);
I added \s to the negated character class [^()] since I added the examples as a single string.
Pattern details
[^()\s]+ - 1 or more chars other than (, ) and whitespace
(?:\([^)]*\))* - 0 or more sequences of:
\( - a (
[^)]* - 0+ chars other than )
\) - a )
The splitting regex is [()]+ that matches 1 or more ) or ( chars, and filter(Boolean) removes empty items.
You cannot have an undetermined number of capture groups. The number of capture groups you get is determined by the regular expression, not by the input it parses. A capture group that occurs within another repetition will indeed only retain the last of those repetitions.
If you know the maximum number of repetitions you can encounter, then just repeat the pattern that many times, and make each of them optional with a ?. For instance, this will capture up to 4 items within parentheses:
([^()]*)(?:\(([^)]*)\))?(?:\(([^)]*)\))?(?:\(([^)]*)\))?(?:\(([^)]*)\))?
It's not an error. It's just that in regex when you repeat a capture group (...)* that only the last occurence will be put in the backreference.
For example:
On a string "a,b,c,d", if you match /(,[a-z])+/ then the back reference of capture group 1 (\1) will give ",d".
If you want it to return more, then you could surround it in another capture group.
--> With /((?:,[a-z])+)/ then \1 will give ",b,c,d".
To get those numbers between the parentheses you could also just try to match the word characters.
For example:
var str = "2(3)(14)(B)";
var matches = str.match(/\w+/g);
console.log(matches);

RegEx: Capture Word immediately between certain text and the opened parenthesis and closed parenthesis

I'm not really expert in regex especially the hard one.
I want the string which reside between parentheses and after word "index" .
"(NO3) index(doc.id(), doc.description) index (doc.id)"
would return
"[ 'doc.id(), doc.description' , 'doc.id' ]"
what I have done so far
https://jsfiddle.net/asjbcvve/
Parentheses inside matching string can make this hard. Recursive regex will match that, but not all regex engines implements it. JS for example doesn't (PCRE does)
regex with recursion
This doesn't work in JS and many other regex engines
index\s*\((([^\(\)]*(\([^\(\)]*\g<2>\))?)*)
regex without recursion with 1 nested parenthesis
index\s*\((([^\(\)]*(\([^\(\)]*\))?)*)
They both catch what you want in group 1.
Example:
var rx = /index\s*\((([^\(\)]*(\([^\(\)]*\))?)*)/g; //works with 1 nested parentheses
var rx_recursion = /index\s*\((([^\(\)]*(\([^\(\)]*\g<2>\))?)*)/g; //works with any number of nested parentheses, but JS regex engine doesn't suppoorts recursion
var res = [], m;
var s = "(NO3) index(doc.id(s)(), doc.description) index (doc.id) index(nestet.doesnt.work((())))";
while ((m=rx.exec(s)) !== null) {
res.push(m[1]);
}
document.body.innerHTML = "<pre>" + JSON.stringify(res, 0, 4) + "</pre>";
Regex explanation
index\s* - Match literal 'index' followed by any number of white characters
\( - Match literal openning parenthesis character
( - Group 1
( - Group 2
[^\(\)]* - Match anything that is not parentheses
( - Group 3
\( - Match literal opening parenthesis
[^\(\)]* - Match anything that is not parentheses
\g<1> - Recursively match group 1
\) - Match literal closing parenthesis
)? - End group 3, match it one or more times
)* - End group 2, match it zero or more times
) - End group 1
If you need to match multiple nested parentheses but engine of your choice doesn't support recursion, just replace \g<1> with literal of whole group 2. Repeat as many times as many nested parentheses you expect to appear in string.

JS Regex: Remove anything (ONLY) after a word

I want to remove all of the symbols (The symbol depends on what I select at the time) after each word, without knowing what the word could be. But leave them in before each word.
A couple of examples:
!!hello! my! !!name!!! is !!bob!! should return...
!!hello my !!name is !!bob ; for !
and
$remove$ the$ targetted$# $$symbol$$# only $after$ a $word$ should return...
$remove the targetted# $$symbol# only $after a $word ; for $
You need to use capture groups and replace:
"!!hello! my! !!name!!! is !!bob!!".replace(/([a-zA-Z]+)(!+)/g, '$1');
Which works for your test string. To work for any generic character or group of characters:
var stripTrailing = trail => {
let regex = new RegExp(`([a-zA-Z0-9]+)(${trail}+)`, 'g');
return str => str.replace(regex, '$1');
};
Note that this fails on any characters that have meaning in a regular expression: []{}+*^$. etc. Escaping those programmatically is left as an exercise for the reader.
UPDATE
Per your comment I thought an explanation might help you, so:
First, there's no way in this case to replace only part of a match, you have to replace the entire match. So we need to find a pattern that matches, split it into the part we want to keep and the part we don't, and replace the whole match with the part of it we want to keep. So let's break up my regex above into multiple lines to see what's going on:
First we want to match any number of sequential alphanumeric characters, that would be the 'word' to strip the trailing symbol from:
( // denotes capturing group for the 'word'
[ // [] means 'match any character listed inside brackets'
a-z // list of alpha character a-z
A-Z // same as above but capitalized
0-9 // list of digits 0 to 9
]+ // plus means one or more times
)
The capturing group means we want to have access to just that part of the match.
Then we have another group
(
! // I used ES6's string interpolation to insert the arg here
+ // match that exclamation (or whatever) one or more times
)
Then we add the g flag so the replace will happen for every match in the target string, without the flag it returns after the first match. JavaScript provides a convenient shorthand for accessing the capturing groups in the form of automatically interpolated symbols, the '$1' above means 'insert contents of the first capture group here in this string'.
So, in the above, if you replaced '$1' with '$1$2' you'd see the same string you started with, if you did 'foo$2' you'd see foo in place of every word trailed by one or more !, etc.

regular expression capture groups

I'm learning regular expression (currently on Javascript).
My question is that:
I have a straight string of some length.
In this string there are at least (obligatory) three patterns.
And as a result I want to rule.exec() string and get a three-elements array. Each pattern into a separate element.
How should I approach this? Currently I have reached it, but with a lot of up and downs and don't know what should EXACTLY be done to group a capture? Is it parenthesis () that separate each group of Regular Expression.
My Regular Expression Rule example:
var rule = /([a-zA-Z0-9].*\s?(#classs?)+\s+[a-zA-Z0-9][^><]*)/g;
var str = "<Home #class www.tarjom.ir><string2 stringValue2>";
var res;
var keys = [];
var values = [];
while((res = rule.exec(str)) != null)
{
values.push(res[0]);
}
console.log(values);
// begin to slice them
var sliced = [];
for(item in values)
{
sliced.push(values[item].split(" "));// converting each item into an array and the assign them to a super array
}
/// Last Updated on 7th of Esfand
console.log(sliced);
And the return result (with firefox 27 - firebug console.log)
[["Home", "#class", "www.tarjom.ir"]]
I have got what I needed, I just need a clarification on the return pattern.
Yes, parentheses capture everything between them. Captured groups are numbered by their opening parenthesis. So if /(foo)((bar)baz)/ matches, your first captured group will contain foo, your second barbaz, and your third bar. In some dialects, only the first 9 capturing groups are numbered.
Captured groups can be used for backreferencing. If you want to match "foobarfoo", /(foo)bar\1/ will do that, where \1 means "the first group I captured".
There are ways to avoid capturing, if you just need the parenthesis for grouping. For instance, if you want to match either "foo" or "foobar", /(foo(bar)?)/ will do so, but may have captured "bar" in its second group. If you want to avoid this, use /(foo(?:bar)?)/ to only have one capture, either "foo" or "foobar".
The reason your code shows three values, is because of something else. First, you do a match. Then, you take your first capture and split that on a space. That is what you put in your array of results. Note that you push the entire array in there at once, so you end up with an array of arrays. Hence the double brackets.
Your regex matches (pretending we're in Perl's eXtended legibility mode):
/ # matching starts
( # open 1st capturing group
[a-zA-Z0-9] # match 1 character that's in a-z, A-Z, or 0-9
.* # match as much of any character possible
\s? # optionally match a white space (this will generally never happen, since the .* before it will have gobbled it up)
( # open 2nd capturing group
#classs? # match '#class' or '#classs'
)+ # close 2n group, matching it once or more
\s+ # match one or more white space characters
[a-zA-Z0-9] # match 1 character that's in a-z, A-Z, or 0-9
[^><]* # match any number of characters that's not an angle bracket
) # close 1st capturing group
/g # modifiers - g: match globally (repeatedly match throughout entire input)

Categories

Resources