Regex match multiple same expression multiple times

Regex match multiple same expression multiple times - javascript

I have got this string {bgRed Please run a task, {red a list has been provided below}, I need to do a string replace to remove the braces and also the first word.
So below I would want to remove {bgRed and {red and then the trailing brace which I can do separate.
I have managed to create this regex, but it is only matching {bgRed and not {red, can someone lend a hand?
/^\{.+?(?=\s)/gm

Note you are using ^ anchor at the start and that makes your pattern only match at the start of a line (mind also the m modifier). .+?(?=\s|$) is too cumbersome, you want to match any 1+ chars up to the first whitespace or end of string, use {\S+ (or {\S* if you plan to match { without any non-whitespace chars after it).
You may use
s = s.replace(/{\S*|}/g, '')
You may trim the outcome to get rid of resulting leading/trailing spaces:
s = s.replace(/{\S*|}/g, '').trim()
See the regex demo and the regex graph:
Details
{\S* - { char followed with 0 or more non-whitespace characters
| - or
} - a } char.

If the goal is go to from
"{bgRed Please run a task, {red a list has been provided below}"
to
"Please run a task, a list has been provided below"
a regex with two capture groups seems simplest:
const original = "{bgRed Please run a task, {red a list has been provided below}";
const rex = /\{\w+ ([^{]+)\{\w+ ([^}]+)}/g;
const result = original.replace(rex, "$1$2");
console.log(result);
\{\w+ ([^{]+)\{\w+ ([^}]+)} is:
\{ - a literal {
\w+ - one or more word characters ("bgRed")
a literal space
([^{]+) one or more characters that aren't {, captured to group 1
\{ - another literal {
\w+ - one or more word characters ("red")
([^}]+) - one or more characters that aren't }, captured to group 2
} - a literal }
The replacement uses $1 and $2 to swap in the capture group contents.

Related

Javascript regex space or

I created this javascript regex
(?<=\s|^|\.)[^ ]+\(
Here is my regex fiddle. The lines I am testing against are:
a bcde(
a bc.de(
bc(
See how these strings are matched:
instead of matching on line 2
bc.de(
I wish to get only
.de(

You can use
(?<=[\s.]|^)[^\s.]+\(
See the regex demo. If you do not want to match any whitespace, use a regular space:
(?<=[ .]|^)[^ .]+\(
Details:
(?<=[\s.]|^) - a positive lookbehind that requires a whitespace, start of string or a . to occur immediately to the left of the current location
[^\s.]+ - any one or more chars other than whitespace and a dot
\( - a ( char.
Note that is would be much better to use a consuming pattern here rather than rely on the lookbehind. You could match all till the first dot, or if there is no dot, match the first whitespace, or start of string, that are followed with any one or more chars other than space till a ( char. The point here is that you need to capture the part of the pattern you need to extract:
const regex = /(?:^[^.\r\n]*\.|\s|^)([^ (]+)\(/;
const texts = ["a bcde(", "a bc.de(", "bc("];
for (const text of texts) {
console.log(text, '=>', text.match(regex)?.[1]);
}

regex if capture group matches string

I need to build a simple script to hyphenate Romanian words. I've seen several and they don't implement the rules correctly.
var words = "arta codru";
Rule: if 2 consonants are between 2 vowels, then they become split between syllables unless they belong in this array in which case both consonants move to the second syllable:
var exceptions_to_regex2 = ["bl","cl","dl","fl","gl","hl","pl","tl","vl","br","cr","dr","fr","gr","hr","pr","tr","vr"];
Expected result: ar-ta co-dru
The code so far:
https://playcode.io/156923?tabs=console&script.js&output
var words = "arta codru";
var exceptions_to_regex2 = ["bl","cl","dl","fl","gl","hl","pl","tl","vl","br","cr","dr","fr","gr","hr","pr","tr","vr"];
var regex2 = /([aeiou])([bcdfghjklmnprstvwxy]{1})(?=[bcdfghjklmnprstvwxy]{1})([aeiou])/gi;
console.log(words.replace(regex2, '$1$2-'));
console.log("desired result: ar-ta co-dru");
Now I would need to do something like this:
if (exceptions_to_regex2.includes($2+$3)){
words.replace(regex2, '$1-');
}
else {
words.replace(regex2, '$1$2-');
}
Obviously it doesn't work because I can't just use the capture groups as I would a regular variable. Please help.

You may code your exceptions as a pattern to check for after a vowel, and stop matching there, or you may still consume any other consonant before another vowel, and replace with the backreference to the whole match with a hyphen right after:
.replace(/[aeiou](?:(?=[bcdfghptv][lr])|[bcdfghj-nprstvwxy](?=[bcdfghj-nprstvwxy][aeiou]))/g, '$&-')
Add i modifier after g if you need case insensitive matching.
See the regex demo.
Details
[aeiou] - a vowel
(?: - start of a non-capturing group:
(?=[bcdfghptv][lr]) - a positive lookahead that requires the exception letter clusters to appear immediately to the right of the current position
| - or
[bcdfghj-nprstvwxy] - a consonant
(?=[bcdfghj-nprstvwxy][aeiou]) - followed with any consonant and a vowel
) - end of the non-capturing group.
The $& in the replacement pattern is the placeholder for the whole match value (at regex101, $0 can only be used at this moment, since the Web site does not support language specific only replacement patterns).

Why isn't this group capturing all items that appear in parentheses?

I'm trying to create a regex that will capture a string not enclosed by parentheses in the first group, followed by any amount of strings enclosed by parentheses.
e.g.
2(3)(4)(5)
Should be: 2 - first group, 3 - second group, and so on.
What I came up with is this regex: (I'm using JavaScript)
([^()]*)(?:\((([^)]*))\))*
However, when I enter a string like A(B)(C)(D), I only get the A and D captured.
https://regex101.com/r/HQC0ib/1
Can anyone help me out on this, and possibly explain where the error is?

Since you cannot use a \G anchor in JS regex (to match consecutive matches), and there is no stack for each capturing group as in a .NET / PyPi regex libraries, you need to use a 2 step approach: 1) match the strings as whole streaks of text, and then 2) post-process to get the values required.
var s = "2(3)(4)(5) A(B)(C)(D)";
var rx = /[^()\s]+(?:\([^)]*\))*/g;
var res = [], m;
while(m=rx.exec(s)) {
res.push(m[0].split(/[()]+/).filter(Boolean));
}
console.log(res);
I added \s to the negated character class [^()] since I added the examples as a single string.
Pattern details
[^()\s]+ - 1 or more chars other than (, ) and whitespace
(?:\([^)]*\))* - 0 or more sequences of:
\( - a (
[^)]* - 0+ chars other than )
\) - a )
The splitting regex is [()]+ that matches 1 or more ) or ( chars, and filter(Boolean) removes empty items.

You cannot have an undetermined number of capture groups. The number of capture groups you get is determined by the regular expression, not by the input it parses. A capture group that occurs within another repetition will indeed only retain the last of those repetitions.
If you know the maximum number of repetitions you can encounter, then just repeat the pattern that many times, and make each of them optional with a ?. For instance, this will capture up to 4 items within parentheses:
([^()]*)(?:\(([^)]*)\))?(?:\(([^)]*)\))?(?:\(([^)]*)\))?(?:\(([^)]*)\))?

It's not an error. It's just that in regex when you repeat a capture group (...)* that only the last occurence will be put in the backreference.
For example:
On a string "a,b,c,d", if you match /(,[a-z])+/ then the back reference of capture group 1 (\1) will give ",d".
If you want it to return more, then you could surround it in another capture group.
--> With /((?:,[a-z])+)/ then \1 will give ",b,c,d".
To get those numbers between the parentheses you could also just try to match the word characters.
For example:
var str = "2(3)(14)(B)";
var matches = str.match(/\w+/g);
console.log(matches);

JS Regex: Remove anything (ONLY) after a word

I want to remove all of the symbols (The symbol depends on what I select at the time) after each word, without knowing what the word could be. But leave them in before each word.
A couple of examples:
!!hello! my! !!name!!! is !!bob!! should return...
!!hello my !!name is !!bob ; for !
and
$remove$ the$ targetted$# $$symbol$$# only $after$ a $word$ should return...
$remove the targetted# $$symbol# only $after a $word ; for $

You need to use capture groups and replace:
"!!hello! my! !!name!!! is !!bob!!".replace(/([a-zA-Z]+)(!+)/g, '$1');
Which works for your test string. To work for any generic character or group of characters:
var stripTrailing = trail => {
let regex = new RegExp(`([a-zA-Z0-9]+)(${trail}+)`, 'g');
return str => str.replace(regex, '$1');
};
Note that this fails on any characters that have meaning in a regular expression: []{}+*^$. etc. Escaping those programmatically is left as an exercise for the reader.
UPDATE
Per your comment I thought an explanation might help you, so:
First, there's no way in this case to replace only part of a match, you have to replace the entire match. So we need to find a pattern that matches, split it into the part we want to keep and the part we don't, and replace the whole match with the part of it we want to keep. So let's break up my regex above into multiple lines to see what's going on:
First we want to match any number of sequential alphanumeric characters, that would be the 'word' to strip the trailing symbol from:
( // denotes capturing group for the 'word'
[ // [] means 'match any character listed inside brackets'
a-z // list of alpha character a-z
A-Z // same as above but capitalized
0-9 // list of digits 0 to 9
]+ // plus means one or more times
)
The capturing group means we want to have access to just that part of the match.
Then we have another group
(
! // I used ES6's string interpolation to insert the arg here
+ // match that exclamation (or whatever) one or more times
)
Then we add the g flag so the replace will happen for every match in the target string, without the flag it returns after the first match. JavaScript provides a convenient shorthand for accessing the capturing groups in the form of automatically interpolated symbols, the '$1' above means 'insert contents of the first capture group here in this string'.
So, in the above, if you replaced '$1' with '$1$2' you'd see the same string you started with, if you did 'foo$2' you'd see foo in place of every word trailed by one or more !, etc.

Finding duplicates with regular expressions, how does this actually work? [duplicate]

I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?

Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here

I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html

The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.

Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}

Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source

The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1

Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+

No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.

This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.

Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.

The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.

This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result

Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b

I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.

You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}

As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )

To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.

Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b

Develop Reference

JavaScript is the programming language of the Web.

Regex match multiple same expression multiple times - javascript

Related

Javascript regex space or

regex if capture group matches string

Why isn't this group capturing all items that appear in parentheses?

JS Regex: Remove anything (ONLY) after a word

Finding duplicates with regular expressions, how does this actually work? [duplicate]

Categories

Resources