regex if capture group matches string - javascript

I need to build a simple script to hyphenate Romanian words. I've seen several and they don't implement the rules correctly.
var words = "arta codru";
Rule: if 2 consonants are between 2 vowels, then they become split between syllables unless they belong in this array in which case both consonants move to the second syllable:
var exceptions_to_regex2 = ["bl","cl","dl","fl","gl","hl","pl","tl","vl","br","cr","dr","fr","gr","hr","pr","tr","vr"];
Expected result: ar-ta co-dru
The code so far:
https://playcode.io/156923?tabs=console&script.js&output
var words = "arta codru";
var exceptions_to_regex2 = ["bl","cl","dl","fl","gl","hl","pl","tl","vl","br","cr","dr","fr","gr","hr","pr","tr","vr"];
var regex2 = /([aeiou])([bcdfghjklmnprstvwxy]{1})(?=[bcdfghjklmnprstvwxy]{1})([aeiou])/gi;
console.log(words.replace(regex2, '$1$2-'));
console.log("desired result: ar-ta co-dru");
Now I would need to do something like this:
if (exceptions_to_regex2.includes($2+$3)){
words.replace(regex2, '$1-');
}
else {
words.replace(regex2, '$1$2-');
}
Obviously it doesn't work because I can't just use the capture groups as I would a regular variable. Please help.

You may code your exceptions as a pattern to check for after a vowel, and stop matching there, or you may still consume any other consonant before another vowel, and replace with the backreference to the whole match with a hyphen right after:
.replace(/[aeiou](?:(?=[bcdfghptv][lr])|[bcdfghj-nprstvwxy](?=[bcdfghj-nprstvwxy][aeiou]))/g, '$&-')
Add i modifier after g if you need case insensitive matching.
See the regex demo.
Details
[aeiou] - a vowel
(?: - start of a non-capturing group:
(?=[bcdfghptv][lr]) - a positive lookahead that requires the exception letter clusters to appear immediately to the right of the current position
| - or
[bcdfghj-nprstvwxy] - a consonant
(?=[bcdfghj-nprstvwxy][aeiou]) - followed with any consonant and a vowel
) - end of the non-capturing group.
The $& in the replacement pattern is the placeholder for the whole match value (at regex101, $0 can only be used at this moment, since the Web site does not support language specific only replacement patterns).

Related

Regex match multiple same expression multiple times

I have got this string {bgRed Please run a task, {red a list has been provided below}, I need to do a string replace to remove the braces and also the first word.
So below I would want to remove {bgRed and {red and then the trailing brace which I can do separate.
I have managed to create this regex, but it is only matching {bgRed and not {red, can someone lend a hand?
/^\{.+?(?=\s)/gm
Note you are using ^ anchor at the start and that makes your pattern only match at the start of a line (mind also the m modifier). .+?(?=\s|$) is too cumbersome, you want to match any 1+ chars up to the first whitespace or end of string, use {\S+ (or {\S* if you plan to match { without any non-whitespace chars after it).
You may use
s = s.replace(/{\S*|}/g, '')
You may trim the outcome to get rid of resulting leading/trailing spaces:
s = s.replace(/{\S*|}/g, '').trim()
See the regex demo and the regex graph:
Details
{\S* - { char followed with 0 or more non-whitespace characters
| - or
} - a } char.
If the goal is go to from
"{bgRed Please run a task, {red a list has been provided below}"
to
"Please run a task, a list has been provided below"
a regex with two capture groups seems simplest:
const original = "{bgRed Please run a task, {red a list has been provided below}";
const rex = /\{\w+ ([^{]+)\{\w+ ([^}]+)}/g;
const result = original.replace(rex, "$1$2");
console.log(result);
\{\w+ ([^{]+)\{\w+ ([^}]+)} is:
\{ - a literal {
\w+ - one or more word characters ("bgRed")
a literal space
([^{]+) one or more characters that aren't {, captured to group 1
\{ - another literal {
\w+ - one or more word characters ("red")
([^}]+) - one or more characters that aren't }, captured to group 2
} - a literal }
The replacement uses $1 and $2 to swap in the capture group contents.

JavaScript Regex start of string clarification + str.replace()

got a question about the start of string regex anchor tag ^.
I was trying to sanitize a string to check if it's a palindrome and found a solution to use regex but couldn't wrap my head around the explanations I found for the start of string anchor tag:
To my understanding:
^ denotes that whatever expression that follows must match, starting from the beginning of the string.
Question:
Why then is there a difference between the two output below:
1)
let x = 'A man, a plan, a canal: Panama';
const re = new RegExp(/[^a-z]/, 'gi');
console.log(x.replace(re, '*'));
Output: A*man**a*plan**a*canal**Panama
VS.
2)
let x = 'A man, a plan, a canal: Panama';
const re = new RegExp(/[a-z]/, 'gi');
console.log(x.replace(re, '*'));
Output: * ***, * ****, * *****: ******
VS.
3)
let x = 'A man, a plan, a canal: Panama';
const re = new RegExp(/^[a-z]/, 'gi');
console.log(x.replace(re, '*'));
Output: * man, a plan, a canal: Panama
Please let me know if my explanation for each of the case above is off:
1) Confused about this one. If it matches a character class of [a-z] case insensitive + global find, with start of string anchor ^ denoting that it must match at the start of each string, should it not return all the words in the sentence? Since each word is a match of [a-z] insensitive characters that occurs at the start of each string per global find iteration?
(i.e.
finds "A" at the start
then on the next iteration, it should start search on the remaining string " man"
finds a space...and moves on to search "man"?
and so on and so forth...
Q: Why does it then when I call replace does it only targets the non alpha stuff? Should I in this case be treating ^ as inverting [a-z]?
2) This seems pretty straight forward, finds all occurrence of [a-z]and replaces them with the start. Inverse case of 1)??
3) Also confused about this one. I'm not sure how this is different from 1).
/^[a-z]/gi to me means: "starting at the start of the string being looked at, match all alpha characters, case insensitive. Repeat for global find".
Compared to:
1) /[^a-z]/gi to me means: "match all character class that starts each line with alpha character. case insensitive, repeat search for global find."
To mean they mean exactly the same #_#. Please let me know how my understanding is off for the above cases.
Your first expression [^a-z] matches anything other than an alphabetic, lower case letter, therefore that's why when you replace with * all the special characters such as whitespace, commas and colons are replaced.
Your second expression [a-z] matches any alphabetic, lower case letter, therefore the special characters mentioned are not replaced by *.
Your third expression ^[a-z] matches a alphabetic, lower case letter at the start of the string, therefore only the first letter is replaced by *.
For the first two expressions, the global flag g ensures that all characters that match the specified pattern, regardless of their position in the string, are replaced. For the third pattern however, since ^ anchors the pattern at the beginning of the string, only the first letter is replaced.
As you mentioned, the i flag ensures case insensitivity, so that all three patterns operate on both lower and upper case alphabetic letters, from a to z and A to Z.
The character ^ therefore has two meanings:
It negates characters in a character set.
It asserts position at the start of string.
^ denotes that whatever expression that follows must match, starting from the beginning of the string.
That's only when it's the first thing in the regex; it has other purposes when used elsewhere:
/[^a-z]/gi
In the above regex, the ^ does not indicate anchoring the match to the beginning of a string; it inverts the rest of the contents of the [] -- so the above regex will match any single character except a-z. Since you're using the g flag it will repeat that match for all characters in the string.
/[a-z]/gi
The above is not inverted, so will match a single instance of any character from a-z (and again because of the g flag will repeat to match all of those instances.)
/^[a-z]/gi
In this last example, the caret anchors the match to the beginning of the string; the bracketed portion will match any single a-z character. The g flag is still in use, so the regex would try to continue matching more characters later in the string -- but none of them except the first one will will meet the anchored-to-start requirement, so this will end up matching only the first character (if it's within a-z), exactly as if the g flag was not in use.
(When used anywhere in a regex other than the start of the regex or the start of a [] group, the ^ will be treated as a literal ^.)
If you're trying to detect palindromes, you'll want to remove everything except letter characters (and will probably want to convert everything to the same letter case, instead of having to detect that "P" == "p":)
const isPalindrome = function(input) {
let str = input.toLowerCase().replace(/[^a-z]/g,'');
return str === str.split('').reverse().join('')
}
console.log(isPalindrome("Able was I, ere I saw Elba!"))
console.log(isPalindrome("No, it never propagates if I set a ”gap“ or prevention."))
console.log(isPalindrome("Are we not pure? “No, sir!” Panama’s moody Noriega brags. “It is garbage!” Irony dooms a man –– a prisoner up to new era."))
console.log(isPalindrome("Taco dog is not a palindrome."))

Why isn't this group capturing all items that appear in parentheses?

I'm trying to create a regex that will capture a string not enclosed by parentheses in the first group, followed by any amount of strings enclosed by parentheses.
e.g.
2(3)(4)(5)
Should be: 2 - first group, 3 - second group, and so on.
What I came up with is this regex: (I'm using JavaScript)
([^()]*)(?:\((([^)]*))\))*
However, when I enter a string like A(B)(C)(D), I only get the A and D captured.
https://regex101.com/r/HQC0ib/1
Can anyone help me out on this, and possibly explain where the error is?
Since you cannot use a \G anchor in JS regex (to match consecutive matches), and there is no stack for each capturing group as in a .NET / PyPi regex libraries, you need to use a 2 step approach: 1) match the strings as whole streaks of text, and then 2) post-process to get the values required.
var s = "2(3)(4)(5) A(B)(C)(D)";
var rx = /[^()\s]+(?:\([^)]*\))*/g;
var res = [], m;
while(m=rx.exec(s)) {
res.push(m[0].split(/[()]+/).filter(Boolean));
}
console.log(res);
I added \s to the negated character class [^()] since I added the examples as a single string.
Pattern details
[^()\s]+ - 1 or more chars other than (, ) and whitespace
(?:\([^)]*\))* - 0 or more sequences of:
\( - a (
[^)]* - 0+ chars other than )
\) - a )
The splitting regex is [()]+ that matches 1 or more ) or ( chars, and filter(Boolean) removes empty items.
You cannot have an undetermined number of capture groups. The number of capture groups you get is determined by the regular expression, not by the input it parses. A capture group that occurs within another repetition will indeed only retain the last of those repetitions.
If you know the maximum number of repetitions you can encounter, then just repeat the pattern that many times, and make each of them optional with a ?. For instance, this will capture up to 4 items within parentheses:
([^()]*)(?:\(([^)]*)\))?(?:\(([^)]*)\))?(?:\(([^)]*)\))?(?:\(([^)]*)\))?
It's not an error. It's just that in regex when you repeat a capture group (...)* that only the last occurence will be put in the backreference.
For example:
On a string "a,b,c,d", if you match /(,[a-z])+/ then the back reference of capture group 1 (\1) will give ",d".
If you want it to return more, then you could surround it in another capture group.
--> With /((?:,[a-z])+)/ then \1 will give ",b,c,d".
To get those numbers between the parentheses you could also just try to match the word characters.
For example:
var str = "2(3)(14)(B)";
var matches = str.match(/\w+/g);
console.log(matches);

JS Regex: Remove anything (ONLY) after a word

I want to remove all of the symbols (The symbol depends on what I select at the time) after each word, without knowing what the word could be. But leave them in before each word.
A couple of examples:
!!hello! my! !!name!!! is !!bob!! should return...
!!hello my !!name is !!bob ; for !
and
$remove$ the$ targetted$# $$symbol$$# only $after$ a $word$ should return...
$remove the targetted# $$symbol# only $after a $word ; for $
You need to use capture groups and replace:
"!!hello! my! !!name!!! is !!bob!!".replace(/([a-zA-Z]+)(!+)/g, '$1');
Which works for your test string. To work for any generic character or group of characters:
var stripTrailing = trail => {
let regex = new RegExp(`([a-zA-Z0-9]+)(${trail}+)`, 'g');
return str => str.replace(regex, '$1');
};
Note that this fails on any characters that have meaning in a regular expression: []{}+*^$. etc. Escaping those programmatically is left as an exercise for the reader.
UPDATE
Per your comment I thought an explanation might help you, so:
First, there's no way in this case to replace only part of a match, you have to replace the entire match. So we need to find a pattern that matches, split it into the part we want to keep and the part we don't, and replace the whole match with the part of it we want to keep. So let's break up my regex above into multiple lines to see what's going on:
First we want to match any number of sequential alphanumeric characters, that would be the 'word' to strip the trailing symbol from:
( // denotes capturing group for the 'word'
[ // [] means 'match any character listed inside brackets'
a-z // list of alpha character a-z
A-Z // same as above but capitalized
0-9 // list of digits 0 to 9
]+ // plus means one or more times
)
The capturing group means we want to have access to just that part of the match.
Then we have another group
(
! // I used ES6's string interpolation to insert the arg here
+ // match that exclamation (or whatever) one or more times
)
Then we add the g flag so the replace will happen for every match in the target string, without the flag it returns after the first match. JavaScript provides a convenient shorthand for accessing the capturing groups in the form of automatically interpolated symbols, the '$1' above means 'insert contents of the first capture group here in this string'.
So, in the above, if you replaced '$1' with '$1$2' you'd see the same string you started with, if you did 'foo$2' you'd see foo in place of every word trailed by one or more !, etc.

Finding duplicates with regular expressions, how does this actually work? [duplicate]

I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?
Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here
I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html
The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.
Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}
Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source
The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1
Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+
No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.
This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.
Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.
The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.
This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result
Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b
I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.
You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}
As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )
To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.
Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b

Categories

Resources