I created this javascript regex
(?<=\s|^|\.)[^ ]+\(
Here is my regex fiddle. The lines I am testing against are:
a bcde(
a bc.de(
bc(
See how these strings are matched:
instead of matching on line 2
bc.de(
I wish to get only
.de(
You can use
(?<=[\s.]|^)[^\s.]+\(
See the regex demo. If you do not want to match any whitespace, use a regular space:
(?<=[ .]|^)[^ .]+\(
Details:
(?<=[\s.]|^) - a positive lookbehind that requires a whitespace, start of string or a . to occur immediately to the left of the current location
[^\s.]+ - any one or more chars other than whitespace and a dot
\( - a ( char.
Note that is would be much better to use a consuming pattern here rather than rely on the lookbehind. You could match all till the first dot, or if there is no dot, match the first whitespace, or start of string, that are followed with any one or more chars other than space till a ( char. The point here is that you need to capture the part of the pattern you need to extract:
const regex = /(?:^[^.\r\n]*\.|\s|^)([^ (]+)\(/;
const texts = ["a bcde(", "a bc.de(", "bc("];
for (const text of texts) {
console.log(text, '=>', text.match(regex)?.[1]);
}
I'm trying to check for every occurrence that a string has an # at the beginning of a string.
So something like this works for only one string occurance
const comment = "#barnowl is cool"
const regex = /#[a-z]/i;
if (comment.charAt(0).includes("#")) {
if (regex.test(comment)) {
// do something
console.log('test passeed')
} else {
// do something else
}
} else {
// do other
}
but....
What if you have a textarea and a user uses the # multiple times to reference another user this test will no longer work because charAt(0) is looking for the first character in a string.
What regex test is doable in a situation where you have to check the occurrence of a # followed by a space. I know i can ditch charAt(0) and use comment.includes("#") but i want to use a regex pattern to check if there is space after wards
So if user does #username followed by a space after words, the regex should pass.
Doing this \s doesn't seem to make the test pass
const regex = /#[a-z]\s/i; // shouldn't this check for white space after a letter ?
demo:
https://jsbin.com/riraluxape/edit?js,console
I think your expression is very close. There are two things that are missing:
The [a-z] match is only looking for one character, so in order to look for multiple characters it needs to be [a-z]+.
The flags section is missing the g modifier, which enables the expression to look through the entire text string instead of just the first match.
I believe the regular expression declaration should be adjusted to the following:
const regex = /#[a-z]+\s/ig;
Is this what you want? Matching all the occurrences of the mention?
const regex = /#\w+/ig
I used the \w flag here which matches any word character.
To check for multiple matches instead of only the first one, append g to the regex:
const regex = /#[a-z]*\s/ig;
Your regex with \s actually works, see: https://regex101.com/r/gyMyvB/1
I have got this string {bgRed Please run a task, {red a list has been provided below}, I need to do a string replace to remove the braces and also the first word.
So below I would want to remove {bgRed and {red and then the trailing brace which I can do separate.
I have managed to create this regex, but it is only matching {bgRed and not {red, can someone lend a hand?
/^\{.+?(?=\s)/gm
Note you are using ^ anchor at the start and that makes your pattern only match at the start of a line (mind also the m modifier). .+?(?=\s|$) is too cumbersome, you want to match any 1+ chars up to the first whitespace or end of string, use {\S+ (or {\S* if you plan to match { without any non-whitespace chars after it).
You may use
s = s.replace(/{\S*|}/g, '')
You may trim the outcome to get rid of resulting leading/trailing spaces:
s = s.replace(/{\S*|}/g, '').trim()
See the regex demo and the regex graph:
Details
{\S* - { char followed with 0 or more non-whitespace characters
| - or
} - a } char.
If the goal is go to from
"{bgRed Please run a task, {red a list has been provided below}"
to
"Please run a task, a list has been provided below"
a regex with two capture groups seems simplest:
const original = "{bgRed Please run a task, {red a list has been provided below}";
const rex = /\{\w+ ([^{]+)\{\w+ ([^}]+)}/g;
const result = original.replace(rex, "$1$2");
console.log(result);
\{\w+ ([^{]+)\{\w+ ([^}]+)} is:
\{ - a literal {
\w+ - one or more word characters ("bgRed")
a literal space
([^{]+) one or more characters that aren't {, captured to group 1
\{ - another literal {
\w+ - one or more word characters ("red")
([^}]+) - one or more characters that aren't }, captured to group 2
} - a literal }
The replacement uses $1 and $2 to swap in the capture group contents.
I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?
Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here
I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html
The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.
Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}
Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source
The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1
Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+
No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.
This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.
Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.
The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.
This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result
Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b
I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.
You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}
As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )
To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.
Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b
I have a part of the word and I should find full word in the string using regular expressions.
For example, I have the following text:
If it bothers you, call it a "const identifier" instead.
It doesn't matter whether you call max a const variable or a const identififfiieer. What matters...
And the part of the word: identifi. I have to find both: identifier and identififfiieer.
I tried the following regex (javascript):
[\ ,!##$%^&*()\.\"]*(identifi.*?)[\ ,!##$%^&*()\d\.\"]
So it searches the part of word surrounded by punctuation characters or space. Sometime this regex works fine, but in this case it also includes quote and dot int the match. What's wrong with it? Maybe there is a better idea?
You can use
\bidentifi.*?\b
Which means:
Assert the position at a word boundary
Match the characters "identifi" literally
Match any single character that is not not a line break
Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Assert the position at a word boundary
'foo "bar identifier"'.match(/\bidentifi.*?\b/g); // ["identifier"]
'foo identififfiieer. bar'.match(/\bidentifi.*?\b/g); // ["identififfiieer"]
You can use \w*identifi\w*
\w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits.
Here is a demo, showing the regex and its matches.
As a side note, your original regex actually works fine if you use the capturing group:
var body = 'If it bothers you, call it a "const identifier" instead.\nIt doesn\'t matter whether you call max a const variable or a const identififfiieer. What matters...';
var reg = /[\ ,!##$%^&*()\.\"]*(identifi.*?)[\ ,!##$%^&*()\d\.\"]/g;
var match;
while (match = reg.exec(body)) {
console.log('>' + match[1] + '<');
}
This outputs:
>identifier<
>identififfiieer<
Here's a demo for this code.