I want to use a regex that looks for spaces with a minimum length of 2 in a row, and replaces the occurrence with another value for each occurrence of the space found.
For example:
I love to eat cake
There are 3 spaces after love and 4 spaces after eat. I want my regex to replace occurrences of a space more than 1, and to replace it with a value for each occurrence found.
The output I am trying to go for:
I love---to eat----cake
I tried something like
myStr.replace(/ +{2,}/g, '-')
You may use this code with a lookahead and a lookbehind:
const s = 'I love to eat cake'
var r = s.replace(/ (?= )|(?<= ) /g, '-');
console.log(r);
//=> 'I love---to eat----cake'
RegEx Details:
(?= ): Match a space only if that is followed by a space
|: OR
(?<= ) : Match a space only if that is preceded by a space
You can match two or more whitespaces and replace with the same amount of hyphens:
const s = 'I love to eat cake'
console.log(s.replace(/\s{2,}/g, (x) => '-'.repeat(x.length)) )
The same approach can be used in Python (since you asked), re.sub(r'\s{2,}', lambda x: '-' * len(x.group()), s), see the Python demo.
Also, you may replace any whitespace that is followed with a whitespace char or is preceded with whitespace using
const s = 'I love to eat cake'
console.log(s.replace(/\s(?=\s|(?<=\s.))/gs, '-') )
console.log(s.replace(/\s(?=\s|(?<=\s\s))/g, '-') )
See this regex demo. Here, s flag makes . match any char. g makes the regex replace all occurrences. Also,
\s - matches any whitespace
(?=\s|(?<=\s.)) - a positive lookahead that matches a location that is immediately followed with a whitespace char (\s), or (|) if it is immediately preceded with a whitespace and any one char (which is the matched whitespace). If you use (?<=\s\s) version, there is no need of s flag, \s\s just makes sure the whitespace before the matched whitespace is checked.
Im very new to Regex . Right now im trynig to use regex to prepare my markup string before sending it to the database.
Here is an example string:
#[admin](user:3) Testing this string #[hellotessginal](user:4) Hey!
So far i am able to identify #[admin](user:3) the entire term here using /#\[(.*?)]\((.*?):(\d+)\)/g
But the next step forward is that i wish to remove the (user:3) leaving me with #[admin].
Hence the result of passing through the stripper function would be:
#[admin] Testing this string #[hellotessginal] Hey!
Please help!
You may use
s.replace(/(#\[[^\][]*])\([^()]*?:\d+\)/g, '$1')
See the regex demo. Details:
(#\[[^\][]*]) - Capturing group 1: #[, 0 or more digits other than [ and ] as many as possible and then ]
\( - a ( char
[^()]*? - 0 or more (but as few as possible) chars other than ( and )
: - a colon
\d+ - 1+ digits
\) - a ) char.
The $1 in the replacement pattern refers to the value captured in Group 1.
See the JavaScript demo:
const rx = /(#\[[^\][]*])\([^()]*?:\d+\)/g;
const remove_parens = (string, regex) => string.replace(regex, '$1');
let s = '#[admin](user:3) Testing this string #[hellotessginal](user:4) Hey!';
s = remove_parens(s, rx);
console.log(s);
Try this:
var str = "#[admin](user:3) Testing this string #[hellotessginal](user:4) Hey!";
str = str.replace(/ *\([^)]*\) */g, ' ');
console.log(str);
You can replace matches of the following regular expression with empty strings.
str.replace(/(?<=\#\[(.*?)\])\(.*?:\d+\)/g, ' ');
regex demo
I've assumed the strings for which "admin" and "user" are placeholders in the example cannot contain the characters in the string "()[]". If that's not the case please leave a comment and I will adjust the regex.
I've kept the first capture group on the assumption that it is needed for some unstated purpose. If it's not needed, remove it:
(?<=\#\[.*?\])\(.*?:\d+\)
There is of course no point creating a capture group for a substring that is to be replaced with an empty string.
Javascript's regex engine performs the following operations.
(?<= : begin positive lookbehind
\#\[ : match '#['
(.*?) : match 0+ chars, lazily, save to capture group 1
\] : match ']'
) : end positive lookbehind
\(.*?:\d+\) : match '(', 0+ chars, lazily, 1+ digits, ')'
I have below RegEx to validate a string..
var str = "Thebestthingsinlifearefree";
var patt = /[^0-9A-Za-z !\\#$%&()*+,\-.\/:;<=>?#\[\]^_`{|}~]*/g;
var res = patt.test(str);
the result will always give true but I thought it would give false.. because I checking any pattern which is not in the patt variable...
The given string is valid and it contains only Alphabets with capital and small case letters. Not sure what is wrong with the pattern.
Here's your code:
var str = "Thebestthingsinlifearefree";
var patt = /[^0-9A-Za-z !\\#$%&()*+,\-.\/:;<=>?#\[\]^_`{|}~]*/g;
console.log(patt.test(str));
The regex
/[^0-9A-Za-z !\\#$%&()*+,\-.\/:;<=>?#\[\]^_`{|}~]*/g
will match anything since it accepts match of length 0 due to the quantifier *.
Just add anchors:
var str = "Thebestthingsinlifearefree";
var patt = /^[^0-9A-Za-z !\\#$%&()*+,\-.\/:;<=>?#\[\]^_`{|}~]*$/;
console.log(patt.test(str));
Here's an explanation or your regex:
[^0-9A-Za-z !\\#$%&()*+,\-.\/:;<=>?#\[\]^_`{|}~]* match a single character not present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
0-9 a single character in the range between 0 and 9
A-Z a single character in the range between A and Z (case sensitive)
a-z a single character in the range between a and z (case sensitive)
! a single character in the list ! literally
\\ matches the character \ literally
#$%&()*+, a single character in the list #$%&()*+, literally (case sensitive)
\- matches the character - literally
. the literal character .
\/ matches the character / literally
:;<=>?# a single character in the list :;<=>?# literally (case sensitive)
\[ matches the character [ literally
\] matches the character ] literally
^_`{|}~ a single character in the list ^_`{|}~ literally
Note that:
A search for a missing pattern is better expressed by a negative condition in code (!patt.test...).
You need to escape certain characters like ., (, ), ?, etc. by prefixing them with a backslash (\).
var str = "Thebestthingsinlifearefree";
var patt = /[0-9A-Za-z !\\#$%&\(\)*+,\-\.\/:;<=>\?#\[\]^_`\{|\}~]/;
var res = !patt.test(str);
console.log(res);
This will print false, as expected.
I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?
Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here
I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html
The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.
Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}
Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source
The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1
Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+
No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.
This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.
Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.
The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.
This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result
Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b
I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.
You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}
As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )
To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.
Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b
How to match word in string that contain exactly "3 digits and 3 letters"?
e.g. 100BLA
var regex = ?;
var string = "word word 100BLA word";
desiredString = string .match(regex);
\d matches a digit
[a-zA-Z] matches a letter
{3} is the quantifier that matches exactly 3 repetitions
^ Anchor to match the start of the string
$ Anchor to match the end of the string
So if you use all this new knowledge, you will come to a regex like this:
^\d{3}[a-zA-Z]{3}$
Update:
Since the input example has changed after I wrote my answer, here the update:
If your word is part of a larger string, you don't need the anchors ^ and $ instead you have to use word boundaries \b.
\b\d{3}[a-zA-Z]{3}\b
INITIAL (incomplete)
var regex = /[0-9]{3}[A-Za-z]{3}/;
EDIT 1 (incomplete)
var regex = /[0-9]{3}[A-Za-z]{3}\b/; // used \b for word boundary
EDIT 2 (correct)
var regex = /\b[0-9]{3}[A-Za-z]{3}\b/; // used \b at start and end for whole word boundary