Regex - string repetition of min length - javascript

I want to find any string repetition. I have following code:
let match: Object;
let repetition: ?string;
while ((match = /(.+?)\1+/g.exec(string)) !== null && repetition === null) {
repetition = match[1];
}
It finds 'abc' replication in 'weabcabcjy', but it also finds 'll' in 'all'. I would like to have regex to limit minimal length of replication to 2 characters. It means it compare always minimally 2 characters against others two.

The .+? pattern finds any one or more chars other than linebreak characters, so ll in all will get matched since the first l will be captured into Group 1 and the second one will be matched with \1+.
To only find repetitions of 2+ character chunks you may use a lazy limiting quantifier {2,}?:
/(.{2,}?)\1+/g
See the regex demo.
The (.{2,}?)\1+ pattern will match and capture into Group 1 any two or more, but as few as possible, characters other than linebreak symbols and then 1 or more same consecutive substrings.

Related

Regex matching multiple numbers in a string

I would like to extract numbers from a string such as
There are 1,000 people in those 3 towns.
and get an array like ["1,000", "3"].
I got the following number matching Regex from Justin in this question
^[+-]?(\d*|\d{1,3}(,\d{3})*)(\.\d+)?\b$
This works great for checking if it is a number but to make it work on a sentence you need to remove the "^" and "$".
regex101 with start/end defined
regex101 without start/end defined
Without the start and end defined you get a bunch of 0 length matches these can easily be discarded but it also now splits any numbers with a comma in them.
How do I make that regex (or a new regex) work on sentences and still find numbers with commas in them.
A bonus would be not having all the 0 length matches as well.
The expression /-?\d(?:[,\d]*\.\d+|[,\d]*)/g should do it, if you're okay with allowing different groups such as 1,00,000 (which isn't unknown in some locales). I feel like I should be able to simplify that further, but when I try the example "333.33" gets broken up into "333" and "33" as separate numbers. With the above it's kept together.
Live Example:
const str = "There are 10,000 people in those 3 towns. That's 3,333.33 people per town, roughly. Which is about -67.33 from last year.";
const rex = /-?\d(?:[,\d]*\.\d+|[,\d]*)/g;
let match;
while ((match = rex.exec(str)) !== null) {
console.log(match[0]);
}
Breaking /\d(?:[,\d]*\.\d+|[,\d]*)/g down:
-? - an optional minus sign (thank you to x15 for flagging that up in his/her answer!)
\d - a digit
(?:...|...) - a non-capturing group containing an alternation between
[,\d]*\.\d+ - zero or more commas and digits followed by a . and one or more digits, e.g. 3,333.33; or
[,\d]* - zero or more commas and digits
The first alternative will match greedily, falling back to the second alternative if there's no decimal point.
One alternate approach is to split with space and see if the value can be parsed to a number,
let numberExtractor = str => str.split(/\s+/)
.filter(v => v && parseFloat(v.replace(/[.,]/g, '')))
console.log(numberExtractor('There are 1,000 people in those 3 towns. some more numbers -23.012 1,00,000,00'))
To match integer and decimal numbers where the whole part can have optional
comma's that are between numbers but not in the decimal part is done like this:
/[+-]?(?:(?:\d(?:,(?=\d))?)+(?:\.\d*)?|\.\d+)/
https://regex101.com/r/yOuBPx/1
The input sample does not reflect all the boundary conditions this regex handles.
Best to experiment to see it's full effect.

Why isn't this group capturing all items that appear in parentheses?

I'm trying to create a regex that will capture a string not enclosed by parentheses in the first group, followed by any amount of strings enclosed by parentheses.
e.g.
2(3)(4)(5)
Should be: 2 - first group, 3 - second group, and so on.
What I came up with is this regex: (I'm using JavaScript)
([^()]*)(?:\((([^)]*))\))*
However, when I enter a string like A(B)(C)(D), I only get the A and D captured.
https://regex101.com/r/HQC0ib/1
Can anyone help me out on this, and possibly explain where the error is?
Since you cannot use a \G anchor in JS regex (to match consecutive matches), and there is no stack for each capturing group as in a .NET / PyPi regex libraries, you need to use a 2 step approach: 1) match the strings as whole streaks of text, and then 2) post-process to get the values required.
var s = "2(3)(4)(5) A(B)(C)(D)";
var rx = /[^()\s]+(?:\([^)]*\))*/g;
var res = [], m;
while(m=rx.exec(s)) {
res.push(m[0].split(/[()]+/).filter(Boolean));
}
console.log(res);
I added \s to the negated character class [^()] since I added the examples as a single string.
Pattern details
[^()\s]+ - 1 or more chars other than (, ) and whitespace
(?:\([^)]*\))* - 0 or more sequences of:
\( - a (
[^)]* - 0+ chars other than )
\) - a )
The splitting regex is [()]+ that matches 1 or more ) or ( chars, and filter(Boolean) removes empty items.
You cannot have an undetermined number of capture groups. The number of capture groups you get is determined by the regular expression, not by the input it parses. A capture group that occurs within another repetition will indeed only retain the last of those repetitions.
If you know the maximum number of repetitions you can encounter, then just repeat the pattern that many times, and make each of them optional with a ?. For instance, this will capture up to 4 items within parentheses:
([^()]*)(?:\(([^)]*)\))?(?:\(([^)]*)\))?(?:\(([^)]*)\))?(?:\(([^)]*)\))?
It's not an error. It's just that in regex when you repeat a capture group (...)* that only the last occurence will be put in the backreference.
For example:
On a string "a,b,c,d", if you match /(,[a-z])+/ then the back reference of capture group 1 (\1) will give ",d".
If you want it to return more, then you could surround it in another capture group.
--> With /((?:,[a-z])+)/ then \1 will give ",b,c,d".
To get those numbers between the parentheses you could also just try to match the word characters.
For example:
var str = "2(3)(14)(B)";
var matches = str.match(/\w+/g);
console.log(matches);

Finding duplicates with regular expressions, how does this actually work? [duplicate]

I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?
Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here
I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html
The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.
Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}
Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source
The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1
Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+
No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.
This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.
Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.
The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.
This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result
Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b
I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.
You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}
As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )
To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.
Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b

Regular expression for Password match in PHP and Javascript

I have built a regular expression for password policy match, but it is not working as expected
/((?=.\d)(?=.[a-z])(?=.[A-Z])(?=.[##\$%!])(?!(.)*\1{2,}).{6,20})/
Password must satisfy below rules
-> must have 1 digit
-> must have 1 upper case letter
-> must have 1 lower case letter
-> must have 1 special character from given list
-> minimum 6 character long
-> maximum 20 character long
-> Not more than 2 identical characters`
So it matches
aDm!n1, Adw1n#
but it must not match below
aaaD!n1, teSt#111
I have searched for this regular expression and found "(?!(.)*\1{2,})" is not working properly
I am not getting why it is not working even though it has lookahead negative assertion.
Thanks in advance
You must need to provide start and end anchors.
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[##\$%\!])(?!.*(.).*\1.*\1).{6,20}$
DEMO
To match the strings which isn't contain more than two duplicate characters, you need to use a negative lookahead like (?!.*(.).*\1.*\1) which asserts that the string we are going to match wouldn't contain not more than two duplicate characters.
(?!) Negative lookahead which checks if there isn't
.* Any character zero or more times.
(.) A single character was captured.
.* Any character zero or more times.
\1 Reference to the group index 1. That is, it refers to the character which are already captured by group 1.
.* Any character zero or more times.
\1 Back-referencing to the character which was present inside the group index 1.
Code:
> var re = /^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[##\$%\!])(?!.*(.).*\1.*\1).{6,20}$/;
undefined
> re.test('aDm!n1')
true
> re.test('Adw1n#')
true
> re.test('tetSt#11')
false
(?:(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[##\$%!])(?!.*(.).*\1.*\1).{6,20})
You need this.See demo.
http://regex101.com/r/hQ9xT1/22
Your regex was failing cos
(?!(.)*\1{2,}) will not work as it find consecutive repeated characters and not any character which is three times or more.So use (?!.*(.).*\1.*\1).
var re = /^(?:(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[##\$%!])(?!.*(.).*\1.*\1).{6,20})$/gm;
var str = 'aaaD!n1\nteSt#111\naDm!n1\nAdw1n#';
var m;
while ((m = re.exec(str)) != null) {
if (m.index === re.lastIndex) {
re.lastIndex++;
}
// View your result using the m-variable.
// eg m[0] etc.
}

Regex split string of numbers at finding of Alpha Characters

OK Regex is one of the most confusing things to me. I'm trying to do this in Javascript. I have a search field that the user will enter a series of characters. Codes are either:
999MC111
or just
999MC
There is ALWAYS 2 Alpha characters. BUT there may be 1-4 characters at the front and sometimes 1-4 characters at the end.
If the code ENDS with the Alpha characters, then I run a certain ajax script. If there are Numbers + 2 letters + numbers....it runs a different ajax script.
My struggle is I know \d is for 2 digits....but it may not always be 2 digits.
So what would my regex code be to split this into an array. or something.
I think correct regex would be (/^([0-9]+)([a-zA-z]+)([0-9]+)$/
But how do i make sure its ONLY 2 alpha characters in middle?
Thanks
You could use the regex /\d$/ to determine if it ends with a decimal.
\d matches a decimal character, and $ matches the end of the string. The / characters enclose the expression.
Try running this in your javascript console, line by line.
var values = ['999MC111', '999MC', '999XYZ111']; // some test values
// does it end in digits?
!!values[0].match(/\d$/); // evaluates to true
!!values[1].match(/\d$/); // evaluates to false
To specify the exact number of tokens you must use brackets {}, so if you know that there are 2 alphabetic tokens you put {2}, if you know that there could be 0-4 digits you put {0,4}
^([0-9]{0,4})([a-zA-z]{2})([0-9]{0,4})$
The above RegEx evaluates as follows:
999MC ---> TRUE
999MC111 --> TRUE
999MAC111 ---> FALSE
MC ---> TRUE
The splitting of the expression into capturing groups is done by means of grouping subexpressions into parentheses
As you can see in the following link:
http://regexr.com?2vfhv
you obtain this:
3 capturing groups:
group 1: ([0-9]{0,4})
group 2: ([a-zA-z]{2})
group 3: ([0-9]{0,4})
The regex /^\d{1,4}[a-zA-Z]{2}\d{0,4}$/ matches a series of 1-4 digits, followed by a series of 2 alpha characters, followed by another series of 0-4 digits.
This regex: /^\d{1,4}[a-zA-Z]{2}$/ matches a series of 1-4 digits, followed only by 2 alpha characters.
Ok so I didnt really care about the middle 2 characters....all that really mattered was the 1st set of numbers and last set of numbers (if any).
So essentially I just needed to deal with digits. So I did this:
var lead = '123mc444'; //For example purposes
var regex = /(\d+)/g;
var result = (lead.match(regex));
var memID = result[0]; //First set of numbers is member id
if(result[1] != undefined) {
var leadID = result[1];
}

Categories

Resources