Regex capturing group only capturing last occurence

Regex capturing group only capturing last occurence - javascript

Here is my input:
start
#var=somevar1
#var=somevar2
end
I am using this regex
start(?:\s*\n*(?:#var=(.*)\s*)*)\s*\n*end
Its should give the output as
somevar1
somevar2
but its giving just somevar2.
Is there any way to get all occurrence of the capturing group?

One option is to use a positive lookbehind with an inifite quantifier to assert start followed by a newline at the left.
See the support for lookbehinds.
(?<=^start\n[^]*#var=)\S+(?=[^]*\nend$)
Regex demo
const regex = /(?<=^start\n[^]*#var=)\S+(?=[^]*\nend$)/gm;
const str = `start
#var=somevar1
#var=somevar2
end`;
let m;
console.log(str.match(regex));
If there can only be formats of #var=somevar preceding and following instead of other content:
(?<=^start\n\s*(?:#var=\S+\s+)*#var=)\S+(?=(?:\s*#var=\S+)*\s*\nend$)
See another regex demo

Related

Validate text with javascript RegEX

I'm trying to validate text with javascript but can find out why it's not working.
I have been using : https://regex101.com/ for testing where it works but in my script it fails
var check = "test"
var pattern = new RegExp('^(?!\.)[a-zA-Z0-9._-]+$(?<!\.)','gmi');
if (!pattern.test(check)) validate_check = false;else validate_check = true;
What i'm looking for is first and last char not a dot, and string may contain [a-zA-Z0-9._-]
But the above check always fails even on the word : test

+$(?<!\.) is invalid in your RegEx
$ will match the end of the text or line (with the m flag)
Negative lookbehind → (?<!Y)X will match X, but only if Y is not before it

What about more simpler RegEx?
var checks = ["test", "1-t.e_s.t0", ".test", "test.", ".test."];
checks.forEach(check => {
var pattern = new RegExp('^[^.][a-zA-Z0-9\._-]+[^.]$','gmi');
console.log(check, pattern.test(check))
});
Your code should look like this:
var check = "test";
var pattern = new RegExp('^[^.][a-zA-Z0-9\._-]+[^.]$','gmi');
var validate_check = pattern.test(check);
console.log(validate_check);

A few notes about the pattern:
You are using the RegExp constructor, where you have to double escape the backslash. In this case with a single backslash, the pattern is ^(?!.)[a-zA-Z0-9._-]+$(?<!.) and the first negative lookahead will make the pattern fail if there is a character other than a newline to the right, that is why it does not match test
If you use the /i flag for a case insensitive match, you can shorten [A-Za-z] to just one of the ranges like [a-z] or use \w to match a word character like in your character class
This part (?<!\.) using a negative lookbehind is not invalid in your pattern, but is is not always supported
For your requirements, you don't have to use lookarounds. If you also want to allow a single char, you can use:
^[\w-]+(?:[\w.-]*[\w-])?$
^ Start of string
[\w-]+ Match 1+ occurrences of a word character or -
(?: Non capture group
[\w.-]*[\w-] Match optional word chars, a dot or hyphen
)? Close non capture group and make it optional
$ End of string
Regex demo
const regex = /^[\w-]+(?:[\w.-]*[\w-])?$/;
["test", "abc....abc", "a", ".test", "test."]
.forEach((s) =>
console.log(`${s} --> ${regex.test(s)}`)
);

Capturing the character before the regex

I have a quick question about a regex that I wrote in JavaScript. It is the following (?<=,)(.*)(?=:) and it captures everything between , and :. I want it, however, to capture the comma itself too, as in.
So,<< this is what my regex captures at the moment>>: end would become
So<<, this is what my regex captures at the moment>>: end.
I tried using a . before the , in the regex but it doesn't seem to be working.

Use a simple capturing group - it's shorter than your current regex and works perfectly:
var regex = /(,.*?):/g;
var string = "So,<< this is what my regex captures at the moment>>: end";
console.log(string.match(regex));
Explanation:
() - denotes a capturing group
, - match a comma
.?* - match any amount of any characters
: - match a comma

Assuming the double arrows are for indicating the start and the end what your current pattern matches, you could match the comma and then 1+ times not a comma using a negated character class:
,[^:]+
If the comma at the end should be there, you could use the capturing group:
(,[^:]+):
Regex demo
You can omit the positive lookahead (?=:) by just matching the colon because you are already using a capturing group to get the match.
const regex = /(,[^:]+):/;
const str = `So,<< this is what my regex captures at the moment>>: end`;
let res = str.match(regex);
console.log(res[1]);

As you said :
So,<< this is what my regex captures at the moment>>: end would become
So<<, this is what my regex captures at the moment>>: end.
you could use replace like this :
var str = `So,<< this is what my regex captures at the moment>>: end`;
var replace = str.replace(/(.*?)(,)(<<)(.*)/,"$1$3$2$4");
console.log(replace);

How to get last occurrence with regex javascript?

Could you help me extract "women-watches" from the string:
https://www.aliexpress.com/category/200214036/women-watches.html?spm=2114.search0103.0.0.160b628cMC1npI&site=glo&SortType=total_tranpro_desc&g=y&needQuery=n&shipFromCountry=cn&tag=
I tried
\/(?:.(?!\/.+\.))+$
But I don't know how to do it right.

One option could be to use a capturing group to match a word character or a hyphen. Your match will be in the first capturing group.
^.*?\/([\w-]+)\.html
That will match:
^ Start of the string
.*? Match any character except a newline non greedy
\/ Match /
([\w-]+) Capturing group to match 1+ times a wordcharacter of a hyphen
\.html Match .html
Regex demo
const regex = /^.*?\/([\w-]+)\.html/;
const str = `https://www.aliexpress.com/category/200214036/women-watches.html?spm=2114.search0103.0.0.160b628cMC1npI&site=glo&SortType=total_tranpro_desc&g=y&needQuery=n&shipFromCountry=cn&tag=`;
console.log(str.match(regex)[1]);
Another option to match from the last occurence of the forward slash could be to match a forward slash and use a negative lookahead to check if there are no more forward slashes following. Then use a capturing group to match not a dot:
\/(?!.*\/)([^.]+)\.html
Regex demo
const regex = /\/(?!.*\/)([^.]+)\.html/;
const str = `https://www.aliexpress.com/category/200214036/women-watches.html?spm=2114.search0103.0.0.160b628cMC1npI&site=glo&SortType=total_tranpro_desc&g=y&needQuery=n&shipFromCountry=cn&tag=`;
console.log(str.match(regex)[1]);
Without using a regex, you might use the dom and split:
const str = `https://www.aliexpress.com/category/200214036/women-watches.html?spm=2114.search0103.0.0.160b628cMC1npI&site=glo&SortType=total_tranpro_desc&g=y&needQuery=n&shipFromCountry=cn&tag=`;
let elm = document.createElement("a");
elm.href = str;
let part = elm.pathname.split('/').pop().split('.')[0];
console.log(part);

regex capturing group/alternative combination not working in quotes

I am trying to use this expression:
var reg = "/(jan|feb|mar)[A-z]*\[0-9]/"
to capture at least the first three letters of the month(or more letters) plus a digit. This does not work however. When I remove the parenthesis, it works but then the [A-z]*[0-9] bit only aplies to march. Please help, thanks.

Your regex is incorrect, also the regex should not be a string.
Use regex /(jan|feb|mar)[a-z]*[0-9]/i
Regex explanation: https://regex101.com/r/9Qv2dy/2
Snippet:
var reg = /(jan|feb|mar)[a-z]*[0-9]/i;
console.log('January1'.match(reg));

Your code contains several issues.
The /.../ regex literal should not be put inside quotes.
[A-z] matches more than just letters, you need [A-Za-z]
A \[ pattern matches a literal [ char. To match a digit, you need [0-9] or \d. To match 1 or more digits: [0-9]+ or \d+.
Use
var reg = /(?:jan|feb|mar)[a-z]*[0-9]/i;
See JS demo:
var reg = /(?:jan|feb|mar)[a-z]*[0-9]/i;
console.log("Date: January1".match(reg));

Finding duplicates with regular expressions, how does this actually work? [duplicate]

I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?

Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here

I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html

The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.

Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}

Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source

The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1

Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+

No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.

This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.

Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.

The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.

This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result

Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b

I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.

You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}

As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )

To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.

Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b

Develop Reference

JavaScript is the programming language of the Web.

Regex capturing group only capturing last occurence - javascript

Here is my input: start #var=somevar1 #var=somevar2 end I am using this regex start(?:\s\n(?:#var=(.)\s)*)\s\nend Its should give the output as somevar1 somevar2 but its giving just somevar2. Is there any way to get all occurrence of the capturing group?

Related

Validate text with javascript RegEX

Capturing the character before the regex

How to get last occurrence with regex javascript?

regex capturing group/alternative combination not working in quotes

Finding duplicates with regular expressions, how does this actually work? [duplicate]

Categories

Resources

Develop Reference

JavaScript is the programming language of the Web.

Regex capturing group only capturing last occurence - javascript

Here is my input: start #var=somevar1 #var=somevar2 end I am using this regex start(?:\s*\n*(?:#var=(.*)\s*)*)\s*\n*end Its should give the output as somevar1 somevar2 but its giving just somevar2. Is there any way to get all occurrence of the capturing group?

Related

Validate text with javascript RegEX

Capturing the character before the regex

How to get last occurrence with regex javascript?

regex capturing group/alternative combination not working in quotes

Finding duplicates with regular expressions, how does this actually work? [duplicate]

Categories

Resources

Here is my input: start #var=somevar1 #var=somevar2 end I am using this regex start(?:\s\n(?:#var=(.)\s)*)\s\nend Its should give the output as somevar1 somevar2 but its giving just somevar2. Is there any way to get all occurrence of the capturing group?