Regex to match all words except those in parentheses - javascript - javascript

I'm using the following regex to match all words:
mystr.replace(/([^\W_]+[^\s-]*) */g, function (match, p1, index, title) {...}
Note that words can contain special characters like German Umlauts.
How can I match all words excluding those inside parentheses?
If I have the following string:
here wäre c'è (don't match this one) match this
I would like to get the following output:
here
wäre
c'è
match
this
The trailing spaces don't really matter.
Is there an easy way to achieve this with regex in javascript?
EDIT:
I cannot remove the text in parentheses, as the final string "mystr" should also contain this text, whereas string operations will be performed on text that matches. The final string contained in "mystr" could look like this:
Here Wäre C'è (don't match this one) Match This

Try this:
var str = "here wäre c'è (don't match this one) match this";
str.replace(/\([^\)]*\)/g, '') // remove text inside parens (& parens)
.match(/(\S+)/g); // match remaining text
// ["here", "wäre", "c'è", "match", "this"]

Thomas, resurrecting this question because it had a simple solution that wasn't mentioned and that doesn't require replacing then matching (one step instead of two steps). (Found your question while doing some research for a general question about how to exclude patterns in regex.)
Here's our simple regex (see it at work on regex101, looking at the Group captures in the bottom right panel):
\(.*?\)|([^\W_]+[^\s-]*)
The left side of the alternation matches complete (parenthesized phrases). We will ignore these matches. The right side matches and captures words to Group 1, and we know they are the right words because they were not matched by the expression on the left.
This program shows how to use the regex (see the matches in the online demo):
<script>
var subject = 'here wäre c\'è (don\'t match this one) match this';
var regex = /\(.*?\)|([^\W_]+[^\s-]*)/g;
var group1Caps = [];
var match = regex.exec(subject);
// put Group 1 captures in an array
while (match != null) {
if( match[1] != null ) group1Caps.push(match[1]);
match = regex.exec(subject);
}
document.write("<br>*** Matches ***<br>");
if (group1Caps.length > 0) {
for (key in group1Caps) document.write(group1Caps[key],"<br>");
}
</script>
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...

Related

Finding duplicates with regular expressions, how does this actually work? [duplicate]

I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?
Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here
I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html
The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.
Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}
Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source
The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1
Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+
No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.
This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.
Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.
The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.
This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result
Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b
I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.
You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}
As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )
To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.
Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b

Regex to not match when not in quotes

I'm looking to create a JS Regex that matches double spaces
([-!$%^&*()_+|~=`{}\[\]:";'<>?,.\w\/]\s\s[^\s])
The RegEx should match double spaces (not including the start or end of a line, when wrapped within quotes).
Any help on this would be greatly appreciated.
For example:
var x = 1,
Y = 2;
Would be fine where as
var x = 1;
would not (more than one space after the = sign.
Also if it was
console.log("I am some console output");
would be fine as it is within double quotes
This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
We can solve it with a beautifully-simple regex:
(["']) \1|([ ]{2})
The left side of the alternation | matches complete ' ' and " ". We will ignore these matches. The right side matches and captures double spaces to Group 2, and we know they are the right ones because they were not matched by the expression on the left.
This program shows how to use the regex in JavaScript, where we will retrieve the Group 2 captures:
var the_captures = [];
var string = 'your_test_string'
var myregex = /(["']) \1|([ ]{2})/g;
var thematch = myregex.exec(string);
while (thematch != null) {
// add it to array of captures
the_captures.push(thematch[2]);
document.write(thematch[2],"<br />");
// match the next one
thematch = myregex.exec(string);
}
A Neat Variation for Perl and PCRE
In the original answer, I hadn't noticed that this was a JavaScript question (the tag was added later), so I had given this solution:
(["']) \1(*SKIP)(*FAIL)|[ ]{2}
Here, thanks to (*SKIP)(*FAIL) magic, we can directly match the spaces, without capture groups.
See demo.
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...
Simple solution:
/\s{2,}/
This matches all occurrences of one or more whitespace characters. If you need to match the entire line, but only if it contains two or more consecutive whitespace characters:
/^.*\s{2,}.*$/
If the whitespaces don't need to be consecutive:
/^(.*\s.*){2,}$/

match a string not after another string

This
var re = /[^<a]b/;
var str = "<a>b";
console.log(str.match(re)[0]);
matches >b.
However, I don't understand why this pattern /[^<a>]b/ doesn't match anything. I want to capture only the "b".
The reason why /[^<a>]b/ doesn't do anything is that you are ignoring <, a, and > as individual characters, so rewriting it as /[^><a]b/ would do the same thing. I doubt this is what you want, though. Try the following:
var re = /<a>(b)/;
var str = "<a>b";
console.log(str.match(re)[1]);
This regex looks for a string that looks like <a>b first, but it captures the b with the parentheses. To access the b, simply use [1] when you call .match instead of [0], which would return the entire string (<a>b).
What you're using here is a match for a b preceded by any character that is not listed in the group. The syntax [^a-z+-] where the a-z+- is a range of characters (in this case, the range of the lowercase Latin letters, a plus sign and a minus sign). So, what your regex pattern matches is any b preceded by a character that is NOT < or a. Since > doesn't fall in that range, it matches it.
The range selector basically works the same as a list of characters that are seperated by OR pipes: [abcd] matches the same as (a|b|c|d). Range selectors just have an extra functionality of also matching that same string via [a-d], using a dash in between character ranges. Putting a ^ at the start of a range automatically turns this positive range selector into a negative one, so it will match anything BUT the characters in that range.
What you are looking for is a negative lookahead. Those can exclude something from matching longer strings. Those work in this format: (?!do not match) where do not match uses the normal regex syntax. In this case, you want to test if the preceding string does not match <a>, so just use:
(?!<a>)(.{3}|^.{0,2})b
That will match the b when it is either preceded by three characters that are not <a>, or by fewer characters that are at the start of the line.
PS: what you are probably looking for is the "negative lookbehind", which sadly isn't available in JavaScript regular expressions. The way that would work is (?<!<a>)b in other languages. Because JavaScript doesn't have negative lookbehinds, you'll have to use this alternative regex.
you could write a pattern to match anchor tag and then replace it with empty string
var str = "<a>b</a>";
str = str.replace(/((<a[\w\s=\[\]\'\"\-]*>)|</a>)/gi,'')
this will replace the following strings with 'b'
<a>b</a>
<a class='link-l3'>b</a>
to better get familiar with regEx patterns you may find this website very useful regExPal
Your code :
var re = /[^<a>]b/;
var str = "<a>b";
console.log(str.match(re));
Why [^<a>]b is not matching with anything ?
The meaning of [^<a>]b is any character except < or a or > then b .
Hear b is followed by > , so it will not match .
If you want to match b , then you need to give like this :
var re = /(?:[\<a\>])(b)/;
var str = "<a>b";
console.log(str.match(re)[1]);
DEMO And EXPLANATION

JS regexp to match special characters

I'm trying to find a JavaScript regexp for this string: ![](). It needs to be an exact match, though, so:
`!()[]` // No match
hello!()[] // No match
!()[]hello // No Match
!()[] // Match
!()[] // Match (with a whitespace before and/or after)
I tried this: \b![]()\b. It works for words, like \bhello\b, but not for those characters.
The characters specified are control characters and need to be escaped also user \s if you want to match whitespace. Try the following
\s?!(?:\[\]\(\)|\(\)\[\])\s?
EDIT: Added a capture group to extract ![]() if needed
EDIT2: I missed that you wanted order independant for [] and () I've added it in this fiddle http://jsfiddle.net/MfFAd/3/
This matches your example:
\s*!\[\]\(\)\s*
Though the match also includes the spaces before and after !()[].
I think \b does not work here because ![]() is not a word. Check out this quote from MDN:
\b - Matches a word boundary. A word boundary matches the position where a word character is not followed or preceeded by another word-character. Note that a matched word boundary is not included in the match. In other words, the length of a matched word boundary is zero.
Let's create a function for convenience :
function find(r, s) {
return (s.match(r) || []).slice(-1);
}
The following regular expression accepts only the searched string and whitespaces :
var r = /^\s*(!\[\]\(\))\s*$/;
find(r, '![]() '); // ["![]()"]
find(r, '!()[] '); // []
find(r, 'hello ![]()'); // []
This one searches a sub-string surrounded by whitespaces or string boundaries :
var r = /(?:^|\s)(!\[\]\(\))(?:\s|$)/;
find(r, '![]() '); // ["![]()"]
find(r, 'hello ![]()'); // ["![]()"]
find(r, 'hello![]()'); // []
To match all characters except letters and numbers you can use this regex
/[^A-Z0-9]/gi
g - search global [ mean whole text, not just first match ]
i -case insensitive
to remove any other sign for example . and ,
/[^A-Z0-9\.\,]/gi
In order to match exact string you need to group it and global parameter
/(\!\[\]\(\))/g
so it will search for all matches

How to match start of line and white space in a lookahead expression

I am sure this is really easy but how do I match
match either start of line or whitespace
match a-z
match either end of line or whitespace
I only want to return item no. 2 so for the following string
"one 1.ignore two 2ignore ignore3 three"
The expression will return
["one","two","three"]
Thanks
You would need lookbehind for a regex that matches these items, which is not supported in javascript. Either you do a manual iteration and extract matching groups (as demonstrated by #Some1.Kill.The.DJ), or you're going to split the string instead of matching:
str.split(/\s+(?:\S*?(?![a-z])\S+\s+)*/);
This expression does match all whitespaces combined with words that contain at least one character that is not [a-z]. However, this regex is complicated and not easy to maintain; also it does yield empty strings sometimes. Better, do something like
str.split(/\s+/).filter(RegExp.prototype.test.bind(/^[a-z]+$/));
Use this code:
var str = 'one 1.ignore two 2ignore ignore3 three';
str = str.replace(/\s(?=[a-z])/ig, function(text, p1) {
return p1 ? p1 : text;
});
var arr = str.match(/([a-z]+)(?=\s|$)/ig);

Categories

Resources