This regex (Regular expression) find all words or group of word who begin with capital letter.
But is should exclude words after a dot followed by a space and a word who begin by a capital letters: I.E. it will exclude Hello because a dot and space are preceding the word Hello ". Hello you".
The goal is to replace in a text all included word from the regex by a href link but will exclude ". Any word beginning with Cap letter".
It look like:
// EXCLUDE: (. Hello) dot and space precede the capital word )
const regex = /\b((?!\.[\s]+)(?:[A-Z][\p{L}0-9-_]+)(?:\s+[A-Z][\p{L}0-9-_]+)*)\b/ug;
const subst = '$1';
I though that (?!\.[\s]+) should do the trick but it's not.
Here a test on regex101: https://regex101.com/r/nwyL8I/3
Thank you.
The correct way to express a negative lookbehind assertion for your situation would be (?<!\.\s+) and not (?!\.\s+), which is a negative lookahead assertion. So I would use:
((?<!\.\s+)\b(?:[A-Z][\p{L}0-9-_]+)(?:\s+[A-Z][\p{L}0-9-_]+)*)\b
But (?:[A-Z][\p{L}0-9-_]+) will not match words with a single letter, such as A. Is that what you really want?
The current regular expression seems to match words or groups of words that start with a capital letter and excludes words that are preceded by a dot and a space. However, the exclusion of words after a dot followed by a space might not be working as expected.
One issue with the current regular expression is that it is only checking for the first character of the word after the dot to be a space. You can modify the exclusion part to also check if the first character after the dot is a capital letter:
const regex = /\b((?!\.[A-Z][\p{L}0-9-_])(?:[A-Z][\p{L}0-9-_]+)(?:\s+[A-Z][\p{L}0-9-_]+)*)\b/ug;
This modification should ensure that words that are preceded by a dot and a capital letter will be excluded from the matching process.
Related
For example, I want to match all strings that contain the word 'cat' or 'dog' such as concatenation, doghouse, underdog, catastrophe, or endogamy. But I want to exclude the words dogs or cats from being matched. I tried this task using the following regex.
\\w*(cat|dog)(s(?=\w+))*\
But this regex doesn't help me select whatever is after the s. Is there some other way to achieve this? Any help is appreciated.
If you also don't want to match dogsdogs you might write the pattern as:
\b(?!\w*(?:cats\b|dogs\b))\w*(?:cat|dog)\w*
The pattern matches:
\b a word boundary
(?! Negative lookahead, assert that to the right is not
\w*(?:cats\b|dogs\b) Match optional word characters followed by the word cat or dog followed by a word boundary
) Close the lookahead
\w*(?:cat|dog)\w* Match cat or dot between word characters
Regex demo
If a lookbehind assertion is supported, and you also want to allow other non whitespace characters, you can use \S to match a non whitespace character instead of \w that matches a word character.
(?<!\S)(?!\S*(?:cats\b|dogs\b))\S*(?:cat|dog)\S*
See another Regex demo
I understand your requirements as: match everything that has cat/dog anywhere in word apart from the specific words 'cats' and 'dogs'
\b(?!cats\b|dogs\b)(?=\S*cat\S*|\S*dog\S*)\S*\b
(very) Rough human translation: Find a point where a word isn't cats or dogs (ending with word boundary) and then find a point where a word has cat or dog (either at start, middle, or end) then match everything till the end of the word from that point
Note: flavour - PCRE2
This regex avoids a lookbehind, which is not supported by all browsers.
const regex = /\b(?!cats\b|dogs\b)[a-z]*(?:cat|dog)[a-z]*\b/gi;
const m = 'concatenation, doghouse, underdog, catastrophe, endogamy, dogshore and catstick should match, but not cats and dogs.'.match(regex);
console.log(m);
Output:
[
"concatenation",
"doghouse",
"underdog",
"catastrophe",
"endogamy"
]
Explanation of regex:
\b -- word boundary
(?!cats\b|dogs\b) -- negative lookahead for just cats or dogs
[a-z]* -- optional alpha chars
(?:cat|dog) -- non-capture group for literal cat or dog
[a-z]* -- optional alpha chars
\b -- word boundary
I'm trying to create a custom word boundary (like \b) that also takes words starting or ending with the unicode characters "ÆØÅæøå" into consideration.
Now the only thing I can come up with is this ugly thing
((?<![\wÆØÅæøå])(?=[\wÆØÅæøå])|(?![\wÆØÅæøå])(?<=[\wÆØÅæøå]))
Is there a more elegant solution to this? Or is this the only way.
You can use:
(?<!\p{L}\p{M}*|[\p{N}_]) // leading word boundary, similar to \<, [[:<:]] or \m in other flavors
(?![\p{L}\p{N}_]) // trailing word boundary, similar to \>, [[:>:]] or \M
Compile the regex with the u modifier to enable Unicode category classes.
The (?<!\p{L}\p{M}*|[\p{N}_]) is a negative lookbehind that matches a location not immediately preceded with a letter followed with zero or more diacritic marks or a digit or an underscore.
The (?![\p{L}\p{N}_]) is a negative lookahead that matches a location not immediately followed with a letter, digit or an underscore.
I want to detect words starting with $ but ignore words starting with $$ because I want to give the user a way to escape that character.
I have tried many things, but the nearer I got was this: [^\$]\$\w+
It matches occurrences like The side bar $$includes a| $Cheatsheet|, full with the white space at the beginning of the word $Cheatsheet included. It should match the word $Cheatsheet only, without the space.
How can I do it? Any ideas?
The regex you tried [^\$]\$\w+ will match not a dollar sign followed by a dollar sign and one or more times a word character. That would match for example a$Cheatsheet or $Cheatsheet with a leading space. Note that you don't have to escape the dollar sign in the character class.
If negative lookbehinds are supported, to match a word that does not start with a dollar sign you could use:
(?<!\$)\$\w+
Regex demo
Without a lookbehind you could match what you don't want and capture what you do want in a capturing group.
\${2}\w+|(\$\w+)
Regex demo
If the dollar sign can also not be in the middle of the word you could use:
\S(?:\$+\w+)+\$?|(\$\w+)
Regex demo
You want to escape a $ with $. That means, you need
/(?:^|[^$])(?:\${2})*\B(\$\w+)/g
See the regex demo.
Details
(?:^|[^$]) - start of string or any char but $
(?:\${2})* - 0 or more repetitions of double dollar (this is required to avoid matching literal dollars)
\B - requires start of string or non-word char before the next $
(\$\w+) - Group 1: a $ and then 1+ word chars.
JS demo:
var s = "The $side bar $$includes a| $Cheatsheet|, $$$$full aaa$side";
var res = [], m;
var rx = /(?:^|[^$])(?:\${2})*\B(\$\w+)/g;
while(m=rx.exec(s)) {
res.push(m[1]);
}
console.log(res);
Since negative lookbehinds are NOT yet supported in all JavaScript engines (https://github.com/tc39/proposal-regexp-lookbehind), you can start with your regex and introduce a match group:
[^\$](\$\w+)
then, to exclude aaa$bbb, it is possible to use:
\s(\$\w+)
edit: and to match at the beginning or after punctuation:
(?:^|[^$\w])(\$\w+)
https://regex101.com/r/3cW5oY/2
This is my String.
var re = "i have a string";
And this my expression
var str = re.replace(/(^[a-z])/g, function(x){return x.toUpperCase();});
I want that it will make the the first character of any word to Uppercase. But the replacement above return only the first character uppercased. But I have added /g at the last.
Where is my problem?
You can use the \b to mark a boundary to the expression.
const re = 'i am a string';
console.log(re.replace(/(\b[a-z])/g, (x) => x.toUpperCase()));
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Vlaz' comment looks like the right answer to me -- by putting a "^" at the beginning of your pattern, you've guaranteed that you'll only find the first character -- the others won't match the pattern despite the "/g" because they don't immediately follow the start of the line.
I need is the last match. In the case below the word test without the $ signs or any other special character:
Test String:
$this$ $is$ $a$ $test$
Regex:
\b(\w+)\b
The $ represents the end of the string, so...
\b(\w+)$
However, your test string seems to have dollar sign delimiters, so if those are always there, then you can use that instead of \b.
\$(\w+)\$$
var s = "$this$ $is$ $a$ $test$";
document.body.textContent = /\$(\w+)\$$/.exec(s)[1];
If there could be trailing spaces, then add \s* before the end.
\$(\w+)\$\s*$
And finally, if there could be other non-word stuff at the end, then use \W* instead.
\b(\w+)\W*$
In some cases a word may be proceeded by non-word characters, for example, take the following sentence:
Marvelous Marvin Hagler was a very talented boxer!
If we want to match the word boxer all previous answers will not suffice due the fact we have an exclamation mark character proceeding the word. In order for us to ensure a successful capture the following expression will suffice and in addition take into account extraneous whitespace, newlines and any non-word character.
[a-zA-Z]+?(?=\s*?[^\w]*?$)
https://regex101.com/r/D3bRHW/1
We are informing upon the following:
We are looking for letters only, either uppercase or lowercase.
We will expand only as necessary.
We leverage a positive lookahead.
We exclude any word boundary.
We expand that exclusion,
We assert end of line.
The benefit here are that we do not need to assert any flags or word boundaries, it will take into account non-word characters and we do not need to reach for negate.
var input = "$this$ $is$ $a$ $test$";
If you use var result = input.match("\b(\w+)\b") an array of all the matches will be returned next you can get it by using pop() on the result or by doing: result[result.length]
Your regex will find a word, and since regexes operate left to right it will find the first word.
A \w+ matches as many consecutive alphanumeric character as it can, but it must match at least 1.
A \b matches an alphanumeric character next to a non-alphanumeric character. In your case this matches the '$' characters.
What you need is to anchor your regex to the end of the input which is denoted in a regex by the $ character.
To support an input that may have more than just a '$' character at the end of the line, spaces or a period for instance, you can use \W+ which matches as many non-alphanumeric characters as it can:
\$(\w+)\W+$
Avoid regex - use .split and .pop the result. Use .replace to remove the special characters:
var match = str.split(' ').pop().replace(/[^\w\s]/gi, '');
DEMO