How to understand regex '\b'? - javascript

I am learning the regex.But I can't understand the '\b' , match a word boundary . there have three situation,like this:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
I can't understand the third situation.for example:
var reg = /end\bend/g;
var string = 'wenkend,end,end,endend';
alert( reg.test(string) ) ; //false
The '\b' require a '\w' character at its one side , another not '\w' character at the other side . the string 'end,end' should match the rule, after the first character is string ',' , before the last character is string ',' , so why the result is error .Could you help,Thanks in advance!
============dividing line=============
With your help, I understand it. the 'end,end' match the first 'end' and have a boundary ,but the next character is ',' not 'e',so '/end\bend' is false.
In other words ,the reg '/end\bend/g' or others similar reg aren't exit forever.
Thanks again

The \b matches position, not a character. So this regex /end\bend/g says that there must be string end. Then it should be followed by not a word character, which is , and it matches, but the regex engine doesn't move in the string and it stays at ,. So the next character in your regex is e, and e doesn't match ,. So regexp fails. Here is step by step what happens:
-----------------
/end\bend/g, "end,end" (match)
| |
-----------------
/end\bend/g, "end,end" (both regex and string position moved - match)
| |
------------------
/end\bend/g, "end,end" (the previous match was zero-length, so only regex position moved - not match)
| |

With (most) regular expression engines, you can match, capture characters and assert positions within a string.
For the purpose of this example let's assume the string
Rogue One: A Star Wars Story
where you want to match the character o (which is there twice, after R and after t). Now you want to specify the position and want to match os only before lowercase rs.
You write (with a positive lookahead):
o(?=r)
Now generalize the idea of zero-width assertions where you want to look for a word character ahead while making sure there's no word character immediately behind. Herefore you could write:
(?=\w)(?<!\w)
A positive and a negative lookahead, combined. We're almost there :) You only need the same thing around (a word character behind and not a word character ahead) which is:
(?<=\w)(?!\w)
If you combine these two, you'll eventually get (see the | in the middle):
(?:(?=\w)(?<!\w)|(?<=\w)(?!\w))
Which is equivalent to \b (and a lot longer). Coming back to our string, this is true for:
Rogue One: A Star Wars Story
# right before R
# right after e in Rogue
# right before O of One
# right after e of One (: is not a word character)
# and so on...
See a demo on regex101.com.
To conclude, you can think of \b as a zero-width assertion which only ensures a position within the string.

Try this Expression
/(end)\b|\b(end)/g

Related

RegExp avoid double space and space before characters

I'm trying to write a regular expression in order to not allow double spaces anywhere in a string, and also force a single space before a MO or GO mandatory, with no space allowed at the beginning and at the end of the string.
Example 1 : It is 40 GO right
Example 2 : It is 40GO wrong
Example 3 : It is 40 GO wrong
Here's what I've done so far ^[^ ][a-zA-Z0-9 ,()]*[^;'][^ ]$, which prevents spaces at the beginning and at the end, and also the ";" character. This one works like a charm.
My issue is not allowing double spaces anywhere in the string, and also forcing spaces right before MO or GO characters.
After a few hours of research, I've tried these (starting from the previous RegExp I wrote):
To prevent the double spaces: ^[^ ][a-zA-Z0-9 ,()]*((?!.* {2}).+)[^;'][^ ]$
To force a single space before MO: ^[^ ][a-zA-Z0-9 ,()]*(?=\sMO)*[^;'][^ ]$
But neither of the last two actually work. I'd be thankful to anyone that helps me figure this out
The lookahead (?!.* {2} can be omitted, and instead start the match with a non whitespace character and end the match with a non whitespace character and use a single space in an optionally repeated group.
If the string can not contain a ' or ; then using [^;'][^ ]$ means that the second last character should not be any of those characters.
But you can omit that part, as the character class [a-zA-Z0-9,()] does not match ; and '
Note that using a character class like [^ ] and [^;'] actually expect a single character, making the pattern that you tried having a minimum length.
Instead, you can rule out the presence of GO or MO preceded by a non whitespace character.
^(?!.*\S[MG]O\b)[a-zA-Z0-9,()]+(?: [a-zA-Z0-9,()]+)*$
The pattern matches:
^ Start of string
(?!.*\S[MG]O\b) Negative lookahead, assert not a non whitspace character followed by either MO or GO to the right. The word boundary \b prevents a partial word match
[a-zA-Z0-9,()]+ Start the match with 1+ occurrences of any of the listed characters (Note that there is no space in it)
(?: [a-zA-Z0-9,()]+)* Optionally repeat the same character class with a leading space
$ End of string
Regex demo

Unexpected result with Regular Expression in JavaScript

This is my String.
var re = "i have a string";
And this my expression
var str = re.replace(/(^[a-z])/g, function(x){return x.toUpperCase();});
I want that it will make the the first character of any word to Uppercase. But the replacement above return only the first character uppercased. But I have added /g at the last.
Where is my problem?
You can use the \b to mark a boundary to the expression.
const re = 'i am a string';
console.log(re.replace(/(\b[a-z])/g, (x) => x.toUpperCase()));
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Vlaz' comment looks like the right answer to me -- by putting a "^" at the beginning of your pattern, you've guaranteed that you'll only find the first character -- the others won't match the pattern despite the "/g" because they don't immediately follow the start of the line.

Regular expression to match line separated size strings

I am writing a reular expression to validate input string, which is a line separated list of sizes ([width]x[height]).
Valid input example:
300x200
50x80
100x100
The regular expression I initially came up with is (https://regex101.com/r/H9JDjA/1):
^(\d+x\d+[\r\n|\r|\n]*)+$
This regular expression matches my input but also matches this invalid input (size can't be 100x100x200):
300x200
50x80
100x100x200
Adding a word boundary at the end seems to have fixed this issue:
^(\d+x\d+[\r\n|\r|\n]*\b)+$
My questions:
Why does the initial regular expression without the word boundary fail? It looks like I am matching one or more instances of a \d+(number), followed by character 'x', followed by a \d+(number), followed by one or more new lines from various operating systems.
How to validate input having multiple training new line characters in this input? The following doesn't work for some kind of inputs like this:
500x500\n100x100\n\n\n384384
^(\d+x\d+[\r\n|\r|\n]\b)+|[\r\n|\r|\n]$
Isolate the problem with this target 100x100x200
For now, forget about the anchors in the regex.
The minimum regex is \d+x\d+ since it only has to be satisfied once
for a match to take place.
The maximum is something like this \d+x\d+ (?: (?:\r?\n | \r)* \d+x\d+ )*
Since \r?\n|\r is optional, it can be reduced to this \d+x\d+ (?: \d+x\d+ )*
The result, when you applied to the target string is:
100x100x200 matches.
But, since you've anchored the regex ^$, it is forced to break up
the middle 100 to make it match.
100x10 from \d+x\d+
0x200 from (?: \d+x\d+ )*
So, that is why the first regex seemingly matches 100x100x200.
To avoid all of that, just require a line break between them, and
make the trailing linebreaks optional (if you need to validate the whole
string, otherwise leave it and the end anchor off).
^\d+x\d+(?:(?:\r?\n|\r)+\d+x\d+)*(?:\r?\n|\r)*$
A better view of it
^
\d+ x \d+
(?:
(?: \r? \n | \r )+
\d+ x \d+
)*
(?: \r? \n | \r )*
$
Your initial regular expression "fails" because of the +:
^(\d+x\d+[\r\n|\r|\n]*)+$
-----------------------^ here
Your parenthesis pattern (\d+x\d+[\r\n|\r|\n]*) says match one or more number followed by an "x" followed by one or more number followed by zero or more newlines. The + after that says match one or more of the entire parenthesis pattern, which means that for an input like 100x200x300 your pattern matches 100x200 and then 200x300, so it looks like it matches the entire line.
If you're simply trying to extract dimensions from a newline-separated string, I would use the following regular expression with a multiline flag:
^(\d+x\d+)$
https://regex101.com/r/H9JDjA/2
Side note: In your expression, [\r\n|\r|\n] is actually saying match any one instance of \r, \n, |, \r, |, or \n (i.e. it's quite redundant, and you probably aren't meaning to match |). If you want to match a sequential set of any combination of \r or \n, you can simply use [\r\n]+.
You can use multiline modifier, which should make life easier:
var input = "\n\
300x200x400\n\
50x80\n\
\n\
\n\
300x200\n\
50x80\n\
100x100x200x100\n";
var allSizes = input.match(/^\d+x\d+/gm); // multiline modifier assumes each line has start and end
for (var size in allSizes)
console.log(allSizes[size]);
Prints:
300x200
50x80
300x200
50x80
100x100
Try this regex out
^[0-9]{1,4}x[0-9]{1,4}|[(\r\n|\r|\n)]+$
It'll match these inputs.
1x1
10x10
100x100
2000x2938
\n
\r
\r\n
but not this 100x100x200

With a JS Regex matching exact word but not hypenated words starting with said word

I could not find a match to this question.
I have a string like so
var s="one two one-two one-three one one_four"
and my function is as follows
function replaceMatches( str, word )
{
var pattern=new RegExp( '\\b('+word+')\\b','g' )
return str.replace( pattern, '' )
}
the problem is if I run the function like
var problem=replaceMatches( s,'one' )
it
returns two -two -three one_four"
the function replaces every "one" like it should but treats words with a hyphen as
two words replacing the "one" before the hyphen.
My question is not about the function but about the regex. What literal regex will match
only the words "one" in my string and not "one-two" or "one-\w"<--you know what I mean lol
basically
var pat=/\b(one)\b/g
"one one-two one".replace( pat, '')
I want the above ^ to return
" one-two "
only replace the exact match "one" and not the one in "one-two"
the "one" on the end is important to, the regex must work if the match is at the very end
Thank you, sorry if my question is relatively confusing. I am just trying to get my learn on, and expand my personal library.
What do you considered to be a word?
A word is a sequence of 1 or more word characters, and word boundary \b is defined based upon the definition of word character (and non-word character).
Word character as defined by \w in JavaScript RegExp is shorthand for character class [a-zA-Z0-9_].
What is your definition of a "word"? Let's say your definition is [a-zA-Z0-9_-].
Emulating word boundary
This post describes how to emulate a word boundary in languages that support look-behind and look-ahead. Too bad, JS doesn't support look-behind.
Let us assume the word to be replaced is one for simplicity.
We can limit the replacement with the following code:
inputString.replace(/([^a-zA-Z0-9_-]|^)one(?![a-zA-Z0-9_-])/g, "$1")
Note: I use the expanded form [a-zA-Z0-9_-] instead of [\w-] to avoid association with \w.
Break down the regex:
(
[^a-zA-Z0-9_-] # Negated character class of "word" character
| # OR
^ # Beginning of string
)
one # Keyword
(?! # Negative look-ahead
[a-zA-Z0-9_-] # Word character
)
I emulate the negative look-behind (which is (?<![a-zA-Z0-9_-]) if supported) by matching a character from negated character class of "word" character and ^ beginning of string. This is natural, since if we can't find a "word" character, then it must be either a non-"word" character or beginning of the string. Everything is wrapped in a capturing group so that it can be replaced back later.
Since one is only replace if there is no "word" character before or after, there is no risk of missing a match.
Putting together
Since you are removing "word"s, you must make sure your keyword contains only "word" characters.
function replaceMatches(str, keyword)
{
// The keyword must not contain non-"word" characters
if (!/^[a-zA-Z0-9_-]+$/.test(keyword)) {
throw "not a word";
}
// Customize [a-zA-Z0-9_-] and [^a-zA-Z0-9_-] with your definition of
// "word" character
var pattern = new RegExp('([^a-zA-Z0-9_-]|^)' + keyword + '(?![a-zA-Z0-9_-])', 'g')
return str.replace(pattern, '$1')
}
You need to escape meta-characters in the keyword if your definition of "word" character includes regex meta-characters.
Use this for your RegExp:
function replaceMatches( str, word ) {
var pattern = new RegExp('(^|[^-])\\b('+word+')\\b([^-]|$)', 'g');
return str.replace(pattern, '$1$3')
}
The (^|[^-]) will match either the start of the string or any character except -. The ([^-]|$) will match either a character other than - or the end of the string.
I'm not a JS pattern function expert but the function should replace all.
As for the hyphen in 'one-two' between one and - is a word boundry (ie. \b) and the
end of string is a word boundry if a \w character is there before it.
But, it sounds like you may want 'one' to be preceeded with a space or BOL.
([ ]|^)one\b in that case you want to make the replacement capture group 1, thus strippking out 'one' only.
And, I'm not sure how that function call works in JS.
Edit: after new expected output, the regex could be -
([ ]|^)one(?=[ ]|$)

Examples and explanation for javascript regular expression (x), decimal point, and word boundary

Can someone give a better explanation for these special characters examples in here? Or provide some clearer examples?
(x)
The '(foo)' and '(bar)' in the pattern /(foo) (bar) \1 \2/ match and
remember the first two words in the string "foo bar foo bar". The \1
and \2 in the pattern match the string's last two words.
decimal point
For example, /.n/ matches 'an' and 'on' in "nay, an apple is on the
tree", but not 'nay'.
Word boundary \b
/\w\b\w/ will never match anything, because a word character can never
be followed by both a non-word and a word character.
non word boundary \B
/\B../ matches 'oo' in "noonday" (, and /y\B./ matches 'ye' in
"possibly yesterday."
totally having no idea what the above example is showing :(
Much thanks!
Parentheses (aka capture groups)
Parantheses are used to indicate a group of symbols in the regular expression that, when matched, are 'remembered' in the match result. Each matched group is labelled with a numbered order, as \1, \2, and so on. In the example /(foo) (bar) \1 \2/ we remember the match foo as \1, and the match bar as \2. This means that the string "foo bar foo bar" matches the regular expression because the third and fourth terms (the \1 and \2) are matching the first and second capture groups (i.e. (foo) and (bar)). You can use capture groups in javascript like this:
/id:(\d+)/.exec("the item has id:57") // => ["id:57", "57"]
Note that in the return we get the whole match, and the subsequent groups that were captured.
Decimal point (aka wildcard)
A decimal point is used to represent a single character that can have any value. This means that the regular expression /.n/ will match any two character string where the second character is an 'n'. So /.n/.test("on") // => true, /.n/.test("an") // => true but /.n/.test("or") // => false. DrC brings up a good point in the comments that this won't match a newline character, but I feel in order for that to be an issue you need to explicitly specify multiline mode.
Word boundaries
A word boundary will match against any non-word character that directly precedes, or directly follows a word (i.e. adjacent to a word character). In javascript the word characters are any alpahnumeric and the underscore (mdn), non word is obviously everything else! The trick for word boundaries is that they are zero width assertions, which means they don't count as a character. That's why /\w\b\w/ will never match, because you can never have a word boundary between two word characters.
Non-word boundaries
The opposite of a word boundary, instead of matching a point that goes from non-word to word, or word to non-word (i.e. the ends of a word) it will match points where it's moving between the same types of character. So for our examples /\B../ will match the first point in the string that is between two characters of the same type and the next two characters, in this case it's between the first 'n' and 'o', and the next two characters are "oo". In the second example /y\B./ we are looking for the character 'y' followed by a character of matching type (so a word character), and the '.' will match that second character. So "possibly yesterday" won't match on the 'y' at the end of "possibly" because the next character is a space, which is a non word, but it will match the 'y' at the beginning of "yesterday", because it's followed by a word character, which is then included in the match by the '.' in the regular expression.
Overall, regular expressions are popular in many languages and based off a sound theoretical basis, so there's a lot of material on these characters. In general, Javascript is very similar to Perl's PCRE regular expressions (but not exactly the same!), so the majority of your questions about javascript regular expressions would be answered by any PCRE regex tutorial (of which there are many).
Hope that helps!

Categories

Resources