javascript regular expressions - groups - javascript

I"m currently studying regular expression groups. I'm having trouble fully understanding the first example presented in the book under groups. The book gives the following example:
/(\S+) (\S*) ?\b(\S+)/
I understand that this will match at most three words (consisting of any character except a white space), where the second word and space is optional.
What I have trouble understanding is the function of the boundary condition to start the match of the last group at the beginning of the third word.
When there are three words It makes no difference whether it is included or not.
When there are only two words there is a difference between group #2 and group #3
So, my question is as follows
When there are two words, why is the presence of \b causing group#2 to be an empty string as expected, but when not present causes group #2 to contain the second word minus the last letter and group #3 to contain the last letter of the second word?

When there are two words, why is the presence of \b causing group#2 to be an empty string as expected
Look at the first and third groups - being (\S+), they must contain characters. When there are two words, those two words must go into the first and third group - the second group, since it's repeated with *, will not consume any characters, and will be the empty string.
but when not present causes group #2 to contain the second word minus the last letter and group #3 to contain the last letter of the second word?
When the pattern is
(\S+) (\S*) ?(\S+)
once the engine has matched the first word, the engine will start trying to match the second word. So if the input is foo bar, we can consider how the pattern (\S*) ?(\S+) works on bar.
The engine first tries to consume all remaining characters in the string with the \S*. This fails, because the last group is required to contain at least one character, so the engine backs up a step, and has the \S* group match all but the last character. This results in a successful match, because the position before the last character does match \s?(\S+).
You can see this process visually here:
https://regex101.com/r/RAkEOt/1/debugger
In the first pattern, the word boundary before the last group ensures that the second group does not match any characters, in case there are only two words in the string - rather than backtracking to just before the last character, it must back up all the way until a word boundary is found:
The original pattern may be slightly flawed - \b matches a word boundary, but not every non-space character is a word character - it (probably undesirably) matches foo it's where the it' goes into the second group, and the s goes into the third group.

The difference comes from the second group (\S*) - it will capture any amount of non-whitespace characters. So, when you have two words but three groups where the last one is (\S+) - match at least one non-whitespace character, the regex engine will try to satisfy both group 2 and 3.
Remember that it's matching a pattern and you've not told it not to match like that. Hence it does the minimum work necessary - the second group's \S* will initially match everything grabbing brown - the next part of the pattern is an optional space, which passes, then it gets to the final group \S+ and since it has a mandatory character, the second match will release matches one by one until group 3 is satisfied.
You can see this here - I've defined the third group to have at least two mandatory characters, hence it only gets two:
let [ , group1, group2, group3] = "the brown".match(/(\S+) (\S*) ?(\S{2,})/);
console.log("group 1:", group1)
console.log("group 2:", group2)
console.log("group 3:", group3)
When you instead add the word boundary \b to the pattern, you cannot have group 2 have any characters and satisfy the later condition - when a regex consumes a character the rest of the pattern will only continue from that character onward, hence you cannot have, for example group 2 match b and then have a word boundary followed by rown. The only way that (\S+) (\S*) ?\b(\S+) can be satisfied is the following:
group 1 matches the
the space character is matched
group 2 matches nothing, which is acceptable as it can match any amount, including zero
the optional space matches zero spaces
there is a word boundary
group 3 consumes the rest of the letters - brown

Related

Regex for match beginning with 2 letters and ending with 3 letters

Example input:
'Please find the ref AB45676785567XYZ. which is used to identify reference number'
Example output:
'AB45676785567XYZ'
I need a RegExp to return the match exactly matching my requirements; i.e. the substring where the first 2 and last 3 characters are letters.
The first 2 and last 3 letters are unknown.
I've tried this RegExp:
[a-zA-Z]{2}[^\s]*?[a-zA-Z]{3}
But it is not matching as intended.
Your current RegExp matches the following words marked with code blocks:
Please find the ref AB45676785567XYZ. which is used to identify reference number
This is because your RegExp, [a-zA-Z]{2}[^\s]*?[a-zA-Z]{3}, is asking for:
[a-zA-Z]{2} Begins with 2 letters (either case)
[^\s]*? Contains anything that isn't a whitespace
[a-zA-Z]{3} Ends with 3 letters (either case)
In your current example, restricting the letters to uppercase only would match only the match you seek:
[A-Z]{2}[^\s]+[A-Z]{3}
Alternatively, requiring numbers between the 2 beginning and 3 ending letters would also produce the match you want:
[a-zA-Z]{2}\d+[a-zA-Z]{3}
What is really important here, is word boundaries \b, try: \b[a-zA-Z]{2}\w+[a-zA-Z]{3}\b
Explanation:
\b - word boundary
[a-zA-Z]{2} - match any letter, 2 times
\w+ - match one or more word characters
[a-zA-Z]{3} - match any letter, 3 times
\b - word boundary
Demo
CAUTION your requirements are amibgious, as any word consisting of 5 or more letters would match the pattern
Start with 2 letters :
[a-zA-Z]{2}
Digits in the middle :
\d+
Finish with 3 letters :
[a-zA-Z]{3}
Full Regex :
[a-zA-Z]{2}\d+[a-zA-Z]{3}
If the middle text is Alpha-Numeric, you can use this :
[A-Z]{2}[^\s]+[A-Z]{3}

JavaScript Regular Expressions Basics

I'm trying to learn Regular Expressions and at the moment I've gathered a very basic understanding from all of the overviews from W3, Mozilla or http://www.regular-expressions.info/, but when I was exploring this wikibook http://en.wikibooks.org/wiki/JavaScript/Regular_Expressions it gave this example:
"abbc".replace(/(.)\1/g, "$1") => "abc"
which I have no idea why is true (the wikibook didn't really explain), but I tried it myself and it does drop the second b. I know \1 is a backreference to the captured group (.), but . is the any character besides a new line symbol... Wouldn't that still pick up the second b? Trying a few variations didn't clear things up either...
"abbc".replace(/(.)/g, "$1") => "abbc"
"aabc".replace(/(.)*/g, "$1") => "c"
Does anybody have a good in depth tutorial on Javascript Regular Expressions (I've looked at a couple of books and they're very generalized for about 15 languages and no real emphasis on Javascript).
First One
(.) matches and captures a single character to Group 1, so (.)\1 matches two of the same characters, for instance AA.
In the string, the only match for this pattern is bb.
By replacing these two characters bb by the Group 1 capture buffer $1, i.e. b, we replace two chars with one, effectively removing oneb`.
Second One
Again (.) matches and captures a single character, capturing it to Group 1.
The pattern matches each character in the string in turn.
The replacement is the Group 1 capture buffer $1, so we replace each character with itself. Therefore the string is unchanged.
Third One
Here, forgetting the parentheses for a moment, .* matches the whole string: this is the match.
The quantifier * means that the Group 1 is reset every time a single character is matched (new group numbers are not created, as group numbering is done from left to right).
For every character that is matched, that character is therefore captured to Group 1—until the next capture resets Group 1.
The end value of Group 1 is the the last capture, which is the last character c
We replace the match (i.e., the whole string) with Group 1 (i.e. c), so the replacement string is c.
The details of group numbering are important to grasp, and I highly recommend you read the linked article about "the gory details".
Reference
Capture Group Numbering & Naming: The Gory Details
JavaScript Regex Basics
Backreferences
This is quite simple when broken down:
With "abbc".replace(/(.)\1/g, "$1"), the result is "abc" because:
(.) references one character.
\1 references the first back reference
So what it says is "find 2 times the same letter" and replace it with the reference. So any doubled character would match and be replaced by the reference.

Difference between (\w)* and \w?

I'm trying to study regexes, and I came upon this confusing scenario:
Suppose you have the text:
hello world
If you run the regex (\w)*, it gives:
['hello', 'o']
What I expected was:
['hello', 'h']
Doesn't \w mean any word character?
Another example:
Text:
Delicious cake
(\w)* output:
['Delicious', 's']
What I expected:
['Delicious', 'D']
'*' matches the preceding part zero or more times and bind tightly to the element on the left.
Example: m*o will match o, mo, mmo, mmmmo and so on.
Parentheses () are used to mark sub-expressions, also called capture groups.
So (\w)* is repeated capturing group.
Regex Demo
Sam, the reason why (\w)* returns "s" in Group 1 against "delicious" is that there can only be one Group 1. Each time a new character is matched by (\w), the parentheses force the new value of the character to be captured into Group 1. "s" is the last character, so it is the final Group 1 reported to you by the engine.
If you wanted to capture the first letter into Group 1 instead, you could go with something like:
(\w)\w*
This causes the first character to be captured. There is no quantifier on the capturing parentheses, so Group 1 doesn't change. The remaining \w* optionally match any additional characters.
Also please note that when you run (\w)* against "hello world", the matches are not "hello" and "o" as you stated. The matches (if you match them all) are "hello" and "world". The Group 1 captures are "o" and "d", the last letters of each word.
Reference: All about capture
Remember, a repeated capturing group always captures the last group.
So.
(\w)* on hello will check one character at a time unless it reaches the last match.
Thus will get o in the capture group.
(\w)* on helloworld will check one character at a time unless it reaches the last match.
Thus will get d in the capture group.
(\w)* on hello123 will check one character at a time unless it reaches the last match.
Thus will get 3 in the capture group.
(\w)* on helloworld#3w4 will check one character at a time unless it reaches the last match. Thus will get d in the capture group since # is not a valid \word character( only [_0-9a-zA-Z] allowed).
(\w)*
Match the regular expression below and capture its match into backreference number 1 «(\w)*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Note: You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «*»
Match a single character that is a “word character” (letters, digits, and underscores) «\w»
Will give you two matches:
hello
world
\w
Match a single character that is a “word character” (letters, digits, and underscores) «\w»
Will match every character (individually) on the sentence:
h
e
l
l
o
w
o
r
l
d
\w is a RegEx shortcut for [_a-zA-Z0-9] which means any letter, digit, or an underscore.
When you add an asterisk * after anything, it means it can appear from 0 to unlimited times.
If you want to match all the letters in your input, use \w
If you want to match whole words in your input, use \w+ (use + and not * since a word has at least one letter)
Also, when you're surrounding stuff in your RegEx with brackets, they become a capture group, which means they will appear in your results, which is why (\w)* is different from (\w*)
Useful RegEx sites:
RegexPal
Debuggex

HTML5 Pattern for allowing only 5 comma separated words without including a space

I basically don't know how to make HTML5 Patterns so that's why I am asking this question. I just want to create a pattern that would be applied onto a input[type=text] thorough pattern attribute newly introduced by the HTML5 but I have no idea on how to implement that pattern.
The pattern includes the following things:
Allow only 5 comma separated words
No space could be added.
^(\w+,){4}\w+$ is the pattern you need: repeat "any number of word characters followed by comma" four times, then "just word characters". If you want "up to five", the solution would be
^(\w+,){0,4}\w+$
Detailed explanation (adapted from http://www.regex101.com):
^ assert position at start of the string (i.e. start matching from start of string)
1st Capturing group (\w+,){0,4}
Quantifier {0,4}: Between 0 and 4 times, as many times as possible, giving back as needed [greedy]
\w+ match any word character [a-zA-Z0-9_]
Quantifier +: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
, matches the character , literally
\w+ match any word character [a-zA-Z0-9_]
Quantifier +: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
$ assert position at end of the string
If you don't want numbers as part of the match, replace every instance of \w in the above with [a-zA-Z] - it will match only lowercase and uppercase letters A through Z.
UPDATE In response to your comment, if you want no group of characters to be more than 10 long, you can modify the above expression to
^(\w{1,10},){0,4}\w{1,10}$
Now the "quantifier" is {1,10} instead of + : that means "between 1 and 10 times" instead of "1 or more times".

Examples and explanation for javascript regular expression (x), decimal point, and word boundary

Can someone give a better explanation for these special characters examples in here? Or provide some clearer examples?
(x)
The '(foo)' and '(bar)' in the pattern /(foo) (bar) \1 \2/ match and
remember the first two words in the string "foo bar foo bar". The \1
and \2 in the pattern match the string's last two words.
decimal point
For example, /.n/ matches 'an' and 'on' in "nay, an apple is on the
tree", but not 'nay'.
Word boundary \b
/\w\b\w/ will never match anything, because a word character can never
be followed by both a non-word and a word character.
non word boundary \B
/\B../ matches 'oo' in "noonday" (, and /y\B./ matches 'ye' in
"possibly yesterday."
totally having no idea what the above example is showing :(
Much thanks!
Parentheses (aka capture groups)
Parantheses are used to indicate a group of symbols in the regular expression that, when matched, are 'remembered' in the match result. Each matched group is labelled with a numbered order, as \1, \2, and so on. In the example /(foo) (bar) \1 \2/ we remember the match foo as \1, and the match bar as \2. This means that the string "foo bar foo bar" matches the regular expression because the third and fourth terms (the \1 and \2) are matching the first and second capture groups (i.e. (foo) and (bar)). You can use capture groups in javascript like this:
/id:(\d+)/.exec("the item has id:57") // => ["id:57", "57"]
Note that in the return we get the whole match, and the subsequent groups that were captured.
Decimal point (aka wildcard)
A decimal point is used to represent a single character that can have any value. This means that the regular expression /.n/ will match any two character string where the second character is an 'n'. So /.n/.test("on") // => true, /.n/.test("an") // => true but /.n/.test("or") // => false. DrC brings up a good point in the comments that this won't match a newline character, but I feel in order for that to be an issue you need to explicitly specify multiline mode.
Word boundaries
A word boundary will match against any non-word character that directly precedes, or directly follows a word (i.e. adjacent to a word character). In javascript the word characters are any alpahnumeric and the underscore (mdn), non word is obviously everything else! The trick for word boundaries is that they are zero width assertions, which means they don't count as a character. That's why /\w\b\w/ will never match, because you can never have a word boundary between two word characters.
Non-word boundaries
The opposite of a word boundary, instead of matching a point that goes from non-word to word, or word to non-word (i.e. the ends of a word) it will match points where it's moving between the same types of character. So for our examples /\B../ will match the first point in the string that is between two characters of the same type and the next two characters, in this case it's between the first 'n' and 'o', and the next two characters are "oo". In the second example /y\B./ we are looking for the character 'y' followed by a character of matching type (so a word character), and the '.' will match that second character. So "possibly yesterday" won't match on the 'y' at the end of "possibly" because the next character is a space, which is a non word, but it will match the 'y' at the beginning of "yesterday", because it's followed by a word character, which is then included in the match by the '.' in the regular expression.
Overall, regular expressions are popular in many languages and based off a sound theoretical basis, so there's a lot of material on these characters. In general, Javascript is very similar to Perl's PCRE regular expressions (but not exactly the same!), so the majority of your questions about javascript regular expressions would be answered by any PCRE regex tutorial (of which there are many).
Hope that helps!

Categories

Resources