Difference between (\w)* and \w? - javascript

I'm trying to study regexes, and I came upon this confusing scenario:
Suppose you have the text:
hello world
If you run the regex (\w)*, it gives:
['hello', 'o']
What I expected was:
['hello', 'h']
Doesn't \w mean any word character?
Another example:
Text:
Delicious cake
(\w)* output:
['Delicious', 's']
What I expected:
['Delicious', 'D']

'*' matches the preceding part zero or more times and bind tightly to the element on the left.
Example: m*o will match o, mo, mmo, mmmmo and so on.
Parentheses () are used to mark sub-expressions, also called capture groups.
So (\w)* is repeated capturing group.
Regex Demo

Sam, the reason why (\w)* returns "s" in Group 1 against "delicious" is that there can only be one Group 1. Each time a new character is matched by (\w), the parentheses force the new value of the character to be captured into Group 1. "s" is the last character, so it is the final Group 1 reported to you by the engine.
If you wanted to capture the first letter into Group 1 instead, you could go with something like:
(\w)\w*
This causes the first character to be captured. There is no quantifier on the capturing parentheses, so Group 1 doesn't change. The remaining \w* optionally match any additional characters.
Also please note that when you run (\w)* against "hello world", the matches are not "hello" and "o" as you stated. The matches (if you match them all) are "hello" and "world". The Group 1 captures are "o" and "d", the last letters of each word.
Reference: All about capture

Remember, a repeated capturing group always captures the last group.
So.
(\w)* on hello will check one character at a time unless it reaches the last match.
Thus will get o in the capture group.
(\w)* on helloworld will check one character at a time unless it reaches the last match.
Thus will get d in the capture group.
(\w)* on hello123 will check one character at a time unless it reaches the last match.
Thus will get 3 in the capture group.
(\w)* on helloworld#3w4 will check one character at a time unless it reaches the last match. Thus will get d in the capture group since # is not a valid \word character( only [_0-9a-zA-Z] allowed).

(\w)*
Match the regular expression below and capture its match into backreference number 1 «(\w)*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Note: You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «*»
Match a single character that is a “word character” (letters, digits, and underscores) «\w»
Will give you two matches:
hello
world
\w
Match a single character that is a “word character” (letters, digits, and underscores) «\w»
Will match every character (individually) on the sentence:
h
e
l
l
o
w
o
r
l
d

\w is a RegEx shortcut for [_a-zA-Z0-9] which means any letter, digit, or an underscore.
When you add an asterisk * after anything, it means it can appear from 0 to unlimited times.
If you want to match all the letters in your input, use \w
If you want to match whole words in your input, use \w+ (use + and not * since a word has at least one letter)
Also, when you're surrounding stuff in your RegEx with brackets, they become a capture group, which means they will appear in your results, which is why (\w)* is different from (\w*)
Useful RegEx sites:
RegexPal
Debuggex

Related

JavaScript Regular Expression to repeat element match in string

Yes I am a newbie, I have looked online and cannot seem to find the answer to the following, I know it must be simple.
I have a simple string and need to match spaced Capitals eg T G D ......repeater,
Secondly I need to match capitals with a dot between them and no space eg T.G.D ........repeater
I have the current string = str.match(/ [A-Z] [A-Z] | [A-Z].[A-Z]/g)
but this will only match the first two e.g T G I Need it to match wherever it finds the following pattern eg T G D E F L ...repeater as a one match
Likewise It will only match the T.G but nothing after e.g T.G I Need to match T.G.D.L.T repeater (may end with a dot and may not)
any help will be appreciated.
You might use a capturing group matching either a dot or a space in a character class ([. ])
Match the first 2 capitals and capture the space or dot in group 1. Then optionally repeat what is captured using a back reference followed by a captital A-Z.
\b[A-Z]([. ])[A-Z](?:\1[A-Z])*\b
\b Word boundary
[A-Z]([. ])[A-Z] Match A-Z, capture in group 1 matching either space or .
(?:\1[A-Z])* Repeat 0+ times matching what it captured in group 1 followed by A-Z
\b Word boundary
Regex demo

Regex to find character only if it occurs 4 times

I'm stuck on making this Regex. I tried using look-ahead and look-behind together, but I couldn't use the capture group in the look-behind. I need to extract characters from a string ONLY if it occurs 4 times.
If I have these strings
3346AAAA44
3973BBBBBB44
9755BBBBBBAAAA44
The first one will match because it has 4 A's in a row.
The second one will NOT match because it has 6 B's in a row.
The third one will match because it still has 4 A's. What makes it even more frustrating, is that it can be any char from A to Z occuring 4 times.
Positioning does not matter.
EDIT: My attempt at the regex, doesn't work.
(([A-Z])\2\2\2)(?<!\2*)(?!\2*)
If lookbehind is allowed, after capturing the character, negative lookbehind for \1. (because if that matches, the start of the match is preceded by the same character as the captured first character). Then backreference the group 3 times, and negative lookahead for the \1:
`3346AAAA44
3973BBBBBB44
9755BBBBBBAAAA44`
.split('\n')
.forEach((str) => {
console.log(str.match(/([a-z])(?<!\1.)\1{3}(?!\1)/i));
});
([a-z]) - Capture a character
(?<!\1.) Negative lookbehind: check that the position at the 1st index of the captured group is not preceded by 2 of the same characters
\1{3} - Match the same character that was captured 3 more times
(?!\1) - After the 4th match, make sure it's not followed by the same character
Another version without lookbehind (see demo). The captured sequence of 4 equal characters will be rendered in Group 2.
(?:^|(?:(?=(\w)(?!\1))).)(([A-Z])\3{3})(?:(?!\3)|$)
(?:^|(?:(?=(\w)(?!\1))).) - ensure it's the beginning of the string. Otherwise, the 2nd char must be different from the 1st one - if yes, skip the 1st char.
(([A-Z])\3{3}) Capture 4 repeated [A-Z] chars
(?:(?!\3)|$) - ensure the first char after those 4 is different. Or it's the end of the string
As it was suggested by bobble-bubble in this comment - the expression above can be simplified to (demo):
(?:^|(\w)(?!\1))(([A-Z])\3{3})(?!\3)
Another variant could be capturing the first char in a group 1.
Assert that then the previous 2 chars on the left are not the same as group 1, match an additional 3 times group 1 which is a total of 4 the same chars.
Then assert what is on the right is not group 1.
([A-Z])(?<!\1\1)\1{3}(?!\1)
([A-Z]) Capture group 1, match a single char A-Z
(?<!\1\1) Negative lookbehind, assert what is on the left is not 2 times group 1
\1{3} Match 3 times group 1
(?!\1) Assert what is on the right is not group 1
For example
let pattern = /([A-Z])(?<!\1\1)\1{3}(?!\1)/g;
[
"3346AAAA44",
"3973BBBBBB44",
"9755BBBBBBAAAA44",
"AAAA",
"AAAAB",
"BAAAAB"
].forEach(s =>
console.log(s + " --> " + s.match(pattern))
);

javascript regular expressions - groups

I"m currently studying regular expression groups. I'm having trouble fully understanding the first example presented in the book under groups. The book gives the following example:
/(\S+) (\S*) ?\b(\S+)/
I understand that this will match at most three words (consisting of any character except a white space), where the second word and space is optional.
What I have trouble understanding is the function of the boundary condition to start the match of the last group at the beginning of the third word.
When there are three words It makes no difference whether it is included or not.
When there are only two words there is a difference between group #2 and group #3
So, my question is as follows
When there are two words, why is the presence of \b causing group#2 to be an empty string as expected, but when not present causes group #2 to contain the second word minus the last letter and group #3 to contain the last letter of the second word?
When there are two words, why is the presence of \b causing group#2 to be an empty string as expected
Look at the first and third groups - being (\S+), they must contain characters. When there are two words, those two words must go into the first and third group - the second group, since it's repeated with *, will not consume any characters, and will be the empty string.
but when not present causes group #2 to contain the second word minus the last letter and group #3 to contain the last letter of the second word?
When the pattern is
(\S+) (\S*) ?(\S+)
once the engine has matched the first word, the engine will start trying to match the second word. So if the input is foo bar, we can consider how the pattern (\S*) ?(\S+) works on bar.
The engine first tries to consume all remaining characters in the string with the \S*. This fails, because the last group is required to contain at least one character, so the engine backs up a step, and has the \S* group match all but the last character. This results in a successful match, because the position before the last character does match \s?(\S+).
You can see this process visually here:
https://regex101.com/r/RAkEOt/1/debugger
In the first pattern, the word boundary before the last group ensures that the second group does not match any characters, in case there are only two words in the string - rather than backtracking to just before the last character, it must back up all the way until a word boundary is found:
The original pattern may be slightly flawed - \b matches a word boundary, but not every non-space character is a word character - it (probably undesirably) matches foo it's where the it' goes into the second group, and the s goes into the third group.
The difference comes from the second group (\S*) - it will capture any amount of non-whitespace characters. So, when you have two words but three groups where the last one is (\S+) - match at least one non-whitespace character, the regex engine will try to satisfy both group 2 and 3.
Remember that it's matching a pattern and you've not told it not to match like that. Hence it does the minimum work necessary - the second group's \S* will initially match everything grabbing brown - the next part of the pattern is an optional space, which passes, then it gets to the final group \S+ and since it has a mandatory character, the second match will release matches one by one until group 3 is satisfied.
You can see this here - I've defined the third group to have at least two mandatory characters, hence it only gets two:
let [ , group1, group2, group3] = "the brown".match(/(\S+) (\S*) ?(\S{2,})/);
console.log("group 1:", group1)
console.log("group 2:", group2)
console.log("group 3:", group3)
When you instead add the word boundary \b to the pattern, you cannot have group 2 have any characters and satisfy the later condition - when a regex consumes a character the rest of the pattern will only continue from that character onward, hence you cannot have, for example group 2 match b and then have a word boundary followed by rown. The only way that (\S+) (\S*) ?\b(\S+) can be satisfied is the following:
group 1 matches the
the space character is matched
group 2 matches nothing, which is acceptable as it can match any amount, including zero
the optional space matches zero spaces
there is a word boundary
group 3 consumes the rest of the letters - brown

JavaScript Regular Expressions Basics

I'm trying to learn Regular Expressions and at the moment I've gathered a very basic understanding from all of the overviews from W3, Mozilla or http://www.regular-expressions.info/, but when I was exploring this wikibook http://en.wikibooks.org/wiki/JavaScript/Regular_Expressions it gave this example:
"abbc".replace(/(.)\1/g, "$1") => "abc"
which I have no idea why is true (the wikibook didn't really explain), but I tried it myself and it does drop the second b. I know \1 is a backreference to the captured group (.), but . is the any character besides a new line symbol... Wouldn't that still pick up the second b? Trying a few variations didn't clear things up either...
"abbc".replace(/(.)/g, "$1") => "abbc"
"aabc".replace(/(.)*/g, "$1") => "c"
Does anybody have a good in depth tutorial on Javascript Regular Expressions (I've looked at a couple of books and they're very generalized for about 15 languages and no real emphasis on Javascript).
First One
(.) matches and captures a single character to Group 1, so (.)\1 matches two of the same characters, for instance AA.
In the string, the only match for this pattern is bb.
By replacing these two characters bb by the Group 1 capture buffer $1, i.e. b, we replace two chars with one, effectively removing oneb`.
Second One
Again (.) matches and captures a single character, capturing it to Group 1.
The pattern matches each character in the string in turn.
The replacement is the Group 1 capture buffer $1, so we replace each character with itself. Therefore the string is unchanged.
Third One
Here, forgetting the parentheses for a moment, .* matches the whole string: this is the match.
The quantifier * means that the Group 1 is reset every time a single character is matched (new group numbers are not created, as group numbering is done from left to right).
For every character that is matched, that character is therefore captured to Group 1—until the next capture resets Group 1.
The end value of Group 1 is the the last capture, which is the last character c
We replace the match (i.e., the whole string) with Group 1 (i.e. c), so the replacement string is c.
The details of group numbering are important to grasp, and I highly recommend you read the linked article about "the gory details".
Reference
Capture Group Numbering & Naming: The Gory Details
JavaScript Regex Basics
Backreferences
This is quite simple when broken down:
With "abbc".replace(/(.)\1/g, "$1"), the result is "abc" because:
(.) references one character.
\1 references the first back reference
So what it says is "find 2 times the same letter" and replace it with the reference. So any doubled character would match and be replaced by the reference.

HTML5 Pattern for allowing only 5 comma separated words without including a space

I basically don't know how to make HTML5 Patterns so that's why I am asking this question. I just want to create a pattern that would be applied onto a input[type=text] thorough pattern attribute newly introduced by the HTML5 but I have no idea on how to implement that pattern.
The pattern includes the following things:
Allow only 5 comma separated words
No space could be added.
^(\w+,){4}\w+$ is the pattern you need: repeat "any number of word characters followed by comma" four times, then "just word characters". If you want "up to five", the solution would be
^(\w+,){0,4}\w+$
Detailed explanation (adapted from http://www.regex101.com):
^ assert position at start of the string (i.e. start matching from start of string)
1st Capturing group (\w+,){0,4}
Quantifier {0,4}: Between 0 and 4 times, as many times as possible, giving back as needed [greedy]
\w+ match any word character [a-zA-Z0-9_]
Quantifier +: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
, matches the character , literally
\w+ match any word character [a-zA-Z0-9_]
Quantifier +: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
$ assert position at end of the string
If you don't want numbers as part of the match, replace every instance of \w in the above with [a-zA-Z] - it will match only lowercase and uppercase letters A through Z.
UPDATE In response to your comment, if you want no group of characters to be more than 10 long, you can modify the above expression to
^(\w{1,10},){0,4}\w{1,10}$
Now the "quantifier" is {1,10} instead of + : that means "between 1 and 10 times" instead of "1 or more times".

Categories

Resources