JavaScript Regular Expressions Basics - javascript

I'm trying to learn Regular Expressions and at the moment I've gathered a very basic understanding from all of the overviews from W3, Mozilla or http://www.regular-expressions.info/, but when I was exploring this wikibook http://en.wikibooks.org/wiki/JavaScript/Regular_Expressions it gave this example:
"abbc".replace(/(.)\1/g, "$1") => "abc"
which I have no idea why is true (the wikibook didn't really explain), but I tried it myself and it does drop the second b. I know \1 is a backreference to the captured group (.), but . is the any character besides a new line symbol... Wouldn't that still pick up the second b? Trying a few variations didn't clear things up either...
"abbc".replace(/(.)/g, "$1") => "abbc"
"aabc".replace(/(.)*/g, "$1") => "c"
Does anybody have a good in depth tutorial on Javascript Regular Expressions (I've looked at a couple of books and they're very generalized for about 15 languages and no real emphasis on Javascript).

First One
(.) matches and captures a single character to Group 1, so (.)\1 matches two of the same characters, for instance AA.
In the string, the only match for this pattern is bb.
By replacing these two characters bb by the Group 1 capture buffer $1, i.e. b, we replace two chars with one, effectively removing oneb`.
Second One
Again (.) matches and captures a single character, capturing it to Group 1.
The pattern matches each character in the string in turn.
The replacement is the Group 1 capture buffer $1, so we replace each character with itself. Therefore the string is unchanged.
Third One
Here, forgetting the parentheses for a moment, .* matches the whole string: this is the match.
The quantifier * means that the Group 1 is reset every time a single character is matched (new group numbers are not created, as group numbering is done from left to right).
For every character that is matched, that character is therefore captured to Group 1—until the next capture resets Group 1.
The end value of Group 1 is the the last capture, which is the last character c
We replace the match (i.e., the whole string) with Group 1 (i.e. c), so the replacement string is c.
The details of group numbering are important to grasp, and I highly recommend you read the linked article about "the gory details".
Reference
Capture Group Numbering & Naming: The Gory Details
JavaScript Regex Basics
Backreferences

This is quite simple when broken down:
With "abbc".replace(/(.)\1/g, "$1"), the result is "abc" because:
(.) references one character.
\1 references the first back reference
So what it says is "find 2 times the same letter" and replace it with the reference. So any doubled character would match and be replaced by the reference.

Related

Regex Javascript Capture groups with quantifier Not Working

I have this nice regex:
*(?:(?:([0-9]+)(?:d| ?days?)(?:, ?| )?)|(?:([0-9]+)(?:h| ?hours?)(?:, ?| )?)|(?:([0-9]+)(?:m| ?minutes?)(?:, ?| )?)|(?:([0-9]+)(?:s| ?seconds?)(?:, ?| )?))+
that pretty much matches a human-readable time-delta. It works on php, python, and go, but for some reason the capture groups do not work on javascript. Here is a working php example on regex101 that shows the working capture groups. You will notice that upon changing it to javascript (ECMAscript) mode, the capture group will only capture the last value. Can somebody please help and clarify what I am doing wrong, and whu it doesn't work on js?
Here's a simpler example that demonstrates the issue:
console.log(
'34'.match(/(?:(3)|(4))+/)
);
In PHP, whenever a capture group is matched, it will be put into the result. In contrast, in JavaScript, things are more complicated: when there are capturing groups on one side of an alternation |, whenever the whole alternation token is entered, there are 2 possibilities:
The alternation that is taken contains the capture group, and the result will have the capture group index set to the matched value
The alternation that is taken does not contain the capture group, in which case the result will have undefined assigned to that index - even if the capturing group was matched previously.
This is described in the specification:
Any capturing parentheses inside a portion of the pattern skipped by | produce undefined values instead of Strings.
and
Step 4 of the RepeatMatcher clears Atom's captures each time Atom is repeated.
because each iteration of the outermost * clears all captured Strings contained in the quantified Atom
In your case, the easiest tweak to fix it would be to remove the repeating outermost capturing group, so that only one subsequence is matched at a time, eg 1m, and then 1d, then iterate through the matches, instead of trying to match everything all in one go. To ensure that all the matches are next to each other (eg 1m1d, and not 1m 1d), check the index while iterating through the matches to see if it's next to a previous match or not.

Why is this second backreference not working?

I want to make use of backreferences as much as possible, avoiding the duplication of combinations of many patterns.
Other requirements: Use less literals without constructing new RegExp while maintaining generality.
Original title: Why is this negative lookahead with capturing group not working?
For example, a string:
1.'2.2'.33.'4.4'.5.(…etc)
— I want to match the characters separated by periods, and the quoted ones are not segmented and the quotes are truncated. That is to match:
1, 2.2, 33, 4.4, 5, (…etc).
A working regex is:
(?<=(["'])(?!\.)).*?(?=\1)|((?!["']|\.).)+
console.log(
"1.'2.2'.33.'4.4'.5.(…etc)".match(
/(?<=(["'])(?!\.)).*?(?=\1)|((?!["']|\.).)+/g
)
)
A non-working one is:
(?<=(["'])(?!\.)).*?(?=\1)|((?!\1|\.).)+
^^
console.log(
"1.'2.2'.33.'4.4'.5.(…etc)".match(
/(?<=(["'])(?!\.)).*?(?=\1)|((?!\1|\.).)+/g
)
)
— it does not match 1, 33, 5, (…etc).
Why is it (\1←^^) non-working and how to correct it? Thank you!
The main point of confusion seems to be that backreferences are not like "regex subroutines"; they don't let you reuse parts of the pattern elsewhere. What they do is they let you match the exact string that was matched before again.
For example:
console.log(/(\w)\1/.test('AB'));
console.log(/(\w)\1/.test('AA'));
console.log(/(\w)\1/.test('BB'));
(\w)\1 does not match AB, but it does match AA and BB. The \1 part only matches the exact string that was matched by the (\w) group before.
In your case,
(?<=(["'])(?!\.)).*?(?=\1)
|
((?!\1|\.).)+
there are two branches separated by |. The second branch contains a backreference (\1) to a capturing group in the first branch ((["'])).
This can never match because the second branch is only tried if the first branch failed to match anything, but in that case the first capturing group also failed to match anything, so what string would \1 refer to?
If the capturing group referred to by a backreference never matched anything, browsers behave as if it were the empty string.
The empty string always matches, so (?!\1) always fails.
console.log(
"1.'2.2'.33.'4.4'.5.(…etc)".match(
/(["'])[\d.]+\1|\d+/g
)
)

Forcing a Strict Character Order in a Regex Expression

I'm trying to create a regex in Javascript that has a limited order the characters can be placed in, but I'm having trouble getting the validation to be fully correct.
The criteria for the expression is a little complicated. The user must input strings with the following criteria:
The string contains two parts, an initial group, and an end group.
The groups are separated by a colon (:).
Strings are separated by a semi-colon (;).
The initial group can start with one optional forward-slash and end with one optional forward-slash, but these forward-slashes may not appear anywhere else in the group.
Inside forward-slashes, one optional underscore may appear on either end, but they may not appear anywhere else in the group.
Inside these optional elements, the user may enter any number of numbers or letters, uppercase or lowercase, but exactly one of these characters must be surrounded with angular brackets (<>).
If the letter inside the brackets is an uppercase C, it may be followed by one of a lowercase u or v.
The end group may contain one or more of a number or letter, uppercase or lowercase (If it is an uppercase C, it can be followed by a lowercase u or v.) or one asterisk (*), but not both.
A string must be able to validate with multiple groupings.
This probably sounds a little confusing.
For example, the following examples are valid:
<C>:Cu;
<Cu>:Cv;
/_V<C>V:C;
/_VV<Cv>VV_/:Cu;
_<V>:V1;
_<V>_:V1;
_<V>/:V1;
_<V>:*;
_<m>:n;
The following are invalid:
Cu:Cv;
Cu:Cv
CuCv;
<Cu/>:Cv;
<Cu_>:Cv;
<Cu>:Cv/;
_/<Cu>:Cv;
<Cu>/_:Cv;
They should validate when grouped together like so.
<Cu>:Cv;/_V<C>V:C;_<V>:V1;_<V>/:V1;_<V>:*;_<m>:n;
Hopefully, these examples help you understand what I'm trying to match.
I created the following regexp and tested it on Regex101.com, but this is the closest I could come:
\\/{0,1}_{0,1}[A-Za-z0-9]{0,}<{1}[A-Za-z0-9]{1,2}>{1}[A-Za-z0-9]{0,}_{0,1}\\/{0,1}):([A-Za-z0-9]{1,2}|\\*;$
It's mostly correct, but it allows strings that should be invalid such as:
_/<C>:C;
If an underscore comes before the first forward-slash, it should be rejected. Otherwise, my regexp seems to be correct for all other cases.
If anyone has any suggestions on how to fix this, or knows of a way to match all criteria much more efficiently, any help is appreciated.
The following seems to fulfill all the criteria:
(?:^|;)(\/?_?[a-zA-Z0-9]*<(?:[a-zA-Z]|C[uv]?)>[a-zA-Z0-9]*_?\/?):([a-zA-Z0-9]+|\*)(?=;|$)
Regex101 demo.
It puts each of the "groups" in a capturing group so you can access them individually.
Details:
(?:^|;) A non-capturing group to make sure the string is either at the beginning or starts with a semicolon.
( Start of group 1.
\/?_? An optional forward-slash followed by an optional underscore.
[a-zA-Z0-9]* Any letter or number - Matches zero or more.
<(?:[a-zA-Z]|C[uv]?)> Mandatory <> pair containing one letter or the capital letter C followed by a lowercase u or v.
[a-zA-Z0-9]* Any letter or number - Matches zero or more.
_?\/? An optional underscore followed by an optional forward-slash.
) End of group1.
: Matches a colon character literally.
([a-zA-Z0-9]+|\*) Group 2 - containing one or more numbers or letters or a single * character.
(?=;|$) A positive Lookahead to make sure the string is either followed by a semicolon or is at the end.
Did you mean this?
/^(?:(^|\s*;\s*)(?:\/_|_)?[a-z]*<[a-z]+>[a-z]*_?\/?:(?:[a-z0-9]+|\*)(?=;))+;$/i
We start with a case-insensitive expression /.../i to keep it more readable. You have to rewrite it to a case-sensitive expression if you only want to allow uppercase at the beginning of a word.
^ means the begin of the string. $ means the end of the string.
The whole string ends with ';' after multiple repeatitions of the inner expression (?:...)+ where + means 1 or more ocurrences. ;$ at the end includes the last semicolon into the result. It is not necessary for a test only, since the look-ahead already does the job.
(^|\s*;\s*) every part is at the begin of the string or after a semicolon surrounded by arbitrary whitespaces including linefeed. Use \n if you do not want to allow spaces and tabs.
(?:...|...) is a non-captured alternative. ? after a character or group is the quantifier 0/1 - none or once.
So (?:\/_|_)? means '/', '' or nothing. Use \/?_? if you do want to allow strings starting with a single slash as well.
[a-z]*<[a-z]+>[a-z]* 0 or more letters followed by <...> with at least one letter inside and again followed by 0 or more letters.
_?\/?: optional '_', optional '/', mandatory : in this sequence.
(?:[a-z0-9]+|\*) The part after the colon contains letters and numbers or the asterisk.
(?=;) Look-ahead: Every group must be followed by a semicolon. Look-ahead conditions do not move the search position.

Understanding the usage of \b in regex that matches multiple strings

I just found the below regex online while browsing:
(?:^|\b)(bitcoin atm|new text|bitcoin|test a|test)(?!\w)
I was just curious to know what is the advantage of using (?:^|\b) here ?
I understand that basically (?:) means it a non capturing group but I am a bit stumped by ^|\b in this particular parenthesis, here I understand that ^ basically means asset start of string.
The examples of \b on MDN gave me a fair understanding of what \b does, but I am still not able to put things into context based on the example I have provided.
The (?:^|\b) is a non-capturing group that contains 2 alternatives both of which are zero-width assertions. That means, they just match locations in a string, and thus do not affect the text you get in the match.
Besides, as the next subpattern matches b, n or t as the first char (a word char) the \b (a word boundary) in the first non-capturing group will also match them in the beginning of a string, making ^ (start of string anchor) alternative branch redundant here.
Thus, you may safely use
\b(bitcoin atm|new text|bitcoin|test a|test)(?!\w)
and even
\b(bitcoin atm|new text|bitcoin|test a|test)\b
since the alternatives end with a word char here.
If the alternatives in the (bitcoin atm|new text|bitcoin|test a|test) group are user-defined, dynamic, and can start or end with a non-word char, then the (?:^|\b) and (?!\w) regex patterns makes sense, but it would not be prcise then, as (?:^|\b)\.txt(?!\w) will not match .txt as a whole word, it should be preceded with a word char then. I would use (?:^|\W) rather than (?:^|\b).

Difference between (\w)* and \w?

I'm trying to study regexes, and I came upon this confusing scenario:
Suppose you have the text:
hello world
If you run the regex (\w)*, it gives:
['hello', 'o']
What I expected was:
['hello', 'h']
Doesn't \w mean any word character?
Another example:
Text:
Delicious cake
(\w)* output:
['Delicious', 's']
What I expected:
['Delicious', 'D']
'*' matches the preceding part zero or more times and bind tightly to the element on the left.
Example: m*o will match o, mo, mmo, mmmmo and so on.
Parentheses () are used to mark sub-expressions, also called capture groups.
So (\w)* is repeated capturing group.
Regex Demo
Sam, the reason why (\w)* returns "s" in Group 1 against "delicious" is that there can only be one Group 1. Each time a new character is matched by (\w), the parentheses force the new value of the character to be captured into Group 1. "s" is the last character, so it is the final Group 1 reported to you by the engine.
If you wanted to capture the first letter into Group 1 instead, you could go with something like:
(\w)\w*
This causes the first character to be captured. There is no quantifier on the capturing parentheses, so Group 1 doesn't change. The remaining \w* optionally match any additional characters.
Also please note that when you run (\w)* against "hello world", the matches are not "hello" and "o" as you stated. The matches (if you match them all) are "hello" and "world". The Group 1 captures are "o" and "d", the last letters of each word.
Reference: All about capture
Remember, a repeated capturing group always captures the last group.
So.
(\w)* on hello will check one character at a time unless it reaches the last match.
Thus will get o in the capture group.
(\w)* on helloworld will check one character at a time unless it reaches the last match.
Thus will get d in the capture group.
(\w)* on hello123 will check one character at a time unless it reaches the last match.
Thus will get 3 in the capture group.
(\w)* on helloworld#3w4 will check one character at a time unless it reaches the last match. Thus will get d in the capture group since # is not a valid \word character( only [_0-9a-zA-Z] allowed).
(\w)*
Match the regular expression below and capture its match into backreference number 1 «(\w)*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Note: You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «*»
Match a single character that is a “word character” (letters, digits, and underscores) «\w»
Will give you two matches:
hello
world
\w
Match a single character that is a “word character” (letters, digits, and underscores) «\w»
Will match every character (individually) on the sentence:
h
e
l
l
o
w
o
r
l
d
\w is a RegEx shortcut for [_a-zA-Z0-9] which means any letter, digit, or an underscore.
When you add an asterisk * after anything, it means it can appear from 0 to unlimited times.
If you want to match all the letters in your input, use \w
If you want to match whole words in your input, use \w+ (use + and not * since a word has at least one letter)
Also, when you're surrounding stuff in your RegEx with brackets, they become a capture group, which means they will appear in your results, which is why (\w)* is different from (\w*)
Useful RegEx sites:
RegexPal
Debuggex

Categories

Resources