What am I doing wrong with my regex? - javascript

I am trying to capture "Rio Grande Do Leste" from:
...
<h1>Rio Grande Do Leste<br />
...
using
var myregexp = /<h1>()<br/;
var nomeAldeiaDoAtaque = myregexp.exec(document);
what am I doing wrong?
update:
2 questions remain:
1) searching (document) didn´t produce any result, but changing it to (document.body.innerHTML) worked. Why is that?
2) I had to change it to: myregexp.exec(document.body.innerHTML)[1]; to get what I want, otherwise it would give me some result which includes <h1>. why is that?
3) (answered) why do I need to use ".*" ? I tought it would collect anything between ()?

Try /<h1>(.*?)<br/.

On capturing group
A capturing group attempts to capture what it matches. This has some important consequences:
A group that matches nothing, can never capture anything.
A group that only matches an empty string, can only capture an empty string.
A group that captures repeatedly in a match attempt only gets to keep the last capture
Generally true for most flavors, but .NET regex is an exception (see related question)
Here's a simple pattern that contains 2 capturing groups:
(\d+) (cats|dogs)
\___/ \_________/
1 2
Given i have 16 cats, 20 dogs, and 13 turtles, there are 2 matches (as seen on rubular.com):
16 cats is a match: group 1 captures 16, group 2 captures cats
20 dogs is a match: group 1 captures 20, group 2 captures dogs
Now consider this slight modification on the pattern:
(\d)+ (cats|dogs)
\__/ \_________/
1 2
Now group 1 matches \d, i.e. a single digit. In most flavor, a group that matches repeatedly (thanks to the + in this case) only gets to keep the last match. Thus, in most flavors, only the last digit that was matched is captured by group 1 (as seen on rubular.com):
16 cats is a match: group 1 captures 6, group 2 captures cats
20 dogs is a match: group 1 captures 0, group 2 captures dogs
References
regular-expressions.info/Use Round Brackets for Capturing
Is there a regex flavor that allows me to count the number of repetitions matched by * and +?
.NET regex keeps intermediate captures!
On greedy vs reluctant vs negated character class
Now let's consider the problem of matching "everything between A and ZZ". As it turns out, this specification is ambiguous: we will come up with 3 patterns that does this, and they will yield different matches. Which one is "correct" depends on the expectation, which is not properly conveyed in the original statement.
We use the following as input:
eeAiiZooAuuZZeeeZZfff
We use 3 different patterns:
A(.*)ZZ yields 1 match: AiiZooAuuZZeeeZZ (as seen on ideone.com)
This is the greedy variant; group 1 matched and captured iiZooAuuZZeee
A(.*?)ZZ yields 1 match: AiiZooAuuZZ (as seen on ideone.com)
This is the reluctant variant; group 1 matched and captured iiZooAuu
A([^Z]*)ZZ yields 1 match: AuuZZ (as seen on ideone.com)
This is the negated character class variant; group 1 matched and captured uu
Here's a visual representation of what they matched:
___n
/ \ n = negated character class
eeAiiZooAuuZZeeeZZfff r = reluctant
\_________/r / g = greedy
\____________/g
See related question for a more in-depth treatment on the difference between these 3 techniques.
Related questions
Difference between .*? and .* for regex
Greedy vs reluctant vs negated character class, detailed explanation with illustrative examples
Going back to the question
So let's go back to the question and see what's wrong with pattern:
<h1>()<br
\/
1
Group 1 matches the empty string, therefore the whole pattern overall can only match <hr1><br, and group 1 can only match the empty string.
One can try to "fix" this in many different ways. The 3 obvious ones to try are:
<h1>(.*)<br; greedy
<h1>(.*?)<br; reluctant
<h1>([^<]*)<br; negated character class
You will find that none of the above "work" all the time; there will be problems with some HTML. This is to be expected: regex is the "wrong" tool for the job. You can try to make the pattern more and more complicated, to get it "right" more often and "wrong" less often. More than likely you'll end up with a horrible mess that no one can understand and/or maintain, and it'd still probably won't work "right" 100% of the time.

or
^(<h1>)(.)+(<br />)
go here to test
gskinner.com

Related

How to make the middle capture group work when surrounded by wildcards?

I use Regex quite a bit. I'm no master, but I've surprised myself with how difficult this has been.
We have a Regex string like this:
^(?:remind me ).*? (to|that|about|its|it's)? ?(.*)$
I want it to match both of the following strings, and assign some value to the first capture group.
remind me in 24 hours test
remind me in 24 hours to test
Assigning this little "to" to the first capture group is proving very difficult.
I could work-around this by doing two passes like below and then checking if the result is null or not, but that seems like madness, so I'm hoping to learn a better approach to this.
const regex1 = /^(?:remind me ).*? (to|that|about|its|it's)? ?(.*)$/i
const regex2 = /(to|that|about|its|it's) ?(.*)$/i
const matches1 = 'remind me in 24 hours to test'.match(regex1)[2]
const matches2 = matches1.match(regex2)
console.log(matches2)
// String1 output: null
// String2 output: [ 'to test', 'to', 'test', index: 9, input: '24 hours to test', groups: undefined ]
On related questions:
I've seen numerous other questions about this - but none of the "solutions" seem applicable here, as most of the answers are tailored to the user's specific issue, and I haven't been able to figure out how to fix our issue using them as a reference.
I read this answer, and it improved my understanding of greedy vs lazy, but did not help me understand how to resolve my issue without crummy code.
TLDR: Desired results would look like below, matching the whole string with to in the first capture group. The contents of the second capture group are not important to us except that the group is not empty.
It works if you remove the optional quantifier from the first capturing group and put .*? together with the capture group into another non-capturing group and make this outer group optional:
^remind me +(?:.*?\b(to|that|about|its|it's)\b *)?(.*)$
See this demo at regex101 (I also did some little changes like adding word boundaries, change quantifiers for variable space and remove the non-capture group at start, that looks unneeded)
To understand why this works, first have a look at the simple pattern (a)? and how this results in one capture of a and three empty matches in abc while getting four empty matches in e.g. xyz.
Simplifying your current pattern to e.g. ^a.*?(b)?(.*) investigate this at the regex101 debugger and click the matches tab on the left side. For the string abc the regex parser first matches a. The next character b matches the optional group and the capture succeeds. Using the same pattern on another string acbc, after matching the first a the next character is a c. Because b is optional it "fits in" between a and the adjacent c (click around step 7 at match 2) and won't get captured.
But refactoring this pattern to ^a(?:.*?(b))?(.*) and now looking into the debugger (watch steps 3 to 12) you can see that at the same position after the first a the grouped (?:.*?(b))? part fits in here for both test strings. The first group captures the substring before proceeding in the pattern.
With your current pattern there are even some strings that will the first group let capture (demo).

Regex to follow pattern except between braces

I am having a tough time figuring out a clean Regex (in a Javascript implementation) that will capture as much of a line as it can following a pattern, but anything inside braces doesn't need to follow the pattern. I'm not sure the best way to explain that except by example:
For example:
Let's say the pattern is, the line must start with 0, end with a 0 anywhere, but only allow sequence of 1, 2 or 3 in between, so I use ^(0[123]+0). This should match the first part of the strings:
0213123123130
012312312312303123123
01231230123123031230
etc.
But I want to be able to insert {gibberish} between braces into the line and have the Regex allow it to disrupt the pattern. i.e., ignore the pattern of the curly braces and everything inside, but still capture the full string including the {gibberish}. So this would capture everything in bold:
01232231{whatever 3 gArBaGe? I want.}121{foo}2310312{bar}3120123
and a 0 inside the braces does not end the capture prematurely, even if the pattern is correct.
01213123123123{21310030123012301}31231230123
EDIT: Now, I know I could just do something like ^0[123]*?(?:{.*})*?[123]*?0 maybe? But that only works if there is a single set of braces, and now I have to duplicate my [123] pattern. As that [123] pattern gets more complex, having it appear more than once in the Regex starts getting really incomprehensible. Something like the best regex trick seemed promising but I couldn't figure out how to apply it here. Using crazy lookarounds seems like the only way now but I would hope there's a cleaner way.
Since you've specified that you want the whole match including the garbage, you can use ^0([123]+(?:{[^}]*}[123]*)*)0 and use $1 to get the part between the 0s, or $0 to get everything that matched.
https://regex101.com/r/iFSabs/3
Here's the rundown on how the regex works:
^ anchors the match to start at the beginning of the line
0 matches a literal zero character
([123]+(?:{[^}]*}[123]*)*) is a capturing group that captures everything inside of it.
[123]+ matches one or more instances of 1, 2, or 3
(?:{[^}]*}[123]*)* is a non-capturing group. I.e. it'll be part of the match, but won't have a $# for use in replacement or the match.
{[^}]*} matches a literal { followed by any number of non } characters followed by }
[123]* matches zero or more instances of 1, 2, or 3
Then this whole non-capturing group can be matched 0 or more times.
The process behind this regex is known as unrolling the loop. http://www.softec.lu/site/RegularExpressions/UnrollingTheLoop gives a good description of it. (with a few typo fixes)
The unrolling the loop technique is based on the hypothesis that in
most case, you [know] in a [repeated] alternation, which case should be
the most usual and which one is exceptional. We will called the first
one, the normal case and the second one, the special case. The general
syntax of the unrolling the loop technique could then be written as:
normal* ( special normal* )*
Which could means something like, match the normal case, if you find a
special case, matched it than match the normal case again. [You'll] notice
that part of this syntax could [potentially] lead to a super-linear
match.
Example using Regex#test and Regex#match:
const strings = [
'0213123123130',
'012312312312303123123',
'01231230123123031230',
'01213123123123{21310030123012301}31231230123',
'01212121{hello 0}121312',
'012321212211231{whatever 3 gArBaGe? I want.}1212313123120123',
'012321212211231{whatever 3 gArBaGe? I want.}121231{extra garbage}3123120123',
];
const regex = /^0([123]+(?:{[^}]*}[123]*)*)0/
console.log('tests')
console.log(strings.map(string => `'${string}': ${regex.test(string)}`))
console.log('matches');
let matches = strings
.map((string) => regex.exec(string))
.map((match) => (match ? match[1] : undefined));
console.log(matches);
Robo Robok's answer is where I'd go with if you want to only keep the non braced part, although using a slightly different regex ({[^}]*}) for a bit more performance.
How about the other way around? Checking the string with curly tags removed:
const string = '012321212211231{whatever 3 gArBaGe? I want.}1212313123120123{foo}123';
const stringWithoutTags = string.replace(/\{.*?\}/g, '');
const result = /^(0[123]+0)/.test(stringWithoutTags);
You say you need to capture everything, including the gibberish, so I think a simple pattern like this should work:
^(0(?:[123]|{.+?})+0)
That allows a string starting with 0, and then any of your pattern characters (1, 2, or 3), or one of the { gibberish } sections, and allows that to repeat to handle multiple gibberish sections, and finally it must end with a 0.
https://regex101.com/r/K4teGY/2
You might use
^0[123]*(?:{[^{}]*}[123]*)*0
^ Start of string
0 Match a zero
[123]* Match 0+ times either 1, 2 or 3
(?: Non capture group
{[^{}]*}[123]* match from an opening till closing } followed by 0+ either 1, 2 or 3
)* Close group and repeat 0+ times
0 Match a zero
Regex demo

Why is this second backreference not working?

I want to make use of backreferences as much as possible, avoiding the duplication of combinations of many patterns.
Other requirements: Use less literals without constructing new RegExp while maintaining generality.
Original title: Why is this negative lookahead with capturing group not working?
For example, a string:
1.'2.2'.33.'4.4'.5.(…etc)
— I want to match the characters separated by periods, and the quoted ones are not segmented and the quotes are truncated. That is to match:
1, 2.2, 33, 4.4, 5, (…etc).
A working regex is:
(?<=(["'])(?!\.)).*?(?=\1)|((?!["']|\.).)+
console.log(
"1.'2.2'.33.'4.4'.5.(…etc)".match(
/(?<=(["'])(?!\.)).*?(?=\1)|((?!["']|\.).)+/g
)
)
A non-working one is:
(?<=(["'])(?!\.)).*?(?=\1)|((?!\1|\.).)+
^^
console.log(
"1.'2.2'.33.'4.4'.5.(…etc)".match(
/(?<=(["'])(?!\.)).*?(?=\1)|((?!\1|\.).)+/g
)
)
— it does not match 1, 33, 5, (…etc).
Why is it (\1←^^) non-working and how to correct it? Thank you!
The main point of confusion seems to be that backreferences are not like "regex subroutines"; they don't let you reuse parts of the pattern elsewhere. What they do is they let you match the exact string that was matched before again.
For example:
console.log(/(\w)\1/.test('AB'));
console.log(/(\w)\1/.test('AA'));
console.log(/(\w)\1/.test('BB'));
(\w)\1 does not match AB, but it does match AA and BB. The \1 part only matches the exact string that was matched by the (\w) group before.
In your case,
(?<=(["'])(?!\.)).*?(?=\1)
|
((?!\1|\.).)+
there are two branches separated by |. The second branch contains a backreference (\1) to a capturing group in the first branch ((["'])).
This can never match because the second branch is only tried if the first branch failed to match anything, but in that case the first capturing group also failed to match anything, so what string would \1 refer to?
If the capturing group referred to by a backreference never matched anything, browsers behave as if it were the empty string.
The empty string always matches, so (?!\1) always fails.
console.log(
"1.'2.2'.33.'4.4'.5.(…etc)".match(
/(["'])[\d.]+\1|\d+/g
)
)

Issue getting the shortest match using optional groups

I want to allow any 0 to 2 characters between each group in the (this is)?.??.??(an)?.??.??(example sentence) regex. It should match the bolded text in the below strings:
blah blah. An example sentence
blah blah. This is an example sentence
Something something Example sentence
Now, in the first example, the match is ah. example sentence. I thought adding 2 question marks to "." would mean that the regex engine will prefer to match 0 chars.
I'm using regex within VBA in MS Word, implemented by CreateObject("vbscript.regexp"), which as I understand it uses the VBScript regex flavor, which as I understand it is the same as the JavaScript flavor.
When searching 0020002101 should 2.??.??.??101 not prefer 2101 to 20002101?
Regex egine cannot "prefer" anything. It matches from left to right. Once the 2 is found (the first 2) it starts matching the subsequent subpatterns, and when a match is found, it is returned.
In your case, you need to use the .{0,2} inside the optional groups,
(this is.{0,2})?(an.{0,2})?(example sentence)
^^^^^^ ^^^^^^
See the regex demo.
If the order of the optional strings is important, make them nested:
(this is.{0,2}(an.{0,2})?)?(example sentence)
See another regex demo. This regex will only match an with 0 to 2 chars after it only if this is with 0 to 2 chars is found before it.

JavaScript Regular Expressions Basics

I'm trying to learn Regular Expressions and at the moment I've gathered a very basic understanding from all of the overviews from W3, Mozilla or http://www.regular-expressions.info/, but when I was exploring this wikibook http://en.wikibooks.org/wiki/JavaScript/Regular_Expressions it gave this example:
"abbc".replace(/(.)\1/g, "$1") => "abc"
which I have no idea why is true (the wikibook didn't really explain), but I tried it myself and it does drop the second b. I know \1 is a backreference to the captured group (.), but . is the any character besides a new line symbol... Wouldn't that still pick up the second b? Trying a few variations didn't clear things up either...
"abbc".replace(/(.)/g, "$1") => "abbc"
"aabc".replace(/(.)*/g, "$1") => "c"
Does anybody have a good in depth tutorial on Javascript Regular Expressions (I've looked at a couple of books and they're very generalized for about 15 languages and no real emphasis on Javascript).
First One
(.) matches and captures a single character to Group 1, so (.)\1 matches two of the same characters, for instance AA.
In the string, the only match for this pattern is bb.
By replacing these two characters bb by the Group 1 capture buffer $1, i.e. b, we replace two chars with one, effectively removing oneb`.
Second One
Again (.) matches and captures a single character, capturing it to Group 1.
The pattern matches each character in the string in turn.
The replacement is the Group 1 capture buffer $1, so we replace each character with itself. Therefore the string is unchanged.
Third One
Here, forgetting the parentheses for a moment, .* matches the whole string: this is the match.
The quantifier * means that the Group 1 is reset every time a single character is matched (new group numbers are not created, as group numbering is done from left to right).
For every character that is matched, that character is therefore captured to Group 1—until the next capture resets Group 1.
The end value of Group 1 is the the last capture, which is the last character c
We replace the match (i.e., the whole string) with Group 1 (i.e. c), so the replacement string is c.
The details of group numbering are important to grasp, and I highly recommend you read the linked article about "the gory details".
Reference
Capture Group Numbering & Naming: The Gory Details
JavaScript Regex Basics
Backreferences
This is quite simple when broken down:
With "abbc".replace(/(.)\1/g, "$1"), the result is "abc" because:
(.) references one character.
\1 references the first back reference
So what it says is "find 2 times the same letter" and replace it with the reference. So any doubled character would match and be replaced by the reference.

Categories

Resources