Regular expression for extracting a noun in variable order - javascript

I have the following text:
Action by Toni Kroos, Real Madrid. Rayo Vallecano 2, Real Madrid 0.
where the nouns are Toni Kroos, Real Madrid (team1) and Rayo Vallecano (team2).
I need a regular expression with a named capturing group that returns these results given the following variations:
Action by Toni Kroos, Real Madrid. Rayo Vallecano 2, Real Madrid 0.
Expected result: Rayo Vallecano
Action by Toni Kroos, Real Madrid. Real Madrid 0, Rayo Vallecano 2.
Expected result: Rayo Vallecano
My naive intention was to negate the backreference captured in team1 and use it on the second sentence. So when it is about to match Real Madrid or Rayo Vallecano, it would discard Real Madrid as is the same value as team1. So team2 would return Rayo Vallecano. No lucky so far with something like (it only works on the first example):
^Action by .*\, (?<team1>.*)\. (?!\1)(?<team2>.*)( \d+\,| \d+\.).
In plain English, my expectation is a regex to pick either the first noun or the second one on the second sentence (after the first .) so team2 would be either Real Madrid or Rayo Vallecano in the examples, and then discard the one that matches the named capturing group team1 (Real Madrid in the example). So it wouldn't matter the order of the noun in the second sentence.
I'm no expert with regular expressions, so I'm not sure that's possible to achieve with one unique pattern that fits both examples. Is it possible to get such expression? If so, I would appreciate the solution with an explanation of the pattern used. Thanks in advance.
EDIT: The language I'll be using is JavaScript

You might write the pattern using \1 to refer to the first capture group and use the named group team1 and team2 only once.
^Action by [^,]*, (?<team1>[^.]+)[.,] (?:\1[^,]*, )?(?<team2>[^,]+) \d+[,.]
Explanation
^ Start of string
Action by [^,]*, Match Action by followed by optional chars other than a comma and then match ,
(?<team1>[^.]+)[.,] Group team1 match 1+ chars other than . then then match either . or ,
(?:\1[^,]*, )? Optionally match what is matched by group 1 using a backreference followed by optional chars other than , followed by matching ,
(?<team2>[^,]+) Named group team2 match 1+ chars other than , and then match a space
\d+[,.] Match 1+ digits followed by , or .
See a regex101 demo.
const regex = /^Action by [^,]*, (?<team1>[^.]+)[.,] (?:\1[^,]*, )?(?<team2>[^,]+) \d+[,.]/;
[
`Action by Toni Kroos, Real Madrid. Rayo Vallecano 2, Real Madrid 0.`,
`Action by Toni Kroos, Real Madrid. Real Madrid 0, Rayo Vallecano 2.`
].forEach(s => {
const m = s.match(regex);
if (m) {
console.log(m.groups);
}
});

Take a look at this: (https://regex101.com/r/yYgl5R/1)
Regex: ^Action by .*\, (.*)\. (?<team1>.*)( \d+\,) (?<team2>.*)(\d+\.)
This matches only the teams, and the score

You can use the following regular expression with named capturing groups:
Action by.*, (?<team1>.*)\. (?<team2>.*) (\d+), (?<team1>.*) (\d+)\.
This regex matches the text and captures the values of team1 and team2 using named capturing groups. Note that the named capturing group team1 is used twice in the expression to capture both values of the same team.

Related

Match numbers that are not following an alphabet and place them in a capturing group (globally)

For example, if I have text like this:
"hello800 more text 1234 and 567"
It should match 1234 and 567, and not 800 (since it is following hello's o, which is not a number).
It is similar to what programming languages do, for example, in JavaScript, abc123 is a variable, while 50 alone, not following text, is treated as a number.
Please do have in mind that I want to negate only characters from the set [A-Za-z] and not others. For example, +33 and -33 should still return 33.
My first trial was to match a NOT set:
[^A-Za-z]([0-9]+)
That didn't work at all.
My second trial was to reverse the string and to use a negative lookahead:
/([0-9]+)(?![A-Za-z])/g
It works only if there is 1 digit. 1a does not match the 1 (and that's good), but 123a matches the 12 (and that's bad).
you can try the following regex
\b\d+\b
you can check the demo here
additionally if you want to match +33 also you can try the regex
\b[+-]*\d+\b
demo
var str = "hello800 more text 1234 and 567 blah blah +33 end";
var result = str.match(/\b\d+\b/g);
console.log(result);
Explanation:
\b : word boundary, make sure we don't have letter or digit before
\d+ : 1 or more digits
\b : word boundary, make sure we don't have letter or digit after
Try this \b(\d+?)\b
\b Checks for word boundaries (spaces, endlines, etc..)
\d Checks for numbers
Demo Here

Use regex to split a string at the first letter

I have a regex that can split a string at the first digit
var name = 'Achill Road Boy 1:30 Ayr'
var horsedata = name.match(/^(\D+)(.*)$/);
var horse = horsedata[1]; // "Achill Road Boy "
var meeting = horsedata[2]; // "1:30 Ayr"
However, I now need to further split
var meetingdata = meeting.match(?what is the regex);
var racetime = meetingdata[1]; // "1:30 "
var course = meetingdata[2]; // "Ayr"
What is the regex to split the string at the first letter?
You can use single regex to do that:
^([^\d]+) +(\d+):(\d+) (.*)$
It will catch name, hour and minute separately, and track name, in groups 1, 2, 3 and 4.
Note that I have added ^ and $ to the expression, meaning that this expression should match given string completely, from start to finish, which I think are useful safeguards against matching something inside the string which you didn't expect initially. They may, however, interfere with your task, so you can remove them if you don't need them.
When tinkering with regular expressions I always use this nifty tool, http://regex101.com - it has a very useful interface to debug regular expressions and also their execution time. Here's a link to a regular expression above: https://regex101.com/r/jYgc9K/1. It also gives you a nice clear breakdown of this regular expression:
Full match 0-24 `Achill Road Boy 1:30 Ayr`
Group 1. 0-15 `Achill Road Boy`
Group 2. 16-17 `1`
Group 3. 18-20 `30`
Group 4. 21-24 `Ayr`
Last, word of advice: there's a famous saying by Jamie Zawinski, a very smart guy. It goes like this:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems".
There's a lot of truth in this saying.
Given the string:
1:30 Ayr
The capture group from this regex will give you Ayr:
^[^a-zA-Z]*([a-zA-Z\s]+)$
Regular Expression Key:
^ - start of match
[^a-zA-Z]* - from 0 to any number of non-alphabet characters
([a-zA-Z\s]+) - capture group: from 1 to any number of alphabet characters and spaces
$ - end of match

Regex calling backreferences from all group interations

I'm catching international numbers and running a regex to replace the characters people like to put between numbers.
I'm using the below RegEx:
[+]([0-9]{1,3})(([\s\-\.\(\)]*)([0-9]*)([\s\-\.\(\)]*)){1,3}
It works great but when I use a repeated group, it only catches the last iteration. When I use the regex101 site to debug my regular expression, I see:
A repeated capturing group will only capture the last iteration. Put a
capturing group around the repeated group to capture all iterations
I want to take the advice but I'm not sure how I can put a capturing group around the repeated group. See: https://regex101.com/r/pT3cK9/1
As stated in comments, the simples way to clean phone numbers would be to define a list of unwanted characters, and replace them by spaces:
'+94 (666) 999-5555'.replace(/[ .()-]+/g, ' '); // +94 666 999 5555
'+42 555.123.4567'.replace(/[ .()-]+/g, ' '); // +42 555 123 4567

Difference between (\w)* and \w?

I'm trying to study regexes, and I came upon this confusing scenario:
Suppose you have the text:
hello world
If you run the regex (\w)*, it gives:
['hello', 'o']
What I expected was:
['hello', 'h']
Doesn't \w mean any word character?
Another example:
Text:
Delicious cake
(\w)* output:
['Delicious', 's']
What I expected:
['Delicious', 'D']
'*' matches the preceding part zero or more times and bind tightly to the element on the left.
Example: m*o will match o, mo, mmo, mmmmo and so on.
Parentheses () are used to mark sub-expressions, also called capture groups.
So (\w)* is repeated capturing group.
Regex Demo
Sam, the reason why (\w)* returns "s" in Group 1 against "delicious" is that there can only be one Group 1. Each time a new character is matched by (\w), the parentheses force the new value of the character to be captured into Group 1. "s" is the last character, so it is the final Group 1 reported to you by the engine.
If you wanted to capture the first letter into Group 1 instead, you could go with something like:
(\w)\w*
This causes the first character to be captured. There is no quantifier on the capturing parentheses, so Group 1 doesn't change. The remaining \w* optionally match any additional characters.
Also please note that when you run (\w)* against "hello world", the matches are not "hello" and "o" as you stated. The matches (if you match them all) are "hello" and "world". The Group 1 captures are "o" and "d", the last letters of each word.
Reference: All about capture
Remember, a repeated capturing group always captures the last group.
So.
(\w)* on hello will check one character at a time unless it reaches the last match.
Thus will get o in the capture group.
(\w)* on helloworld will check one character at a time unless it reaches the last match.
Thus will get d in the capture group.
(\w)* on hello123 will check one character at a time unless it reaches the last match.
Thus will get 3 in the capture group.
(\w)* on helloworld#3w4 will check one character at a time unless it reaches the last match. Thus will get d in the capture group since # is not a valid \word character( only [_0-9a-zA-Z] allowed).
(\w)*
Match the regular expression below and capture its match into backreference number 1 «(\w)*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Note: You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «*»
Match a single character that is a “word character” (letters, digits, and underscores) «\w»
Will give you two matches:
hello
world
\w
Match a single character that is a “word character” (letters, digits, and underscores) «\w»
Will match every character (individually) on the sentence:
h
e
l
l
o
w
o
r
l
d
\w is a RegEx shortcut for [_a-zA-Z0-9] which means any letter, digit, or an underscore.
When you add an asterisk * after anything, it means it can appear from 0 to unlimited times.
If you want to match all the letters in your input, use \w
If you want to match whole words in your input, use \w+ (use + and not * since a word has at least one letter)
Also, when you're surrounding stuff in your RegEx with brackets, they become a capture group, which means they will appear in your results, which is why (\w)* is different from (\w*)
Useful RegEx sites:
RegexPal
Debuggex

What am I doing wrong with my regex?

I am trying to capture "Rio Grande Do Leste" from:
...
<h1>Rio Grande Do Leste<br />
...
using
var myregexp = /<h1>()<br/;
var nomeAldeiaDoAtaque = myregexp.exec(document);
what am I doing wrong?
update:
2 questions remain:
1) searching (document) didn´t produce any result, but changing it to (document.body.innerHTML) worked. Why is that?
2) I had to change it to: myregexp.exec(document.body.innerHTML)[1]; to get what I want, otherwise it would give me some result which includes <h1>. why is that?
3) (answered) why do I need to use ".*" ? I tought it would collect anything between ()?
Try /<h1>(.*?)<br/.
On capturing group
A capturing group attempts to capture what it matches. This has some important consequences:
A group that matches nothing, can never capture anything.
A group that only matches an empty string, can only capture an empty string.
A group that captures repeatedly in a match attempt only gets to keep the last capture
Generally true for most flavors, but .NET regex is an exception (see related question)
Here's a simple pattern that contains 2 capturing groups:
(\d+) (cats|dogs)
\___/ \_________/
1 2
Given i have 16 cats, 20 dogs, and 13 turtles, there are 2 matches (as seen on rubular.com):
16 cats is a match: group 1 captures 16, group 2 captures cats
20 dogs is a match: group 1 captures 20, group 2 captures dogs
Now consider this slight modification on the pattern:
(\d)+ (cats|dogs)
\__/ \_________/
1 2
Now group 1 matches \d, i.e. a single digit. In most flavor, a group that matches repeatedly (thanks to the + in this case) only gets to keep the last match. Thus, in most flavors, only the last digit that was matched is captured by group 1 (as seen on rubular.com):
16 cats is a match: group 1 captures 6, group 2 captures cats
20 dogs is a match: group 1 captures 0, group 2 captures dogs
References
regular-expressions.info/Use Round Brackets for Capturing
Is there a regex flavor that allows me to count the number of repetitions matched by * and +?
.NET regex keeps intermediate captures!
On greedy vs reluctant vs negated character class
Now let's consider the problem of matching "everything between A and ZZ". As it turns out, this specification is ambiguous: we will come up with 3 patterns that does this, and they will yield different matches. Which one is "correct" depends on the expectation, which is not properly conveyed in the original statement.
We use the following as input:
eeAiiZooAuuZZeeeZZfff
We use 3 different patterns:
A(.*)ZZ yields 1 match: AiiZooAuuZZeeeZZ (as seen on ideone.com)
This is the greedy variant; group 1 matched and captured iiZooAuuZZeee
A(.*?)ZZ yields 1 match: AiiZooAuuZZ (as seen on ideone.com)
This is the reluctant variant; group 1 matched and captured iiZooAuu
A([^Z]*)ZZ yields 1 match: AuuZZ (as seen on ideone.com)
This is the negated character class variant; group 1 matched and captured uu
Here's a visual representation of what they matched:
___n
/ \ n = negated character class
eeAiiZooAuuZZeeeZZfff r = reluctant
\_________/r / g = greedy
\____________/g
See related question for a more in-depth treatment on the difference between these 3 techniques.
Related questions
Difference between .*? and .* for regex
Greedy vs reluctant vs negated character class, detailed explanation with illustrative examples
Going back to the question
So let's go back to the question and see what's wrong with pattern:
<h1>()<br
\/
1
Group 1 matches the empty string, therefore the whole pattern overall can only match <hr1><br, and group 1 can only match the empty string.
One can try to "fix" this in many different ways. The 3 obvious ones to try are:
<h1>(.*)<br; greedy
<h1>(.*?)<br; reluctant
<h1>([^<]*)<br; negated character class
You will find that none of the above "work" all the time; there will be problems with some HTML. This is to be expected: regex is the "wrong" tool for the job. You can try to make the pattern more and more complicated, to get it "right" more often and "wrong" less often. More than likely you'll end up with a horrible mess that no one can understand and/or maintain, and it'd still probably won't work "right" 100% of the time.
or
^(<h1>)(.)+(<br />)
go here to test
gskinner.com

Categories

Resources