What does $1, $2, etc. mean in Regular Expressions? - javascript

Time and time again I see $1 and $2 being used in code. What does it mean? Can you please include examples?

When you create a regular expression you have the option of capturing portions of the match and saving them as placeholders. They are numbered starting at $1.
For instance:
/A(\d+)B(\d+)C/
This will capture from A90B3C the values 90 and 3. If you need to group things but don't want to capture them, use the (?:...) version instead of (...).
The numbers start from left to right in the order the brackets are open. That means:
/A((\d+)B)(\d+)C/
Matching against the same string will capture 90B, 90 and 3.

This is esp. useful for Replacement String Syntax (i.e. Format Strings) Goes good for Cases/Case Foldings for Find & Replaces. To reference a capture, use $n where n is the capture register number. Using $0 means the entire match. Example : Find: (<a.*?>)(.*?)(</a>) Replace: $1\u$2\e$3

Related

How to use regex ?: operator and get the right group in my case? [duplicate]

This is an example string:
123456#p654321
Currently, I am using this match to capture 123456 and 654321 in to two different groups:
([0-9].*)#p([0-9].*)
But on occasions, the #p654321 part of the string will not be there, so I will only want to capture the first group. I tried to make the second group "optional" by appending ? to it, which works, but only as long as there is a #p at the end of the remaining string.
What would be the best way to solve this problem?
You have the #p outside of the capturing group, which makes it a required piece of the result. You are also using the dot character (.) improperly. Dot (in most reg-ex variants) will match any character. Change it to:
([0-9]*)(?:#p([0-9]*))?
The (?:) syntax is how you get a non-capturing group. We then capture just the digits that you're interested in. Finally, we make the whole thing optional.
Also, most reg-ex variants have a \d character class for digits. So you could simplify even further:
(\d*)(?:#p(\d*))?
As another person has pointed out, the * operator could potentially match zero digits. To prevent this, use the + operator instead:
(\d+)(?:#p(\d+))?
Your regex will actually match no digits, because you've used * instead of +.
This is what (I think) you want:
(\d+)(?:#p(\d+))?

JS RegEx replacement of a non-captured group?

I'm currently going through the book "Eloquent JavaScript". There's an exercice at the end of Chapter 9 on Regular Expressions that I couldn't understand its solution very well. Description of the exercice can be found here.
TL;DR : The objective is to replace single quotes (') with double quotes (") in a given string while keeping single quotes in contractions. Using the replace methode with a RegEx of course.
Now, after actually resolving this exercice using my own method, I checked the proposed solution which looks like this :
console.log(text.replace(/(^|\W)'|'(\W|$)/g, '$1"$2'));
The RegEx looks fine and it's quite understandable, but what I fail to understand is the usage of replacements, mainly why using $2 works ? As far as I know this regular expression will only take one path of two, either (^|\W)' or '(\W|$) each of these paths will only result in a single captured group, so we will only have $1 available. And yet $2 is capturing what comes after the single quote without having an explicit second capture group that does this in the regular expression. One can argue that there are two groups, but then again $2 is capturing a different string than the one intended by the second group.
My questions :
Why $2 is actually a valid string and is not undefined, and what is it referring to precisely?
Is this one of JavaScript RegEx quirks ?
Does this mean $1, $2... don't always refer to explicit groups ?
The backreferences are initialized with an empty string upon each match, so there will be no issues if a group is not matched. And it is no quirk, it is in compliance with the ES5 standard.
Here is a quote from Backreferences to Failed Groups:
According to the official ECMA standard, a backreference to a non-participating capturing group must successfully match nothing just a backreference to a participating group that captured nothing does.
So, once a backreference is not participating in the match, it refers to an empty string, not undefined. And it is not a quirk, just a "feature". That is not quite expected sometimes, but it is just how it works.
In your scenario, either of the backreferences is empty upon a match since there are two alternative branches and only one matches each time. The point is to restore the char matched in either of the groups. Both backreferences are used as either of them contains the text to restore while the other only contains empty text.

Complex string parsing in Javascript

I am attempting to parse a complex string in JavaScript, and I'm pretty horrible with Regular Expressions, so I haven't had much luck. The data is loaded into a variable formatted as follows:
Miami 2.5 O (207.5) 125.0 | Oklahoma City -2.5 U (207.5) -145.0 (Feb 20, 2014 08:05 PM)
I am trying to parse that string following these parameters:
1) Each value must be loaded into their own variable (IE: separate variables for Miami, 2.5 O, (207.5) ect)
2) String must split at pipe character (I have this working with .split(" | ") )
3) I am dealing with city names that include spaces
4) The date at the end must be isolated and removed
I have a feeling regular expressions must be used, but I'm seriously hoping there is a different way to approach this. The example provided is just that, an example from a much larger data set. I can provide the full data set if requested.
More direct version of my question: Given the data above, what concepts / procedures can I use to intelligently parse the string elements into their own variables?
If RegEx must be used, will I need multiple expressions?
Thanks in advance for your help!
EDIT: In an effort to supply multiple pathways to a solution I'll explain the overarching problem as well. This data is the return of a RSS / XML item. The string mentioned above is sports odds, and is all contained in the title node of the feed I'm using. If anyone has a better XML / RSS feed for sports odds, I would be ecstatic for that as well.
EDIT 2: Thanks to the replies, I can run a RegEx that matches the data points needed. I'm now having trouble iterating through the matches and returning them correctly. I have the RegEx loaded into its own function:
function regExExtract (txt){
var exp = /([^|\d]+) ([-\d.]+ [A-Z]) (\([^)]+\)) ([-\d.]+) (\([^)]+\))?/g;
var comp_arr = exp.exec(txt);
return comp_arr;
}
And it is being called with:
var title_arr = regExExtract(title);
Title is loaded with the data string listed above. I assume I'm using the global flag correctly to ensure all matches are considered, but I'm not sure I'm loading the matches correctly. I apologize for my ignorance, this is all brand new to me.
As requested below, my expected output is ultimately a table with a row for each city, and its subsequent data. Each cell in each row corresponds to a data point.
I have created a JS Fiddle with what I've done, and what the expected output is:
http://jsfiddle.net/vDkQD/2/
Potential Final Edit: With the assistance of Robin and rewt, I have come up with:
http://jsfiddle.net/hMJx3/
Wouldn't a regex like
/([^|\d]+) ([-\d.]+ [A-Z]) (\([^)]+\)) ([-\d.]+) (\([^)]+\))?/g
do the trick? Obviously, this is based on the example string you gave, and if there are other patterns possible this should be updated... But if it is that fixed it's not so complicated.
Afterwards you just have to go through the captured groups for each match, and you'll have your data parsed. Live demo for fun: http://regex101.com/r/kF5zD3
Explanation
[^|\d] evrything but a pipe or a digit. This is to account for strange city name that [a-zA-Z ] might not catch
[-\d.] a digit, a dot or a hyphen
\([^)]+\) opening parenthesis, everything that isn't a closing parenthesis, closing parenthesis.
Quick incomplete pointers on regex
Here, the regex is the part between the /. The g after is a flag, thanks to it the regex won't stop after hitting the first match and will return every match
The match is what the whole expression will find. Here, the match will be everything between two | in your string. The capturing groups are a very useful tool that allows you too extract data from this match: they are delimited by parenthesis, which are a special character in regex. (a)b will match ab, the first captured group of this match will be a
[...] is means every character inside will do. [abc] will match a or b or c.
+ is a quantifier, another special character, meaning "one or more of what precedes me". a+ means "one or more a and will match aaaaa.
\d is a shortcut for [0-9] (yes, - is a special range character inside of [...]. That's why in [-\d.], which is equivalent to [-0-9.], it's directly following the opening bracket)
since parenthesis are special characters, when you actually want to match a parenthesis you need to escape: regex (\(a\))b will match (a)b, the first captured group of this match will be (a) with the parenthesis
? means what precedes is optional (zero or one instances)
^ when put at the beginning of a [...] statement means "everything but what's in the brackets". [^a]+ will match bcd-*ù but not aa
If you really know nothing about regex, as I believe they're the right tool for your case, I suggest your take a quick overview of a tuto, just to get a better idea of what you're dealing with. The way to set flags, loop through matches and their respective captured groups will depend on your language and how you call your regex.
[A-z][a-z]+( [A-z][a-z]+)* -?[0-9]+\.[0-9] [OU] \(-?[0-9]+\.[0-9]\) -?[0-9]+\.[0-9]
This should match a single part of your long string under the following assumptions:
The city consists only of alpha characters, each word starts with an uppercase character and is at least 2 characters long.
Numbers have an optional sign and exactly one digit after the decimal point
the single character is either O or U
Now it is up to you to:
Properly create capturing parentheses
Check whether my assumptions are right
In order to match the date:
\([JFMASOND][a-z]{2} [0-9]?[0-9], [0-9]{4} [0-9]{2}:[0-9]{2} [AP]M\)$

Javascript regexp search with capture

How would I capture the two integers from the following string into two different variables using a regexp javascript search?
"10 of 25"
Your regex statement is going to be very specific to your strings, so this answer might be specific to whatever your actual use-case is. However just put the decimals in capturing groups. The .+? In front of those mean "match anything lazily until you find two decimals".
--So if there is a change you'll have two decimals that shouldn't be captured you'd want to add some extra checks such as a positive lookahead/lookbehind for quotes, etc.
.+?(\d\d).+?(\d\d).+?
Simply refer to each capture group as $1, $2, etc.
Use ?: in a group to make it non-capturing, fwiw.
http://regex101.com/r/vN6jO2

How to explain "$1,$2" in JavaScript when using regular expression?

A piece of JavaScript code is as follows:
num = "11222333";
re = /(\d+)(\d{3})/;
re.test(num);
num.replace(re, "$1,$2");
I could not understand the grammar of "$1,$2". The book from which this code comes says $1 means RegExp.$1, $2 means RegExp.$2. But these explanations lead to more questions:
It is known that in JavaScript, the name of variables should begin with letter or _, how can $1 be a valid name of member variable of RegExp here?
If I input $1, the command line says it is not defined; if I input "$1", the command line only echoes $1, not 11222. So, how does the replace method know what "$1,$2" mean?
Thank you.
It's not a "variable" - it's a placeholder that is used in the .replace() call. $n represents the nth capture group of the regular expression.
var num = "11222333";
// This regex captures the last 3 digits as capture group #2
// and all preceding digits as capture group #1
var re = /(\d+)(\d{3})/;
console.log(re.test(num));
// This replace call replaces the match of the regex (which happens
// to match everything) with the first capture group ($1) followed by
// a comma, followed by the second capture group ($2)
console.log(num.replace(re, "$1,$2"));
$1 is the first group from your regular expression, $2 is the second. Groups are defined by brackets, so your first group ($1) is whatever is matched by (\d+). You'll need to do some reading up on regular expressions to understand what that matches.
It is known that in Javascript, the name of variables should begin with letter or _, how can $1 be a valid name of member variable of RegExp here?
This isn't true. $ is a valid variable name as is $1. You can find this out just by trying it. See jQuery and numerous other frameworks.
You are misinterpreting that line of code. You should consider the string "$1,$2" a format specifier that is used internally by the replace function to know what to do. It uses the previously tested regular expression, which yielded 2 results (two parenthesized blocks), and reformats the results. $1 refers to the first match, $2 to the second one. The expected contents of the num string is thus 11222,333 after this bit of code.
It is known that in Javascript, the name of variables should begin with letter or _,
No, it's not. $1 is a perfectly valid variable. You have to assign to it first though:
$variable = "this is a test"
This is how jQuery users a variable called $ as an alias for the jQuery object.
The book from which this code comes says $1 means RegExp.$1, $2 means RegExp.$2.
This book is made of paper. And paper cannot oppose any resistance to whom is writing on it :-) . But perhaps did you only misinterpret what is actually written in this book.
Actually, it is depending on the context.
In the context of the replace() method of String, $1, $2, ... $99 (1 through 99) are placeholders. They are handled internally by the replace() method (and they have nothing to do with RegExp.$1, RegExp.$2, etc, which are probably not even defined (see point 2. )). See String.prototype.replace() #Specifying_a_string_as_a_parameter. Compare this with the return value of the match() method of String when the flag g is not used, which is similar to the return value of the exec() method of RegExp. Compare also with the arguments passed implicitly to an (optional) function specified as second argument of replace().
RegExp.$1, RegExp.$2, ... RegExp.$9 (1 through 9 only) are non-standard properties of RegExp as you may see at RegExp.$1-$9 and Deprecated and obsolete features. They seem to be implemented on your browser, but, for somebody else, they could be not defined. To use them, you need always to prepend $1, $2, etc with RegExp.. These properties are static, read-only and stored in the RegExp global object, not in an individual regular expression object. But, anyway, you should not use them. The $1 through $99 used internally by the replace() method of String are stored elsewhere.
Have a nice day!

Categories

Resources