Javascript regexp search with capture - javascript

How would I capture the two integers from the following string into two different variables using a regexp javascript search?
"10 of 25"

Your regex statement is going to be very specific to your strings, so this answer might be specific to whatever your actual use-case is. However just put the decimals in capturing groups. The .+? In front of those mean "match anything lazily until you find two decimals".
--So if there is a change you'll have two decimals that shouldn't be captured you'd want to add some extra checks such as a positive lookahead/lookbehind for quotes, etc.
.+?(\d\d).+?(\d\d).+?
Simply refer to each capture group as $1, $2, etc.
Use ?: in a group to make it non-capturing, fwiw.
http://regex101.com/r/vN6jO2

Related

Defining a financial number with a regular expression

The following are valid numbers for various financial displays depending on the region, etc:
1,000
1000.00
1,000,000.00
2.000.000,00
123748 # without commas
4 294 967 295,000
0.24
.24
24
24.
What would be a better approach to finding a regex for the above, doing one large regex or multiple regexes for each pattern? For example, an individual pattern being like:
no_thousands_separator = '\d+[.,]?\d{0,3}?
You likely want to design some expression, not exactly, but maybe similar to,
^\.?(?:(?:\d{1,3}[,. ])*\d{1,3}(?:\.\d{2})?|\d+\.\d+|\d+)\.?$
One way to design that is to look for your most complicated pattern, write an expression, alter, then continue to your simplest pattern.
Demo
I've just added two \.? in the beginning and end of the expression, but that's not really right, you are going to incorporate those wherever you want to.
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
This expression would not validate, but only pass those numbers.
You can make one regex, to keep things more simple:
\d*(?:([., ])(?:\d{3}\1)*\d{3})?(?:[.,]\d*)?
Inspect on regex101.com
How does this work?
\d* Where numbers can occur
([., ])? Capture the thousands separator
\d{3} Match 3 digits between separators
\1 Recall thousands separator
(?:[.,]\d*)? Optionally capures decimal part (no thousands separator allowed here)

How to use regex ?: operator and get the right group in my case? [duplicate]

This is an example string:
123456#p654321
Currently, I am using this match to capture 123456 and 654321 in to two different groups:
([0-9].*)#p([0-9].*)
But on occasions, the #p654321 part of the string will not be there, so I will only want to capture the first group. I tried to make the second group "optional" by appending ? to it, which works, but only as long as there is a #p at the end of the remaining string.
What would be the best way to solve this problem?
You have the #p outside of the capturing group, which makes it a required piece of the result. You are also using the dot character (.) improperly. Dot (in most reg-ex variants) will match any character. Change it to:
([0-9]*)(?:#p([0-9]*))?
The (?:) syntax is how you get a non-capturing group. We then capture just the digits that you're interested in. Finally, we make the whole thing optional.
Also, most reg-ex variants have a \d character class for digits. So you could simplify even further:
(\d*)(?:#p(\d*))?
As another person has pointed out, the * operator could potentially match zero digits. To prevent this, use the + operator instead:
(\d+)(?:#p(\d+))?
Your regex will actually match no digits, because you've used * instead of +.
This is what (I think) you want:
(\d+)(?:#p(\d+))?

JS RegEx replacement of a non-captured group?

I'm currently going through the book "Eloquent JavaScript". There's an exercice at the end of Chapter 9 on Regular Expressions that I couldn't understand its solution very well. Description of the exercice can be found here.
TL;DR : The objective is to replace single quotes (') with double quotes (") in a given string while keeping single quotes in contractions. Using the replace methode with a RegEx of course.
Now, after actually resolving this exercice using my own method, I checked the proposed solution which looks like this :
console.log(text.replace(/(^|\W)'|'(\W|$)/g, '$1"$2'));
The RegEx looks fine and it's quite understandable, but what I fail to understand is the usage of replacements, mainly why using $2 works ? As far as I know this regular expression will only take one path of two, either (^|\W)' or '(\W|$) each of these paths will only result in a single captured group, so we will only have $1 available. And yet $2 is capturing what comes after the single quote without having an explicit second capture group that does this in the regular expression. One can argue that there are two groups, but then again $2 is capturing a different string than the one intended by the second group.
My questions :
Why $2 is actually a valid string and is not undefined, and what is it referring to precisely?
Is this one of JavaScript RegEx quirks ?
Does this mean $1, $2... don't always refer to explicit groups ?
The backreferences are initialized with an empty string upon each match, so there will be no issues if a group is not matched. And it is no quirk, it is in compliance with the ES5 standard.
Here is a quote from Backreferences to Failed Groups:
According to the official ECMA standard, a backreference to a non-participating capturing group must successfully match nothing just a backreference to a participating group that captured nothing does.
So, once a backreference is not participating in the match, it refers to an empty string, not undefined. And it is not a quirk, just a "feature". That is not quite expected sometimes, but it is just how it works.
In your scenario, either of the backreferences is empty upon a match since there are two alternative branches and only one matches each time. The point is to restore the char matched in either of the groups. Both backreferences are used as either of them contains the text to restore while the other only contains empty text.

What is the purpose of the passive (non-capturing) group in a Javascript regex?

What is the purpose of the passive group in a Javascript regex?
The passive group is prefaced by a question mark colon: (?:group)
In other words, these 2 things appear identical:
"hello world".match(/hello (?:world)/)
"hello world".match(/hello world/)
In what situations do you need the non capturing group and why?
Two use cases for capturing groups
A capturing group in a regex has actually two distinct goals (as the name "capturing group" itself suggests):
Grouping — if you need a group to be a treated as a single entity in order to apply some stuff to the whole group.
Probably the most trivial example is including an optional sequence of characters, e.g. "foo" optionally followed by "bar", in regex terms: /foo(bar)?/ (capturing group) or /foo(?:bar)?/ (non-capturing group). Note that the trailing ? is applied to the whole group (bar) (which consists of a simple character sequence bar in this case). In case you just want to check if the input matches your regex, it really doesn't matter whether you use a capturing or a non-capturing group — they act the same (except that a non-capturing group is slightly faster).
Capturing — if you need to extract a part of the input.
For example, you want to get number of rabbits from an input like "The farm contains 8 cows and 89 rabbits" (not very good English, I know). The regex could be /(\d+)\s*rabbits\b/. On successful match, you can get the value matched by the capturing group from JavaScript code (or any other programming language).
In this example, you have a single capturing group, so you access it via its index 0 (see this answer for details).
Now imagine you want to ensure that the "place" is called "farm" or "ranch". If it's not the case, then you don't want to extract the number of rabbits (in regex terms — you don't want the regex to match).
So you rewrite your regex as follows: /(farm|ranch).*\b(\d+)\s*rabbits\b/. The regex works by itself, but your JavaScript is broken — there are two capturing groups now and you must change your code to get the contents of the second capturing group for the number of rabbits (i.e. change index from 0 to 1). The first group now contains the string "farm" or "ranch", which you didn't intend to extract.
A non-capturing group comes to rescue: /(?:farm|ranch).*\b(\d+)\s*rabbits\b/. It still matches either "farm" or "ranch", but doesn't capture it, thus not shifting the indexes of subsequent capturing groups. And your JavaScript code works fine without changing.
The example may be oversimplified, but consider that you have a very complex regex with many groups, and you want to capture only few of them. Non-capturing groups are really helpful then — you don't have to count all of your groups (only capturing ones).
Besides, non-capturing groups serve documentation purposes: for someone who reads you code, a non-capturing group is an indication that you are not interested in extracting contents, you just want to ensure that it matches.
A few words on separation of concerns
Capturing groups are a typical example of breaking the SoC principle. This syntax construct serves two distinct purposes. As the problems herewith became evident, an additional construct (?:) was introduced to disable one of the two features.
It was just a design mistake. Maybe a lack of "free special characters" played its role... but it was still a poor design.
Regex is a very old, powerful and widely used concept. For the reasons of backwards compatibility, this flaw is now unlikely to be fixed. It's just a lesson of how important the separation of concerns is.
Non-capturing have just one difference from "normal" (capturing) groups: they don't require the regex engine to remember what they matched.
The use case is that sometimes you must (or should) use a group not because you are interested in what it captures but for syntactic reasons. In these situations it makes sense to use a non-capturing group instead of a "standard" capturing one because it is less resource intensive -- but if you don't care about that, a capturing group will behave in the exact same manner.
Your specific example does not make a good case for using non-capturing groups exactly because the two expressions are identical. A better example might be:
input.match(/hello (?:world|there)/)
In addition to the answers above, if you're using String.prototype.split() and you use a capturing group, the output array contains the captured results (see MDN). If you use a non-capturing group that doesn't happen.
var myString = 'Hello 1 word. Sentence number 2.';
var splits = myString.split(/(\d)/);
console.log(splits);
Outputs:
["Hello ", "1", " word. Sentence number ", "2", "."]
Whereas swapping /(\d)/ for /(?:\d)/ results in:
["Hello ", " word. Sentence number ", "."]
When you want to apply modifiers to the group.
/hello (?:world)?/
/hello (?:world)*/
/hello (?:world)+/
/hello (?:world){3,6}/
etc.
Use them when you need a conditional and don't care about which of the choices cause the match.
Non-capturing groups can simplify the result of matching a complex expression. Here, the group 1 is always the name speaker. Without the non-capturing group, the speaker's name may end up in group 1 or group 2.
/hello (?:world|foobar )?said (.+)/
I have just found a different use for it. I was trying to capture a nested group but wanted the whole collection of the repeating group as one element:
So for AbbbbC
(A)((?:b)*)(C)
gives three groups A bbbb C
for AC also gives three groups A null C

What does $1, $2, etc. mean in Regular Expressions?

Time and time again I see $1 and $2 being used in code. What does it mean? Can you please include examples?
When you create a regular expression you have the option of capturing portions of the match and saving them as placeholders. They are numbered starting at $1.
For instance:
/A(\d+)B(\d+)C/
This will capture from A90B3C the values 90 and 3. If you need to group things but don't want to capture them, use the (?:...) version instead of (...).
The numbers start from left to right in the order the brackets are open. That means:
/A((\d+)B)(\d+)C/
Matching against the same string will capture 90B, 90 and 3.
This is esp. useful for Replacement String Syntax (i.e. Format Strings) Goes good for Cases/Case Foldings for Find & Replaces. To reference a capture, use $n where n is the capture register number. Using $0 means the entire match. Example : Find: (<a.*?>)(.*?)(</a>) Replace: $1\u$2\e$3

Categories

Resources