Regex to count the number of capturing groups in a regex - javascript

I need a regex that examines arbitrary regex (as a string), returning the number of capturing groups. So far I have...
arbitrary_regex.toString().match(/\((|[^?].*?)\)/g).length
Which works for some cases, where the assumption that any group that starts with a question mark, is non-capturing. It also counts empty groups.
It does not work for brackets included in character classes, or escaped brackets, and possibly some other scenarios.

Modify your regex so that it will match an empty string, then match an empty string and see how many groups it returns:
var num_groups = (new RegExp(regex.toString() + '|')).exec('').length - 1;
Example: http://jsfiddle.net/EEn6G/

The accepted answer is what you should use in any production system. However, if you wanted to solve it using a regex for fun, you can do that as shown below. It assumes the regex you want the number of groups in is correct.
Note that the number of groups is just the number of non-literal (s in the regex. The strategy we're going to take is instead of matching all the correct (, we're going to split on all the incorrect stuff in between them.
re.toString().split(/(\(\?|\\\[|\[(?:\\\]|.)*?\]|\\\(|[^(])+/g).length - 1
You can see how it works on www.debuggex.com.

Related

Regex exact match on number, not digit

I have a scenario where I need to find and replace a number in a large string using javascript. Let's say I have the number 2 and I want to replace it with 3 - it sounds pretty straight forward until I get occurrences like 22, 32, etc.
The string may look like this:
"note[2] 2 2_ someothertext_2 note[32] 2finally_2222 but how about mymomsays2."
I want turn turn it into this:
"note[3] 3 3_ someothertext_3 note[32] 3finally_2222 but how about mymomsays3."
Obviously this means .replace('2','3') is out of the picture so I went to regex. I find it easy to get an exact match when I am dealing with string start to end ie: /^2$/g. But that is not what I have. I tried grouping, digit only, wildcards, etc and I can't get this to match correctly.
Any help on how to exactly match a number (where 0 <= number <= 500 is possible, but no constraints needed in regex for range) would be greatly appreciated.
The task is to find (and replace) "single" digit 2, not embedded in
a number composed of multiple digits.
In regex terms, this can be expressed as:
Match digit 2.
Previous char (if any) can not be a digit.
Next char (if any) can not be a digit.
The regex for the first condition is straightforward - just 2.
In other flavours of regex, e.g. PCRE, to forbid the previous
char you could use negative lookbehind, but unfortunately Javascript
regex does not support it.
So, to circumvent this, we must:
Put a capturing group matching either start of text or something
other than a digit: (^|\D).
Then put regex matching just 2: 2.
The last condition, fortunately, can be expressed as negative lookahead,
because even Javascript regex support it: (?!\d).
So the whole regex is:
(^|\D)2(?!\d)
Having found such a match, you have to replace it with the content
of the first capturing group and 3 (the replacement digit).
You can use negative look-ahead:
(\D|^)2(?!\d)
Replace with: ${1}3
If look behind is supported:
(?<!\d)2(?!\d)
Replace with: 3
See regex in use here
(\D|\b)2(?!\d)
(\D|\b) Capture either a non-digit character or a position that matches a word boundary
(?!\d) Negative lookahead ensuring what follows is not a digit
Alternations:
(^|\D)2(?!\d) # Thanks to #Wiktor in the comments below
(?<!\d)2(?!\d) # At the time of writing works in Chrome 62+
const regex = /(\D|\b)2(?!\d)/g
const str = `note[2] 2 2_ someothertext_2 note[32] 2finally_2222 but how about mymomsays2.`
const subst = "$13"
console.log(str.replace(regex, subst))

Regex finding the last string that doesnt contain a number

Usually in my system i have the following string:
http://localhost/api/module
to find out the last part of the string (which is my route) ive been using the following:
/[^\/]+$/g
However there may be cases where my string looks abit different such as:
http://localhost/api/module/123
Using the above regex it would then return 123. When my String looks like this i know that the last part will always be a number. So my question is how do i make sure that i can always find the last string that does not contain a number?
This is what i came up with which really stricty matches only module for the following lines:
http://localhost/api/module
http://localhost/api/module/123
http://localhost/api/module/123a
http://localhost/api/module/a123
http://localhost/api/module/a123a
http://localhost/api/module/1a3
(?!\w*\d\w*)[^\/][a-zA-Z]+(?=\/\w*\d+\w*|$)
Explanation
I basically just extended your expression with negative lookahead and lookbehind which basically matches your expression given both of the following conditions is true:
(?!\w*\d\w*) May contain letters, but no digits
[a-zA-Z]+ Really, truly only consists of one or more letters (was needed)
(?=\/\d+|$)The match is either followed by a slash, followed by digits or the end of the line
See this in action in my sample at Regex101.
partYouWant = urlString.replace(/^.*\/([a-zA-Z]+)[\/0-9]*$/,'$1')
Here it is in action:
urlString="http://localhost/api/module/123"
urlString.replace(/^.*\/([a-zA-Z]+)[\/0-9]*$/,'$1')
-->"module"
urlString="http://localhost/api/module"
urlString.replace(/^.*\/([a-zA-Z]+)[\/0-9]*$/,'$1')
-->"module"
It just uses a capture expression to find the last non-numeric part.
It's going to do this too, not sure if this is what you want:
urlString="http://localhost/api/module/123/456"
urlString.replace(/^.*\/([a-zA-Z]+)[\/0-9]*$/,'$1')
-->"module"
/([0-9])\w+/g
That would select the numbers. You could use it remove that part from the url. What language are you using it for ?

Regex for digits and hyphen only

I am trying to understand regex, for digits of length 10 I can simply do
/^[0-9]{10}$/
for hyphen only I can do
/^[-]$/
combining the two using group expression will result in
/^([0-9]{10})|([-])$/
This expression does not work as intended, it somehow will match part of the string instead of not match at all if the string is invalid.
How do I make the regex expression that accepts only "-" or 10 digits?
It would have worked fine to combine your two regexps exactly as you had them. In other words, just use the alternation/pipe operator to combine
/^[0-9]{10}$/
and
/^[-]$/
as is, directly into
/^[0-9]{10}$|^[-]$/
↑↑↑↑↑↑↑↑↑↑↑ ↑↑↑↑↑ YOUR ORIGINAL REGEXPS, COMBINED AS IS WITH |
This can be represented as
and that would have worked fine. As others have pointed out, you don't need to specify the hyphen in a character class, so
/^[0-9]{10}$|^-$/
↑ SIMPLIFY [-] TO JUST -
Now, we notice that each of the two alternatives has a ^ at the beginning and a $ at the end. That is a bit duplicative, and it also makes it little harder to see immediately that the regexp is always matching things from beginning to end. Therefore, we can rewrite this, as explained in other answers, by taking the ^ and $ out of both sub-regexps, and combine their contents using the grouping operator ():
/^([0-9]{10}|-)$/
↑↑↑↑↑↑↑↑↑↑↑↑↑ GROUP REGEXP CONTENTS WITH PARENS, WITH ANCHORS OUTSIDE
The corresponding visualization is
That would also work fine, but you could use \d instead of [0-9], so the final, simplest version is:
/^(\d{10}|-)$/
↑↑ USE \d FOR DIGITS
and this visualizes as
If for some reason you don't want to "capture" the group, use (?:, as in
/^(?:\d{10}|-)$/
↑↑ DON'T CAPTURE THE GROUP
and the visualization now shows that group is not captured:
By the way, in your original attempt to combine the two regexps, I noticed that you parenthesized them as in
/^([0-9]{10})|([-])$/
↑↑↑↑↑↑↑↑↑↑↑ ↑↑↑↑↑ YOU PARENTHESIZED THE SUB-REGEXPS
But actually this is not necessary, because the pipe (alternation, of "or") operator has low precedence already (actually it has the lowest precedence of any regexp operator); "low precedence" means it will apply only after things on both side are already processed, so what you wrote here is identical to
/^[0-9]{10}|[-]$/
which, however, still won't work for the reasons mentioned in other answers, as is clear from its visualization:
How do I make the regex expression that accepts only "-" or 10 digits?
You can use:
/^([0-9]{10}|-)$/
RegEx Demo
Your regex is just asserting presence of hyphen in the end due to misplacements of parentheses.
Here is the effective breakdown of OP's regex:
^([0-9]{10}) # matches 10 digits at start
| # OR
([-])$ # matches hyphen at end
which will cause OP's regex to match any input starting with 10 digits or ending with hyphen making these invalid inputs also a valid match:
1234567890111
1234----
------------------
1234567890--------
To get the regex expression that accepts only "-" or 10 digits - change your regexp as shown below:
^(\d{10}|-)$
DEMO link
The problem with your regex is it's looking for strings either
starting with 10 digits i.e. ^([0-9]{10}) or
ends with "-" - i.e. ([-])$
You needs an addtional wrapping ^( .. )$ to get this work. i.e.
/^(([0-9]{10})|([-]))$/
Better yet /^([0-9]{10}|-)$/ since [-] and - are both the same.

Regexp: excluding a word but including non-standard punctuation

I want to find strings that contain words in a particular order, allowing non-standard characters in between the words but excluding a particular word or symbol.
I'm using javascript's replace function to find all instances and put into an array.
So, I want select...from, with anything except 'from' in between the words. Or I can separate select...from from select...from (, as long as I exclude nesting. I think the answer is the same for both, i.e. how do I write: find x and not y within the same regexp?
From the internet, I feel this should work: /\bselect\b^(?!from).*\bfrom\b/gi but this finds no matches.
This works to find all select...from: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b/gi but modifying it to exclude the parenthesis "(" at the end prevents any matches: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\(/gi
Can anyone tell me how to exclude words and symbols within this regexp?
Many thanks
Emma
Edit: partial string input:
left outer join [stage].[db].[table14] o on p.Project_id = o.project_id
left outer join
(
select
different_id
,sum(costs) - ( sum(brushes) + sum(carpets) + sum(fabric) + sum(other) + sum(chairs)+ sum(apples) ) as overallNumber
from
(
select ace from [stage].db.[table18] J
Javascript:
sequel = stringInputAsAbove;
var tst = sequel.replace(/\bselect\b[\s\S]*?\bfrom\b/gi, function(a,b) { console.log('match: '+a); selects.push(b); return a; });
console.log(selects);
Console.log(selects) should print an array of numbers, where each number is the starting character of a select...from. This works for the second regexp I gave in my info, printing: [95, 251]. Your \s\S variation does the same, #stribizhev.
The first example ^(?!from).* should do likewise but returns [].
The third example \s*^\( should return 251 only but returns []. However I have just noticed that the positive expression \s*\( does give 95, so some progress! It's the negatives I'm getting wrong.
Your \bselect\b^(?!from).*\bfrom\b regex doesn't work as expected because:
^ means here beginning of a line, not negation of next part, so
the \bselect\b^ means, select word followed by beginning of a
line. After removal of ^ regex start to match something
(DEMO) but it is still invalid.
in multiline text .* without modification will not match new line,
so regex will match only select...from in single lines, but if you
change it for (.|\n)* (as a simple example) it will match
multiline, but still invalid
the * is greede quantifire, so it will match as much a possible,
but if you use reluctant quantifire *?, regex will match to first
occurance of from word, and int will start to return relativly
correct result.
\bselect\b(?!from) means match separate select word which is not
directly followed by separate from word, so it would be
selectfrom somehow composed of separate words (because
select\bfrom) so (?!from) doesn't work and it is redundant
In effect you will get regex very similar to what Stribizhev gave you: \bselect\b(.|\n)*?\bfrom\b
In third expression you meke same mistake: \bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\( using ^ as (I assume) a negation, not beginning of a line. Remove ^ and you will again get relativly valid result (match from select through from to closing parathesis ) ).
Your second regex works similar to \bselect\b(.|\n)*?\bfrom\b or \bselect\b[\s\S]*?\bfrom\b.
I wrote "relativly valid result", as I also think, that parsing SQL with regex could be very camplicated, so I am not sure if it will work in every case.
You can also try to use positive lookahead to match just position in text, like:
(?=\bselect\b(?:.|\n)*?\bfrom\b)
DEMO - the () was added to regex just to return beginning index of match in groups, so it would be easier to check it validity
Negation in regex
We use ^ as negation in character class, for example [^a-z] means match anything but not letter, so it will match number, symbol, whitespace, etc, but not letter from range a to z (Look here). But this negation is on a level of single character. I you use [^from] it will prevent regex from matching characters f,r,o and m (demo). Also the [^from]{4} will avoid matching from but also form, morf, etc.
To exlude the whole word from matching by regex, you need to use negative look ahead, like (?!from), which will fail to match, if there will be chosen word from fallowing given position. To avoid matching whole line containing from you could use ^(?!.*from.*).+$ (demo).
However in your case, you don't need to use this construction, because if you replace greedy quantifire .*\bfrom with .*?\bfrom it will match to first occurance of this word. Whats more it would couse problems. Take a look on this regex, it will not match anything because (?![\s\S]*from[\s\S]*) is not restricted by anything, so it will match only if there is no from after select, but we want to match also from! in effect this regex try to match and exclude from at once, and fail. so the (?!.*word.*) construction works much better to exclude matching line with given word.
So what to do if we don't what to match a word in a fragment of a match? I think select\b([^f]|f(?!rom))*?\bfrom\b is a good solution. With ([^f]|f(?!rom))*? it will match everything between select and from, but will not exclude from.
But if you would like to match only select...from not followed by ( then it is good idea to use (?!\() like. But in your regex (multiline, use of (.|\n)*? or [\s\S]*? it will cause to match up to next select...from part, because reluctant quantifire will chenge a plece where it need to match to make whole regex . In my opinion, good solution would be to use again:
select\b([^f]|f(?!rom))*?\bfrom\b(?!\s*?\()
which will not overlap additional select..from and will not match if there is \( after select...from - check it here

The behavior of /g mode matching

On this article, it mentioned
Make sure you are clear on the fact that an expression pattern is
tested on each individual character. And that, just because the engine
moves forward when following the pattern and looking for a match it
still backtracks and examines each character in a string until a match
is found or if the global flag is set until all characters are
examined.
But what I tested in Javascript
"aaa#bbb".match(/a+#b+/g)
does not produce a result like:
["aaa#bbb", "aa#bbb", "a#bbb"]
It only produces ["aaa#bbb"]. It seems it does not examine each character to test the pattern. Can anyone can explain a little on matching steps ? Thanks.
/g does not mean it will try to find every possible subset of characters in the input string which may match the given pattern. It means that once a match is found, it will continue searching for more substrings which may match the pattern starting from the previous match onward.
For example:
"aaa#bbb ... aaaa#bbbb".match(/a+#b+/g);
Will produce
["aaa#bbb", "aaaa#bbbb"]
That explanation is mixing two distinct concepts that IMO should be kept separated
A) backtracking
When looking for a match the normal behavior for a quantifier (?, *, +) is to be "greedy", i.e. to munch as much as possible... for example in /(a+)([^b]+)/ tested with aaaacccc all the a will be part of group 1 even if of course they also match the char set [^b] (everything but b).
However if grabbing too much is going to prevent a match the RE rules require that the quantifier "backtracks" capturing less if this allows the expression to match. For example in (a+)([^b]+) tested with aaaa the group 1 will get only three as, to leave one for group 2 to match.
You can change this greedy behavior with "non-greedy quantifiers" like *?, +?, ??. In this case stills the engine does backtracking but with the reverse meaning: a non-greedy subexpression will eat as little as possible that allows the rest of expression to match. For example (a+)(a+b+) tested with aaabbb will leave two as for group 1 and abbb for group 2, but (a+?)(a+b+) with the same string instead will leave only one a for group 1 because this is the minimum that allows matching the remaining part.
Note that also because of backtracking the greedy/non-greedy options doesn't change if an expression has a match or not, but only how big is the match and how much goes to each sub-expression.
B) the "global" option
This is something totally unrelated to backtracking and simply means that instead of stopping at the first match the search must find all non-overlapping matches. This is done by finding the first match and then starting again the search after the end of the match.
Note that each match is computed using the standard regexp rules and there is no look-ahead or backtracking between different matches: in other words if making for example a greedy match shorter would give more matches in the string this option is not considered... a+[^b]+ tested with aaaaaa is going to give only one match even if g option is specified and even if the substrings aa, aa, aa would have been each a valid match for the regexp.
When the global flag is used, it starts searching for the next match after the end of the previous match, to prevent generating lots of overlapping matches like that.
If you don't specify /g, the engine will stop as soon as a match is found.
If you do specify /g, it will keep going after a match. It will, however, still not produce overlapping matches, which is what you're asking about.
Its because.,
What Regex try to do:
All regex expression will try to match the best match.
What Regex wont do
It will not match the combinations for a single match as in your case.
When your "aaa#bbb".match(/a+#b+/g) scenario works
Rather, aaa#bbbHiaa#bbbHelloa#bbbSEEYOU try for some thing like this, which will give you
aaa#bbb
aa#bbb
a#bbb

Categories

Resources