Metacharacters and parenthesis in regular expressions

Metacharacters and parenthesis in regular expressions - javascript

Can anyone elaborate/translate this regular expression into English?
Thank you.
var g = "123456".match(/(.)(.)/);
I have noticed that the output looks like this:
12,1,2
and I know that dot means any character except new line. But what does this actually do?

A pair of parenthesis (without a ? as the first character, indicating other behaviour) will capture the contents to a group.
In your example, the first item in the array is the entire match, and subsequent items are any group matches.
It might be clearer if your code was something like:
var g = "123456".match(/.(.).(.)./);
This will match five characters, placing the second and fourth into groups 1 and 2 respectively, so outputting 12345,2,4
If you want pure grouping without capturing the content, use (?:...) syntax, the ?: part indicating a non-capturing group. (There are various assorted group things, like lookaheads and other fun stuff.)
Let me know if that is clear, or would further explanation help?

It looks for two characters - any characters because of the dots - and 'captures' them so that you can look for the whole string that was matched, and for each of the substrings (captures) as well.

Related

How to use regex ?: operator and get the right group in my case? [duplicate]

This is an example string:
123456#p654321
Currently, I am using this match to capture 123456 and 654321 in to two different groups:
([0-9].*)#p([0-9].*)
But on occasions, the #p654321 part of the string will not be there, so I will only want to capture the first group. I tried to make the second group "optional" by appending ? to it, which works, but only as long as there is a #p at the end of the remaining string.
What would be the best way to solve this problem?

You have the #p outside of the capturing group, which makes it a required piece of the result. You are also using the dot character (.) improperly. Dot (in most reg-ex variants) will match any character. Change it to:
([0-9]*)(?:#p([0-9]*))?
The (?:) syntax is how you get a non-capturing group. We then capture just the digits that you're interested in. Finally, we make the whole thing optional.
Also, most reg-ex variants have a \d character class for digits. So you could simplify even further:
(\d*)(?:#p(\d*))?
As another person has pointed out, the * operator could potentially match zero digits. To prevent this, use the + operator instead:
(\d+)(?:#p(\d+))?

Your regex will actually match no digits, because you've used * instead of +.
This is what (I think) you want:
(\d+)(?:#p(\d+))?

Regexp: excluding a word but including non-standard punctuation

I want to find strings that contain words in a particular order, allowing non-standard characters in between the words but excluding a particular word or symbol.
I'm using javascript's replace function to find all instances and put into an array.
So, I want select...from, with anything except 'from' in between the words. Or I can separate select...from from select...from (, as long as I exclude nesting. I think the answer is the same for both, i.e. how do I write: find x and not y within the same regexp?
From the internet, I feel this should work: /\bselect\b^(?!from).*\bfrom\b/gi but this finds no matches.
This works to find all select...from: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b/gi but modifying it to exclude the parenthesis "(" at the end prevents any matches: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\(/gi
Can anyone tell me how to exclude words and symbols within this regexp?
Many thanks
Emma
Edit: partial string input:
left outer join [stage].[db].[table14] o on p.Project_id = o.project_id
left outer join
(
select
different_id
,sum(costs) - ( sum(brushes) + sum(carpets) + sum(fabric) + sum(other) + sum(chairs)+ sum(apples) ) as overallNumber
from
(
select ace from [stage].db.[table18] J
Javascript:
sequel = stringInputAsAbove;
var tst = sequel.replace(/\bselect\b[\s\S]*?\bfrom\b/gi, function(a,b) { console.log('match: '+a); selects.push(b); return a; });
console.log(selects);
Console.log(selects) should print an array of numbers, where each number is the starting character of a select...from. This works for the second regexp I gave in my info, printing: [95, 251]. Your \s\S variation does the same, #stribizhev.
The first example ^(?!from).* should do likewise but returns [].
The third example \s*^\( should return 251 only but returns []. However I have just noticed that the positive expression \s*\( does give 95, so some progress! It's the negatives I'm getting wrong.

Your \bselect\b^(?!from).*\bfrom\b regex doesn't work as expected because:
^ means here beginning of a line, not negation of next part, so
the \bselect\b^ means, select word followed by beginning of a
line. After removal of ^ regex start to match something
(DEMO) but it is still invalid.
in multiline text .* without modification will not match new line,
so regex will match only select...from in single lines, but if you
change it for (.|\n)* (as a simple example) it will match
multiline, but still invalid
the * is greede quantifire, so it will match as much a possible,
but if you use reluctant quantifire *?, regex will match to first
occurance of from word, and int will start to return relativly
correct result.
\bselect\b(?!from) means match separate select word which is not
directly followed by separate from word, so it would be
selectfrom somehow composed of separate words (because
select\bfrom) so (?!from) doesn't work and it is redundant
In effect you will get regex very similar to what Stribizhev gave you: \bselect\b(.|\n)*?\bfrom\b
In third expression you meke same mistake: \bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\( using ^ as (I assume) a negation, not beginning of a line. Remove ^ and you will again get relativly valid result (match from select through from to closing parathesis ) ).
Your second regex works similar to \bselect\b(.|\n)*?\bfrom\b or \bselect\b[\s\S]*?\bfrom\b.
I wrote "relativly valid result", as I also think, that parsing SQL with regex could be very camplicated, so I am not sure if it will work in every case.
You can also try to use positive lookahead to match just position in text, like:
(?=\bselect\b(?:.|\n)*?\bfrom\b)
DEMO - the () was added to regex just to return beginning index of match in groups, so it would be easier to check it validity
Negation in regex
We use ^ as negation in character class, for example [^a-z] means match anything but not letter, so it will match number, symbol, whitespace, etc, but not letter from range a to z (Look here). But this negation is on a level of single character. I you use [^from] it will prevent regex from matching characters f,r,o and m (demo). Also the [^from]{4} will avoid matching from but also form, morf, etc.
To exlude the whole word from matching by regex, you need to use negative look ahead, like (?!from), which will fail to match, if there will be chosen word from fallowing given position. To avoid matching whole line containing from you could use ^(?!.*from.*).+$ (demo).
However in your case, you don't need to use this construction, because if you replace greedy quantifire .*\bfrom with .*?\bfrom it will match to first occurance of this word. Whats more it would couse problems. Take a look on this regex, it will not match anything because (?![\s\S]*from[\s\S]*) is not restricted by anything, so it will match only if there is no from after select, but we want to match also from! in effect this regex try to match and exclude from at once, and fail. so the (?!.*word.*) construction works much better to exclude matching line with given word.
So what to do if we don't what to match a word in a fragment of a match? I think select\b([^f]|f(?!rom))*?\bfrom\b is a good solution. With ([^f]|f(?!rom))*? it will match everything between select and from, but will not exclude from.
But if you would like to match only select...from not followed by ( then it is good idea to use (?!\() like. But in your regex (multiline, use of (.|\n)*? or [\s\S]*? it will cause to match up to next select...from part, because reluctant quantifire will chenge a plece where it need to match to make whole regex . In my opinion, good solution would be to use again:
select\b([^f]|f(?!rom))*?\bfrom\b(?!\s*?\()
which will not overlap additional select..from and will not match if there is \( after select...from - check it here

silent group not working in javascript regex match()

I'm trying to extract (potentially hyphenated) words from a string that have been marked with a '#'.
So for example from the string
var s = '#moo, #baa and #moo-baa are writing an email to a#bc.de'
I would like to return
['#moo', '#baa', '#moo-baa']
To make sure I don't capture the email address, I check that the group is preceded by a white-space character OR the beginning of the line:
s.match(/(^|\s)#(\w+[-\w+]*)/g)
This seems to do the trick, but it also captures the spaces, which I don't want:
["#moo", " #baa", " #moo-baa"]
Silencing the grouping like this
s.match(/(?:^|\s)#(\w+[-\w+]*)/g)
doesn't seem to work, it returns the same result as before. I also tried the opposite, and checked that there's no \w or \S in front of the group, but that also excludes the beginning of the line. I know I could simply trim the spaces off, but I'd really like to get this working with just a single 'match' call.
Anybody have a suggestion what I'm doing wrong? Thanks a lot in advance!!
[edit]
I also just noticed: Why is it returning the '#' symbols as well?! I mean, it's what I want, but why is it doing that? They're outside of the group, aren't they?

As far as I know, the whole match is returned from String.match when using the "g" modifier. Because, with the modifier you are telling the function to match the whole expression instead of creating numbered matches from sub-expressions (groups). A global match does not return groups, instead the groups are the matches themselves.
In your case, the regular expression you were looking for might be this:
'#moo, #baa and #moo-baa are writing an email to a#bc.de'.match(/(?!\b)(#[\w\-]+)/g);
You are looking for every "#" symbol that doesn't follow a word boundary. So there is no need for silent groups.

If you don't want to capture the space, don't put the \s inside of the parentheses. Anything inside the parentheses will be returned as part of the capture group.

JavaScript and regular expressions: get the number of parenthesized subpattern

I have to get the number of parenthesized substring matches in a regular expression:
var reg=/([A-Z]+?)(?:[a-z]*)(?:\([1-3]|[7-9]\))*([1-9]+)/g,
nbr=0;
//Some code
alert(nbr); //2
In the above example, the total is 2: only the first and the last couple of parentheses will create grouping matches.
How to know this number for any regular expressions?
My first idea was to check the value of RegExp.$1 to RegExp.$9, but even if there are no corresponding parenthseses, these values are not null, but empty string...
I've also seen the RegExp.lastMatch property, but this one represents only the value of the last matched characters, not the corresponding number.
So, I've tried to build another regular expression to scan any RegExp and count this number, but it's quite difficult...
Do you have a better solution to do that?
Thanks in advance!

Javascripts RegExp.match() method returns an Array of matches. You might just want to check the length of that result array.
var mystr = "Hello 42 world. This 11 is a string 105 with some 2 numbers 55";
var res = mystr.match(/\d+/g);
console.log( res.length );

Well, judging from the code snippet we can assume that the input pattern is always a valid regular expression, because otherwise it would fail before the some code partm right? That makes the task much easier!
Because We just need to count how many starting capturing parentheses there are!
var reg = /([A-Z]+?)(?:[a-z]*)(?:\([1-3]|[7-9]\))*([1-9]+)/g;
var nbr = (' '+reg.source).match(/[^\\](\\\\)*(?=\([^?])/g);
nbr = nbr ? nbr.length : 0;
alert(nbr); // 2
And here is a breakdown:
[^\\] Make sure we don't start the match with an escaping slash.
(\\\\)* And we can have any number of escaped slash before the starting parenthes.
(?= Look ahead. More on this later.
\( The starting parenthes we are looking for.
[^?] Make sure it is not followed by a question mark - which means it is capturing.
) End of look ahead
Why match with look ahead? To check that the parenthes is not an escaped entity, we need to capture what goes before it. No big deal here. We know JS doens't have look behind.
Problem is, if there are two starting parentheses sticking together, then once we capture the first parenthes the second parenthes would have nothing to back it up - its back has already been captured!
So to make sure a parenthes can be the starting base of the next one, we need to exclude it from the match.
And the space added to the source? It is there to be the back of the first character, in case it is a starting parenthes.

How do you use non captured elements in a Javascript regex?

I want to capture thing in nothing globally and case insensitively.
For some reason this doesn't work:
"Nothing thing nothing".match(/no(thing)/gi);
jsFiddle
The captured array is Nothing,nothing instead of thing,thing.
I thought parentheses delimit the matching pattern? What am I doing wrong?
(yes, I know this will also match in nothingness)

If you use the global flag, the match method will return all overall matches. This is the equivalent of the first element of every match array you would get without global.
To get all groups from each match, loop:
var match;
while(match = /no(thing)/gi.exec("Nothing thing nothing"))
{
// Do something with match
}
This will give you ["Nothing", "thing"] and ["nothing", "thing"].

Parentheses or no, the whole of the matched substring is always captured--think of it as the default capturing group. What the explicit capturing groups do is enable you to work with smaller chunks of text within the overall match.
The tutorial you linked to does indeed list the grouping constructs under the heading "pattern delimiters", but it's wrong, and the actual description isn't much better:
(pattern), (?:pattern) Matches entire contained pattern.
Well of course they're going to match it (or try to)! But what the parentheses do is treat the entire contained subpattern as a unit, so you can (for example) add a quantifier to it:
(?:foo){3} // "foofoofoo"
(?:...) is a pure grouping construct, while (...) also captures whatever the contained subpattern matches.
With just a quick look-through I spotted several more examples of inaccurate, ambiguous, or incomplete descriptions. I suggest you unbookmark that tutorial immediately and bookmark this one instead: regular-expressions.info.

Parentheses do nothing in this regex.
The regex /no(thing)/gi is same as /nothing/gi.
Parentheses are used for grouping. If you don't put any reference to groups (using $1, $2) or count for group, the () are useless.
So, this regex will find only this sequence n-o-t-h-i-n-g. The word thing does'nt starts with 'no', so it doen't match.
EDIT:
Change to /(no)?thing/gi and will work.
Will work because ()? indicates a optional part.

Develop Reference

JavaScript is the programming language of the Web.

Metacharacters and parenthesis in regular expressions - javascript

Can anyone elaborate/translate this regular expression into English? Thank you. var g = "123456".match(/(.)(.)/); I have noticed that the output looks like this: 12,1,2 and I know that dot means any character except new line. But what does this actually do?

It looks for two characters - any characters because of the dots - and 'captures' them so that you can look for the whole string that was matched, and for each of the substrings (captures) as well.

Related

How to use regex ?: operator and get the right group in my case? [duplicate]

Regexp: excluding a word but including non-standard punctuation

silent group not working in javascript regex match()

JavaScript and regular expressions: get the number of parenthesized subpattern

How do you use non captured elements in a Javascript regex?

Categories

Resources