Regex to follow pattern except between braces - javascript

I am having a tough time figuring out a clean Regex (in a Javascript implementation) that will capture as much of a line as it can following a pattern, but anything inside braces doesn't need to follow the pattern. I'm not sure the best way to explain that except by example:
For example:
Let's say the pattern is, the line must start with 0, end with a 0 anywhere, but only allow sequence of 1, 2 or 3 in between, so I use ^(0[123]+0). This should match the first part of the strings:
0213123123130
012312312312303123123
01231230123123031230
etc.
But I want to be able to insert {gibberish} between braces into the line and have the Regex allow it to disrupt the pattern. i.e., ignore the pattern of the curly braces and everything inside, but still capture the full string including the {gibberish}. So this would capture everything in bold:
01232231{whatever 3 gArBaGe? I want.}121{foo}2310312{bar}3120123
and a 0 inside the braces does not end the capture prematurely, even if the pattern is correct.
01213123123123{21310030123012301}31231230123
EDIT: Now, I know I could just do something like ^0[123]*?(?:{.*})*?[123]*?0 maybe? But that only works if there is a single set of braces, and now I have to duplicate my [123] pattern. As that [123] pattern gets more complex, having it appear more than once in the Regex starts getting really incomprehensible. Something like the best regex trick seemed promising but I couldn't figure out how to apply it here. Using crazy lookarounds seems like the only way now but I would hope there's a cleaner way.

Since you've specified that you want the whole match including the garbage, you can use ^0([123]+(?:{[^}]*}[123]*)*)0 and use $1 to get the part between the 0s, or $0 to get everything that matched.
https://regex101.com/r/iFSabs/3
Here's the rundown on how the regex works:
^ anchors the match to start at the beginning of the line
0 matches a literal zero character
([123]+(?:{[^}]*}[123]*)*) is a capturing group that captures everything inside of it.
[123]+ matches one or more instances of 1, 2, or 3
(?:{[^}]*}[123]*)* is a non-capturing group. I.e. it'll be part of the match, but won't have a $# for use in replacement or the match.
{[^}]*} matches a literal { followed by any number of non } characters followed by }
[123]* matches zero or more instances of 1, 2, or 3
Then this whole non-capturing group can be matched 0 or more times.
The process behind this regex is known as unrolling the loop. http://www.softec.lu/site/RegularExpressions/UnrollingTheLoop gives a good description of it. (with a few typo fixes)
The unrolling the loop technique is based on the hypothesis that in
most case, you [know] in a [repeated] alternation, which case should be
the most usual and which one is exceptional. We will called the first
one, the normal case and the second one, the special case. The general
syntax of the unrolling the loop technique could then be written as:
normal* ( special normal* )*
Which could means something like, match the normal case, if you find a
special case, matched it than match the normal case again. [You'll] notice
that part of this syntax could [potentially] lead to a super-linear
match.
Example using Regex#test and Regex#match:
const strings = [
'0213123123130',
'012312312312303123123',
'01231230123123031230',
'01213123123123{21310030123012301}31231230123',
'01212121{hello 0}121312',
'012321212211231{whatever 3 gArBaGe? I want.}1212313123120123',
'012321212211231{whatever 3 gArBaGe? I want.}121231{extra garbage}3123120123',
];
const regex = /^0([123]+(?:{[^}]*}[123]*)*)0/
console.log('tests')
console.log(strings.map(string => `'${string}': ${regex.test(string)}`))
console.log('matches');
let matches = strings
.map((string) => regex.exec(string))
.map((match) => (match ? match[1] : undefined));
console.log(matches);
Robo Robok's answer is where I'd go with if you want to only keep the non braced part, although using a slightly different regex ({[^}]*}) for a bit more performance.

How about the other way around? Checking the string with curly tags removed:
const string = '012321212211231{whatever 3 gArBaGe? I want.}1212313123120123{foo}123';
const stringWithoutTags = string.replace(/\{.*?\}/g, '');
const result = /^(0[123]+0)/.test(stringWithoutTags);

You say you need to capture everything, including the gibberish, so I think a simple pattern like this should work:
^(0(?:[123]|{.+?})+0)
That allows a string starting with 0, and then any of your pattern characters (1, 2, or 3), or one of the { gibberish } sections, and allows that to repeat to handle multiple gibberish sections, and finally it must end with a 0.
https://regex101.com/r/K4teGY/2

You might use
^0[123]*(?:{[^{}]*}[123]*)*0
^ Start of string
0 Match a zero
[123]* Match 0+ times either 1, 2 or 3
(?: Non capture group
{[^{}]*}[123]* match from an opening till closing } followed by 0+ either 1, 2 or 3
)* Close group and repeat 0+ times
0 Match a zero
Regex demo

Related

How to extract separate parts of a string with a regex

I'm trying to build a regex that can process the following:
abc
abc-def
where the -def part is optional.
I'm wanting to get capture groups for the "abc", and optional "def" part.
I've tried this (in Javascript) but can't seem to figure out the optional part:
/^(.*)+(-(.*))?$/
It matches both examples but the optional part is contained in the first capture group. This should be simple, but I can't seem to get it right.
You're close, try a ? to make the expression lazy.
/^(.*?)(-(.*))?$/
You can try /^([^-]+)(-(.*))?$/. One issue is that the first + is outside of the capture group which means it'll only match the last character. Secondly, the .* is greedy and will match a -, gobbling all the way to the end of the line.
Runnable example:
console.log("abc-def".match(/^([^-]*)(-(.*))?$/));
console.log("abc".match(/^([^-]*)(-(.*))?$/));
You may not need to capture the substring starting with -, in which case /^([^-]*)(?:-(.*))?$/ could work.

Regexp: excluding a word but including non-standard punctuation

I want to find strings that contain words in a particular order, allowing non-standard characters in between the words but excluding a particular word or symbol.
I'm using javascript's replace function to find all instances and put into an array.
So, I want select...from, with anything except 'from' in between the words. Or I can separate select...from from select...from (, as long as I exclude nesting. I think the answer is the same for both, i.e. how do I write: find x and not y within the same regexp?
From the internet, I feel this should work: /\bselect\b^(?!from).*\bfrom\b/gi but this finds no matches.
This works to find all select...from: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b/gi but modifying it to exclude the parenthesis "(" at the end prevents any matches: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\(/gi
Can anyone tell me how to exclude words and symbols within this regexp?
Many thanks
Emma
Edit: partial string input:
left outer join [stage].[db].[table14] o on p.Project_id = o.project_id
left outer join
(
select
different_id
,sum(costs) - ( sum(brushes) + sum(carpets) + sum(fabric) + sum(other) + sum(chairs)+ sum(apples) ) as overallNumber
from
(
select ace from [stage].db.[table18] J
Javascript:
sequel = stringInputAsAbove;
var tst = sequel.replace(/\bselect\b[\s\S]*?\bfrom\b/gi, function(a,b) { console.log('match: '+a); selects.push(b); return a; });
console.log(selects);
Console.log(selects) should print an array of numbers, where each number is the starting character of a select...from. This works for the second regexp I gave in my info, printing: [95, 251]. Your \s\S variation does the same, #stribizhev.
The first example ^(?!from).* should do likewise but returns [].
The third example \s*^\( should return 251 only but returns []. However I have just noticed that the positive expression \s*\( does give 95, so some progress! It's the negatives I'm getting wrong.
Your \bselect\b^(?!from).*\bfrom\b regex doesn't work as expected because:
^ means here beginning of a line, not negation of next part, so
the \bselect\b^ means, select word followed by beginning of a
line. After removal of ^ regex start to match something
(DEMO) but it is still invalid.
in multiline text .* without modification will not match new line,
so regex will match only select...from in single lines, but if you
change it for (.|\n)* (as a simple example) it will match
multiline, but still invalid
the * is greede quantifire, so it will match as much a possible,
but if you use reluctant quantifire *?, regex will match to first
occurance of from word, and int will start to return relativly
correct result.
\bselect\b(?!from) means match separate select word which is not
directly followed by separate from word, so it would be
selectfrom somehow composed of separate words (because
select\bfrom) so (?!from) doesn't work and it is redundant
In effect you will get regex very similar to what Stribizhev gave you: \bselect\b(.|\n)*?\bfrom\b
In third expression you meke same mistake: \bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\( using ^ as (I assume) a negation, not beginning of a line. Remove ^ and you will again get relativly valid result (match from select through from to closing parathesis ) ).
Your second regex works similar to \bselect\b(.|\n)*?\bfrom\b or \bselect\b[\s\S]*?\bfrom\b.
I wrote "relativly valid result", as I also think, that parsing SQL with regex could be very camplicated, so I am not sure if it will work in every case.
You can also try to use positive lookahead to match just position in text, like:
(?=\bselect\b(?:.|\n)*?\bfrom\b)
DEMO - the () was added to regex just to return beginning index of match in groups, so it would be easier to check it validity
Negation in regex
We use ^ as negation in character class, for example [^a-z] means match anything but not letter, so it will match number, symbol, whitespace, etc, but not letter from range a to z (Look here). But this negation is on a level of single character. I you use [^from] it will prevent regex from matching characters f,r,o and m (demo). Also the [^from]{4} will avoid matching from but also form, morf, etc.
To exlude the whole word from matching by regex, you need to use negative look ahead, like (?!from), which will fail to match, if there will be chosen word from fallowing given position. To avoid matching whole line containing from you could use ^(?!.*from.*).+$ (demo).
However in your case, you don't need to use this construction, because if you replace greedy quantifire .*\bfrom with .*?\bfrom it will match to first occurance of this word. Whats more it would couse problems. Take a look on this regex, it will not match anything because (?![\s\S]*from[\s\S]*) is not restricted by anything, so it will match only if there is no from after select, but we want to match also from! in effect this regex try to match and exclude from at once, and fail. so the (?!.*word.*) construction works much better to exclude matching line with given word.
So what to do if we don't what to match a word in a fragment of a match? I think select\b([^f]|f(?!rom))*?\bfrom\b is a good solution. With ([^f]|f(?!rom))*? it will match everything between select and from, but will not exclude from.
But if you would like to match only select...from not followed by ( then it is good idea to use (?!\() like. But in your regex (multiline, use of (.|\n)*? or [\s\S]*? it will cause to match up to next select...from part, because reluctant quantifire will chenge a plece where it need to match to make whole regex . In my opinion, good solution would be to use again:
select\b([^f]|f(?!rom))*?\bfrom\b(?!\s*?\()
which will not overlap additional select..from and will not match if there is \( after select...from - check it here

Satisfying two condition in one regex pattern in javascript

I am not sure if I have put the question right.
I want to satisfy both the text with one regular expression.
text1 = 'foobar';
text2 = 'foobar-baz';
Expected Output of text1
$1 should be bar
$2 should be ''
Expected Output of text2
$1 should be bar
$2 should be baz
Here is what I have tried:
/foo([a-z0-9\-_=\+\/]+)(\-(.*))?/i
result for text1 is correct but for text2, $1 gets the full string foobar-baz
The problem here is due to the possible inclusion of - in the first capturing group. There are 2 cases:
There are one or more - in the string, and you want to pick the last group delimited by the hyphen. Intuitively, we think of greedy quantifier, and a simple solution like:
input.match(/foo([a-z0-9_=+\/-]+)-(.*)/)
would work.
However the second case, where there are no - in the string, combined with the previous case, causes problem.
Since [a-z0-9_=+\/-]+ contains -, if you make -(.*) optional, given an input in the first case, it will just match to the end of the string and put everything in the first capturing group.
We need to control the backtracking behavior so that when there is at least one -, it must match it and match the last one, and allow the first group to gobble up everything when there is no -.
One solution which makes minimal change to your current regex is:
input.match(/foo([a-z0-9_=+\/-]+?)(?:-([a-z0-9_=+\/]*))?$/)
The lazy quantifier makes the engine tries from the left-most - first, and the anchor $ and the character class without - at the end forces the engine to split only at the last - if any.
Note that the second capturing group will be undefined when there is no -.
Sample input output:
'foogoobarbaz'.match(/foo([a-z0-9_=+\/-]+?)(?:-([a-z0-9_=+\/]*))?$/)
> [ "foogoobarbaz", "goobarbaz", undefined ]
'foogoobar-baz'.match(/foo([a-z0-9_=+\/-]+?)(?:-([a-z0-9_=+\/]*))?$/)
> [ "foogoobar-baz", "goobar", "baz" ]
'foogoo-bar-baz'.match(/foo([a-z0-9_=+\/-]+?)(?:-([a-z0-9_=+\/]*))?$/)
> [ "foogoo-bar-baz", "goo-bar", "baz" ]
You can use a non-capturing group:
/foo([a-z0-9\-_=\+\/]+)(?:-(.*))?/i
That solves the problem of avoiding the additional capture group. However, your pattern still has the problem of including - as a valid character for the first string. Because of that, when you execute the pattern against "foobar-baz", the entire fragment "bar-baz" will match the first group in the pattern.
You're going to have to decide what it is you want to match; your rule is currently at odds with the result you seek. If you remove the - from the first group:
/foo([a-z0-9_=\+\/]+)(?:-(.*))?/i
then you get the result you say you're looking for.

JavaScript and regular expressions: get the number of parenthesized subpattern

I have to get the number of parenthesized substring matches in a regular expression:
var reg=/([A-Z]+?)(?:[a-z]*)(?:\([1-3]|[7-9]\))*([1-9]+)/g,
nbr=0;
//Some code
alert(nbr); //2
In the above example, the total is 2: only the first and the last couple of parentheses will create grouping matches.
How to know this number for any regular expressions?
My first idea was to check the value of RegExp.$1 to RegExp.$9, but even if there are no corresponding parenthseses, these values are not null, but empty string...
I've also seen the RegExp.lastMatch property, but this one represents only the value of the last matched characters, not the corresponding number.
So, I've tried to build another regular expression to scan any RegExp and count this number, but it's quite difficult...
Do you have a better solution to do that?
Thanks in advance!
Javascripts RegExp.match() method returns an Array of matches. You might just want to check the length of that result array.
var mystr = "Hello 42 world. This 11 is a string 105 with some 2 numbers 55";
var res = mystr.match(/\d+/g);
console.log( res.length );
Well, judging from the code snippet we can assume that the input pattern is always a valid regular expression, because otherwise it would fail before the some code partm right? That makes the task much easier!
Because We just need to count how many starting capturing parentheses there are!
var reg = /([A-Z]+?)(?:[a-z]*)(?:\([1-3]|[7-9]\))*([1-9]+)/g;
var nbr = (' '+reg.source).match(/[^\\](\\\\)*(?=\([^?])/g);
nbr = nbr ? nbr.length : 0;
alert(nbr); // 2
And here is a breakdown:
[^\\] Make sure we don't start the match with an escaping slash.
(\\\\)* And we can have any number of escaped slash before the starting parenthes.
(?= Look ahead. More on this later.
\( The starting parenthes we are looking for.
[^?] Make sure it is not followed by a question mark - which means it is capturing.
) End of look ahead
Why match with look ahead? To check that the parenthes is not an escaped entity, we need to capture what goes before it. No big deal here. We know JS doens't have look behind.
Problem is, if there are two starting parentheses sticking together, then once we capture the first parenthes the second parenthes would have nothing to back it up - its back has already been captured!
So to make sure a parenthes can be the starting base of the next one, we need to exclude it from the match.
And the space added to the source? It is there to be the back of the first character, in case it is a starting parenthes.

Metacharacters and parenthesis in regular expressions

Can anyone elaborate/translate this regular expression into English?
Thank you.
var g = "123456".match(/(.)(.)/);
I have noticed that the output looks like this:
12,1,2
and I know that dot means any character except new line. But what does this actually do?
A pair of parenthesis (without a ? as the first character, indicating other behaviour) will capture the contents to a group.
In your example, the first item in the array is the entire match, and subsequent items are any group matches.
It might be clearer if your code was something like:
var g = "123456".match(/.(.).(.)./);
This will match five characters, placing the second and fourth into groups 1 and 2 respectively, so outputting 12345,2,4
If you want pure grouping without capturing the content, use (?:...) syntax, the ?: part indicating a non-capturing group. (There are various assorted group things, like lookaheads and other fun stuff.)
Let me know if that is clear, or would further explanation help?
It looks for two characters - any characters because of the dots - and 'captures' them so that you can look for the whole string that was matched, and for each of the substrings (captures) as well.

Categories

Resources