I am not sure if I have put the question right.
I want to satisfy both the text with one regular expression.
text1 = 'foobar';
text2 = 'foobar-baz';
Expected Output of text1
$1 should be bar
$2 should be ''
Expected Output of text2
$1 should be bar
$2 should be baz
Here is what I have tried:
/foo([a-z0-9\-_=\+\/]+)(\-(.*))?/i
result for text1 is correct but for text2, $1 gets the full string foobar-baz
The problem here is due to the possible inclusion of - in the first capturing group. There are 2 cases:
There are one or more - in the string, and you want to pick the last group delimited by the hyphen. Intuitively, we think of greedy quantifier, and a simple solution like:
input.match(/foo([a-z0-9_=+\/-]+)-(.*)/)
would work.
However the second case, where there are no - in the string, combined with the previous case, causes problem.
Since [a-z0-9_=+\/-]+ contains -, if you make -(.*) optional, given an input in the first case, it will just match to the end of the string and put everything in the first capturing group.
We need to control the backtracking behavior so that when there is at least one -, it must match it and match the last one, and allow the first group to gobble up everything when there is no -.
One solution which makes minimal change to your current regex is:
input.match(/foo([a-z0-9_=+\/-]+?)(?:-([a-z0-9_=+\/]*))?$/)
The lazy quantifier makes the engine tries from the left-most - first, and the anchor $ and the character class without - at the end forces the engine to split only at the last - if any.
Note that the second capturing group will be undefined when there is no -.
Sample input output:
'foogoobarbaz'.match(/foo([a-z0-9_=+\/-]+?)(?:-([a-z0-9_=+\/]*))?$/)
> [ "foogoobarbaz", "goobarbaz", undefined ]
'foogoobar-baz'.match(/foo([a-z0-9_=+\/-]+?)(?:-([a-z0-9_=+\/]*))?$/)
> [ "foogoobar-baz", "goobar", "baz" ]
'foogoo-bar-baz'.match(/foo([a-z0-9_=+\/-]+?)(?:-([a-z0-9_=+\/]*))?$/)
> [ "foogoo-bar-baz", "goo-bar", "baz" ]
You can use a non-capturing group:
/foo([a-z0-9\-_=\+\/]+)(?:-(.*))?/i
That solves the problem of avoiding the additional capture group. However, your pattern still has the problem of including - as a valid character for the first string. Because of that, when you execute the pattern against "foobar-baz", the entire fragment "bar-baz" will match the first group in the pattern.
You're going to have to decide what it is you want to match; your rule is currently at odds with the result you seek. If you remove the - from the first group:
/foo([a-z0-9_=\+\/]+)(?:-(.*))?/i
then you get the result you say you're looking for.
Related
I use Regex quite a bit. I'm no master, but I've surprised myself with how difficult this has been.
We have a Regex string like this:
^(?:remind me ).*? (to|that|about|its|it's)? ?(.*)$
I want it to match both of the following strings, and assign some value to the first capture group.
remind me in 24 hours test
remind me in 24 hours to test
Assigning this little "to" to the first capture group is proving very difficult.
I could work-around this by doing two passes like below and then checking if the result is null or not, but that seems like madness, so I'm hoping to learn a better approach to this.
const regex1 = /^(?:remind me ).*? (to|that|about|its|it's)? ?(.*)$/i
const regex2 = /(to|that|about|its|it's) ?(.*)$/i
const matches1 = 'remind me in 24 hours to test'.match(regex1)[2]
const matches2 = matches1.match(regex2)
console.log(matches2)
// String1 output: null
// String2 output: [ 'to test', 'to', 'test', index: 9, input: '24 hours to test', groups: undefined ]
On related questions:
I've seen numerous other questions about this - but none of the "solutions" seem applicable here, as most of the answers are tailored to the user's specific issue, and I haven't been able to figure out how to fix our issue using them as a reference.
I read this answer, and it improved my understanding of greedy vs lazy, but did not help me understand how to resolve my issue without crummy code.
TLDR: Desired results would look like below, matching the whole string with to in the first capture group. The contents of the second capture group are not important to us except that the group is not empty.
It works if you remove the optional quantifier from the first capturing group and put .*? together with the capture group into another non-capturing group and make this outer group optional:
^remind me +(?:.*?\b(to|that|about|its|it's)\b *)?(.*)$
See this demo at regex101 (I also did some little changes like adding word boundaries, change quantifiers for variable space and remove the non-capture group at start, that looks unneeded)
To understand why this works, first have a look at the simple pattern (a)? and how this results in one capture of a and three empty matches in abc while getting four empty matches in e.g. xyz.
Simplifying your current pattern to e.g. ^a.*?(b)?(.*) investigate this at the regex101 debugger and click the matches tab on the left side. For the string abc the regex parser first matches a. The next character b matches the optional group and the capture succeeds. Using the same pattern on another string acbc, after matching the first a the next character is a c. Because b is optional it "fits in" between a and the adjacent c (click around step 7 at match 2) and won't get captured.
But refactoring this pattern to ^a(?:.*?(b))?(.*) and now looking into the debugger (watch steps 3 to 12) you can see that at the same position after the first a the grouped (?:.*?(b))? part fits in here for both test strings. The first group captures the substring before proceeding in the pattern.
With your current pattern there are even some strings that will the first group let capture (demo).
I am having a tough time figuring out a clean Regex (in a Javascript implementation) that will capture as much of a line as it can following a pattern, but anything inside braces doesn't need to follow the pattern. I'm not sure the best way to explain that except by example:
For example:
Let's say the pattern is, the line must start with 0, end with a 0 anywhere, but only allow sequence of 1, 2 or 3 in between, so I use ^(0[123]+0). This should match the first part of the strings:
0213123123130
012312312312303123123
01231230123123031230
etc.
But I want to be able to insert {gibberish} between braces into the line and have the Regex allow it to disrupt the pattern. i.e., ignore the pattern of the curly braces and everything inside, but still capture the full string including the {gibberish}. So this would capture everything in bold:
01232231{whatever 3 gArBaGe? I want.}121{foo}2310312{bar}3120123
and a 0 inside the braces does not end the capture prematurely, even if the pattern is correct.
01213123123123{21310030123012301}31231230123
EDIT: Now, I know I could just do something like ^0[123]*?(?:{.*})*?[123]*?0 maybe? But that only works if there is a single set of braces, and now I have to duplicate my [123] pattern. As that [123] pattern gets more complex, having it appear more than once in the Regex starts getting really incomprehensible. Something like the best regex trick seemed promising but I couldn't figure out how to apply it here. Using crazy lookarounds seems like the only way now but I would hope there's a cleaner way.
Since you've specified that you want the whole match including the garbage, you can use ^0([123]+(?:{[^}]*}[123]*)*)0 and use $1 to get the part between the 0s, or $0 to get everything that matched.
https://regex101.com/r/iFSabs/3
Here's the rundown on how the regex works:
^ anchors the match to start at the beginning of the line
0 matches a literal zero character
([123]+(?:{[^}]*}[123]*)*) is a capturing group that captures everything inside of it.
[123]+ matches one or more instances of 1, 2, or 3
(?:{[^}]*}[123]*)* is a non-capturing group. I.e. it'll be part of the match, but won't have a $# for use in replacement or the match.
{[^}]*} matches a literal { followed by any number of non } characters followed by }
[123]* matches zero or more instances of 1, 2, or 3
Then this whole non-capturing group can be matched 0 or more times.
The process behind this regex is known as unrolling the loop. http://www.softec.lu/site/RegularExpressions/UnrollingTheLoop gives a good description of it. (with a few typo fixes)
The unrolling the loop technique is based on the hypothesis that in
most case, you [know] in a [repeated] alternation, which case should be
the most usual and which one is exceptional. We will called the first
one, the normal case and the second one, the special case. The general
syntax of the unrolling the loop technique could then be written as:
normal* ( special normal* )*
Which could means something like, match the normal case, if you find a
special case, matched it than match the normal case again. [You'll] notice
that part of this syntax could [potentially] lead to a super-linear
match.
Example using Regex#test and Regex#match:
const strings = [
'0213123123130',
'012312312312303123123',
'01231230123123031230',
'01213123123123{21310030123012301}31231230123',
'01212121{hello 0}121312',
'012321212211231{whatever 3 gArBaGe? I want.}1212313123120123',
'012321212211231{whatever 3 gArBaGe? I want.}121231{extra garbage}3123120123',
];
const regex = /^0([123]+(?:{[^}]*}[123]*)*)0/
console.log('tests')
console.log(strings.map(string => `'${string}': ${regex.test(string)}`))
console.log('matches');
let matches = strings
.map((string) => regex.exec(string))
.map((match) => (match ? match[1] : undefined));
console.log(matches);
Robo Robok's answer is where I'd go with if you want to only keep the non braced part, although using a slightly different regex ({[^}]*}) for a bit more performance.
How about the other way around? Checking the string with curly tags removed:
const string = '012321212211231{whatever 3 gArBaGe? I want.}1212313123120123{foo}123';
const stringWithoutTags = string.replace(/\{.*?\}/g, '');
const result = /^(0[123]+0)/.test(stringWithoutTags);
You say you need to capture everything, including the gibberish, so I think a simple pattern like this should work:
^(0(?:[123]|{.+?})+0)
That allows a string starting with 0, and then any of your pattern characters (1, 2, or 3), or one of the { gibberish } sections, and allows that to repeat to handle multiple gibberish sections, and finally it must end with a 0.
https://regex101.com/r/K4teGY/2
You might use
^0[123]*(?:{[^{}]*}[123]*)*0
^ Start of string
0 Match a zero
[123]* Match 0+ times either 1, 2 or 3
(?: Non capture group
{[^{}]*}[123]* match from an opening till closing } followed by 0+ either 1, 2 or 3
)* Close group and repeat 0+ times
0 Match a zero
Regex demo
var a = 'a\na'
console.log(a.match(/.*/g)) // ['a', '', 'a', '']
Why are there two empty strings in the result?
Let's say if there are empty strings, why isn't there one at beginning and at the end of each line as well, hence 4 empty strings?
I am not looking for how to select 'a's but just want to understand the presence of the empty strings.
The best explanation I can offer for the following:
'ab\na'.match(/.*/g)
["ab", "", "a", ""]
Is that JavaScript's match function uses dot not in DOT ALL mode, meaning that dot does not match across newlines. When the .* pattern is applied to ab\na, it first matches ab, then stops at the newline. The newline generates an empty match. Then, a is matched, and then for some reason the end of the string matches another empty match.
If you just want to extract the non whitespace content from each line, then you may try the following:
print('ab\na'.match(/.+/g))
ab,a
Let's say if there are empty strings, why isn't there one at beginning
and at the end...
.* applies greediness. It swallows a complete line asap. By a line I mean everything before a line break. When it encounters end of a line, it matches again due to star quantifier.
If you want 4 you may add ? to star quantifier and make it lazy .*? but yet this regex has different result in different flavors because of the way they handle zero-length matches.
You can try .*? with both PCRE and JS engines in regex101 and see the differences.
Question:
You may ask why does engine try to find a match at the end of line while whole thing is already matched?
Answer:
It's for the reason that we have a definition for end of lines and end of strings. So not whole thing is matched. There is a left position that has a chance to be matched and we have it with star quantifier.
This left position is end of line here which is a true match for $ when m flag is on. A . doesn't match this position but a .* or .*? match because they would be a pattern for zero-length positions too as any X-STAR patterns like \d*, \D*, a* or b?
Star operator * means there can be any number of ocurrences (even 0 ocurrences). With the expression used, an empty string can be a match. Not sure what are you looking for, but maybe a + operator (1 or more ocurrences) will be better?
Want to add some more info, regex use a greedy algorithm by default (in some languages you can override this behaviour), so it will pick as much of the text as it can. In this case, it will pick the a, because it can be processed with the regex, so the "\na" is still there. "\n" does not match the ".", so the only available option is the empty string. Then, we will process the next line, and again, we can match a "a". After this, only the empty string matches the regex.
* Matches the preceding expression 0 or more times.
. matches any single character except the newline character.
That is what official doc says about . and *. So i guess the array you received is something like this:
[ the first "any character" of the first line, following "nothing", the first "any character" of the second line, following "nothing"]
And the new-line character is just ignored
I'm using a regular expression to repeatedly match the first and second part of a section. It does this fine.
I also want it to optionally capture the last instance of something in the first part which is exactly three digits. I don't need the value of that sub-part. I only need to know it is there. So, I use the first and third group from the match and test if the second group is undefined or not.
The problem I'm having is in JavaScript mode the second group result is always undefined.
See this at regex101.com.
After seeing it's not working in JavaScript, change to any of the other modes (Golang, Python...) and the first match will provide '123' for the second group. That's what I'd like it to do in JavaScript, but, it doesn't.
So, why doesn't it do the same in JavaScript? Is there a way I can make a regular expression in JavaScript to produce the desired results?
Try this out in JavaScript:
var regex = /((?:([0-9]{3})|[^,])+?)(?:,([^\.]*))?(?:\.|$)/g
, string = 'some 123 thing, to test this with. which shows, not working in JS only'
console.log(regex.exec(string))
/* prints this:
[ 'some 123 thing, to test this with.',
'some 123 thing',
undefined, // <--------------------- I want this to be '123'
' to test with',
index: 0,
input: 'the whole string...'
] */
console.log(regex.exec(string))
/* prints this:
[ 'which shows, not working in JS only',
'which shows',
undefined, // this one is correct
' not working in JS only',
index: 0,
input: 'the whole string...'
] */
I realize that capturing group only provides the last instance of matching that spot. That's okay because I only want to know it was in there at least once. I'm testing whether the match result's second group exists or is undefined.
This is a simplified example. The basic format is text which is divided by something, in this example, a period (.). Then, the subsection is divided by something else, in this example, a comma. And the first part may have something special in it, in this example, a three digit number.
What I want to do:
var match = regex.exec(string)
if (match) {
var first = match[1]
, second = match[3]
, isSpecial = match[2] !== undefined
}
With each iteration of the non-capturing group, the contents inside Group 1 are zeroed/repopulated, this is how ECMAScript regex works. It would work well if your 3-digit chunk appeared right at the ,, try testing your current regex with some 123, to test this with. string.
You need to move the second capturing group outside of the repeated non-capturing group, like:
/([^,]*?(?:([0-9]{3})[^,]*)?)(?:,([^.]*))?(?:\.|$)/g
^^^^^^^^^^^^^^^^^^^^^^^^^^
See the regex demo.
Now, the first capturing group pattern matches:
[^,]*? - any 0+ chars other than , as few as possible
(?:([0-9]{3})[^,]*)? - one or zero occurrences of:
([0-9]{3}) - Group 1 capturing 3 digits
[^,]* - zero or more chars other than , as many as possible.
I want to find strings that contain words in a particular order, allowing non-standard characters in between the words but excluding a particular word or symbol.
I'm using javascript's replace function to find all instances and put into an array.
So, I want select...from, with anything except 'from' in between the words. Or I can separate select...from from select...from (, as long as I exclude nesting. I think the answer is the same for both, i.e. how do I write: find x and not y within the same regexp?
From the internet, I feel this should work: /\bselect\b^(?!from).*\bfrom\b/gi but this finds no matches.
This works to find all select...from: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b/gi but modifying it to exclude the parenthesis "(" at the end prevents any matches: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\(/gi
Can anyone tell me how to exclude words and symbols within this regexp?
Many thanks
Emma
Edit: partial string input:
left outer join [stage].[db].[table14] o on p.Project_id = o.project_id
left outer join
(
select
different_id
,sum(costs) - ( sum(brushes) + sum(carpets) + sum(fabric) + sum(other) + sum(chairs)+ sum(apples) ) as overallNumber
from
(
select ace from [stage].db.[table18] J
Javascript:
sequel = stringInputAsAbove;
var tst = sequel.replace(/\bselect\b[\s\S]*?\bfrom\b/gi, function(a,b) { console.log('match: '+a); selects.push(b); return a; });
console.log(selects);
Console.log(selects) should print an array of numbers, where each number is the starting character of a select...from. This works for the second regexp I gave in my info, printing: [95, 251]. Your \s\S variation does the same, #stribizhev.
The first example ^(?!from).* should do likewise but returns [].
The third example \s*^\( should return 251 only but returns []. However I have just noticed that the positive expression \s*\( does give 95, so some progress! It's the negatives I'm getting wrong.
Your \bselect\b^(?!from).*\bfrom\b regex doesn't work as expected because:
^ means here beginning of a line, not negation of next part, so
the \bselect\b^ means, select word followed by beginning of a
line. After removal of ^ regex start to match something
(DEMO) but it is still invalid.
in multiline text .* without modification will not match new line,
so regex will match only select...from in single lines, but if you
change it for (.|\n)* (as a simple example) it will match
multiline, but still invalid
the * is greede quantifire, so it will match as much a possible,
but if you use reluctant quantifire *?, regex will match to first
occurance of from word, and int will start to return relativly
correct result.
\bselect\b(?!from) means match separate select word which is not
directly followed by separate from word, so it would be
selectfrom somehow composed of separate words (because
select\bfrom) so (?!from) doesn't work and it is redundant
In effect you will get regex very similar to what Stribizhev gave you: \bselect\b(.|\n)*?\bfrom\b
In third expression you meke same mistake: \bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\( using ^ as (I assume) a negation, not beginning of a line. Remove ^ and you will again get relativly valid result (match from select through from to closing parathesis ) ).
Your second regex works similar to \bselect\b(.|\n)*?\bfrom\b or \bselect\b[\s\S]*?\bfrom\b.
I wrote "relativly valid result", as I also think, that parsing SQL with regex could be very camplicated, so I am not sure if it will work in every case.
You can also try to use positive lookahead to match just position in text, like:
(?=\bselect\b(?:.|\n)*?\bfrom\b)
DEMO - the () was added to regex just to return beginning index of match in groups, so it would be easier to check it validity
Negation in regex
We use ^ as negation in character class, for example [^a-z] means match anything but not letter, so it will match number, symbol, whitespace, etc, but not letter from range a to z (Look here). But this negation is on a level of single character. I you use [^from] it will prevent regex from matching characters f,r,o and m (demo). Also the [^from]{4} will avoid matching from but also form, morf, etc.
To exlude the whole word from matching by regex, you need to use negative look ahead, like (?!from), which will fail to match, if there will be chosen word from fallowing given position. To avoid matching whole line containing from you could use ^(?!.*from.*).+$ (demo).
However in your case, you don't need to use this construction, because if you replace greedy quantifire .*\bfrom with .*?\bfrom it will match to first occurance of this word. Whats more it would couse problems. Take a look on this regex, it will not match anything because (?![\s\S]*from[\s\S]*) is not restricted by anything, so it will match only if there is no from after select, but we want to match also from! in effect this regex try to match and exclude from at once, and fail. so the (?!.*word.*) construction works much better to exclude matching line with given word.
So what to do if we don't what to match a word in a fragment of a match? I think select\b([^f]|f(?!rom))*?\bfrom\b is a good solution. With ([^f]|f(?!rom))*? it will match everything between select and from, but will not exclude from.
But if you would like to match only select...from not followed by ( then it is good idea to use (?!\() like. But in your regex (multiline, use of (.|\n)*? or [\s\S]*? it will cause to match up to next select...from part, because reluctant quantifire will chenge a plece where it need to match to make whole regex . In my opinion, good solution would be to use again:
select\b([^f]|f(?!rom))*?\bfrom\b(?!\s*?\()
which will not overlap additional select..from and will not match if there is \( after select...from - check it here