capture group with optional second capture group containing first group pattern [duplicate]

capture group with optional second capture group containing first group pattern [duplicate] - javascript

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?

Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:

Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.

Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;

I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm

(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Related

Ensure that all lines in a file match a known RegEx pattern, without relying on \A \z [duplicate]

This question already has an answer here:
\z PCRE equivalent in JavaScript regex to match all markdown list items
(1 answer)
Closed 6 months ago.
In a multi-line file I want to check that all lines match one or more complex patterns (which may each cover multiple lines).
I can make this work just fine with this RegEx
\A(patternA\n|patternB\n)*\z
(For some languages it works with \Z instead of \z)
It will match:
patternA
patternB
patternA
and will reject
patternA
patternC
patternB
But it does not work in JavaScript (where I need to execute the test) because JavaScript RegEx apparently does not support the anchors \A (start of file) or \z (end of file). And if those anchors are left off then I just get back a set of matches (the first and third lines in my second example above), without the information that there are also non-matches.
At the moment, the only thing I can think of is to run the RegEx check without those two anchors, and then check that the sum of the length of all the matches equals the length of the overall text, but this seems rather clunky.
Is there a simple/elegant way to implement this check in JavaScript RegEx?

I now think the best solution may be to invert logic so that it searches for anything that does not match the required patterns, and passes the check if no match is found. The following RegEx, running under JavaScript, for example, matches the second example from my original post and not the first:
^(?!patternA$|patternB$|$)
The last option ($) is needed as otherwise it always matches the (empty) line following the last newline.

If the individual patterns are complex, it may be both easier on the regex engine and simpler to understand to write imperative code that loops through each of the lines and checks for each of the patterns in order.
This will let each of the patterns stay one regex. They do not have to be baked into the "whole file" pattern when one is updated, added or removed. The different patterns can also use different regex flags and so on.

Regular Expression first coincidense [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?

Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:

Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.

Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;

I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm

(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Regex - Validate that the local part of the email is not ending with a dot while only allowing certain characters without using a lookbehind

I was using a lookbehind to check for a dot before the # but just realized not all browsers are supporting lookbehinds. It works perfect in Chrome but fails in Firefox and IE.
This is what I came up with but it certainly is messy
^([a-zA-Z0-9&^*%#~{}=+?`_-]\.?)*[a-zA-Z0-9&^*%#~{}=+?`_-]#([a-zA-Z0-9]+\.)+[a-zA-Z]$
Is there a simpler and/or more elegant way to do this? I don't think I can negate the dot (^.) because I'm only allowing certain characters to be present in the local part.

This ([a-zA-Z0-9&^*%#~{}=+?`_-].?)*[a-zA-Z0-9&^*%#~{}=+?`_-] part is not messy, but inefficient, because the * quantifies a group containing an obligatory part, [...], and an optional \.?. Instead of (ab?)*a, you may use a+(?:ba+)* that will make matching linear and swift, in your case, [a-zA-Z0-9&^*%#~{}=+?`_-]+(?:.[a-zA-Z0-9&^*%#~{}=+?`_-]+)*.
More, [a-zA-Z0-9_] equals \w in JS regex, you may use this to shorten the pattern.
Besides, the last [a-zA-Z]$ pattern only matches a single letter, you most probably need [a-zA-Z]{2}$ there, as TLDs consist of 2+ letters.
So, you may use
^[\w&^*%#~{}=+?`-]+(?:\.[\w&^*%#~{}=+?`-]+)*#(?:[a-zA-Z0-9]+\.)+[a-zA-Z]{2,}$
See the regex demo.

Javascript Find a string between two strings, but keep each occurence of the match [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?

Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:

Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.

Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;

I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm

(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

help making a "universal" regex Javascript compatible

I found a very nice URL regex matcher on this site: http://daringfireball.net/2010/07/improved_regex_for_matching_urls . It states that it's free to use and that it's cross language compatible (including Javascript). First of all, I have to escape some of the slashes to get it to compile at all. When I do that, it works fine on Rubular.com (where I generally test regexes), with the strange side effect that each match has 5 fields: 1 is the url, and the extra 4 are empty. When I put this in JS, I get the error "Invalid Group". I am using Node.js if that makes any difference, but I wish I could understand that error. I'd like to cut back on the unnecessary empty match fields, but I don't even know where to begin diagnosing this beast. This is what I had after escaping:
(?xi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’] ))

Actually, you don't need the first capturing group either; it's the same as the whole match in this case, and that can always be accessed via $&. You can change all the capturing groups to non-capturing by adding ?: after the opening parens:
/\b(?:(?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\((?:[^\s()<>]+|(\(?:[^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i
That "invalid group" error is due to the inline modifiers (i.e., (?xi)) which, as #kirilloid observed, are not supported in JavaScript. Jon Gruber (the regex's author) was mistaken about that, as he was about JS supporting free-spacing mode.
Just FYI, the reason you had to escape the slashes is because you were using regex-literal notation, the most common form of which uses the forward-slash as the regex delimiter. In other words, it's the language (Ruby or JavaScript) that requires you to escape that particular character, not the regex. Some languages let you choose different regex delimiters, while others don't support regex literals at all.
But these are all language issues, not regex issues; the regex itself appears to work as advertised.

Seemes, that you copied it wrong.
http://www.regular-expressions.info/javascript.html
No mode modifiers to set matching options within the regular expression.
No regular expression comments
I.e. (?xi) at the beginning is useless.
x is useless at all for compacted RegExp
i can be replaced with flag
All these result in:
/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i
Tested and working in Google Chrome => should work in Node.js

Develop Reference

JavaScript is the programming language of the Web.

capture group with optional second capture group containing first group pattern [duplicate] - javascript

Make .* non-greedy by adding '?' after it: Project name:\s+(.*?)\s+J[0-9]{7}:

(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:) This will work for you. Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Related

Ensure that all lines in a file match a known RegEx pattern, without relying on \A \z [duplicate]

Regular Expression first coincidense [duplicate]

Regex - Validate that the local part of the email is not ending with a dot while only allowing certain characters without using a lookbehind

Javascript Find a string between two strings, but keep each occurence of the match [duplicate]

help making a "universal" regex Javascript compatible

Categories

Resources