JavaScript / Using system generated regex for validation - javascript

we want to use regex to validate a document structure. For this we simplify the document and the regex. The regex is generated out of a schema which is used for the validation.
The application is completly client based and coded in JavaScript.
A simple example is this regex:
regex1 = new RegExp(/~(A{1}B?C?(D*|E*|F*|G*)+){1}~/g)
That means the document structure can have this structure
A
-B
-D
-D
-D
-D
-D
So the document structure is parsed to ~ABDDDDD~
Now I want to validate if I can add "A" to the end which would result in this string: ~ABDDDDDA~
This does not match with the reg ex anymore:
"~ABDDDDDA~".match(regex1)
This does work quiet fine, but the document structure can grow and be like this: ~ABDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD~
A matching value can be matched quiet fast, but if the value is then:
~ABDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDA~
It takes very long, most times I just close the browser and reopen it.
Does anyone have ideas how to solve it?
Thanks!
UPDATE
The RegEx should also cover more, the structure can be quiet dynamic. I have not used a RegEx Generator, this example is parsed from a self developed script and is just an example.
It is in this case, that there is one root element A, then optional B or C. And now in a not given order any amount of D,E,F,G. But at least one!
So it should be valid for:
"~ABDDDDDFEG~"
"~AGGGGGEGGD~"
"~ABCDEFG~"
"~ABCDDDDDDDDDDDDDDDEFGGGGGG~"
Additionally it is possible, that that the E is limited to 0-5 occurances.
As soon as I work with the match either(A | B), there are real performance issues in all browsers. (IE, Chrome, Firefox)
Any ideas? Are there any alternatives to "match either(A | B)" with better performance?

The resulting regex should be as close as possible to:
~AB?C?[DEFG]*A?~
There are a lot of simplifications to do in your regex generator to get rid of the following points:
{1}: is literally useless, you can remove it from everywhere
(A*|B*)+: is strictly equivalent to [AB]*
Here is a Regex101: https://regex101.com/r/Lc6Fx8/1
Also, if you want help fixing your Regex generator, you should post some info about it.

Related

capture group with optional second capture group containing first group pattern [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Regular Expression first coincidense [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Regex - Removing parts of URL path

I am useless at Regex and I want to remove parts of a URL that are not always consistent.
The URL might be:
www.test.com /en/ restOfPath
or
www.test.com /en/en_gb/ restOfPath
Then depending on the country values might change to:
www.test.com /es/ restOfPath
or
www.test.com /es/es_es/ restOfPath
I am therefore looking to alway remove, the parts in bold, so that I can split the remained of the path, to create a logical naming that is language/location agnostic.
I am doing this as a work around to build out a data layer until the client can implement it properly when they launch their new website. I have managed to build an if else statement as a workaround which is a bit clunky but would like a cleaner solution.
Probably this will help you
(?:\/([a-z]{2})(?:\/([a-z]{2}_[A-Z]{2}))?)
This example is about to find first / with two alpha after that, and probably another / with aa_AA construction.
I got you code samples at regex101
I believe this is what you're after:
\/.*(?=\/.*?)
https://regex101.com/r/OZIseI/4
It uses a positive look ahead to exclude the last / from the match

Javascript Find a string between two strings, but keep each occurence of the match [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Is there a length limitation when using replace method of a string?

I have a big string (1116902 char length) that I want to process with a regex (pretty simple one). I get a response from a soap server that is encoded in base64. So I just get the result between the appropriate xml tags and then decode the response.
This working for a small request. But when I get a big response back, the callback function of the replace() method is never called. I have tried to test the string on the regex101 website and it can find the result. So I wonder if there is a limitation in my JavaScript engine. I'm working on a Wakanda Server V10 that use Webkit as JavaScript engine. I cannot provide the string because it contains some enterprise information.
Here is my regex : /xsd:base64Binary">((.|\n)*?)<\/responseData>/
I taught it is maybe a special character that is not included in the ((.|\n)*?) group. But then why the regex101 find out the result (then maybe is the JavaScript engine)
Maybe anybody can help me?
Thanks
If you can guarantee that there are no tags between your start and end delimiter, which sounds like it might be the case, you could just change your RE to
/xsd:base64Binary">([^<]*)<\/responseData>/
which shouldn't require any backtracking and might work for you.
[^<] simply means everything but the < character. Since there shouldn't be any tags between the open and closing tags of your section (at least that's what I understand) that will accept everything until you hit your closing tag. The important thing is that the RE engine can tell immediately whether something matches that or not, so no branching or backtracking is required.

Categories

Resources