Regex to find if there is only one block of code - javascript

My input is a string, I want to validate that there is only one first-level block of code.
Examples :
{ abc } TRUE
{ a { bc } } TRUE
{ a {{}} } TRUE
{ abc {efg}{hij}} TRUE
{ a b cde }{aa} FALSE
/^\{.*\}$/ is valid for the 5 cases, can you help me to find a regex invalid for the last case ?
Language is JavaScript.

EDIT: I started writing the answer before JavaScript was specified. Will leave it as for the record as it fully explains the regex.
In short: In JavaScript I cannot think of a reliable solution. In other engines there are several options:
Recursion (on which I will expand below)
Balancing group (.NET)
For solutions 2 (which anyhow won't work in JS either), I'll refer you to the example in this question
Recursive Regex
In Perl, PCRE (e.g. Notepad++, PHP, R) and the Matthew Barnett's regex module for Python, you can use:
^({(?:[^{}]++|(?1))*})$
The idea is to match exactly one set of nested braces. Anything more makes the regex fail.
See what matches and fails in the Regex Demo.
Explanation
The ^ anchor asserts that we are at the beginning of the string
The outer parentheses define Group 1 (or Subroutine 1)
{ match the opening brace
(?: ... )* zero or more times, we will...
[^{}]++ match any chars that are not { or }
OR |
(?1) repeat the expression of subroutine 1
} match closing brace
The $ anchor asserts that we are at the end of the string. Therefore,

This is a terrible workaround.
Since this is in Javascript there's not really much to do, but please see the following regex:
/^{([^{}]*|{})*}$/
Where you copy ([^{}]*|{})* and insert it between the last pair of curly brackets (rinse and repeat). Every duplication of this pattern allows another level of nesting between your elements. (This is a workaround for the lack of recursion in JS regex, required to solve nesting problems.)
Online Regex Demo

In JavaScript what you need to do is strip out all the nested blocks until no nested blocks are left and then check whether there are still multiple blocks:
var r = input.replace(/(['"])(?:(?!\1|\\).|\\.)*\1|\/(?![*/])(?:[^\\/]|\\.)+\/[igm]*|\/\/[^\n]*(?:\n|$)|\/\*(?:[^*]|\*(?!\/))*\*\//gi, '');
if (r.split('{').length != r.split('}').length || r.indexOf('}') < r.indexOf('{')) {
// ERROR
continue;
}
while (r.match(/\{[^}]*\{[^{}]*\}/))
r = r.replace(/(\{[^}]*)\{[^{}]*\}/g, '$1');
if (r.match(/\}.*\{/)
// FALSE
else
// TRUE
Working JSFiddle
Be sure to make the regex in the while and the regex in the replace match the same otherwise this might result in infinite loops.
Updated to address ERROR cases and remove anything in comments, strings and regex-literals first after Unihedron asked.

(\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*
Code for brackets

Related

catastrophic backstring in regular expression

I am using below regular expression
[^\/]*([A-Za-z]+_([A-Za-z]+|-?[A-Z0-9]+(\.[A_Z0-9]+)?|(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?)+(?=(,|\/)))+|,[^\/]*
and it showing me catastrophic backstring when i am trying to match with input string.
/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/1488800313_DSC_0334__3_.JPG_mweubp.jpg
The expected output array of the matching regex will be like
[ 'w_100',
'h_500',
'e_saturation:50,e_tint:red:blue',
'c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.',
'l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc' ]
don't want to consider image name 1488800313_DSC_0334__3_.JPG_mweubp.jpg in match. the following
is there any method to solve this backstrack in regular expression or suggest me good regex for my input string.
The problem
You use a lot of alternations when a character class would be more effective. Also, you're getting the catastrophic backtracking due to the following quantifier:
[^\/]*([A-Za-z]+_([A-Za-z]+|-?[A-Z0-9]+(\.[A_Z0-9]+)?|(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?)+(?=(,|\/)))+|,[^\/]*
^
It's trying to match any of the alternations you have, but keeps backtracking and never makes it past all your alternations (it's sometimes comparable to an infinite loop). In your case, your regex is so ineffective that it times out. I removed half your pattern and it takes a half second to complete with almost 200K steps (and that's only half your pattern).
Original Answer
How can it be fixed?
First step is to fix the quantifier and prevent it from continuously backtracking. This is actually quite easy, just make it possessive: + becomes ++. Changing the quantifier to possessive yields a pattern that takes about 56ms to complete and approx 9K steps (on my computer)
Second step is to improve the efficiency of the pattern. Change your alternations to character classes where possible.
(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?
# should instead be
(?::-|[_:-=\s?%.!#*]|[-.][a-zA-Z]+|[A-Z0-9a-z]+)?
It's much shorter, much more concise and less prone to errors.
The new pattern
See regex in use here
This pattern only takes 271 steps and less than one millisecond to complete (yes, using PCRE engine, works in Java too)
(?<=[,\/])[A-Za-z]+_(?:[A-Z0-9a-z]+|-?[A-Z0-9]+(?:\.[A-Z0-9]+)?|:-|[_:-=\s?%.!#*]|[-.][a-zA-Z]+)++
I also changed your positive lookahead to a positive lookbehind (?<=[,\/]) to improve performance.
Additionally, if you don't need all the specific logic, you can quite simply use the following regex (just under half as many steps as my regex above):
See regex in use here
(?<=[,\/])[A-Za-z]+_[^,\/]+
Results
This results in the following array:
P.S. I'm assuming there'a a typo in your expected output and that the / between l_text and l_fetch should also be split on; needs clarification.
w_100
h_500
e_saturation:50
e_tint:red:blue
c_crop
a_100
l_text:Neucha_26_bold:Loremipsum.
l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc
Edit #1
The OP clarified the expected results. I added , to the character class in the fourth option of the non-capture group:
See regex in use here
(?<=[,\/])[A-Za-z]+_(?:[A-Z0-9a-z]+|-?[A-Z0-9]+(?:\.[A-Z0-9]+)?|:-|[_:-=\s?%.!#*,]|[-.][a-zA-Z]+)++
And in its shortened form:
See regex in use here
(?<=\/)[A-Za-z]+_[^\/]+
Results
This results in the following array:
w_100
h_500
e_saturation:50,e_tint:red:blue
c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.
l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc
Edit #2
The OP presented another input and identified issues with Edit #1 related to that input. I added logic to force a fail on the last item in a string.
New test string:
/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/sample_url_image.jpg
See regex in use here
(?<=\/)(?![A-Za-z]+_[^\/]+$)[A-Za-z]+_[^\/]+
Same results as in Edit #1.
PCRE version (if anyone is looking for it) - more efficient than the method above:
See regex in use hereenter link description here
(?<=\/)[A-Za-z]+_[^\/]+(?:$(*SKIP)(*FAIL))?
Assuming your example has a typo, e.g. the last / would be split too:
You can simply split on /, then filter out the .jpg items:
function splitWithFilter(line, filter) {
var filterRe = filter ? new RegExp(filter, 'i') : null;
return line
.replace(/^\//, '') // remove leading /
.split(/\//)
//.filter(Boolean) // filter out empty items (alternative to above replace())
.filter(function(item) {
return !filterRe || !item.match(filterRe);
});
}
var str = "/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/1488800313_DSC_0334__3_.JPG_mweubp.jpg";
console.log(JSON.stringify(splitWithFilter(str, '\\.jpg$'), null, ' '));
Expected output:
[
"w_100",
"h_500",
"e_saturation:50,e_tint:red:blue",
"c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.",
"l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc"
]

Regex returns nothing to repeat [duplicate]

I'm new to Regex and I'm trying to work it into one of my new projects to see if I can learn it and add it to my repitoire of skills. However, I'm hitting a roadblock here.
I'm trying to see if the user's input has illegal characters in it by using the .search function as so:
if (name.search("[\[\]\?\*\+\|\{\}\\\(\)\#\.\n\r]") != -1) {
...
}
However, when I try to execute the function this line is contained it, it throws the following error for that specific line:
Uncaught SyntaxError: Invalid regular expression: /[[]?*+|{}\()#.
]/: Nothing to repeat
I can't for the life of me see what's wrong with my code. Can anyone point me in the right direction?
You need to double the backslashes used to escape the regular expression special characters. However, as #Bohemian points out, most of those backslashes aren't needed. Unfortunately, his answer suffers from the same problem as yours. What you actually want is:
The backslash is being interpreted by the code that reads the string, rather than passed to the regular expression parser. You want:
"[\\[\\]?*+|{}\\\\()#.\n\r]"
Note the quadrupled backslash. That is definitely needed. The string passed to the regular expression compiler is then identical to #Bohemian's string, and works correctly.
Building off of #Bohemian, I think the easiest approach would be to just use a regex literal, e.g.:
if (name.search(/[\[\]?*+|{}\\()#.\n\r]/) != -1) {
// ... stuff ...
}
Regex literals are nice because you don't have to escape the escape character, and some IDE's will highlight invalid regex (very helpful for me as I constantly screw them up).
For Google travelers: this stupidly unhelpful error message is also presented when you make a typo and double up the + regex operator:
Okay:
\w+
Not okay:
\w++
Firstly, in a character class [...] most characters don't need escaping - they are just literals.
So, your regex should be:
"[\[\]?*+|{}\\()#.\n\r]"
This compiles for me.
Well, in my case I had to test a Phone Number with the help of regex, and I was getting the same error,
Invalid regular expression: /+923[0-9]{2}-(?!1234567)(?!1111111)(?!7654321)[0-9]{7}/: Nothing to repeat'
So, what was the error in my case was that + operator after the / in the start of the regex. So enclosing the + operator with square brackets [+], and again sending the request, worked like a charm.
Following will work:
/[+]923[0-9]{2}-(?!1234567)(?!1111111)(?!7654321)[0-9]{7}/
This answer may be helpful for those, who got the same type of error, but their chances of getting the error from this point of view, as mine! Cheers :)
for example I faced this in express node.js when trying to create route for paths not starting with /internal
app.get(`\/(?!internal).*`, (req, res)=>{
and after long trying it just worked when passing it as a RegExp Object using new RegExp()
app.get(new RegExp("\/(?!internal).*"), (req, res)=>{
this may help if you are getting this common issue in routing
This can also happen if you begin a regex with ?.
? may function as a quantifier -- so ? may expect something else to come before it, thus the "nothing to repeat" error. Nothing preceded it in the regex string so it didn't get to quantify anything; there was nothing to repeat / nothing to quantify.
? also has another role -- if the ? is preceded by ( it may indicate the beginning of a lookaround assertion or some other special construct. See example below.
If one forgets to write the () parentheses around the following lookbehind assertion ?<=x, this will cause the OP's error:
Incorrect: const xThenFive = /?<=x5/;
Correct:
const xThenFive = /(?<=x)5/;
This /(?<=x)5/ is a positive lookbehind: we're looking for a 5 that is preceded by an x e.g. it would match the 5 in x563 but not the 5 in x652.

Javascript regex to check for error tags

I need to write a regex that will tell me if any back-end framework that I'm working with is throwing an error and then store those errors in an array for retrieval if necessary.
The issue is, they use different tags for errors. Tags are as follows:
{{error}}, <<error>>, [[error]], and <{:error:}>
Usually, but not always, a set of braces will come after. Inside the braces will be a string; either an explanation of the error, or a JSON string containing more info about the error, like this:
<<error>> { Something has gone terribly wrong. }
<<error>> {
{"some":"json"}
}
<{:error:}> { What went wrong? }
As of now, it's undergoing a specific check for each tag, which is rather inefficient, like this:
if ( string.indexOf('<<error>>') >= 0 )
// Remove << and >>
if ( string.indexOf('[[error]]') >= 0 )
// Remove [[ and ]]
// So forth...
Then, I am left with a string like this:
error { Some description. }
or
error {
{"some":"json"}
}
Which I need a regex to extract what's between the brackets. This was the regex I wrote, but it falls short on numerous things:
string.match('/error\s?\{([^\}]+)\}/gi');
As I said, this procedure is very inefficient and has issues.
First, it doesn't allow the braces {} after error to be optional. They should be optional.
Second, it does not allow JSON as the charset [^}] is not matched when JSON presents it's closing}. So I need some way of matching all characters in a set until the opening bracket of error is closed. Is this possible?
Given the comments on my first answer, I'd use this regular expression as a replace to convert the data into single-line json, the regex also removes comments. It removes newlines that do not start with a properly labeled error. Multiline must be on.
(?:\/[\s\S*]*?\*\/|\/\/.*$|\s*^\s*(?!<<|{{|\[\[|<{:))) (demo)
or (?:\s*^\s*(?!<<|{{|\[\[|<{:)) if there are never comments to remove
And then this to extract the error information, on the reformatted string, this regex to match.
({{error}}|<<error>>|\[\[error\]\]|<{:error:}>)[ \t]*(?:(.*)}\s*$)? demo
I'll leave the other answer intact as I think it basically explains the problems that a person can encounter doing this.
Good question. Explained your problem, showed what you've tried, gave enough examples of input.
Regex, especially Javascript's limited implementation, is not ideal for parsing many languages and data objects. It can be difficult in this scenario to capture say 5. .* wants to go to 6 and .*? wants to go to 4.
{
{
{
} // 5
} // 5
} // 6
However, if your code is really indented like your examples (it may not be, that could be you making it readable), you should be able to use something like this ({{error}}|<<error>>|\[\[error\]\]|<{:error:}>)\s*(\s*{(.*?(?=$)|[\s\S]*?^)})?, (demo)
What this is doing is
capturing from { to } on the same line and if it can't, it proceeds to step 2 (alternation.
everything between { and } as long as } starts the line.
If the } is always prefixed by a certain number of spaces, you can prefix the marked } with that number of spaces in the regex.
({{error}}|<<error>>|\[\[error\]\]|<{:error:}>)\s*(\s*{(.*?(?=$)|[\s\S]*?^)})?`
^
If the } is always prefixed by the same number of spaces as the opening error marker, you can do this
([t ]*)({{error}}|<<error>>|\[\[error\]\]|<{:error:}>)(?:[ \t]*({(.*?(?=}$)|[\s\S]*?^\1)})?) (demo)
For this example, it's important to look at the full sample indent text. I demonstrate how it can go wrong.
If these won't work, you'll need a more code-oriented solution, but at the very least you can detect presence of errors with this
({{error}}|<<error>>|\[\[error\]\]|<{:error:}>)
. Chris85's simpler version is bad form, it could match <<error]] and any other combination, something he's probably aware of.

Breaking a String into Chunks based on Pattern

I have one string, that looks like this:
a[abcdefghi,2,3,jklmnopqr]
The beginning "a" is fixed and non-changing, however the content within the brackets is and can follow a pattern. It will always be an alphabetical string, possibly followed by numbers separate by commas or more strings and/or numbers.
I'd like to be able to break it into chunks of the string and any numbers that follow it until the "]" or another string is met.
Probably best explained through examples and expected ideal results:
a[abcdefghi] -> "abcdefghi"
a[abcdefghi,2] -> "abcdefghi,2"
a[abcdefghi,2,3,jklmnopqr] -> "abcdefghi,2,3" and "jklmnopqr"
a[abcdefghi,2,3,jklmnopqr,stuvwxyz] -> "abcdefghi,2,3" and "jklmnopqr" and "stuvwxyz"
a[abcdefghi,2,3,jklmnopqr,1,9,stuvwxyz] -> "abcdefghi,2,3" and "jklmnopqr,1,9" and "stuvwxyz"
a[abcdefghi,1,jklmnopqr,2,stuvwxyz,3,4] -> "abcdefghi,1" and "jklmnopqr,2" and "stuvwxyz,3,4"
Ideally a malformed string would be partially caught (but this is a nice extra):
a[2,3,jklmnopqr,1,9,stuvwxyz] -> "jklmnopqr,1,9" and "stuvwxyz"
I'm using Javascript and I realize a regex won't bring me all the way to the solution I'd like but it could be a big help. The alternative is to do a lot of manually string parsing which I can do but doesn't seem like the best answer.
Advice, tips appreciated.
UPDATE: Yes I did mean alphametcial (A-Za-z) instead of alphanumeric. Edited to reflect that. Thanks for letting me know.
You'd probably want to do this in 2 steps. First, match against:
a\[([^[\]]*)\]
and extract group 1. That'll be the stuff in the square brackets.
Next, repeatedly match against:
[a-z]+(,[0-9]+)*
That'll match things like "abcdefghi,2,3". After the first match you'll need to see if the next character is a comma and if so skip over it. (BTW: if you really meant alphanumeric rather than alphabetic like your examples, use [a-z0-9]*[a-z][a-z0-9]* instead of [a-z]+.)
Alternatively, split the string on commas and reassemble into your word with number groups.
Why wouldn't a regex bring you all the way to a solution?
The following regex works against the given data, but it makes a few assumptions (at least two alphas followed by comma separated single digits).
([a-z]{2,}(?:,\\d)*)
Example:
re = new RegExp('[a-z]{2,}(?:,\\d)*', 'g')
matches = re.exec("a[abcdefghi,2,3,jklmnopqr,1,9,stuvwxyz]")
Assuming you can easily break out the string between the brackets, something like this might be what you're after:
> re = new RegExp('[a-z]+(?:,\\d)*(?:,?)', 'gi')
> while (match = re.exec("abcdefghi,2,3,jklmnopqr,1,9,stuvwxyz")) { print(match[0]) }
abcdefghi,2,3,
jklmnopqr,1,9,
stuvwxyz
This has the advantage of working partially in your malformed case:
> while (match = re.exec("abcdefghi,2,3,jklmnopqr,1,9,stuvwxyz")) { print(match[0]) }
jklmnopqr,1,9,
stuvwxy
The first character class [a-z] can be modified if you meant for it to be truly alphanumeric.

How to search csv string and return a match by using a Javascript regex

I'm trying to extract the first user-right from semicolon separated string which matches a pattern.
Users rights are stored in format:
LAA;LA_1;LA_2;LE_3;
String is empty if user does not have any rights.
My best solution so far is to use the following regex in regex.replace statement:
.*?;(LA_[^;]*)?.*
(The question mark at the end of group is for the purpose of matching the whole line in case user has not the right and replace it with empty string to signal that she doesn't have it.)
However, it doesn't work correctly in case the searched right is in the first position:
LA_1;LA_2;LE_3;
It is easy to fix it by just adding a semicolon at the beginning of line before regex replace but my question is, why doesn't the following regex match it?
.*?(?:(?:^|;)(LA_[^;]*))?.*
I have tried numerous other regular expressions to find the solution but so far without success.
I am not sure I get your question right, but in regards to the regular expressions you are using, you are overcomplicating them for no clear reason (at least not to me). You might want something like:
function getFirstRight(rights) {
var m = rights.match(/(^|;)(LA_[^;]*)/)
return m ? m[2] : "";
}
You could just split the string first:
function getFirstRight(rights)
{
return rights.split(";",1)[0] || "";
}
To answer the specific question "why doesn't the following regex match it?", one problem is the mix of this at the beginning:
.*?
eventually followed by:
^|;
Which might be like saying, skip over any extra characters until you reach either the start or a semicolon. But you can't skip over anything and then later arrive at the start (unless it involves newlines in a multiline string).
Something like this works:
.*?(\bLA_[^;]).*
Meaning, skip over characters until a word boundary followed by "LA_".

Categories

Resources