catastrophic backstring in regular expression - javascript

I am using below regular expression
[^\/]*([A-Za-z]+_([A-Za-z]+|-?[A-Z0-9]+(\.[A_Z0-9]+)?|(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?)+(?=(,|\/)))+|,[^\/]*
and it showing me catastrophic backstring when i am trying to match with input string.
/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/1488800313_DSC_0334__3_.JPG_mweubp.jpg
The expected output array of the matching regex will be like
[ 'w_100',
'h_500',
'e_saturation:50,e_tint:red:blue',
'c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.',
'l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc' ]
don't want to consider image name 1488800313_DSC_0334__3_.JPG_mweubp.jpg in match. the following
is there any method to solve this backstrack in regular expression or suggest me good regex for my input string.

The problem
You use a lot of alternations when a character class would be more effective. Also, you're getting the catastrophic backtracking due to the following quantifier:
[^\/]*([A-Za-z]+_([A-Za-z]+|-?[A-Z0-9]+(\.[A_Z0-9]+)?|(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?)+(?=(,|\/)))+|,[^\/]*
^
It's trying to match any of the alternations you have, but keeps backtracking and never makes it past all your alternations (it's sometimes comparable to an infinite loop). In your case, your regex is so ineffective that it times out. I removed half your pattern and it takes a half second to complete with almost 200K steps (and that's only half your pattern).
Original Answer
How can it be fixed?
First step is to fix the quantifier and prevent it from continuously backtracking. This is actually quite easy, just make it possessive: + becomes ++. Changing the quantifier to possessive yields a pattern that takes about 56ms to complete and approx 9K steps (on my computer)
Second step is to improve the efficiency of the pattern. Change your alternations to character classes where possible.
(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?
# should instead be
(?::-|[_:-=\s?%.!#*]|[-.][a-zA-Z]+|[A-Z0-9a-z]+)?
It's much shorter, much more concise and less prone to errors.
The new pattern
See regex in use here
This pattern only takes 271 steps and less than one millisecond to complete (yes, using PCRE engine, works in Java too)
(?<=[,\/])[A-Za-z]+_(?:[A-Z0-9a-z]+|-?[A-Z0-9]+(?:\.[A-Z0-9]+)?|:-|[_:-=\s?%.!#*]|[-.][a-zA-Z]+)++
I also changed your positive lookahead to a positive lookbehind (?<=[,\/]) to improve performance.
Additionally, if you don't need all the specific logic, you can quite simply use the following regex (just under half as many steps as my regex above):
See regex in use here
(?<=[,\/])[A-Za-z]+_[^,\/]+
Results
This results in the following array:
P.S. I'm assuming there'a a typo in your expected output and that the / between l_text and l_fetch should also be split on; needs clarification.
w_100
h_500
e_saturation:50
e_tint:red:blue
c_crop
a_100
l_text:Neucha_26_bold:Loremipsum.
l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc
Edit #1
The OP clarified the expected results. I added , to the character class in the fourth option of the non-capture group:
See regex in use here
(?<=[,\/])[A-Za-z]+_(?:[A-Z0-9a-z]+|-?[A-Z0-9]+(?:\.[A-Z0-9]+)?|:-|[_:-=\s?%.!#*,]|[-.][a-zA-Z]+)++
And in its shortened form:
See regex in use here
(?<=\/)[A-Za-z]+_[^\/]+
Results
This results in the following array:
w_100
h_500
e_saturation:50,e_tint:red:blue
c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.
l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc
Edit #2
The OP presented another input and identified issues with Edit #1 related to that input. I added logic to force a fail on the last item in a string.
New test string:
/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/sample_url_image.jpg
See regex in use here
(?<=\/)(?![A-Za-z]+_[^\/]+$)[A-Za-z]+_[^\/]+
Same results as in Edit #1.
PCRE version (if anyone is looking for it) - more efficient than the method above:
See regex in use hereenter link description here
(?<=\/)[A-Za-z]+_[^\/]+(?:$(*SKIP)(*FAIL))?

Assuming your example has a typo, e.g. the last / would be split too:
You can simply split on /, then filter out the .jpg items:
function splitWithFilter(line, filter) {
var filterRe = filter ? new RegExp(filter, 'i') : null;
return line
.replace(/^\//, '') // remove leading /
.split(/\//)
//.filter(Boolean) // filter out empty items (alternative to above replace())
.filter(function(item) {
return !filterRe || !item.match(filterRe);
});
}
var str = "/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/1488800313_DSC_0334__3_.JPG_mweubp.jpg";
console.log(JSON.stringify(splitWithFilter(str, '\\.jpg$'), null, ' '));
Expected output:
[
"w_100",
"h_500",
"e_saturation:50,e_tint:red:blue",
"c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.",
"l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc"
]

Related

How do i allow only one (dash or dot or underscore) in a user form input using regular expression in javascript?

I'm trying to implement a username form validation in javascript where the username
can't start with numbers
can't have whitespaces
can't have any symbols but only One dot or One underscore or One dash
example of a valid username: the_user-one.123
example of invalid username: 1----- user
i've been trying to implement this for awhile but i couldn't figure out how to have only one of each allowed symbol:-
const usernameValidation = /(?=^[\w.-]+$)^\D/g
console.log(usernameValidation.test('1username')) //false
console.log(usernameValidation.test('username-One')) //true
How about using a negative lookahead at the start:
^(?!\d|.*?([_.-]).*\1)[\w.-]+$
This will check if the string
neither starts with digit
nor contains two [_.-] by use of capture and backreference
See this demo at regex101 (more explanation on the right side)
Preface: Due to my severe carelessness, I assumed the context was usage of the HTML pattern attribute instead of JavaScript input validation. I leave this answer here for posterity in case anyone really wants to do this with regex.
Although regex does have functionality to represent a pattern occuring consecutively within a certain number of times (via {<lower-bound>,<upper-bound>}), I'm not aware of regex having "elegant" functionality to enforce a set of patterns each occuring within a range of number of times but in any order and with other patterns possibly in between.
Some workarounds I can think of:
Make a regex that allows for one of each permutation of ordering of special characters (note: newlines added for readability):
^(?:
(?:(?:(?:[A-Za-z][A-Za-z0-9]*\.?)|\.)[A-Za-z0-9]*-?[A-Za-z0-9]*_?)|
(?:(?:(?:[A-Za-z][A-Za-z0-9]*\.?)|\.)[A-Za-z0-9]*_?[A-Za-z0-9]*-?)|
(?:(?:(?:[A-Za-z][A-Za-z0-9]*-?)|-)[A-Za-z0-9]*\.?[A-Za-z0-9]*_?)|
(?:(?:(?:[A-Za-z][A-Za-z0-9]*-?)|-)[A-Za-z0-9]*_?[A-Za-z0-9]*\.?)|
(?:(?:(?:[A-Za-z][A-Za-z0-9]*_?)|_)[A-Za-z0-9]*\.?[A-Za-z0-9]*-?)|
(?:(?:(?:[A-Za-z][A-Za-z0-9]*_?)|_)[A-Za-z0-9]*-?[A-Za-z0-9]*\.?)
)[A-Za-z0-9]*$
Note that the above regex can be simplified if you don't want usernames to start with special characters either.
Friendly reminder to also make sure you use the HTML attributes to enforce a minimum and maximum input character length where appropriate.
If you feel that regex isn't well suited to your use-case, know that you can do custom validation logic using javascript, which gives you much more control and can be much more readable compared to regex, but may require more lines of code to implement. Seeing the regex above, I would personally seriously consider the custom javascript route.
Note: I find https://regex101.com/ very helpful in learning, writing, and testing regex. Make sure to set the "flavour" to "JavaScript" in your case.
I have to admit that Bobble bubble's solution is the better fit. Here ia a comparison of the different cases:
console.log("Comparison between mine and Bobble Bubble's solution:\n\nusername mine,BobbleBubble");
["valid-usrId1","1nvalidUsrId","An0therVal1d-One","inva-lid.userId","anot-her.one","test.-case"].forEach(u=>console.log(u.padEnd(20," "),chck(u)));
function chck(s){
return [!!s.match(/^[a-zA-Z][a-zA-Z0-9._-]*$/) && ( s.match(/[._-]/g) || []).length<2, // mine
!!s.match(/^(?!\d|.*?([_.-]).*\1)[\w.-]+$/)].join(","); // Bobble bulle
}
The differences can be seen in the last three test cases.

regex between two character positions with known start and end indices

In regex, generally speaking, is there a way to select data between two line positions? I'm not even sure the correct terminology (character/line position, index, column?) after a few days of reading up on regex, but what I mean is...
Select the data between two indices, what is between ^.{4} and ^.{7}, for example:
TESTINGREGEX
ISNTTHEBEST!
or
TESTINGREGEXCANBEFUN
ISNTTHEBEST!ANDFARFROMFUN
the results I'm looking for would be:
TESTREGEX
ISNTBEST!
and
TESTREGEXCANBEFUN
ISNTBEST!ANDFARFROMFUN
I'm wondering, so I can learn if it's possible, how to achieve it? I'm very familiar with other ways to do this using other tools, but I'm curious how to achieve this using regex.
I've tried working with non capturing groups, and wondering if maybe I'm being limited by the fact that I'm attempting to apply this regex within the atom editor find and replace regex feature (falling victim to: Avoiding Common Pitfalls), so I'm hoping to get a few suggestions to broaden my knowledge and try out. I'm guessing javascript, and/or sed style regex answers would be acceptable...really anything would help!
EDIT:
.{3}(?=.{5}$) from Mark's answer works for me and with the example text I gave in the OP. And it's a good thing to know when able to count from the $ end of line. But I'm realizing I actually need the opposite... I need to count out from the ^ start of line. Is this not possible; re: comments on there being no support for lookbehind?
With just regex it's possible, just not in javascript. The regex (?<=^.{4}).+(?=.{5}$) works to capture the group between the 4th letter and the 5th to last letter. Since javascript doesn't support positive look behinds, you'll have to use some ammount of javascript beyond a simple .replace(regex, "") to remove those characters.
The next closest regex possible in javascript would be .{3}(?=.{5}$), which would match 3 characters before the 5th to last letter.
If you wanted with pure regex in javascript to capture something a few characters after the start of a string it would be impossible.
The regex ^(.{4}).{3}(.{5})$ (expressed in JavaScript's dialect, but the features used in it are quite common) will give you two capture groups you can combine to get the output you describe:
function test(str) {
var match = str.match(/^(.{4}).{3}(.{5})$/);
console.log(str, '=>', match[1] + match[2]);
}
test("TESTINGREGEX");
test("ISNTTHEBEST!");
If the lines are of varying length and you want to ignore everything after the end of what you want, just drop the $ assertion at the end.
If the purpose is to get the text between two character offsets then regular expressions are overkill. Just use slice:
function exclude(str, i, j) {
return str.slice(0, i) + str.slice(j);
}
console.log(exclude("TESTINGREGEX", 4, 7));
console.log(exclude("ISNTTHEBEST!", 4, 7));
If you really need to do this with regular expressions then proceed as follows:
function exclude(str, i, j) {
return str.replace(new RegExp(`^(.{${i}})(.{${j-i}})`), "$1");
}
console.log(exclude("TESTINGREGEX", 4, 7));
console.log(exclude("ISNTTHEBEST!", 4, 7));

Javascript regex to pick all multi line text between two strings [duplicate]

var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre.*?<\/pre>/gm );
alert(arr); // null
I'd want the PRE block be picked up, even though it spans over newline characters. I thought the 'm' flag does it. Does not.
Found the answer here before posting. SInce I thought I knew JavaScript (read three books, worked hours) and there wasn't an existing solution at SO, I'll dare to post anyways. throw stones here
So the solution is:
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[\s\S]*?<\/pre>/gm );
alert(arr); // <pre>...</pre> :)
Does anyone have a less cryptic way?
Edit: this is a duplicate but since it's harder to find than mine, I don't remove.
It proposes [^] as a "multiline dot". What I still don't understand is why [.\n] does not work. Guess this is one of the sad parts of JavaScript..
DON'T use (.|[\r\n]) instead of . for multiline matching.
DO use [\s\S] instead of . for multiline matching
Also, avoid greediness where not needed by using *? or +? quantifier instead of * or +. This can have a huge performance impact.
See the benchmark I have made: https://jsben.ch/R4Hxu
Using [^]: fastest
Using [\s\S]: 0.83% slower
Using (.|\r|\n): 96% slower
Using (.|[\r\n]): 96% slower
NB: You can also use [^] but it is deprecated in the below comment.
[.\n] does not work because . has no special meaning inside of [], it just means a literal .. (.|\n) would be a way to specify "any character, including a newline". If you want to match all newlines, you would need to add \r as well to include Windows and classic Mac OS style line endings: (.|[\r\n]).
That turns out to be somewhat cumbersome, as well as slow, (see KrisWebDev's answer for details), so a better approach would be to match all whitespace characters and all non-whitespace characters, with [\s\S], which will match everything, and is faster and simpler.
In general, you shouldn't try to use a regexp to match the actual HTML tags. See, for instance, these questions for more information on why.
Instead, try actually searching the DOM for the tag you need (using jQuery makes this easier, but you can always do document.getElementsByTagName("pre") with the standard DOM), and then search the text content of those results with a regexp if you need to match against the contents.
You do not specify your environment and version of JavaScript (ECMAScript), and I realise this post was from 2009, but just for completeness:
With the release of ECMA2018 we can now use the s flag to cause . to match \n (see https://stackoverflow.com/a/36006948/141801).
Thus:
let s = 'I am a string\nover several\nlines.';
console.log('String: "' + s + '".');
let r = /string.*several.*lines/s; // Note 's' modifier
console.log('Match? ' + r.test(s)); // 'test' returns true
This is a recent addition and will not work in many current environments, for example Node v8.7.0 does not seem to recognise it, but it works in Chromium, and I'm using it in a Typescript test I'm writing and presumably it will become more mainstream as time goes by.
Now there's the s (single line) modifier, that lets the dot matches new lines as well :)
\s will also match new lines :D
Just add the s behind the slash
/<pre>.*?<\/pre>/gms
[.\n] doesn't work, because dot in [] (by regex definition; not javascript only) means the dot-character. You can use (.|\n) (or (.|[\n\r])) instead.
I have tested it (Chrome) and it's working for me (both [^] and [^\0]), by changing the dot (.) with either [^\0] or [^] , because dot doesn't match line break (See here: http://www.regular-expressions.info/dot.html).
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[^\0]*?<\/pre>/gm );
alert(arr); //Working
In addition to above-said examples, it is an alternate.
^[\\w\\s]*$
Where \w is for words and \s is for white spaces
[\\w\\s]*
This one was beyond helpful for me, especially for matching multiple things that include new lines, every single other answer ended up just grouping all of the matches together.

Regular Expression to MATCH ALL words in a query, in any order

I'm trying to build a search feature for a project which narrows down items based on a user search input and if it matches the keywords listed against items. For this, I'm saving the item keywords in a data attribute and matching the query with these keywords using a RegExp pattern.
I'm currently using this expression, which I know is not correct and need your help on that:
new RegExp('\\b(' + query + ')', 'gi'))) where query is | separated values of the query entered by the user (e.g. \\b(meat|pasta|dinner)). This returns me a match even if there is only 1 match, say for example - meat
Just to throw some context, here's a small example:
If a user types: meat pasta dinner it should list all items which have ALL the 3 keywords listed against them i.e. meat pasta and dinner. These are independent of the order they're typed in.
Can you help me with an expression which will match ALL words in a query, in any order?
You can achieve this will lookahead assertions
^(?=.*\bmeat\b)(?=.*\bpasta\b)(?=.*\bdinner\b).+
See it here on Regexr
(?=.*\bmeat\b) is a positive lookahead assertion, that ensures that \bmeat\b is somewhere in the string. Same for the other keywords and the .+ is then actually matching the whole string, but only if the assertions are true.
But it will match also on "dinner meat Foobar pasta"
stema's answer is technically correct, but it doesn't take performance into account at all. Look aheads are extremely slow (in the context of regular expressions, which are lightning fast). Even with the current logic, the regular expression is not optimal.
So here are some measurements, calculated on larger strings which contain all three words, running the search 1000 times and using four different approaches:
stema's regular expression
/^(?=.*\bmeat\b)(?=.*\bpasta\b)(?=.*\bdinner\b).+/
result: 605ms
optimized regular expression
/^(?=.*?\bmeat\b)(?=.*?\bpasta\b)(?=.*?\bdinner\b)/
uses lazy matching and doesn't need the end all selector
result: 291ms
permutation regular expression
/(\bmeat\b.*?(\bpasta\b.*?\bdinner\b|\bdinner\b.*?\bpasta\b)|\bpasta\b.*?(\bmeat\b.*?\bdinner\b|\bdinner\b.*?\bmeat\b)|\bdinner\b.*?(\bpasta\b.*?\bmeat\b|\bmeat\b.*?\bpasta\b))/
result: 56ms
this is fast because the first pattern is matching, if the last pattern matched, it would be even slower than the look ahead one (300 ms)
array of regular expressions
var regs=[/\bmeat\b/,/\bpasta\b/,/\bdinner\b/];
var result = regs.every(reg=>reg.test(text));
result: 26ms
Note that if the strings are crafted to not match, then the results are:
521ms
220ms
161ms - much slower because it has to go through all the branches
14ms
As you can see, in all cases just using a loop is an order of magnitude faster, not to mention easier to read.
The original question was asking for a regular expression, so my answer to that is the permutation regular expression, but I would not use it, as its size would grow exponentially with the number of search words.
Also, in most cases this performance issue is academic, but it is necessary to be highlighted.
your regex looks pretty good:
\b(meat|pasta|dinner)\b
Check that the length of matches equals the number of keywords (in this case, three):
string.match(re).length === numberOfKeywords
where re is the regex with a g flag, string is the data and numberOfKeywords is the number of keywords
This assumes that there are no repeated keywords.
Based on the accepted answer I wrote a simple Java method that builds the regex from an array of keywords
public static String regexIfAllKeywordsExists(String[] keywords) {
StringBuilder sb = new StringBuilder("^");
for (String keyword : keywords) {
sb.append("(?=.*\\b");
sb.append(keyword);
sb.append("\\b)");
}
sb.append(".+");
return sb.toString();
}

Breaking a String into Chunks based on Pattern

I have one string, that looks like this:
a[abcdefghi,2,3,jklmnopqr]
The beginning "a" is fixed and non-changing, however the content within the brackets is and can follow a pattern. It will always be an alphabetical string, possibly followed by numbers separate by commas or more strings and/or numbers.
I'd like to be able to break it into chunks of the string and any numbers that follow it until the "]" or another string is met.
Probably best explained through examples and expected ideal results:
a[abcdefghi] -> "abcdefghi"
a[abcdefghi,2] -> "abcdefghi,2"
a[abcdefghi,2,3,jklmnopqr] -> "abcdefghi,2,3" and "jklmnopqr"
a[abcdefghi,2,3,jklmnopqr,stuvwxyz] -> "abcdefghi,2,3" and "jklmnopqr" and "stuvwxyz"
a[abcdefghi,2,3,jklmnopqr,1,9,stuvwxyz] -> "abcdefghi,2,3" and "jklmnopqr,1,9" and "stuvwxyz"
a[abcdefghi,1,jklmnopqr,2,stuvwxyz,3,4] -> "abcdefghi,1" and "jklmnopqr,2" and "stuvwxyz,3,4"
Ideally a malformed string would be partially caught (but this is a nice extra):
a[2,3,jklmnopqr,1,9,stuvwxyz] -> "jklmnopqr,1,9" and "stuvwxyz"
I'm using Javascript and I realize a regex won't bring me all the way to the solution I'd like but it could be a big help. The alternative is to do a lot of manually string parsing which I can do but doesn't seem like the best answer.
Advice, tips appreciated.
UPDATE: Yes I did mean alphametcial (A-Za-z) instead of alphanumeric. Edited to reflect that. Thanks for letting me know.
You'd probably want to do this in 2 steps. First, match against:
a\[([^[\]]*)\]
and extract group 1. That'll be the stuff in the square brackets.
Next, repeatedly match against:
[a-z]+(,[0-9]+)*
That'll match things like "abcdefghi,2,3". After the first match you'll need to see if the next character is a comma and if so skip over it. (BTW: if you really meant alphanumeric rather than alphabetic like your examples, use [a-z0-9]*[a-z][a-z0-9]* instead of [a-z]+.)
Alternatively, split the string on commas and reassemble into your word with number groups.
Why wouldn't a regex bring you all the way to a solution?
The following regex works against the given data, but it makes a few assumptions (at least two alphas followed by comma separated single digits).
([a-z]{2,}(?:,\\d)*)
Example:
re = new RegExp('[a-z]{2,}(?:,\\d)*', 'g')
matches = re.exec("a[abcdefghi,2,3,jklmnopqr,1,9,stuvwxyz]")
Assuming you can easily break out the string between the brackets, something like this might be what you're after:
> re = new RegExp('[a-z]+(?:,\\d)*(?:,?)', 'gi')
> while (match = re.exec("abcdefghi,2,3,jklmnopqr,1,9,stuvwxyz")) { print(match[0]) }
abcdefghi,2,3,
jklmnopqr,1,9,
stuvwxyz
This has the advantage of working partially in your malformed case:
> while (match = re.exec("abcdefghi,2,3,jklmnopqr,1,9,stuvwxyz")) { print(match[0]) }
jklmnopqr,1,9,
stuvwxy
The first character class [a-z] can be modified if you meant for it to be truly alphanumeric.

Categories

Resources