Regex error in Netbeans not present in other editors - javascript

I have the following regular expression that works fine in my application code and other code editors have not reported a problem with it. It is used to validate a password.
/^(?=.*[A-Za-z])+(?=.*[\d])+(?=.*[^A-Za-z\d\s])+.*$/
So in other words:
Must have one letter
Must have one digit
Must have one non-letter, non-digit
Now it seems netbeans has a fairly decent regex parser and it has reported that this is an erroneous statement. But as i am new to regex I cannot spot the error. Is it due to using the positive lookahead ?= with the one or more + at the end?
When I take out the + the error goes away, but the regex stops performing in my application.
If anyone can tell me what is wrong with my expression that would be great.
The statement is used in a jQuery validation plugin that i use, if that helps. Also due to the fact I am using a plugin, I would prefer not splitting this into several smaller (clearly simpler and cleaner) expressions. That would require a great deal of work.

It never makes sense to apply a quantifier to a zero-width assertion such as a lookahead. The whole point of such assertions is that they allow you to assert that some condition is true, without consuming any of the text--that is, advancing the current match position. Some regex flavors treat that as a syntax error, while others effectively ignore the quantifier. Getting rid of those plus signs makes your regex correct:
/^(?=.*[A-Za-z])(?=.*\d)(?=.*[^A-Za-z\d\s]).*$/
If it doesn't work as expected, you may be running into the infamous IE lookahead bug. The usual workaround is to reorder things so the first lookahead is anchored at the end, like so:
/^(?=.{8,15}$)(?=.*[A-Za-z])(?=.*\d)(?=.*[^A-Za-z\d\s]).*/
The (?=.{8,15}$) is just an example; I have no idea what your real requirements are. If you do want to impose minimum and maximum length limits, this is the ideal place to do it.

Related

Regex to match certain characters and exclude certain characters but without negative lookahead

I want a regex that matches all emojis (or most of them) but excludes certain characters (such as “|”|‘|’|…|—).
This regex does the job via negative lookahead:
/(?!\u201C|\u201D|\u2018|\u2019|\u2026|\u2014)(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])/
But apparently Google Scripts doesn't support this. Error:
Invalid regular expression pattern
(?!“|”|‘|’|…|—)(©|®|[ -㌀]|?[퀀-?]|?[퀀-?]|?[퀀-?])
Is there another way to achieve my goal (a regex that works with Google Script's findText)?
Option 1
Maybe,
[\u{1f300}-\u{1f5ff}\u{1f900}-\u{1f9ff}\u{1f600}-\u{1f64f}\u{1f680}-\u{1f6ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}\u{1f1e6}-\u{1f1ff}\u{1f191}-\u{1f251}\u{1f004}\u{1f0cf}\u{1f170}-\u{1f171}\u{1f17e}-\u{1f17f}\u{1f18e}\u{3030}\u{2b50}\u{2b55}\u{2934}-\u{2935}\u{2b05}-\u{2b07}\u{2b1b}-\u{2b1c}\u{3297}\u{3299}\u{303d}\u{00a9}\u{00ae}\u{2122}\u{23f3}\u{24c2}\u{23e9}-\u{23ef}\u{25b6}\u{23f8}-\u{23fa}]
might be working OK for your desired emojis.
Demo
Option 2
Otherwise, you might want to negate those undesired chars using char classes, such as:
[these unicode ranges &&[^these unicodes]]
which would become pretty complicated, yet possible.
Option 3
Using this option you can most likely solve your problem much simpler. I guess, your problem is that those undesired punctuations are already among the desired unicodes. Check to see if that'd be the case. For example, in
[\u100-\u200]
you might have \u150 and \u175 as undesired chars, which you want them to be removed from your desired ranges of unicodes that you already have.
You can then simply remove those from the range, such as with:
[\u100-\u149\u151-\u174\u176-\u200]
and as simple as that the problem would be solved.
Source
javascript unicode emoji regular expressions

Inefficient regular expression? [duplicate]

I recently became aware of Regular expression Denial of Service attacks, and decided to root out so-called 'evil' regex patterns wherever I could find them in my codebase - or at least those that are used on user input. The examples given at the OWASP link above and wikipedia are helpful, but they don't do a great job of explaining the problem in simple terms.
A description of evil regexes, from wikipedia:
the regular expression applies repetition ("+", "*") to a complex subexpression;
for the repeated subexpression, there exists a match which is also a suffix of another valid match.
With examples, again from wikipedia:
(a+)+
([a-zA-Z]+)*
(a|aa)+
(a|a?)+
(.*a){x} for x > 10
Is this a problem that just doesn't have a simpler explanation? I'm looking for something that would make it easier to avoid this problem while writing regexes, or to find them within an existing codebase.
Why Are Evil Regexes A Problem?
Because computers do exactly what you tell them to do, even if it's not what you meant or is totally unreasonable. If you ask a regex engine to prove that, for some given input, there either is or is not a match for a given pattern, then the engine will attempt to do that no matter how many different combinations must be tested.
Here is a simple pattern inspired by the first example in the OP's post:
^((ab)*)+$
Given the input:
abababababababababababab
The regex engine tries something like (abababababababababababab) and a match is found on the first try.
But then we throw the monkey wrench in:
abababababababababababab a
The engine will first try (abababababababababababab) but that fails because of that extra a. This causes catastrophic backtracking, because our pattern (ab)*, in a show of good faith, will release one of its captures (it will "backtrack") and let the outer pattern try again. For our regex engine, that looks something like this:
(abababababababababababab) - Nope
(ababababababababababab)(ab) - Nope
(abababababababababab)(abab) - Nope
(abababababababababab)(ab)(ab) - Nope
(ababababababababab)(ababab) - Nope
(ababababababababab)(abab)(ab) - Nope
(ababababababababab)(ab)(abab) - Nope
(ababababababababab)(ab)(ab)(ab) - Nope
(abababababababab)(abababab) - Nope
(abababababababab)(ababab)(ab) - Nope
(abababababababab)(abab)(abab) - Nope
(abababababababab)(abab)(ab)(ab) - Nope
(abababababababab)(ab)(ababab) - Nope
(abababababababab)(ab)(abab)(ab) - Nope
(abababababababab)(ab)(ab)(abab) - Nope
(abababababababab)(ab)(ab)(ab)(ab) - Nope
(ababababababab)(ababababab) - Nope
(ababababababab)(abababab)(ab) - Nope
(ababababababab)(ababab)(abab) - Nope
(ababababababab)(ababab)(ab)(ab) - Nope
(ababababababab)(abab)(abab)(ab) - Nope
(ababababababab)(abab)(ab)(abab) - Nope
(ababababababab)(abab)(ab)(ab)(ab) - Nope
(ababababababab)(ab)(abababab) - Nope
(ababababababab)(ab)(ababab)(ab) - Nope
(ababababababab)(ab)(abab)(abab) - Nope
(ababababababab)(ab)(abab)(ab)(ab) - Nope
(ababababababab)(ab)(ab)(ababab) - Nope
(ababababababab)(ab)(ab)(abab)(ab) - Nope
(ababababababab)(ab)(ab)(ab)(abab) - Nope
(ababababababab)(ab)(ab)(ab)(ab)(ab) - Nope
                              ...
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abababab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(abab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab) - Nope
The number of possible combinations scales exponentially with the length of the input and, before you know it, the regex engine is eating up all your system resources trying to solve this thing until, having exhausted every possible combination of terms, it finally gives up and reports "There is no match." Meanwhile your server has turned into a burning pile of molten metal.
How to Spot Evil Regexes
It's actually very tricky. Catastrophic backtracking in modern regex engines is similar in nature to the halting problem which Alan Turing proved was impossible to solve. I have written problematic regexes myself, even though I know what they are and generally how to avoid them. Wrapping everything you can in an atomic group can help to prevent the backtracking issue. It basically tells the regex engine not to revisit a given expression - "lock whatever you matched on the first try". Note, however, that atomic expressions don't prevent backtracking within the expression, so ^(?>((ab)*)+)$ is still dangerous, but ^(?>(ab)*)+$ is safe (it'll match (abababababababababababab) and then refuse to give up any of it's matched characters, thus preventing catastrophic backtracking).
Unfortunately, once it's written, it's actually very hard to immediately or quickly find a problem regex. In the end, recognizing a bad regex is like recognizing any other bad code - it takes a lot of time and experience and/or a single catastrophic event.
Interestingly, since this answer was first written, a team at the University of Texas at Austin published a paper describing the development of a tool capable of performing static analysis of regular expressions with the express purpose of finding these "evil" patterns. The tool was developed to analyse Java programs, but I suspect that in the coming years we'll see more tools developed around analysing and detecting problematic patterns in JavaScript and other languages, especially as the rate of ReDoS attacks continues to climb.
Static Detection of DoS Vulnerabilities in
Programs that use Regular Expressions
Valentin Wüstholz, Oswaldo Olivo, Marijn J. H. Heule, and Isil Dillig
The University of Texas at Austin
Detecting evil regexes
Try Nicolaas Weideman's RegexStaticAnalysis project.
Try my ensemble-style vuln-regex-detector which has a CLI for Weideman's tool and others.
Rules of thumb
Evil regexes are always due to ambiguity in the corresponding NFA, which you can visualize with tools like regexper.
Here are some forms of ambiguity. Don't use these in your regexes.
Nesting quantifiers like (a+)+ (aka "star height > 1"). This can cause exponential blow-up. See substack's safe-regex tool.
Quantified Overlapping Disjunctions like (a|a)+. This can cause exponential blow-up.
Avoid Quantified Overlapping Adjacencies like \d+\d+. This can cause polynomial blow-up.
Additional resources
I wrote this paper on super-linear regexes. It includes loads of references to other regex-related research.
What you call an "evil" regex is a regex that exhibits catastrophic backtracking. The linked page (which I wrote) explains the concept in detail. Basically, catastrophic backtracking happens when a regex fails to match and different permutations of the same regex can find a partial match. The regex engine then tries all those permutations. If you want to go over your code and inspect your regexes these are the 3 key issues to look at:
Alternatives must be mutually exclusive. If multiple alternatives can match the same text then the engine will try both if the remainder of the regex fails. If the alternatives are in a group that is repeated, you have catastrophic backtracking. A classic example is (.|\s)* to match any amount of any text when the regex flavor does not have a "dot matches line breaks" mode. If this is part of a longer regex then a subject string with a sufficiently long run of spaces (matched by both . and \s) will break the regex. The fix is to use (.|\n)* to make the alternatives mutually exclusive or even better to be more specific about which characters are really allowed, such as [\r\n\t\x20-\x7E] for ASCII printables, tabs, and line breaks.
Quantified tokens that are in sequence must either be mutually exclusive with each other or be mutually exclusive what comes between them. Otherwise both can match the same text and all combinations of the two quantifiers will be tried when the remainder of the regex fails to match. A classic example is a.*?b.*?c to match 3 things with "anything" between them. When c can't be matched the first .*? will expand character by character until the end of the line or file. For each expansion the second .*? will expand character by character to match the remainder of the line or file. The fix is to realize that you can't have "anything" between them. The first run needs to stop at b and the second run needs to stop at c. With single characters a[^b]*+b[^c]*+c is an easy solution. Since we now stop at the delimiter, we can use possessive quantifiers to further increase performance.
A group that contains a token with a quantifier must not have a quantifier of its own unless the quantified token inside the group can only be matched with something else that is mutually exclusive with it. That ensures that there is no way that fewer iterations of the outer quantifier with more iterations of the inner quantifier can match the same text as more iterations of the outer quantifier with fewer iterations of the inner quantifier. This is the problem illustrated in JDB's answer.
While I was writing my answer I decided that this merited a full article on my website. This is now online too.
I would sum it up as "A repetition of a repetition". The first example you listed is a good one, as it states "the letter a, one or more times in a row. This can again happen one or more times in a row".
What to look for in this case is combination of the quantifiers, such as * and +.
A somewhat more subtle thing to look out for is the third and fourth one. Those examples contain an OR operation, in which both sides can be true. This combined with a quantifier of the expression can result in a LOT of potential matches depending on the input string.
To sum it up, TLDR-style:
Be careful how quantifiers are used in combination with other operators.
I have surprisingly come across ReDOS quite a few times performing source code reviews. One thing I would recommend is to use a timeout with whatever Regular Expression engine that you are using.
For example, in C# I can create the regular expression with a TimeSpan attribute.
string pattern = #"^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$";
Regex regexTags = new Regex(pattern, RegexOptions.None, TimeSpan.FromSeconds(1.0));
try
{
string noTags = regexTags.Replace(description, "");
System.Console.WriteLine(noTags);
}
catch (RegexMatchTimeoutException ex)
{
System.Console.WriteLine("RegEx match timeout");
}
This regex is vulnerable to denial of service and without the timeout will spin and eat resources. With the timeout, it will throw a RegexMatchTimeoutException after the given timeout and will not cause the resource usage leading to a Denial of Service condition.
You will want to experiment with the timeout value to make sure it works for your usage.
I would say this is related to the regex engine in use. You may not always be able to avoid these types of regexes, but if your regex engine is built right, then it is less of a problem. See this blog series for a great deal of information on the topic of regex engines.
Note the caveat at the bottom of the article, in that backtracking is an NP-Complete problem. There currently is no way to efficiently process them, and you might want to disallow them in your input.
I don't think you can recognize such regexes, at least not all of them or not without restrictively limiting their expressiveness. If you'd really care about ReDoSs, I'd try to sandbox them and kill their processing with a timeout. It also might be possible that there are RegEx implementations that let you limit their max backtracking amount.
There are some ways I can think of that you could implement some simplification rules by running them on small test inputs or analyzing the regex's structure.
(a+)+ can be reduced using some sort of rule for replacing redundant operators to just (a+)
([a-zA-Z]+)* could also be simplified with our new redundancy combining rule to ([a-zA-Z]*)
The computer could run tests by running the small subexpressions of the regex against randomly-generated sequences of the relevant characters or sequences of characters, and seeing what groups they all end up in. For the first one, the computer is like, hey the regex wants a's, so lets try it with 6aaaxaaq. It then sees that all the a's, and only the first groupm end up in one group, and concludes that no matter how many a's is puts, it won't matter, since + gets all in the group. The second one, is like, hey, the regex wants a bunch of letters, so lets try it with -fg0uj=, and then it sees that again each bunch is all in one group, so it gets rid of the + at the end.
Now we need a new rule to handle the next ones: The eliminate-irrelevant-options rule.
With (a|aa)+, the computer takes a look at it and is like, we like that big second one, but we can use that first one to fill in more gaps, lets get ans many aa's as we can, and see if we can get anything else after we're done. It could run it against another test string, like `eaaa#a~aa.' to determine that.
You can protect yourself from (a|a?)+ by having the computer realize that the strings matched by a? are not the droids we are looking for, because since it can always match anywhere, we decide that we don't like things like (a?)+, and throw it out.
We protect from (.*a){x} by getting it to realize that the characters matched by a would have already been grabbed by .*. We then throw out that part and use another rule to replace the redundant quantifiers in (.*){x}.
While implementing a system like this would be very complicated, this is a complicated problem, and a complicated solution may be necessary. You should also use techniques other people have brought up, like only allowing the regex some limited amount of execution resources before killing it if it doesn't finish.

Regex - Validate that the local part of the email is not ending with a dot while only allowing certain characters without using a lookbehind

I was using a lookbehind to check for a dot before the # but just realized not all browsers are supporting lookbehinds. It works perfect in Chrome but fails in Firefox and IE.
This is what I came up with but it certainly is messy
^([a-zA-Z0-9&^*%#~{}=+?`_-]\.?)*[a-zA-Z0-9&^*%#~{}=+?`_-]#([a-zA-Z0-9]+\.)+[a-zA-Z]$
Is there a simpler and/or more elegant way to do this? I don't think I can negate the dot (^.) because I'm only allowing certain characters to be present in the local part.
This ([a-zA-Z0-9&^*%#~{}=+?`_-].?)*[a-zA-Z0-9&^*%#~{}=+?`_-] part is not messy, but inefficient, because the * quantifies a group containing an obligatory part, [...], and an optional \.?. Instead of (ab?)*a, you may use a+(?:ba+)* that will make matching linear and swift, in your case, [a-zA-Z0-9&^*%#~{}=+?`_-]+(?:.[a-zA-Z0-9&^*%#~{}=+?`_-]+)*.
More, [a-zA-Z0-9_] equals \w in JS regex, you may use this to shorten the pattern.
Besides, the last [a-zA-Z]$ pattern only matches a single letter, you most probably need [a-zA-Z]{2}$ there, as TLDs consist of 2+ letters.
So, you may use
^[\w&^*%#~{}=+?`-]+(?:\.[\w&^*%#~{}=+?`-]+)*#(?:[a-zA-Z0-9]+\.)+[a-zA-Z]{2,}$
See the regex demo.

Regex - PHP to Javascript

I have the following regex:
<div\s*class="selectionDescription">(.*?)<\/div>
Which works with PHP perfectly. I am aware that javascript does not support the \s flag.
I have tried using the \g flag, however my pattern is not matched.
I am looking to match everything inside the div in the following string:
<div class="selectionDescription">Text to match</div>
I receive the following error in javascript:
Uncaught SyntaxError: Invalid flags supplied to RegExp constructor 's'(…)
Your pattern seems to work.
s is not a flag, so if you are trying something like new RegExp('<div\s*class="selectionDescription">(.*?)<\/div>', 's') then yes, you would find an error.
You do not need to add any flags, except perhaps the g flag to capture this div many times. (Check it out)
Maybe check out a quick primer on Javascript's regular expressions?
If you are spanning multiple lines, and you mean the single line mode that s provides, you can emulate that with [\S\s] or some other similar "all inclusive, all exclusive" style: [\d\D], [\W\w], etc.
That will allow it to span multiple lines and still match:
<div\s*class="selectionDescription">([\S\s]*?)<\/div>
You need to be wary of using lazy *? quantifiers, however. Take a look at https://regex101.com/r/xD2jV8/1 where the number of steps is 220.
If the content between <div> and </div> tags is very large, this becomes very computationally expensive, very fast.
While slightly less readable,
<div\s*class="selectionDescription">((?:[^<]+|<(?!\/div>))*)<\/div>
would do the same but within only 69 steps.
And at that point, https://regex101.com/r/xD2jV8/3 slightly optimizes it further, but HTML really isn't the best way to handle things with HTML. jQuery could perform this quickly and much "safer": $('div.selectionDescription').html()
Of course, you may not have access to this at this point, but HTML is usually not the best thing to use for parsing HTML.

Do browsers support different HTML5 pattern regexp features?

I had a simple RegEx pattern in a customer-facing payment form on our website:
<input type="text" pattern="(|\$)[0-9]*(|\.[0-9]{2})"
title="Please enter a valid number in the amount field" required>
It was added to help quickly notify customers when they fail to enter a valid number, before hitting the server-side validation.
After four customers called in complaining that they were unable to submit the form because their browser continually told them the amount they had entered was incorrect, I did some digging and discovered that IE10+ doesn't like the back of that expression--any amount entered that did not include a decimal point was accepted, anything with a decimal was rejected. The pattern works in my development environment (Chrome 30+) and in Opera 12, but Firefox 27 won't validate it at all.
I read the specs, which just says:
If specified, the attribute's value must match the JavaScript Pattern production. [ECMA262]
And since the only browsers that support pattern are capable of supporting ECMAScript 5, I figure this includes the full support of all Javascript regular expressions.
Where can I learn more about the quirks between pattern support in the different browsers?
The problem seems to an IE-only bug. Your link to the spec is pretty dead on, heres the bit IE is missing:
... except that the pattern attribute is matched against the entire value, not just any subset (somewhat as if it implied a ^(?: at the start of the pattern and a )$ at the end)
You can actually fix this bug by doing just that to your own pattern - namely:
^(?:(|\$)[0-9]*(|\.[0-9]{2}))$
This is working for me in IE9 and IE10, as well as Chrome. See updated fiddle
The technical reason this happens is a bit more complex:
If you read the EMCA 5.1 spec, in section 15.10.2.3, it talks about how alternations should be evaluated. Basically, each 'part' of the | is evaluated left to right, until one is found that matches. That value is assumed unless there is a problem in the 'sequel', in which case the other possibilities in the alternation are evaluated.
What it seems IE is doing is matching the beginning of your string using the empty parts of your alternations, and it works: \$[digits][empty] matches the start of $12.12 up to the decimal point. IE's regex engine (correctly) says that this is a match, because a substring matched, and it's not been told to check to the end of the string.
Once the regex engine (without the anchors to force the whole string to match) returns true, that there was a match, some engineer at Microsoft took a shortcut and told the pattern attribute to also check that the matched part equals the whole string, and there's where the failure comes from. The engine only matched part of the string, even though it could have matched more, so the secondary check fails, thinking there is extraneous input at the end.
This case is subtle, so I'm not too surprised it hasn't been caught before. I have created a bug report https://connect.microsoft.com/IE/feedback/details/836117/regex-bug-in-pattern-validator to see if there is a response from Microsoft.
The reason this relates to the EMCA spec is that if the engine was told to match the whole string, it would have backtracked when it hit the decimal and tried to match the 2nd part of the alternation, found and matched (\.[0-9{2}), and the whole thing would have worked.
Now, for some workarounds:
Add the anchors ^(?: and )$ to your patterns
Don't use empty alternations. Personally, I like using the optional $ instead for these cases. Your pattern becomes (\$?)[0-9]*(\.[0-9]{2})? and will work because ? is a greedy match, and the engine will consume the whole string if possible, rather than alternation, which is first match
Swap the order on your alternations. If the longer string is tested first, it will match first, and be used first. This has come up in other languages - Why order matters in this RegEx with alternation?
PS: Be careful with the * for your digits. Right now, "$" is a valid match because * allows for 0 digits. My recommendation for your full regex would be (\$)?(\d+)(\.\d{2})?

Categories

Resources