Why does regex take too long to evaluate for certain value? - javascript

Below is my regex :
(https?:\/\/)([a-zA-Z]{2,6}\.)*((?!.*[|!{}[\]^"*;]).)+(\.*)([a-zA-Z0-9\.\-\/\:\?&=_%#]+)+([&|?])+$
It is to validate a URL with a negative look-ahead to allow characters from other languages.
This is what happens when I test it at http://regex101.com/#javascript:
For -
http://server.com/path?id=1111111 - NO MATCH
http://server.com/path?id=11111111 - TIMEOUT Your expression took too long to evaluate.
http://server.com/path?id=111111111111111111111& - MATCH
Observations:
When the value of the query parameter is increased above certain length it times out.
But for a matching URL the length of parameter value doesn't matter.
Why does it time out for beyond certain length? Which part of regex do I need to modify?
Note: RegEx mandates URL to end with ? or &
Thanks in advance.
EDIT:
What I need is, a regex to validate all standard (for e.g. www.xyz.com
or someip:port followed by path parameters and/or query parameters,
etc) URLs. It should support characters from other languages as well.
With an additional validation to mandate the URL to end with ? or
&.

The (…+)+ in ([a-zA-Z0-9\.\-\/\:\?&=_%#]+)+ leads to catastrophic backtracking. Removing one of the pluses should help.

This was the best I could come up with:
\b([\d\w\.\/\+\-\?\:]*)((ht|f)tp(s|)\:\/\/|[\d\d\d|\d\d]\.[\d\d\d|\d\d]\.|www\.|\.tv|\.ac|\.com|\.edu|\.gov|\.int|\.mil|\.net|\.org|\.biz|\.info|\.name|\.pro|\.museum|\.co)([\d\w\.\/\%\+\-\=\&\?\:\\\"\'\,\|\~\;]*)\b
JSFiddle: (I used someone else's demo to test it :)
http://jsfiddle.net/3AE9p/
Ofcourse this is not complete but it is pretty close to would you would want and expect!

Related

Regex issue (long string)

I have RegExp condition is /^([0-9]*\.?[0-9])*$/ to test string.
My string are first is 1.2.840.346991791506342.1482500253171661(large string) & second is 1.2.3.201922311129.10038 (short string).
It successfully search as both strings are OK.
But when I add space at the last of second string short string it's showing invalid that is right conclusion.
But when I add space in first string it should display invalid string as per code but it gets hanged why it is showing hang?
RegExp limit is exhausted? What will be the solution?
You can check this in notepad+ for testing purpose ^([0-9]*\.?[0-9])*$ use this formula directly.
The way you have written your regex, having nested quantifier is leading it to catastrophic backtracking leading it to hang/timeout.
Catastrophic Backtracking Demo
You need to simplify your regex to something like this,
^[0-9]*(?:\.[0-9]+)*$
Let me know if this regex preserves your pattern.
Regex Demo not running into timeout
You should in general avoid over nesting quantifiers in your regex, and rather try writing them in a simpler manner as much as you can. Even for short string like 1.2.840.3469931313.313, see how much steps your regex is taking,
135228 steps taken
and if you increase your string length little bit, then it runs into timeout/catastrophic backtracking.

Is regex a costly operation? It seems atleast

I was writing a regex pattern for a string. The string being a constant type/structure. What I mean is, it looks like(this format is not so important, have a look at the example next)-
[Data Code Format]: POP N/N/N (N: object): JSON object data
Here N represents a number or digit. and what's inside [ ] is a set of string block. But, this format is constant.
So, I wrote a regex-
\s*((?:\S+\s+){1}\S+)\s*(?:\((?:.*?)\))?:\s*(\S*)\s*(\w+)
Keeping this string example in mind-
%DATA-4-JSON: POP 0/1/0 (1: object): JSON object data
It works perfectly, but, what I see on regex101.com is that there is a successful match. But, it has undergone 330 steps to achieve this.
Screenshot-
My question is, its taking 330 steps to achieve this(atleast in my mind I feel its pretty heavy), which I guess can be achieved using if-else and other comparisons with lesser steps right?
I mean, is regex string parsing so heavy? 330 steps for like 10000's of strings I need to parse is going to be heavy right?
When you are using regexps, they can be costly if you use backtracking. When you use quantifiers with consequent patterns that may match one another (and in case the patterns between them may match empty strings (usually, declared with *, {0,x} or ? quantifiers)), backtracking might play a bad trick on you.
What is bad about your regex?
A slight performance issue is caused by an unnecessary non-capturing group (?:\S+\s+){1}. It can be written as \S+\s+ and it will already decrease the number of steps (a bit, 302 steps). However, the worst part is that \S matches the : delimiter. Thus, the regex engine has to try a lot of possible routes to match the beginning of your expected match. Replace that \S+\s+\S with [^:\s]+\s+[^:\s]+, and the step amount will decrease to 159!
Now, coming to (?:\((?:.*?)\))? - removing the unnecessary inner non-capturing group gives another slight improvement (148 steps), and replacing .*? with a negated character class gives another boost (139 steps).
I'd use this for now:
\s*([^:\s]+\s+[^:\s]+)\s*(?:\([^()]*\))?:\s*(\S*)\s*(\w+)
If the initial \s* was obligatory, it would be another enhancement, but you would have different matches.

Regular expression anything but letters (javascript)

I want to validate a form field by checking of the input contains any letters. All other characters and numbers should be allowed. I'm quite bad at regular expressions, and I can't find a correct solution anywhere.
I've tried this:
/[^A-Za-z]/g
but this only returns false if the string consists of only letters (i.e. 432ad32d should return false as well).
Could anyone tell me how to do this?
Using a whitelist of allowed characters is the best approach in your case:
/^[-+\d(), ]+$/
Unicode has many things it calls a letter, better not mess with that in the first place. And JavaScript regexes aren't well suited for handling these (they lack things like \p{L} for instance unless you use an external library).
Also, by using the whitelist approach you can be sure about the kinds of inputs which will be accepted by your form. You can't predict the kind of mess users could input otherwise. Think about things like this:
TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
:-)
/[^A-Za-z]/
This regex matches a single non-letter, which isn't very useful. Yura Yakym's answer matches the beginning of the string, any number of non-letters, and then the end of the string, which is useful when it matches: it means your string contains only those things.
Another useful regex is:
/[A-Za-z]/
This matches a single letter, which is useful when it doesn't match: it means your string does not contain any letters at all.
For your question in general, "how can I ensure a string lacks letters?", I would use that second regex: I would try to match a letter, and hopefully fail to do so. For input validation though, I'd prefer a regex that describes all possible valid inputs. If /^[^A-Za-z]*$/ does so, then use that. If you have additional requirements, add those to it. Don't have multiple "no letters? OK. no non-dash special characters? OK." ... well, unless you want to provide error messages precisely about such things.
Try this regular expression: ^[^A-Za-z]*$
You need to include anchors
/^[^A-Za-z]+$/g
This will ensure the string starts and ends with one or more numbers/special characters
You forgot about start and end markers. Also you don't need g flag.
/^[^A-Za-z]*$/
Anyway, that's strange as I can enter ciryllic letters still.

regex pattern for URL in javascript

Im using the following URL regex pattern for URL validation.
/[-a-zA-Z0-9#:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&//=]*)?/gi;
But i need to exclude .com
ie http://google/ should work.
What change needs to be done for this?
You better user this length expression from jquery.validate.js extension. This is well tested and support multilingual urls. Don't afraid of unicode and hexadecimal expression inside the expression. Its only to support multilingual urls. Refer this (Unicode Characters) to understand what following unicode means
/^(https?|ftp):\/\/(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/i
Your above expression has lots of flaw like last part of your expression \b(\/[-a-zA-Z0-9#:%_\+.~#?&//=]*)? itself match the whole url and does not have any effect of the previous expression
assuming you want everything including urls without the .com in it.
/[-a-zA-Z0-9#:%_\+.~#?&//=]{2,256}
(?:\.[a-z]{2,4})? // (?:) match group this is where the .com is captured
// ? quantifier 0 to 1 times
\b(\/[-a-zA-Z0-9#:%_\+.~#?&//=]*)?/gi
JSFIDDLE
Simply take this section: \.[a-z]{2,4} and replace it with (\.[a-z]{2,4})?.
The full regex:
[-a-zA-Z0-9#:%_\+.~#?&//=]{2,256}(\.[a-z]{2,4})?\b(\/[-a-zA-Z0-9#:%_\+.~#?&//=]*)?
And a demo.
Effectively what we're doing here is making the .xxxx optional, by wrapping it in () and using the ? to denote a non-greedy state.
This will match both:
http://www.google.com/
and
http://localhost/
Caveat: this isn't the most efficient expression to accomplish what you want, but it is simply the smallest required adjustment needed to accomplish what you want.

JS regex - convert "any" plain text hostname/url/ip to a link

I have been looking for a JS regexp that converts plain text url or hostnames to clickable links, but none of the script I found meet my requirements. Unfortunately, I suck at regex and are unable to modify the expression to work the way I want.
The plain text I wish to convert to links are:
Anything staring with http(s):, ftp(s):, mailto: or
file:
domain.tld[:port][path][file][querystring]
any.sub.domain.tld[:port][path][file][querystring]
0/255.0/255.0/255.0/255[:port][path][file][querystring]
locahost[:port][path][file][querystring]
[*] = optional.
Any help are highly appreciated!
If you can live with false positives, such as something.notavalidtld or 999.999.999.999 getting matched, what you are looking for is probably something like this. (Otherwise, it gets more messy.)
Start matching at the beginning of the string.
^(
Match anything starting with http/https/ftp/...
((https?|ftps?|mailto|file):.*?)
OR match the all of the below.
|
Optionally match http/https/ftp/... followed by : and at least one /.
((https?|ftps?|mailto|file):/+)?
Match an IP address...
(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
...or a domain (with optional username/password, which also matches email addresses)...
|([\w\d.:_%+-]+#)?([\w\d-]+\.)+[\w\d]{2,}
... or localhost.
|localhost)
Optionally followed by a port number.
(:\d+)?
Optionally followed by any path/query string.
(/.*)?
Ensuring the string ends here.
)$
All the above parts should be joined together without any whitespace in between.
I haven't tested it extensively, so I might have missed something. But at least you have a starting point.

Categories

Resources