regex pattern for URL in javascript - javascript

Im using the following URL regex pattern for URL validation.
/[-a-zA-Z0-9#:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&//=]*)?/gi;
But i need to exclude .com
ie http://google/ should work.
What change needs to be done for this?

You better user this length expression from jquery.validate.js extension. This is well tested and support multilingual urls. Don't afraid of unicode and hexadecimal expression inside the expression. Its only to support multilingual urls. Refer this (Unicode Characters) to understand what following unicode means
/^(https?|ftp):\/\/(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/i
Your above expression has lots of flaw like last part of your expression \b(\/[-a-zA-Z0-9#:%_\+.~#?&//=]*)? itself match the whole url and does not have any effect of the previous expression

assuming you want everything including urls without the .com in it.
/[-a-zA-Z0-9#:%_\+.~#?&//=]{2,256}
(?:\.[a-z]{2,4})? // (?:) match group this is where the .com is captured
// ? quantifier 0 to 1 times
\b(\/[-a-zA-Z0-9#:%_\+.~#?&//=]*)?/gi
JSFIDDLE

Simply take this section: \.[a-z]{2,4} and replace it with (\.[a-z]{2,4})?.
The full regex:
[-a-zA-Z0-9#:%_\+.~#?&//=]{2,256}(\.[a-z]{2,4})?\b(\/[-a-zA-Z0-9#:%_\+.~#?&//=]*)?
And a demo.
Effectively what we're doing here is making the .xxxx optional, by wrapping it in () and using the ? to denote a non-greedy state.
This will match both:
http://www.google.com/
and
http://localhost/
Caveat: this isn't the most efficient expression to accomplish what you want, but it is simply the smallest required adjustment needed to accomplish what you want.

Related

How to replace my current regular expression without using negative lookbehind

I have the following regular expression which matches on all double quotes besides those that are escaped:
i.e:
The regular expression is as follows:
((?<![\\])")
How could I alter this to no longer use the negative lookbehind as it is not supported on some browsers?
Any help is greatly appreciated, thanks!
I wasn't able to get anything currently working
You can match
/\\"|(")/
and keep only captured matches. Being so simple, it should work with most every regex engine.
Demo
This matches what you don't want (\\")--to be discarded--and captures what you do want (")--to be kept.
This technique has been referred to by one regex expert as The Greatest Regex Trick Ever. To get to the punch line at the link search for "(at last!)".
Neither of these may be a completely satisfactory solution.
This regex won't just match unescaped ", there's additional logic required to check if the 1st character of captured groups is " and adjust the match position.:
(?:^|[^\\])(")
This may be a better choice, but it depends on positive lookahead - which may have the same issue as negative lookbehind.
Version 1a (again requires additional logic)
(?:^|\b)(?=[^\\])(")
Version 2a (depends on positive lookahead)
(?:^|\b|\\\\)(?=[^\\])(")
Assuming you need to also handle escaped slashes followed by escaped quotes (not in the question, but ok):
Version 1a (requires the additional logic):
(?:^|[^\\]|\\\\)(")
Building on this answer, I'd like to add that you may also want to ignore escaped backslashes, and match the closing quote in this string:
"ab\\"
In that case, /\\[\\"]|(")/g is what you're after.

Regex expression excludes links with weird URL

I have this regex expression (Java / JavaScript)
/(http|ftp|https):\/\/([\w+?\.\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\-\\=\\+\\\\\/\\?\\.\\:\\;\\'\\,]*\.(?:jpg|JPG|jpeg|JPEG|gif|GIF|png|PNG|bmp|BMP|tiff|TIFF))?/
But it seem to have issues with a URL like this one :
https://cdn.vox-cdn.com/thumbor/C07imD1SHmAnbObkg-nJ92N6sD8=/0x0:4799x3199/920x613/filters:focal(2017x1217:2783x1983):format(webp)/cdn.vox-cdn.com/uploads/chorus_image/image/62871037/seattle.0.jpg
What do you think is missing in my expression?
I want to accept valid image URL.
Your expression works for me in the validator I tested with (regex101.com), however, it matches as 3 separate capture groups. To capture it all as a single match, just wrap the whole statement in a set of parentheses.
Note: to be clear, there are simpler ways to do this, but to answer the specific question that the OP asked, this will make their statement match their supplied link.
((http|ftp|https):\/\/([\w+?\.\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\-\\=\\+\\\\\/\\?\\.\\:\\;\\'\\,]*\.(?:jpg|JPG|jpeg|JPEG|gif|GIF|png|PNG|bmp|BMP|tiff|TIFF))?)
EDIT: After assisting the OP in narrowing down the scope of their issue, a more appropriate regex statement would be something like this: /^(((http(s?))|((s?)ftp)):)([\w \D~!##$%^&*\\_/-=+/?.:;',]){1,}\.(jpg|gif|png)$/i
Lets break this down:
First this says it must start with either'http' with an optional 's', or if that isnt there, it will look for 'ftp' with an optional 's' prefixing it to account for secure forms of ftp. this must be followed with a colon. The next set accepts just about any commonly used character or symbol in a url path. Finally, it ensures that the expression ends with an actual image extension. wrapping the expression in /{expression}/i indicates that the expression is case insensitive and it will matche either upper or lower case, in any combination.
as a further note, you also may want to account for the print formats of .jpeg, .tif, etc.

Why does regex take too long to evaluate for certain value?

Below is my regex :
(https?:\/\/)([a-zA-Z]{2,6}\.)*((?!.*[|!{}[\]^"*;]).)+(\.*)([a-zA-Z0-9\.\-\/\:\?&=_%#]+)+([&|?])+$
It is to validate a URL with a negative look-ahead to allow characters from other languages.
This is what happens when I test it at http://regex101.com/#javascript:
For -
http://server.com/path?id=1111111 - NO MATCH
http://server.com/path?id=11111111 - TIMEOUT Your expression took too long to evaluate.
http://server.com/path?id=111111111111111111111& - MATCH
Observations:
When the value of the query parameter is increased above certain length it times out.
But for a matching URL the length of parameter value doesn't matter.
Why does it time out for beyond certain length? Which part of regex do I need to modify?
Note: RegEx mandates URL to end with ? or &
Thanks in advance.
EDIT:
What I need is, a regex to validate all standard (for e.g. www.xyz.com
or someip:port followed by path parameters and/or query parameters,
etc) URLs. It should support characters from other languages as well.
With an additional validation to mandate the URL to end with ? or
&.
The (…+)+ in ([a-zA-Z0-9\.\-\/\:\?&=_%#]+)+ leads to catastrophic backtracking. Removing one of the pluses should help.
This was the best I could come up with:
\b([\d\w\.\/\+\-\?\:]*)((ht|f)tp(s|)\:\/\/|[\d\d\d|\d\d]\.[\d\d\d|\d\d]\.|www\.|\.tv|\.ac|\.com|\.edu|\.gov|\.int|\.mil|\.net|\.org|\.biz|\.info|\.name|\.pro|\.museum|\.co)([\d\w\.\/\%\+\-\=\&\?\:\\\"\'\,\|\~\;]*)\b
JSFiddle: (I used someone else's demo to test it :)
http://jsfiddle.net/3AE9p/
Ofcourse this is not complete but it is pretty close to would you would want and expect!

Match unless “escape” character is present; must work inline and back-to-back

This question is very similar to Match unless "escape" character is present, however the approved solution doesn't work in all cases.
In my scenario, I'm using javascript and want to capture contents in [square brackets], unless they escape it with \[slash].
A RegexPal is here.
RegEx
(?:[^\\]|^)\[([^\]]*)\]
Sample to test against to see issue:
This is a [block], but as you can see it is trying to capture the preceeding character.
You can \[escape] a block, but this creates \[problems] with [blocks] that are [stacked][back][to][back].
Javascript's regex engine doesn't support lookbehind..
You can use this workaround
(?:\s|^|\])\[(\w+)(?=\])
Group 1 captures your required data within []
Demo

Match altered version of first match with only one expression?

I'm writing a brush for Alex Gorbatchev's Syntax Highlighter to get highlighting for Smalltalk code. Now, consider the following Smalltalk code:
aCollection do: [ :each | each shout ]
I want to find the block argument ":each" and then match "each" every time it occurrs afterwards (for simplicity, let's say every occurrence an not just inside the brackets).
Note that the argument can have any name, e.g. ":myArg".
My attempt to match ":each":
\:([\d\w]+)
This seems to work. The problem is for me to match the occurrences of "each". I thought something like this could work:
\:([\d\w]+)|\1
But the right hand side of the alternation seems to be treated as an independent expression, so backreferencing doesn't work.
Is it even possible to accomplish what I want in a single expression? Or would I have to use the backreference within a second expression (via another function call)?
You could do it in languages that support variable-length lookbehind (AFAIK only the .NET framework languages do, Perl 6 might). There you could highlight a word if it matches (?<=:(\w+)\b.*)\1. But JavaScript doesn't support lookbehind at all.
But anyway this regex would be very inefficient (I just checked a simple example in RegexBuddy, and the regex engine needs over 60 steps for nearly every character in the document to decide between match and non-match), so this is not a good idea if you want to use it for code highlighting.
I'd recommend you use the two-step approach you mentioned: First match :(\w+)\b (word boundary inserted for safety, \d is implied in \w), then do a literal search for match result \1.
I believe the only thing stored by the Regex engine between matches is the position of the last match. Therefore, when looking for the next match, you cannot use a backreference to the match before.
So, no, I do not think that this is possible.

Categories

Resources