Javascript url validation allowing relative and absolute urls - javascript

I'm trying to validate a field to allow relative and absolute urls. I'm using the regex from this post but it is allowing spaces in the url.
var urlRegex = new RegExp(/(\/?[\w-]+)(\/[\w-]+)*\/?|(((http|ftp|https):\/\/)?[\w-]+(\.[\w-]+)+([\w.,#?^=%&:\/~+#-]*[\w#?^=%&\/~+#-])?)/gi);
Example:
// this should work
this/will/work.aspx?say=hello
http://www.example.com/this/will/work.aspx?say=hello
// this shouldn't work but does
and/this will also work/even though it shouldn't
and/this-shouldn't/but it does/also
The code below is what I was originally using to validate just absolute urls and it was working perfectly. If I remember properly, I pulled it from the jquery source. If this could be modified to also accept relative urls that would be perfect, but this is out of my league.
var urlRegex = new RegExp(/^(https?|ftp):\/\/(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/i);

I think you just need to anchor the pattern so that it has to match the whole string:
var urlRegex = /^(\/?[\w-]+)(\/[\w-]+)*\/?|(((http|ftp|https):\/\/)?[\w-]+(\.[\w-]+)+([\w.,#?^=%&:\/~+#-]*[\w#?^=%&\/~+#-])?)$/gi;
The leading ^ and trailing $ means that the pattern has to match the entire string instead of just some part of it.
edit that said, the pattern has other problems. First, those HTML entities for & (&) need to be just "&". The slashes don't need to be escaped in [] groups, and we don't need the "g" suffix. That leaves us with:
var urlRegex = /^(?:(\/?[\w-]+)(\/[\w-]+)*\/?|(((http|ftp|https):\/\/)?[\w-]+(\.[\w-]+)*([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?))$/i;
edit again - oops also need to wrap the whole thing.

I wrote an article about URI validation complete with code snippets for all the various URI components as defined by RFC3986 here:
Regular Expression URI Validation
You may find what you are looking for there. Note however that almost any string represents a valid URI - even an empty string!

Related

JS regex to get domain name from an email [duplicate]

How can I extract only top-level and second-level domain from a URL using regex? I want to skip all lower level domains. Any ideas?
Here's my idea,
Match anything that isn't a dot, three times, from the end of the line using the $ anchor.
The last match from the end of the string should be optional to allow for .com.au or .co.nz type of domains.
Both the last and second last matches will only match 2-3 characters, so that it doesn't confuse it with a second-level domain name.
Regex:
[^.]*\.[^.]{2,3}(?:\.[^.]{2,3})?$
Demonstration:
Regex101 Example
Updated 2019
This is an old question, and the challenge here is a lot more complicated as we start adding new vanity TLDs and more ccTLD second level domains (e.g. .co.uk, .org.uk). So much so, that a regular expression is almost guaranteed to return false positives or negatives.
The only way to reliably get the primary host is to call out to a service that knows about them, like the Public Suffix List.
There are several open-source libraries out there that you can use, like psl, or you can write your own.
Usage for psl is quite intuitive. From their docs:
var psl = require('psl');
// Parse domain without subdomain
var parsed = psl.parse('google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // null
// Parse domain with subdomain
var parsed = psl.parse('www.google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // 'www'
// Parse domain with nested subdomains
var parsed = psl.parse('a.b.c.d.foo.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'foo'
console.log(parsed.domain); // 'foo.com'
console.log(parsed.subdomain); // 'a.b.c.d'
Old answer
You could use this:
(\w+\.\w+)$
Without more details (a sample file, the language you're using), it's hard to discern exactly whether this will work.
Example: http://regex101.com/r/wD8eP2
Also, you can likely do that with some expression similar to,
^(?:https?:\/\/)(?:w{3}\.)?.*?([^.\r\n\/]+\.)([^.\r\n\/]+\.[^.\r\n\/]{2,6}(?:\.[^.\r\n\/]{2,6})?).*$
and add as much as capturing groups that you want to capture the components of a URL.
Demo
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
For anyone using JavaScript and wanting a simple way to extract the top and second level domains, I ended up doing this:
'example.aus.com'.match(/\.\w{2,3}\b/g).join('')
This matches anything with a period followed by two or three characters and then a word boundary.
Here's some example outputs:
'example.aus.com' // .aus.com
'example.austin.com' // .austin.com
'example.aus.com/howdy' // .aus.com
'example.co.uk/howdy' // .co.uk
Some people might need something a bit cleverer, but this was enough for me with my particular dataset.
Edit
I've realised there are actually quite a few second-level domains which are longer than 3 characters (and allowed). So, again for simplicity, I just removed the character counting element of my regex:
'example.aus.com'.match(/\.\w*\b/g).join('')
Since TLDs now include things with more than three-characters like .wang and .travel, here's a regex that satisfies these new TLDs:
([^.\s]+\.[^.\s]+)$
Strategy: starting at the end of the string, look for one or more characters that aren't periods or whitespace, followed by a single period, followed by one or more characters that aren't periods or whitespace.
http://regexr.com/3bmb3
With capturing groups you can achieve some magix.
For example, consider the following javascript:
let hostname = 'test.something.else.be';
let domain = hostname.replace(/^.+\.([^\.]+\.[^\.]+)$/, '$1');
document.write(domain);
This will result in a string containing 'else.com'. This is because the regex itself will match the complete string and the capturing group will be mapped to $1. So it replaces the complete string 'test.something.else.com' with '$1' which is actually 'else.com'.
The regex isn't pretty and can probably be made more dynamic with things like {3} for defining how many levels deep you want to look for subdomains, but this is just an illustration.
if you want all specific Top Level Domain name then you can write regular expression like this:
[RegularExpression("^(https?:\\/\\/)?(([\\w]+)?\\.?(\\w+\\.((za|zappos|zara|zero|zip|zippo|zm|zone|zuerich|zw))))\\/?$", ErrorMessage = "Is not a valid fully-qualified URL.")]
You can also put more domain name from this link:
https://www.icann.org/resources/pages/tlds-2012-02-25-en
The following regex matches a domain with root and tld extractions (named capture groups) from a url or domain string:
(?:\w+:\/{2})?(?<cs_domain>(?<cs_domain_sub>(?:[\w\-]+\.)*?)(?<cs_domain_root>[\w\-]+(?<cs_domain_tld>(?:\.\w{2})?(?:\.\w{2,3}|\.xn-+\w+|\.site|\.club))))\|
It's hard to say if it is perfect, but it works on all the test data sets that I have put it against including .club, .xn-1234, .co.uk, and other odd endings. And it does it in 5556 steps against 40k chars of logs, so the efficiency seems reasonable too.
If you need to be more specific:
/\.(?:nl|se|no|es|milru|fr|es|uk|ca|de|jp|au|us|ch|it|io|org|com|net|int|edu|mil|arpa)/
Based on http://www.seobythesea.com/2006/01/googles-most-popular-and-least-popular-top-level-domains/

How to grab URLs in JavaScript without harming embedded objects and inline URL

I wrote a RegExp to grab and encode URLs in JavaScript.
This works fine but, it introduced a bug into my app.
I have a span Element which is used to display Emojis like this:
<span style="background:url(http://localhost/res/emo/face/E004.png)"></span>
Now, I'm using this Regular Expression to grab and convert anything URL into actual HTML clickable links:
/((https?:\/\/)?[\w-]+(\.[\w-]+)+\.?(:\d+)?(\/\S*)?)/ig
This ended up encoding the emoji URL into a clickable link.
Can anyone adjust that Code to Ignore URLs inside Elements or embedded Objects???
Please I need help!
This is the code:
var urlRegex = /((https?:\/\/)?[\w-]+(\.[\w-]+)+\.?(:\d+)?(\/\S*)?)/ig;
return txt.replace(urlRegex, function (url) {
var hyperlink = url;
if(!hyperlink.match('^https?:\/\/')) {
hyperlink = 'http://' + hyperlink;
}
return `${url}`;
});
I don't that the URLS inside
<span style="background:url(http://localhost/res/emo/face/E004.png)"></span>
were touched.
You would need to use negative look behind, which has limited support in JavaScript. (see here https://stackoverflow.com/a/50434875/6853740)
Simply adding negative look behind to your existing regex still doesn't work as expected:
((?<!url\()(https?:\/\/)?[\w-]+(\.[\w-]+)+\.?(:\d+)?(\/\S*)?) still matches "E004.png" in your example. Even other URL regexs from this post (What is the best regular expression to check if a string is a valid URL?) also match that. You may need to consider only looking for links that start with http:// or https:// which may help you recraft a regex that will only match full URLs.

How to remove URL from a string completely in Javascript?

I have a string that may contain several url links (http or https). I need a script that would remove all those URLs from the string completely and return that same string without them.
I tried so far:
var url = "and I said http://fdsadfs.com/dasfsdadf/afsdasf.html";
var protomatch = /(https?|ftp):\/\//; // NB: not '.*'
var b = url.replace(protomatch, '');
console.log(b);
but this only removes the http part and keeps the link.
How to write the right regex that it would remove everything that follows http and also detect several links in the string?
Thank you so much!
You can use this regex:
var b = url.replace(/(?:https?|ftp):\/\/[\n\S]+/g, '');
//=> and I said
This regex matches and removes any URL that starts with http:// or https:// or ftp:// and matches up to next space character OR end of input. [\n\S]+ will match across multi lines as well.
Did you search for a url parser regex? This question has a few comprehensive answers Getting parts of a URL (Regex)
That said, if you want something much simpler (and maybe not as perfect), you should remember to capture the entire url string and not just the protocol.
Something like
/(https?|ftp):\/\/[\.[a-zA-Z0-9\/\-]+/
should work better. Notice that the added half parses the rest of the URL after the protocol.

Only match regex if it doesnt start with a pattern in javascript

I have a bit of a strange one here, I basically have a large chunk of text which may or may not contain links to images.
So lets say it does I have a pattern which will extract the image url fine, however once a match is found it is replaced with a element with the link as the src. Now the problem is there may be multiple matches within the text and this is where it gets tricky. As the url pattern will now match the src tags url, which will basically just enter an infinite loop.
So is there a way to ONLY match in regex if it doesnt start with a pattern like ="|=' ? as then it would match the url in something like:
some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6
but not
some image <img src="http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6">
I am not sure if it is possible, but if it is could someone point me in the right direction? A replace by itself will not suffice in this scenario as the url matched needs to be used elsewhere too so it needs to be used like a capture.
The main scenarios I need to account for are:
Many links in one block of varied text
A single link without any other text
A single link with other varied text
== edit ==
Here is the current regex I am using to match urls:
(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))
== edit 2 ==
Just so everyone understands why I cannot use the /g command here is an answer which explains the issue, if I could use this /g like I originally tried then it would make things a lot simpler.
Javascript regex multiple captures again
What you are looking for is a negative look behind, but Javascript doesn't support any kind of look behinds, so you will either have to use a callback function to check what was matched and make sure it is not preceded by a ' or ", or you can use the following regex:
(?:^|[^"'])(\b(https?|ftp|file):\/\/[-a-zA-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))
which has a single problem, that is in the case of a successful match it will catch one more character, the one right before the (\b(https?|ftp|file) pattern in the input, but I think you can deal with this easily.
Regex101 Demo
Using the /ig command at the end should work... the g is for global replace and the i is for case-insensitivity, which is necessary as you've only got A-Z instead of a-zA-Z.
Using the following vanilla JS appears to work for me (see jsfiddle)...
var test="some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6";
var re = new RegExp(/(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))/ig);
document.getElementById("output").innerHTML = test.replace(re,"<img src=\"$1\"/>");
Although, what it does highlight is that the query string part of the URL (the ?v=6 is not being picked up with your RegEx).
For jQuery, it would be (see jsfiddle)...
$(document).ready(function(){
var test="some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6";
var re = new RegExp(/(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))/ig);
$("#output").html(test.replace(re,"<img src=\"$1\"/>"));
});
Update
Just in case my example of using the same image URL in the example doesn't convince you - it also works with different URLs... see this jsfiddle update
var test="http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 http://cdn.sstatic.net/serverfault/img/sprites.png?v=7";
var re = new RegExp(/(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))/ig);
document.getElementById("output").innerHTML = test.replace(re,"<img src=\"$1\"/>");
Couldn't you just see if there is a whitespace in front of the url, instead of that word-boundary? seems to work, although you will have to remove the matched whitespace later.
(\s(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))
http://rubular.com/r/9wSc0HNWas
Edit: Damn, too slow :) I'll still leave this here as my regex is shorter ;)
as was said by freefaller, you might use /g flag to just find all matches in one go, if exec is not a must.
otherwise: you can add (="|=')? to the beginning of your regex, and check if $1 is undefined. if it is undefined, then it was not started with a ="|=' pattern

How to end a regular expression with a forward slash in Javascript

Problem
I am trying to match the hash part of a URL using Javascript. The hash will have the format
/#\/(.*)\//
This is easy to achieve using "new RegExp()" method of creating a JS regular expression, but I can't figure out how to do it using the standard format, because the two forward slashes at the end begin a comment. Is there another way to write this that won't start a comment?
Example
// works
myRegexp = new RegExp ('#\/(.*)\/');
// fails
myRegexp = /#\/(.*)\//
I am trying to match the hash part of a URL using Javascript.
Yeah, don't do that. There's a perfectly good URL parser built into every browser. Set an href on a location object (window.location or a link) and you can read/write URL parts from properties hostname, pathname, search, hash etc.
var a= document.createElement('a');
a.href= 'http://www.example.com/foo#bar#bar';
alert(a.hash); // #bar#bar
If you're putting a path-like /-separated list in the hash, I'd suggest hash.split('/') to follow.
As for the regex, both versions work identically for me. The trailing // does not cause a comment. If you just want to appease some dodgy syntax highlighting, you could potentially escape the / to \x2F.
It is not starting a comment, just like two slashes in a string. Look here: http://jsfiddle.net/Gr2qb/2/

Categories

Resources