Perfect URL Checking Regular Expression for MOST URL's - javascript

I am working on a project where I need to validate my URL's and stumbled upon the following RegEx pattern;
/(((http|ftp|https):\/{2})+(([0-9a-z_-]+\.)+(aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cx|cy|cz|cz|de|dj|dk|dm|do|dz|ec|ee|eg|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mn|mn|mo|mp|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|nom|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ra|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sj|sk|sl|sm|sn|so|sr|st|su|sv|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw|arpa)(:[0-9]+)?((\/([~0-9a-zA-Z\#\+\%#\.\/_-]+))?(\?[0-9a-zA-Z\+\%#\/&\[\];=_-]+)?)?))\b/imuS$/ # https://mathiasbynens.be/demo/url-regex
Which allows me to check URL's that always had a protocol before it (http, https or ftp). I would like to also allow the user to leave out the protocol and it still be valid. How do I do this?
Are there any other RegEx patterns that are better/more accurate that I can use to validate my URL's? Thanks for all answers!

I'm currently working a module that validates inputs. One of the validations required me to parse domains ( hostnames ) per:
RFC 952
RFC 1123
Trailing dots in domain names
To validate a domain I took a few steps, one of them was to use the
browser parsing logic by using this cool trick:
function parseURI( str ) {
var a = document.createElement( "a" );
// If the string doesn't contain a protocol, the browser
// will default to the current document location.
a.href = /^(https?:\/\/)/i.test( str ) === false ? ( "http://" + str ) : str;
// Since I can't overwrite a[property] - return an object I control ( Muahahah ).
return {
hash: a.hash,
hostname: a.hostname,
href: a.href,
origin: a.origin,
pathname: a.pathname,
port: a.port,
protocol: a.protocol,
search: a.search,
// When parsing the URL by the browser fails, the browser will
// set the hostname based on the current document.location value.
valid: a.hostname !== document.location
}
}
If validating a hostname | domain is what you are after, I can share my insights on the topic as well.

I suggest you to use regex powers in your regex for extension part like this:
(aero|asia|arpa|a[c-gil-oq-uwxz]|biz|b[abd-jmnorstv-z]|cat|com|coop|c[acdf-ik-oruvxyz]|
d[ejkmoz]|edu|e[cegr-u]|f[ijkmor]|gov|g[abd-ilmnp-uwy]|h[kmnrtu]|info|int|i[del-oq-t]|
jobs|j[emop]|k[eghimnprwyz]|l[abcikr-vy]|mil|mobi|museum|m[acdeghklnopr-z]|
name|net|nom|n[acefgilopruz]|org|pro|p[ae-hk-nrstwy]|qa|r[easuw]|s[a-eg-ortuvyz]|
tel|travel|t[cdfghj-prtvwz]|u[agksyz]|v[aceginu]|w[fs]|y[etu]|z[amw])

I modified it to the following so that the user can leave out the protocol;
/(((http|ftp|https):\/{2})?(([0-9a-z_-]+\.)+(aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cx|cy|cz|cz|de|dj|dk|dm|do|dz|ec|ee|eg|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mn|mn|mo|mp|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|nom|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ra|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sj|sk|sl|sm|sn|so|sr|st|su|sv|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw|arpa)(:[0-9]+)?((\/([~0-9a-zA-Z\#\+\%#\.\/_-]+))?(\?[0-9a-zA-Z\+\%#\/&\[\];=_-]+)?)?))\b$/
by making the first part (the protocol) optional using the ? operator.
http://www.regexr.com/ is a great tool to use for testing RegEx patterns and learning about how they work.

Related

How to split domain with http or https in nodejs

Anyone can help to split the domain name with http or https from url string,
URL : https://www.test.com/abc/?a=1&b=1
Expected Output : https://www.test.com
Thanks in advance.
I strongly recommend you avoid using a home-grown regexp. Instead, use the node URL class:
https://nodejs.org/api/url.html
Not exactly sure which parts you want to keep or not (do you want to include the port? Do you want to decode IDNs?), but origin may be the way to go. Here’s the example straight out from the docs:
const { URL } = require('url');
const myURL = new URL('https://example.org/foo/bar?baz');
console.log(myURL.origin);
// Prints https://example.org
Otherwise, you could use the protocol and host or hostname components.
You can use the url-parse package also for get the origin from URL,
Refer : https://www.npmjs.com/package/url-parse
var URL = require('url-parse');
const url_obj = new URL('https://test.com/abc/?a=1');
console.log(url_obj.origin); // https://test.com

How to check if url scheme is present in a url string javascript

I am trying to solve an issue where I need to know if there is a URL scheme (not limited to http, https) prepended to my url string.
I could do link.indexOf(://); and then take the substring of anything before the "://", but if I have a case for eg:
example.com?url=http://www.eg.com
in this case, the substring will return me the whole string i.e.
example.com?url=http which is incorrect. It should return me "", since my url does not have a protocol prepended.
I need to find out whether the url is prepended with a protocol or not.
You can do it quite easily with a little bit of regex. The pattern /^[a-z0-9]+:\/\// will be able to extract it.
If you just want to test if it has it, use pattern.test() to get a boolean:
/^[a-z0-9]+:\/\//.test(url); // true
If you want what it is, use url.match() and wrap the protocol portion in parentheses:
url.match(/^([a-z0-9]+):\/\//)[1] // https
Here is a runnable example with a few example URLs.
const urls = ['file://test.com', 'http://test.com', 'https://test.com', 'example.com?http'];
console.log(
urls.map(url => (url.match(/^([a-z0-9]+):\/\//) || [])[1])
);
You could use the URL API which is supported in most browsers.
function getProtocol(str) {
try {
var u = new URL(str);
return u.protocol.slice(0, -1);
} catch (e) {
return '';
}
}
Usage
getProtocol('example.com?url=http://www.eg.com'); // returns ""
getProtocol('https://example.com?url=http://www.eg.com'); // returns "https"

Ensure URL is relative before navigating via JavaScript's location.replace()

I have a login page https://example.com/login#destination where destination is the target URL the user was trying to navigate to when they were required to log in.
(i.e. https://example.com/destination)
The JavaScript I was thinking about using was
function onSuccessfulLogin() {
location.replace(location.hash.substring(1) || 'default')
}
This would result in an XSS vulnerability, by an attacker providing the link
https://example.com/login#javascript:..
Also I need to prevent navigation to a lookalike site after login.
https://example.com/login#https://looks-like-example.com
or https://example.com/login#//looks-like-example.com
How can I adjust onSuccessfulLogin to ensure the URL provided in the hash # portion is a relative URL, and not starting with javascript:, https:, // or any other absolute navigation scheme?
One thought is to evaluate the URL, and see if location.origin remains unchanged before navigating. Can you suggest how to do this, or a better approach?
From OWASP recommendations on Preventing Unvalidated Redirects and Forwards:
It is recommended that any such destination input be mapped to a value, rather than the actual URL or portion of the URL, and that server side code translate this value to the target URL.
So a safe approach would be mapping some keys to actual URLs:
// https://example.com/login#destination
var keyToUrl = {
destination: 'https://example.com/destination',
defaults: 'https://example.com/default'
};
function onSuccessfulLogin() {
var hash = location.hash.substring(1);
var url = keyToUrl[hash] || keyToUrl.defaults;
location.replace(url);
}
You could also consider providing only path part of the URL and appending it with a hostname in the code:
// https://example.com/login#destination
function onSuccessfulLogin() {
var path = location.hash.substring(1);
var url = 'https://example.com/' + path;
location.replace(url);
}
I would stick to the mapping though.
That is a very good point about the XSS vulnerability.
I believe all protocols only use English alphabetic characters, so a regex like /^[a-z]+:/i would check for those. Alternately if we're feeling more inclusive, /^[^:\/?]+:/ allows anything but a / or ? followed by a :. Then we can combine that with /^\/\/ to test for a protocol-free URL, which gives us:
// Either
var rexIsProtocol = /(?:^[a-z]+:)|(?:^\/\/)/i;
// Or
var rexIsProtocol = /(?:^[^:\/?]+:)|(?:^\/\/)/i;
Then the test is like this:
var url = location.hash.substring(1).trim(); // trim to deal with whitespace
if (rexIsProtocol.test(url)) {
// It starts with a protocol
} else {
// It doesn't
}
That said, the only one I think you need to be particularly bothered by is the javascript: pseudo-protcol, so you might just test for that.

window.location.indexOf not working in Javascript

Below is what I have.
var myString = "http://localhost:8888/www.smart-kw.com/";
alert(myString.indexOf("localhost"));
This give me alert... however if I change var myString = "http://localhost:8888/www.smart-kw.com/"; to var myString = window.location;, it won't work (I don't get alert).
var myString = window.location;
alert(myString.indexOf("localhost"));
window.location is an accessor property, and getting its value gives you an object, not a string, and so it doesn't have an indexOf function. (It's perfectly understandable that people sometimes think it's a string, since when you set its value, the accessor property's setter accepts a string; that is, window.location = "some url"; actually works. But when you get it, you don't get a string.)
You can use window.location.toString(), String(window.location), or window.location.href to get a string for it if you like, or use any of its various properties to check specifics. From the link, given example url http://www.example.com:80/search?q=devmo#test:
hash: The part of the URL that follows the # symbol, including the # symbol. You can listen for the hashchange event to get notified of changes to the hash in supporting browsers.Example: #test
host: The host name and port number.Example: www.example.com:80
hostname: The host name (without the port number).Example: www.example.com
href: The entire URL.Example: http://www.example.com:80/search?q=devmo#test
pathname: The path (relative to the host).Example: /search
port: The port number of the URL.Example: 80
protocol: The protocol of the URL.Example: http:
search: The part of the URL that follows the ? symbol, including the ? symbol.Example: ?q=devmo
For instance, for your quoted example, you might check window.location.hostname === "localhost".
As far as I know window.location is a Location object.
For instance, window.location.href will give you the entire URL.
var url = window.location.href;
alert(url.indexOf("domain"));
But this kind of check is bound to trigger false-positives. You are better using window.location.hostname property which holds the host name part.
var hostname = window.location.hostname;
alert(hostname === "my.domain.com");
I found a way to make this work:
(window.location.href).indexOf("localhost") > -1)
I actually use this for my projects as conditionals and it works just fine.

Get the current url but without the http:// part bookmarklet!

Guys I have a question, hoping you can help me out with this one. I have a bookmarklet;
javascript:q=(document.location.href);void(open('http://other.example.com/search.php?search='+location.href,'_self ','resizable,location,menubar,toolbar,scrollbars,status'));
which takes URL of the current webpage and search for it in another website. When I use this bookmarklet it takes the whole URL including http:// and searches for it. But now I would like to change this bookmarklet so it will take only the www.example.com or just example.com (without http://) and search for this url. Is it possible to do this and can you please help me with this one?
Thank you!
JavaScript can access the current URL in parts. For this URL:
http://css-tricks.com/example/index.html
window.location.protocol = "http"
window.location.host = "css-tricks.com"
window.location.pathname = "/example/index.html"
please check: http://css-tricks.com/snippets/javascript/get-url-and-url-parts-in-javascript/
This should do it
location.href.replace(/https?:\/\//i, "")
Use document.location.host instead of document.location.href. That contains only the host name and not the full URL.
Use the URL api
A modern way to get a part of the URL can be to make a URL object from the url that you are given.
const { hostname } = new URL('https://www.some-site.com/test'); // www.some-site.com
You can of course just pass window location or any other url as an argument to the URL constructor.
Like this
const { hostname } = new URL(document.location.href);
Do you have control over website.com other.example.com? This should probably be done on the server side.
In which case:
preg_replace("/^https?:\/\/(.+)$/i","\\1", $url);
should work. Or, you could use str_replace(...), but be aware that that might strip 'http://' from somewhere inside the URL:
str_replace(array('http://','https://'), '', $url);
EDIT: or, if you just want the host name, you could try parse_url(...)?
Using javascript replace via regex matching:
javascript:q=(document.location.href.replace(/(https?|file):\/\//,''));void(open('http://website.com/search.php?search='+q,'_self ','resizable,location,menubar,toolbar,scrollbars,status'));
Replace (https?|file) with your choice, e.g. ftp, gopher, telnet etc.

Categories

Resources