Detect FQDN and URL REGEX MATCH (Javascript) - javascript

This is not related to a previous question I posted. I need a regex to detect FDQN such as google.ca/ and www.google.ca/ (must detect the forward slash) as well as urls such as http://www.google.ca and https://www.stackoverflow.com. Can someone help me with this. I am using match (in javascript) to detect these FDQN and URL. Sorry if this seems to be a repeat to my previous question but it isn't (more specific).
I am using this to match Twitter's character count. When they detect a URL or FDQN, they will compress the URL (if its https) to 21 characters and others to 20 characters (no matter how long it is).

Is "google.ca/" FQDN? I guess it is, if even this resolves http://uz/
The question really is what exactly are you searching for? :)
Check if this one works for you: http://regexlib.com/redetails.aspx?regexp_id=1735&AspxAutoDetectCookieSupport=1
If not, regexplib.com is a good source, but I would suggest defining your requirements more precisely/explicitly.

You could just detect anything with a . and no space, but its likely to cause false positives.
Eg.
var s = "This is not related to a previous question I posted. I need a regex to detect FDQN such as google.ca/ and www.google.ca/ (must detect the forward slash) as well as urls such as http://www.google.ca and https://www.stackoverflow.com. Can someone help me with this"
console.log(s.match(/(https?\:\/\/)?([a-z0-9\-]+\.)+[a-z0-9\-]{2,8}\/?/ig))
Output
[
"google.ca/",
"www.google.ca/",
"http://www.google.ca",
"https://www.stackoverflow.com"
]

Related

Is this input sanitization regex safe?

I have an input field where I expect the user to enter the name of a place (city/town/village/whatever). I have this function which is use to sanitize the content of the input field.
sanitizeInput: function (input) {
return input.replaceAll(/[&/\\#,+()$~%.^'":*?<>{}]/g, "");
}
I want to remove all special characters that I expect not to appear in place name. I thought a blacklist regex is better than a whitelist regex because there are still many characters that might appear in a place name.
My questions are:
Is this regex safe?
Could it be improved?
Do you see a way to attack the program using this regex?
EDIT: This is a tiny frontend-only project. There is no backend.
Your regex is perfect to remove any special characters.
The answers are :
1.the regex is safe , but as you mentioned it is a vuejs project so the js function will run on browser. Browsers basically not safe for doing user input sanitization. You should do that in backend server also , to be 100% safe
You can not improve the regex itself in this example. But instead of regex , you could use indexOf for each special characters also ( it will be fastest process but more verbose and too much code)
Like :
str.indexOf('&') !== -1
str.indexOf('#') !== -1
Etc
3.same as answer 1,the regex is safe but as it is used in browser js , the code an be disabled , so please do server side validation also.
If you have any issue with this answer ,please let me know by comment or reply.
It is important to remember that front end sanitization is mainly to improve the user experience and protect against accidental data input errors. There are ways to get past front end controls. For this reason, it is important to rely on sanitizing data on the backend for security purposes. This may not be the answer to your question, but based on what you are using for a backend, you may need to sanitize certain things or it may have built in controls and you may not need to worry about further sanitization.
ps.
Please forgive my lack of references. But it is worth researching on your own.

alternative to javascript: when colon is filtered

I setup a purposely vulnerable form on my website to test some xss vectors.
Then i thought with href xss, if : is filtered then how would xss be possible because you'd have to insert javascript:alert(1) like this <a href="javascript:alert(1)"? say %3A isn't allowed either.
Thanks to anyone who can help me on this.
To answer your question literally, if only <, >, :, % and quotes are filtered then you can still do
javascript:
http://jsfiddle.net/zaN9m/
Please don't take this as a hint to just add this extra character. This is exactly the game where you are always
one step behind. And you will not be able to allow harmless input if you do the "filter all bad chars out" thing, so yeah.
Consider what happens when you start looking out for : and I pass javascript:&:#:5:8:;. The filter will not detect : there
but will remove the colons and the result will be javascript: after colons are filtered out and it's again a checkmate.
What you want to do is to see if the URL has a scheme and match it against known good schemes like http and https. You will then also have
to apply HTML encoding to the result. This is bullet proof and you don't have to play games.

Why does this regex execute slowly?

So i've got a regex that identifies URLs:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
But when I use it to identify urls that a user has entered, simply using .test slows the page down considerably, even though according to the MDN, it's supposed to be faster than exec. Am I using an outdated method of testing Regular Expressions? Is there a faster method that i don't know about? or is my regex just really long and complicated?
Here's a JSFiddle.
Edit:
Takes 20.7 seconds in Chrome, v24
1:48.5 in Internet Explorer 9
So it seems that the regex only lags when it processes a url that has posted information, for example in the jsfiddle url Product.aspx?Item=N82E16811139009. When that part of the url is removed, the regex preforms correctly, and quickly.
However, removing the last star from ([\/\w \.-]*)* makes the regex preform incorrectly, so using ([\/\w \.-]*) is not an option.
Rather, for the regex to be able to handle urls with posted information, the last part needs to be removed:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
to
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*/
this is because the regex is designed to detect filetypes or a backslash at the end of the url, not a question mark and posted information. removing the last part fixes the problem and the regex runs correctly and quickly.

Getting parts of a URL in JavaScript

I have to match URLs in a text, linkify them, and then display only the host--domain name or IP address--to the user. How can I proceed with JavaScript?
Thanks.
PS: please don't tell me about this; those regular expressions are so buggy they can't match http://google.com
If you don't want to use regular expressions, then you'll need to use things like indexOf and such instead. For instance, search for "://" in the text of every element and if you find it and the bit in front of it looks like a protocol (or "scheme"), grab it and the following characters that are valid URI characters (RFC2396). If the result ends in a dot or question mark, remove the dot or question (it probably ends a sentence). There's not really a lot more to say.
Update: Ah, I see from your edit that you don't have a problem with regular expressions, just the ones in the answers to that question. Fair enough.
This may well be one of those places where trying to do it all with a regular expression is more work that it should be, but using regular expressions as part of the solution is helpful. For instance,
/[a-zA-Z][a-zA-Z0-9+\-.]*:\/\//
...may well be a helpful way to find the beginning of a URL, since the scheme portion must start with an alpha and then can have zero or more alpha, digit, +, -, or . prior to the : (section 3.1).

Javascript/Regex for finding just the root domain name without sub domains

I had a search and found lot's of similar regex examples, but not quite what I need.
I want to be able to pass in the following urls and return the results:
www.google.com returns google.com
sub.domains.are.cool.google.com returns google.com
doesntmatterhowlongasubdomainis.idont.wantit.google.com
returns google.com
sub.domain.google.com/no/thanks returns google.com
Hope that makes sense :)
Thanks in advance!-James
You can't do this with a regular expression because you don't know how many blocks are in the suffix.
For example google.com has a suffix of com. To get from subdomain.google.com to google.com you'd have to take the last two blocks - one for the suffix and one for google.
If you apply this logic to subdomain.google.co.uk though you would end up with co.uk.
You will actually need to look up the suffix from a list like http://publicsuffix.org/
Don't use regex, use the .split() method and work from there.
var s = domain.split('.');
If your use case is fairly narrow you could then check the TLDs as needed, and then return the last 2 or 3 segments as appropriate:
return s.slice(-2).join('.');
It'll make your eyes bleed less than any regex solution.
I've not done a lot of testing on this, but if I understand what you're asking for, this should be a decent starting point...
([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b
EDIT:
To clarify, it's looking for:
one or more alpha-numeric characters or dashes, followed by a literal dot
and then one of three things...
three or more alpha characters (i.e. com/net/mil/coop, etc.)
two alpha characters, followed by a literal dot, followed by two more alphas (i.e. co.uk)
two alpha characters (i.e. us/uk/to, etc)
and at the end of that, a word boundary (\b) meaning the end of the string, a space, or a non-word character (in regex word characters are typically alpha-numerics, and underscore).
As I say, I didn't do much testing, but it seemed a reasonable jumping off point. You'd likely need to try it and tune it some, and even then, it's unlikely that you'll get 100% for all test cases. There are considerations like Unicode domain names and all sorts of technically-valid-but-you'll-likely-not-encounter-in-the-wild things that'll trip up a simple regex like this, but this'll probably get you 90%+ of the way there.
If you have limited subset of data, I suggest to keep the regex simple, e.g.
(([a-z\-]+)(?:\.com|\.fr|\.co.uk))
This will match:
www.google.com --> google.com
www.google.co.uk --> google.co.uk
www.foo-bar.com --> foo-bar.com
In my case, I know that all relevant URLs will be matched using this regex.
Collect a sample dataset and test it against your regex. While prototyping, you can do that using a tool such https://regex101.com/r/aG9uT0/1. In development, automate it using a test script.
([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))(?!\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b
This is an improvement upon theracoonbear's answer.
I did a quick bit of testing and noticed that if you give it a domain where the subdomain has a subdomain, it will fail. I also wanted to point out that the "90%" was definitely not generous. It will be a lot closer to 100% than you think. It works on all subdomains of the top 50 most visited websites which accounts for a huge chunk of worldwide internet activity. The only time it would fail is potentially with unicode domains, etc.
My solution starts off working the same way that theracoonbear's does. Instead of checking for a word boundary, it uses a negative lookahead to check if there is not something that could be a TLD at the end (just copied the TLD checking part over into a negative lookahead).
Without testing the validity of top level domain, I'm using an adaptation of stormsweeper's solution:
domain = 'sub.domains.are.cool.google.com'
s = domain.split('.')
tld = s.slice(-2..-1).join('.')
EDIT: Be careful of issues with three part TLDs like domain.co.uk.

Categories

Resources