Javascript Regex patterns to pickup URLs - javascript

To start off I know this is bad practice. I know there are libraries out there that are supposed to help with this; however, this is the task to which I was assigned and changing this whole thing to work with a library will be much more work than we can take on right now (since we are on a tight time frame).
In our web app we have fields that people usually type URLs into. We have been assigned a task to 'linkify' anything that looks like a URL. Currently the people who wrote our app seemed to have used a regex to determine if a string of text is a URL. I am basing my regex off that (I am no regex guru, not even a novice).
The 'search' regex looks like so
function DoesTextContainLinks(linktText) {
//replace all urls with links!
var linkifyValue = /((ftp|https?):\/\/)?(www\.)?([a-zA-Z0-9\-]{1,}\.){1,}[a-zA-Z0-9]{1,4}(:[0-9]{1,5})?(\/[a-zA-Z0-9\-\_\.\?\&\#]{1,})*(\/)?$/.test(linktText);
return linkifyValue;
}
Using this regex and https://regex101.com/ I have come up with two regexes that work most of the time.
function WrapLinkTextInAnchorTag(linkText) {
//capture links that only have www and add http to the begining of them (regex ignores entries that have http, https, and ftp in them. They are handled by the next regexes)
linkText = linkText.replace(/(^(?:(?!http).)*^(?:(?!ftp).)(www\.)?([a-zA-Z0-9\-]{1,}\.){1,}[a-zA-Z0-9]{1,4}(:[0-9]{1,5})?(\/[a-zA-Z0-9\-\_\.\?\&\#]{1,})*(\/)?$)/gim, "<a href='http://$1'>$1</a>");
//capture links that have https and http on them and fix those too. No need to prepend http here
linkText = linkText.replace(/(((https|http|ftp?):\/\/)?(www\.)?([a-zA-Z0-9\-]{1,}\.){1,}[a-zA-Z0-9]{1,4}(:[0-9]{1,5})?(\/[a-zA-Z0-9\-\_\.\?\&\#]{1,})*(\/)?$)/gim, "<a href='$1'>$1</a>");
return linkText;
}
The problem here is that some complex URLs seem to not work. I can't understand exactly why they don't work. regex101 is pretty bad ass in that it tells you what each part is doing; however, my trouble is combining these keywords in the regex to get them to do what I want. I have two scenarios to account for : when a user types www.something.com | ftp.something.com and when a user actually types http://www.something.com.
I am looking for some help in pointing out exactly what is wrong with my 2 regexes that prevents them from capturing complicated URLs like the one below
https://pw.something.com/AAPS/default.aspx?guid=a5741c35-6fe1-31a1-b555-4028e931642b

I use this one ...
^(http|https|ftp)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?\/?([a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~])*$
Look here ... Regex Tester
URL RegExp that requires (http, https, ftp)://, A nice domain, and a decent file/folder string. Allows : after domain name, and these characters in the file/folder string (letter, numbers, - . _ ? , ' / \ + & % $ # = ~). It blocks all other special characters and id good for protecting against user input!

If you look closely you will notice that nowhere in your regexps do you match an = character. That's what's breaking on the example you give.
Changing the second regexp by adding a \= to the characters supported in the path:
linkText.replace(/(((https|http|ftp?):\/\/)?(www\.)?([a-zA-Z0-9\-]{1,}\.){1,}[a-zA-Z0-9]{1,4}(:[0-9]{1,5})?(\/[a-zA-Z0-9\-\_\.\?\&\#\=]{1,})*(\/)?$)/gim, "<a href='$1'>$1</a>");
Causes your example URL to match. That said it may be worth slogging through the RFC on urls (http://www.ietf.org/rfc/rfc3986.txt) to find other characters that might be allowed in URLs (even if they have special meanings) because you're probably missing some others.

Related

JS regex to get domain name from an email [duplicate]

How can I extract only top-level and second-level domain from a URL using regex? I want to skip all lower level domains. Any ideas?
Here's my idea,
Match anything that isn't a dot, three times, from the end of the line using the $ anchor.
The last match from the end of the string should be optional to allow for .com.au or .co.nz type of domains.
Both the last and second last matches will only match 2-3 characters, so that it doesn't confuse it with a second-level domain name.
Regex:
[^.]*\.[^.]{2,3}(?:\.[^.]{2,3})?$
Demonstration:
Regex101 Example
Updated 2019
This is an old question, and the challenge here is a lot more complicated as we start adding new vanity TLDs and more ccTLD second level domains (e.g. .co.uk, .org.uk). So much so, that a regular expression is almost guaranteed to return false positives or negatives.
The only way to reliably get the primary host is to call out to a service that knows about them, like the Public Suffix List.
There are several open-source libraries out there that you can use, like psl, or you can write your own.
Usage for psl is quite intuitive. From their docs:
var psl = require('psl');
// Parse domain without subdomain
var parsed = psl.parse('google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // null
// Parse domain with subdomain
var parsed = psl.parse('www.google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // 'www'
// Parse domain with nested subdomains
var parsed = psl.parse('a.b.c.d.foo.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'foo'
console.log(parsed.domain); // 'foo.com'
console.log(parsed.subdomain); // 'a.b.c.d'
Old answer
You could use this:
(\w+\.\w+)$
Without more details (a sample file, the language you're using), it's hard to discern exactly whether this will work.
Example: http://regex101.com/r/wD8eP2
Also, you can likely do that with some expression similar to,
^(?:https?:\/\/)(?:w{3}\.)?.*?([^.\r\n\/]+\.)([^.\r\n\/]+\.[^.\r\n\/]{2,6}(?:\.[^.\r\n\/]{2,6})?).*$
and add as much as capturing groups that you want to capture the components of a URL.
Demo
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
For anyone using JavaScript and wanting a simple way to extract the top and second level domains, I ended up doing this:
'example.aus.com'.match(/\.\w{2,3}\b/g).join('')
This matches anything with a period followed by two or three characters and then a word boundary.
Here's some example outputs:
'example.aus.com' // .aus.com
'example.austin.com' // .austin.com
'example.aus.com/howdy' // .aus.com
'example.co.uk/howdy' // .co.uk
Some people might need something a bit cleverer, but this was enough for me with my particular dataset.
Edit
I've realised there are actually quite a few second-level domains which are longer than 3 characters (and allowed). So, again for simplicity, I just removed the character counting element of my regex:
'example.aus.com'.match(/\.\w*\b/g).join('')
Since TLDs now include things with more than three-characters like .wang and .travel, here's a regex that satisfies these new TLDs:
([^.\s]+\.[^.\s]+)$
Strategy: starting at the end of the string, look for one or more characters that aren't periods or whitespace, followed by a single period, followed by one or more characters that aren't periods or whitespace.
http://regexr.com/3bmb3
With capturing groups you can achieve some magix.
For example, consider the following javascript:
let hostname = 'test.something.else.be';
let domain = hostname.replace(/^.+\.([^\.]+\.[^\.]+)$/, '$1');
document.write(domain);
This will result in a string containing 'else.com'. This is because the regex itself will match the complete string and the capturing group will be mapped to $1. So it replaces the complete string 'test.something.else.com' with '$1' which is actually 'else.com'.
The regex isn't pretty and can probably be made more dynamic with things like {3} for defining how many levels deep you want to look for subdomains, but this is just an illustration.
if you want all specific Top Level Domain name then you can write regular expression like this:
[RegularExpression("^(https?:\\/\\/)?(([\\w]+)?\\.?(\\w+\\.((za|zappos|zara|zero|zip|zippo|zm|zone|zuerich|zw))))\\/?$", ErrorMessage = "Is not a valid fully-qualified URL.")]
You can also put more domain name from this link:
https://www.icann.org/resources/pages/tlds-2012-02-25-en
The following regex matches a domain with root and tld extractions (named capture groups) from a url or domain string:
(?:\w+:\/{2})?(?<cs_domain>(?<cs_domain_sub>(?:[\w\-]+\.)*?)(?<cs_domain_root>[\w\-]+(?<cs_domain_tld>(?:\.\w{2})?(?:\.\w{2,3}|\.xn-+\w+|\.site|\.club))))\|
It's hard to say if it is perfect, but it works on all the test data sets that I have put it against including .club, .xn-1234, .co.uk, and other odd endings. And it does it in 5556 steps against 40k chars of logs, so the efficiency seems reasonable too.
If you need to be more specific:
/\.(?:nl|se|no|es|milru|fr|es|uk|ca|de|jp|au|us|ch|it|io|org|com|net|int|edu|mil|arpa)/
Based on http://www.seobythesea.com/2006/01/googles-most-popular-and-least-popular-top-level-domains/

javascript Reg Exp to match specific domain name

I have been trying to make a Reg Exp to match the URL with specific domain name.
So if i want to check if this url is from example.com
what reg exp should be the best?
This reg exp should match following type of URLs:
http://api.example.com/...
http://preview.example.com/...
http://www.example.com/...
http://purhcase.example.com/...
Just simple rule, like http://{something}.example.com/{something} then should pass.
Thank you.
I think this is what you're looking for: (https?:\/\/(.+?\.)?example\.com(\/[A-Za-z0-9\-\._~:\/\?#\[\]#!$&'\(\)\*\+,;\=]*)?).
It breaks down as follows:
https?:\/\/ to match http:// or https:// (you didn't mention https, but it seemed like a good idea).
(.+?\.)? to match anything before the first dot (I made it optional so that, for example, http://example.com/ would be found
example\.com (example.com, of course);
(\/[A-Za-z0-9\-\._~:\/\?#\[\]#!$&'\(\)\*\+,;\=]*)?): a slash followed by every acceptable character in a URL; I made this optional so that http://example.com (without the final slash) would be found.
Example: https://regex101.com/r/kT8lP2/1
Use indexOf javascript API. :)
var url = 'http://api.example.com/api/url';
var testUrl = 'example.com';
if(url.indexOf(testUrl) !== -1) {
console.log('URL passed the test');
} else{
console.log('URL failed the test');
}
EDIT:
Why use indexOf instead of Regular Expression.
You see, what you have here for matching is a simple string (example.com) not a pattern. If you have a fixed string, then no need to introduce semantic complexity by checking for patterns.
Regular expressions are best suited for deciding if patterns are matched.
For example, if your requirement was something like the domain name should start with ex end with le and between start and end, it should contain alphanumeric characters out of which 4 characters must be upper case. This is the usecase where regular expression would prove beneficial.
You have simple problem so it's unnecessary to employ army of 1000 angels to convince someone who loves you. ;)
Use this:
/^[a-zA-Z0-9_.+-]+#(?:(?:[a-zA-Z0-9-]+\.)?[a-zA-Z]+\.)?
(domain|domain2)\.com$/g
To match the specific domain of your choice.
If you want to match only one domain then remove |domain2 from (domain|domain2) portion.
It will help you. https://www.regextester.com/94044
Not sure if this would work for your case, but it would probably be better to rely on the built in URL parser vs. using a regex.
var url = document.createElement('a');
url.href = "http://www.example.com/thing";
You can then call those values using the given to you by the API
url.protocol // (http:)
url.host // (www.example.com)
url.pathname // (/thing)
If that doesn't help you, something like this could work, but is likely too brittle:
var url = "http://www.example.com/thing";
var matches = url.match(/:\/\/(.[^\/]+)(.*)/);
// matches would return something like
// ["://example.com/thing", "example.com", "/thing"]
These posts could also help:
https://stackoverflow.com/a/3213643/4954530
https://stackoverflow.com/a/6168370
Good luck out there!
There are cases where the domain you're looking for could actually be found in the query section but not in the domain section: https://www.google.com/q=www.example.com
This answer would treat that case better.
See this example on regex101.
As you you pointed you only need example.com (write domain then escaped period then com), so use it in regex.
Example
UPDATED
See the answer below

How to validate my URL efficienty using JavaScript?

My regex successfully validates many URLs except http://www.google
Here's my URL validator in JSFiddle: http://jsfiddle.net/z23nZ/2/
It correctly validates the following URLs:
http://www.google.com gives True
www.google.com gives True
http://www.rootsweb.ancestry.com/~mopoc/links.htm gives True
http:// www. gives False
...but not this one:
http://www.google gives True
It's not correct to return true in this case. How can I validate that case?
I think you need to way simplify this. There are plenty of URL validation RegExes out there, but as an exercise, I'll go through my thought process for constructing one.
First, you need to match a protocol if there is one: /((http|ftp)s?:\/\/)?
Then match any series of non-whitespace characters: \S+
If you're trying to pick out URLs from text, you'll want to look for signs that it is a URL. Look for dots or slashes, then more non-whitespace: [\.\/]\S*/
Now put it all together:
/(((http|ftp)s?:\/\/)|(\S+[\.\/]))\S*[^\s\.]*/
I'm guessing that your attempting to look for www.google is because of the new TLDs... the fact is, such URLs might just look like google, and so any word could be a URL. Trying to come up with a catch-all regex which matches valid URLs and nothing else isn't possible, so you're best just going with something simple like the above.
Edit: I've stuck a | in there between the protocol part and the non-whitespace-then-dot-or-slash part to match http://google if people choose to write new URLs like that
Edit 2: See comments for the next improvement. It makes sure google.com matches, http://google matches, and even google/ matches, but not a..

Matching hostname on string when it has no protocol://?

I use this js code to match a hostname from a string:
url.match(/:\/\/(www\.)?(.[^/:]+)/);
This works when the url has protocol:// at the beginning. For example:
This works fine:
var url = "http://domain.com/page";
url.match(/:\/\/(www\.)?(.[^/:]+)/);
But this doesn't:
var url = "domain.com/page";
url.match(/:\/\/(www\.)?(.[^/:]+)/);
I have tried:
url.match(/(:\/\/)?(www\.)?(.[^/:]+)/);
And that matches fine the hostname when it doesn't contain protocol://, but when it does contains it it only returns the protocol and not the hostname.
How could I match the domain when it doesn't contains it?
I used this function from Steven Levithan, it parses urls quite decently.
Here's how you use this function
alert(parseUri("www.domain.com/foo").host)
OK before you have a brain meltdown from #xanatos answer here is a simple regex for basic needs. The other answers are more complete and handle more cases than this regex :
(?:(?:(?:\bhttps?|ftp)://)|^)([-A-Z0-9.]+)/
Group 1 will have your host name. URL parsing is a fragile thing to do with regexes. You were on the right track. You had two regexes that worked partially. I simply combined them.
Edit : I was tired yesterday night. Here is the regex for jscript
if (subject.match(/(?:(?:(?:\bhttps?|ftp):\/\/)|^)([\-a-z0-9.]+)\//i)) {
// Successful match
} else {
// Match attempt failed
}
This
var rx = /^(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?:\w+:\w+#)?(?:(?:[-\w]+\.)+(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?::[\d]{1,5})?(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$/;
should be the uber-url parsing regex :-) Taken from here http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/
Test here: http://jsfiddle.net/Qznzx/1/
It shows the uselessness of regexes.
This might be a bit more complex than necessary but it seems to work:
^((?:.+?:\/\/)?(?:.[^/:]+)+)$
A non-capturing group for the protocol. From the start of the string
match any number of characters until a :. There may be zero or one
protocol.
A non-capturing group for the rest of the url. This part must exist.
Group it all up in single group.

JavaScript Regexp to Wrap URLs and Emails in anchors

I searched high and low but cannot aeem to find a definitve answer to this. As is often the case with regexps. So I thought I'd ask here.
I'm trying to put together a regular expression i can use in JavaScript to replace all instances of URLs and email addresses (does'nt need to be ever so strict) with anchor tags pointing to them.
Obviously this is something usually done very simply on the server-side, but in this case it is necessary to work with plain text so an elegant JavaScript solution to perfom the replaces at runtime would be perfect.
Onl problem is, as I've stated before, I have a huge regular expression shaped gaping hole in my skill set :(
I know that one of you has the answer at the tip of your fingers though :)
Well, blindly using regexps from http://www.osix.net/modules/article/?id=586
var emailRegex =
new RegExp(
'([a-zA-Z0-9_\-\.]+)#((\[[0-9]{1,3}' +
'\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.' +
')+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)',
"gi");
var urlRegex =
new RegExp(
'((https?://)' +
'?(([0-9a-z_!~*\'().&=+$%-]+: )?[0-9a-z_!~*\'().&=+$%-]+#)?' + //user#
'(([0-9]{1,3}\.){3}[0-9]{1,3}' + // IP- 199.194.52.184
'|' + // allows either IP or domain
'([0-9a-z_!~*\'()-]+\.)*' + // tertiary domain(s)- www.
'([0-9a-z][0-9a-z-]{0,61})?[0-9a-z]\.' + // second level domain
'[a-z]{2,6})' + // first level domain- .com or .museum
'(:[0-9]{1,4})?' + // port number- :80
'((/?)|' + // a slash isn't required if there is no file name
'(/[0-9a-z_!~*\'().;?:#&=+$,%#-]+)+/?))',
"gi");
then
text.replace(emailRegex, "<a href='mailto::$1'>$1</a>");
and
text.replace(urlRegex, "<a href='$1'>$1</a>");
might to work
Not a canned solution, but this will point you in the right direction.
I use Regex Coach to build and test my regexes. You can find plentiful examples of regexes for urls and email addresses online.
Here's a good article for urls...
https://blog.codinghorror.com/the-problem-with-urls/
emails are more straight forward since they have to end in a .tld
You don't need to get fancy with that one since you're not validating, just matching, so off the top of my head...
[^\s]+#\w[\w-.]*.[a-zA-Z]+
As always, this ("this" being "processing HTML with regex") is going to be difficult and error-prone. The following will work on reasonably well-formed input only, but here's what I would do:
find the element you want to process, take it's innerHTML property value
iteratively find everything that already is a link (/(<a\b.+?</a>/ig)
based on that, cut your string into "this isn't a link"- and "this is a link"-bits, appending all of them them to a neatly orderd array
process the "non-link" bits only (those that don't begin with "<a "), looking for URL- or e-mail-address patterns
wrap every address you find in <a> tags
join() the array back to a string
set the innerHTML property to your new value
I am sure you will find regular expression examples that match e-mail addresses and URLs. Take the ones that suit you most, and use them in step 4.).
Just adding a bit of information on email regexps: Most of them seems to ignore that domain names can have the characters 'åäö' in them. So if your care about that, make sure that the solution you are using has åäöÅÄÖ in the domain part of the regexp.

Categories

Resources