Regular Expression problem - javascript

I need a regular expression for javascript that will get "jones.com/ca" from "Hello we are jones.com/ca in Tampa". The "jones.com/ca" could be any web url extension (example: .net, .co, .gov, etc), and any name. So the regular expression needs to find all instances of say ".com" and all the text to the last white space or beginning of line and to the last white space or end of line (minus any ending punctuation).
Right now I have as an example line: "jones.com/ca some text", using a javascript regular expression of: "\\(.+?^\\s).com?([^\\s]+)?\\", and all I get is ".com/ca" as the output.

This example will capture specific domains com,org and gov
\b\w+\.(?:com|org|gov)/[a-z]{2}\b
And this will capture almost any domain
\b\w+\.[a-z]{2,3}/[a-z]{2}\b
It uses word boundaries so that it does not capture white space.

Matching URLs is a bit of a dark art. The following site has a fairly well-designed regex for this purpose: http://daringfireball.net/2010/07/improved_regex_for_matching_urls

A comprehensive regex for this is going to be much more complicated than you think. The list of top-level domains is fairly long (.gov, .info, .edu, .museum, etc.), and there are "special" domains like localhost as well. Also, many domains end in a two-letter country abbreviation (google.com.br for Google Brazil, for example, or del.icio.us).
The easiest thing would be to look for http(s):// or www at the beginning and just assume what comes after is a domain name. If you don't, you're going to either miss a lot, or get a lot of false positives.
You could try the following, but the last option (after the last |) is going to be open to a significant number of false positives:
/https?:\/\/\S+|www\.\S+|([-a-z0-9_]+\.)+(com|org|edu|gov|mil|info|[a-z]{2})(\/\S*)?|([-a-z0-9_]+\.)+[-a-z0-9_]+\/\S*/ig

Related

How can I prevent extra characters at the end of my string in Regex?

Background
I'm working on a Javascript application where users have to use a specific email domain to sign up (Either #a.com or #b.com. Anything else gets rejected).
I've created a regex string that makes sure the user doesn't do #a.com with nothing in front of it and limits users to only #a.com and #b.com. The last step is to make sure the user doesn't add extra characters to the end of #a.com by doing something like #a.com.gmail.com
This is the regex I currently have:
\b[a-zA-Z0-9\.]*#(a.com|b.com)
Question
What can I use at the end to prevent anything from being added after a.com or b.com? I'm very novice at regex and have no idea where to start.
To solve your problem add $ to the regex's end. $ means your match should be at the strings' end.
Also you can reduce (a.com|b.com) to [ab]\.com. Look I've also escaped the dot
The character class [ab] means one of its characters should be matched.
Check this demo.
As stated in the comments, be sure to use the (m)ultiline flag, this way the regex engine will threat each line as a separate string.

Trying to make URL-Matching RegEx faster for an IRC bot

Hello fellow programmers, long time lurker here xD
So I was writing this IRC bot on Node.js, and one of the main functionalities is to automatically timeout users that post links without having permissions.
After much testing and researching I came up with this regex that would match almost any URLs, considering that users will often try to circumvent the bot to post links without permission.
/((?!\w+\.+\s\w+\b)\w+\W*(\.|dot|d0t)\W*(aero|asia|biz|cat|com|coop|info|int|jobs|mobi|museum|name|net|org|post|pro|tel|travel|xxx|edu|gov|mil|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b)/i
It takes into consideration users adding spaces between dots, replacing dots with "dot" or adding special characters between the dots, while ignoring matches when users type something like "word. It was good" (since it is a valid url extension).
This regex takes care of almost any cases of users trying to circumvent the url protection, while matching almost no false positives, however my concern is that it might be a bit slow.
Does anyone know of a better regex that has the same function that runs faster or maybe know how to make improvements for this one to run faster?
Regex explanation:
Full regex:
/((?!\w+\.+\s\w+\b)\w+\W*(\.|dot|d0t)\W*(aero|asia|biz|cat|com|coop|info|int|jobs|mobi|museum|name|net|org|post|pro|tel|travel|xxx|edu|gov|mil|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b)/i
Groups:
(?!\w+\.+\s\w+) - Negative lookahead - Checks if user typed a word (\w+) followed by a dot or multiple dots (.+) and a space (\s), if so, check if the next characters are a word (\w+). If this regex group matches, then most likely the user is ending a sentence with a full stop or ellipsis, followed by another sentence, and therefore the regex shouldnt match, even if the second sentence starts with a possible url extension such as "is" or "so", and therefore the negative lookahead should stop the url matching.
\w+ - A word - this is the first part of the url considering a url such as google.com (this ignores the url protocol, if present, and the first part of the url, usually www, since our goal is just to detect urls, and not actually extract them for some other purpose).
\W*(\.|dot|d0t)\W* - Any number of non-alphanumerical characters followed by a dot (or ways to circumvent dot) followed by any number of non-alphanumerical characters - This prevents users from circumventing the filter by typing urls such as google(dot)com as well as spacing between the url words and the dots such as google . com.
(aero|asia|biz|cat|com|coop|info|int|jobs|mobi|museum|name|net|org|post|pro|tel|travel|xxx|edu|gov|mil|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw) - Matches any possible url domain extension - Not much to say here, this prevents false positives from users that do weird punctuations such as "phrase . Next phrase"
\b - A boundary match (boundary character or end-of-string)
Edit: Ive made the obvious improvement from (ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au...) to (a[cdefgilmnoqrstuwxz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnorsuvxyz]|d[dejkmoz]|e[ceghrstu]|f[ijkmor]|g[abdefghilmnpqrstuwy]|h[kmnrtu]|i[delmnoqrst]|j[emop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdeghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrstwy]|qa|r[eosuw]|s[abcdeghijklmnorstuvxyz]|t[cdfghjklmnoprtvwz]|u[agksyz]|v[aceginu]|w[fs]|y[etu]|z[amw]) does anyone know of any more improvements, or a better way to do this?
Thanks in advance,
Gabriel.
Possibly faster:
domainExtTable = { aero: true, asia: true, biz: true, ... }; // init just once
results = text.match(/((?!\w+\.+\s\w+\b)\w+\W*(\.|dot|d0t)\W*(\w{2,4})\b)/i);
domainExt = results[4];
if (domainExt in domainExtTable) { ... } // this is a match
It is hard to say, depends on how good the regexp compiler is.
Removing the lookahead is likely to speed this much more. Just to be sure, you want to NOT match "google. com"?

Email Regular Expression - Excluded Specified Set

I have been researching a regular expression for the better part of about six hours today. For the life of me, I can not figure it out. I have tried what feels like about a hundred different approaches to no avail. Any help is greatly appreciated!
The basic rules:
1 - Exclude these characters in the address portion (before the # symbol): "()<>#,;:\[]*&^%$#!{}/"
2 - The address can contain a ".", but not two in a row.
I have an elegant solution to the rule number one, however, rule number two is killing me! Here is what I have so far. (I'm only including the portion up to the # sign to keep it simple). Also, it is important to note that this regular expression is being used in JavaScript, so no conditional IF is allowed.
/^[^()<>#,;:\\[\]*&^%$#!{}//]+$/
First of all, I would suggest you always choose what characters you want to allow instead of the opposite, you never know what dangerous characters you might miss.
Secondly, this is the regular expression I always use for validating emails and it works perfectly. Hope it helps you out.
/^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,6}$/i
Rule number 2
/^(?:\.?[^.])+\.?$/
which means any number of sequences of (an optional dot followed by a mandatory non dot) with an optional dot at the end.
Consider four two character sequences
xx matches as two non dot characters.
.x matches as an optional dot followed by a non-dot.
x. matches as a non-dot followed by an optional dot at the end.
.. does not match because there is no non-dot after the first dot.
One thing to remember about email addresses is that dots can appear in tricky places
"..#"#.example.com
is a valid email address.
The "..#" is a perfectly valid quoted local-part production, and .example.com is just a way of saying example.com but resolved against the root DNS instead of using a host search path. example.com might resolve to example.com.myintranet.com if myintranet.com is on the host search path but .example.com always resolves to the absolute host example.com.
First of all, to your specifications:
^(?![\s\S]*\.\.)[^()<>#,;:\\[\]*&^%$#!{}/]#.*$
It's just your regex with (?!.*\.\.) tacked onto the front. That's a negative lookahead, which doesn't match if there are any two consecutive periods anywhere in the string.
Properly matching email addresses is quite a bit harder, however.

Getting parts of a URL in JavaScript

I have to match URLs in a text, linkify them, and then display only the host--domain name or IP address--to the user. How can I proceed with JavaScript?
Thanks.
PS: please don't tell me about this; those regular expressions are so buggy they can't match http://google.com
If you don't want to use regular expressions, then you'll need to use things like indexOf and such instead. For instance, search for "://" in the text of every element and if you find it and the bit in front of it looks like a protocol (or "scheme"), grab it and the following characters that are valid URI characters (RFC2396). If the result ends in a dot or question mark, remove the dot or question (it probably ends a sentence). There's not really a lot more to say.
Update: Ah, I see from your edit that you don't have a problem with regular expressions, just the ones in the answers to that question. Fair enough.
This may well be one of those places where trying to do it all with a regular expression is more work that it should be, but using regular expressions as part of the solution is helpful. For instance,
/[a-zA-Z][a-zA-Z0-9+\-.]*:\/\//
...may well be a helpful way to find the beginning of a URL, since the scheme portion must start with an alpha and then can have zero or more alpha, digit, +, -, or . prior to the : (section 3.1).

Javascript/Regex for finding just the root domain name without sub domains

I had a search and found lot's of similar regex examples, but not quite what I need.
I want to be able to pass in the following urls and return the results:
www.google.com returns google.com
sub.domains.are.cool.google.com returns google.com
doesntmatterhowlongasubdomainis.idont.wantit.google.com
returns google.com
sub.domain.google.com/no/thanks returns google.com
Hope that makes sense :)
Thanks in advance!-James
You can't do this with a regular expression because you don't know how many blocks are in the suffix.
For example google.com has a suffix of com. To get from subdomain.google.com to google.com you'd have to take the last two blocks - one for the suffix and one for google.
If you apply this logic to subdomain.google.co.uk though you would end up with co.uk.
You will actually need to look up the suffix from a list like http://publicsuffix.org/
Don't use regex, use the .split() method and work from there.
var s = domain.split('.');
If your use case is fairly narrow you could then check the TLDs as needed, and then return the last 2 or 3 segments as appropriate:
return s.slice(-2).join('.');
It'll make your eyes bleed less than any regex solution.
I've not done a lot of testing on this, but if I understand what you're asking for, this should be a decent starting point...
([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b
EDIT:
To clarify, it's looking for:
one or more alpha-numeric characters or dashes, followed by a literal dot
and then one of three things...
three or more alpha characters (i.e. com/net/mil/coop, etc.)
two alpha characters, followed by a literal dot, followed by two more alphas (i.e. co.uk)
two alpha characters (i.e. us/uk/to, etc)
and at the end of that, a word boundary (\b) meaning the end of the string, a space, or a non-word character (in regex word characters are typically alpha-numerics, and underscore).
As I say, I didn't do much testing, but it seemed a reasonable jumping off point. You'd likely need to try it and tune it some, and even then, it's unlikely that you'll get 100% for all test cases. There are considerations like Unicode domain names and all sorts of technically-valid-but-you'll-likely-not-encounter-in-the-wild things that'll trip up a simple regex like this, but this'll probably get you 90%+ of the way there.
If you have limited subset of data, I suggest to keep the regex simple, e.g.
(([a-z\-]+)(?:\.com|\.fr|\.co.uk))
This will match:
www.google.com --> google.com
www.google.co.uk --> google.co.uk
www.foo-bar.com --> foo-bar.com
In my case, I know that all relevant URLs will be matched using this regex.
Collect a sample dataset and test it against your regex. While prototyping, you can do that using a tool such https://regex101.com/r/aG9uT0/1. In development, automate it using a test script.
([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))(?!\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b
This is an improvement upon theracoonbear's answer.
I did a quick bit of testing and noticed that if you give it a domain where the subdomain has a subdomain, it will fail. I also wanted to point out that the "90%" was definitely not generous. It will be a lot closer to 100% than you think. It works on all subdomains of the top 50 most visited websites which accounts for a huge chunk of worldwide internet activity. The only time it would fail is potentially with unicode domains, etc.
My solution starts off working the same way that theracoonbear's does. Instead of checking for a word boundary, it uses a negative lookahead to check if there is not something that could be a TLD at the end (just copied the TLD checking part over into a negative lookahead).
Without testing the validity of top level domain, I'm using an adaptation of stormsweeper's solution:
domain = 'sub.domains.are.cool.google.com'
s = domain.split('.')
tld = s.slice(-2..-1).join('.')
EDIT: Be careful of issues with three part TLDs like domain.co.uk.

Categories

Resources