JS Regex: Parse urls with conditions - javascript

I had a requirement of parsing a set of urls and extract specific elements from urls under special conditions. To explain it further, consider a set of urls:
http://www.example.com/appName1/some/extra/parts/keyword/rest/of/the/url
http://www.somewebsite.com/appName2/some/extra/parts/keyword/rest/of/the/url
http://www.someothersite.com/appname3/rest/of/the/url
As you can see, there are two sets of urls, one having the word "keyword" in it and others which don't. In my code, I will receive the part of the url after domain name (eg: /appName1/some/extra/parts/keyword/rest/of/the/url).
I have two tasks, one check if the word "keyword" is present in the url, and second, to be done only if "keyword" is not present in url, parse the url to fetch the two groups as the appName and rest of the url (eg: grp 1. appName3 and grp 2. rest/of/the/url for url 3, as it doesn't have "keyword" in it). The whole thing should be done in one regex.
My progress:
I was able to parse the app name and rest of the url into groups, but was not able to apply the condition.
I found out a way to select stings not having "keyword" in it, I'm not sure if it's the right way to do it:^((?!.\*keyword).\*)$
Next, to combine the above two, I tried something I found after a long search, which has syntax (?(?=regex)then|else) Reference. And the result was :
(?(?=^((?!.*keyword).*)$)\1)But it says invalid group structure.
I had gone through many stackoverflow entries and tutorials, but couldn't reach the actual requirement. Please help me solve this.

Yes, this is in fact possible. As far as I understand, you have the following cases:
/appName/some/extra/parts/keyword/rest/of/the/url
/appName/rest/of/the/url
You want your regex to not match the first one at all, while in the second case you want "appName" in one group and "rest/of/the/url" in another. The following regex will do that:
^(?!.*\/keyword\/)\/(.*?)\/(.*)$
Explanation:
^ assert position at the start of the string`
(?!.*\/keyword\/) is a negative lookahead, and looks ahead to make sure the string does not contain /keyword/. This is where the magic happens
\/ matches "/", i.e. the slash right after the domain name
(.*?)\/ captures the first group (appname in your example) greedily until next slash
(.*)$ is the group that captures "rest/of/the/url"

Related

JS regex to get domain name from an email [duplicate]

How can I extract only top-level and second-level domain from a URL using regex? I want to skip all lower level domains. Any ideas?
Here's my idea,
Match anything that isn't a dot, three times, from the end of the line using the $ anchor.
The last match from the end of the string should be optional to allow for .com.au or .co.nz type of domains.
Both the last and second last matches will only match 2-3 characters, so that it doesn't confuse it with a second-level domain name.
Regex:
[^.]*\.[^.]{2,3}(?:\.[^.]{2,3})?$
Demonstration:
Regex101 Example
Updated 2019
This is an old question, and the challenge here is a lot more complicated as we start adding new vanity TLDs and more ccTLD second level domains (e.g. .co.uk, .org.uk). So much so, that a regular expression is almost guaranteed to return false positives or negatives.
The only way to reliably get the primary host is to call out to a service that knows about them, like the Public Suffix List.
There are several open-source libraries out there that you can use, like psl, or you can write your own.
Usage for psl is quite intuitive. From their docs:
var psl = require('psl');
// Parse domain without subdomain
var parsed = psl.parse('google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // null
// Parse domain with subdomain
var parsed = psl.parse('www.google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // 'www'
// Parse domain with nested subdomains
var parsed = psl.parse('a.b.c.d.foo.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'foo'
console.log(parsed.domain); // 'foo.com'
console.log(parsed.subdomain); // 'a.b.c.d'
Old answer
You could use this:
(\w+\.\w+)$
Without more details (a sample file, the language you're using), it's hard to discern exactly whether this will work.
Example: http://regex101.com/r/wD8eP2
Also, you can likely do that with some expression similar to,
^(?:https?:\/\/)(?:w{3}\.)?.*?([^.\r\n\/]+\.)([^.\r\n\/]+\.[^.\r\n\/]{2,6}(?:\.[^.\r\n\/]{2,6})?).*$
and add as much as capturing groups that you want to capture the components of a URL.
Demo
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
For anyone using JavaScript and wanting a simple way to extract the top and second level domains, I ended up doing this:
'example.aus.com'.match(/\.\w{2,3}\b/g).join('')
This matches anything with a period followed by two or three characters and then a word boundary.
Here's some example outputs:
'example.aus.com' // .aus.com
'example.austin.com' // .austin.com
'example.aus.com/howdy' // .aus.com
'example.co.uk/howdy' // .co.uk
Some people might need something a bit cleverer, but this was enough for me with my particular dataset.
Edit
I've realised there are actually quite a few second-level domains which are longer than 3 characters (and allowed). So, again for simplicity, I just removed the character counting element of my regex:
'example.aus.com'.match(/\.\w*\b/g).join('')
Since TLDs now include things with more than three-characters like .wang and .travel, here's a regex that satisfies these new TLDs:
([^.\s]+\.[^.\s]+)$
Strategy: starting at the end of the string, look for one or more characters that aren't periods or whitespace, followed by a single period, followed by one or more characters that aren't periods or whitespace.
http://regexr.com/3bmb3
With capturing groups you can achieve some magix.
For example, consider the following javascript:
let hostname = 'test.something.else.be';
let domain = hostname.replace(/^.+\.([^\.]+\.[^\.]+)$/, '$1');
document.write(domain);
This will result in a string containing 'else.com'. This is because the regex itself will match the complete string and the capturing group will be mapped to $1. So it replaces the complete string 'test.something.else.com' with '$1' which is actually 'else.com'.
The regex isn't pretty and can probably be made more dynamic with things like {3} for defining how many levels deep you want to look for subdomains, but this is just an illustration.
if you want all specific Top Level Domain name then you can write regular expression like this:
[RegularExpression("^(https?:\\/\\/)?(([\\w]+)?\\.?(\\w+\\.((za|zappos|zara|zero|zip|zippo|zm|zone|zuerich|zw))))\\/?$", ErrorMessage = "Is not a valid fully-qualified URL.")]
You can also put more domain name from this link:
https://www.icann.org/resources/pages/tlds-2012-02-25-en
The following regex matches a domain with root and tld extractions (named capture groups) from a url or domain string:
(?:\w+:\/{2})?(?<cs_domain>(?<cs_domain_sub>(?:[\w\-]+\.)*?)(?<cs_domain_root>[\w\-]+(?<cs_domain_tld>(?:\.\w{2})?(?:\.\w{2,3}|\.xn-+\w+|\.site|\.club))))\|
It's hard to say if it is perfect, but it works on all the test data sets that I have put it against including .club, .xn-1234, .co.uk, and other odd endings. And it does it in 5556 steps against 40k chars of logs, so the efficiency seems reasonable too.
If you need to be more specific:
/\.(?:nl|se|no|es|milru|fr|es|uk|ca|de|jp|au|us|ch|it|io|org|com|net|int|edu|mil|arpa)/
Based on http://www.seobythesea.com/2006/01/googles-most-popular-and-least-popular-top-level-domains/

How to check if string is a valid Figma link?

I'm building an app on NodeJS that uses Figma API, and I need to check if the string passed by a user is a valid Figma link. I'm currently using this simple regex expression to check the string:
/^https\:\/\/www.figma.com\/.*/i
However, it matches all links from figma.com, even the home page, not only links to the files and prototypes. Here is an example Figma link that should match:
https://www.figma.com/file/OoYmkiTlusAzIjYwAgSbv8wy/Test-File?node-id=0%3A1
Also the match should be positive if this is a prototype link, with proto instead of file in the path.
Moreover, since I'm using the Figma API, it would be useful to extract necessary parts of the URL such as the file ID and node ID at the same time.
TL;DR
✅ Use this expression to capture four most important groups (type, file id, file name and URL properties) and work from there.
/^(?:https:\/\/)?(?:www\.)?figma\.com\/(file|proto)\/([0-9a-zA-Z]{22,128})(?:\/?([^\?]+)?(.*))?$/
From the docs
This is the regex expression code provided by Figma on their developer documentation page about embeds:
/https://([w.-]+.)?figma.com/(file|proto)/([0-9a-zA-Z]{22,128})(?:/.*)?$/
🛑 However, it doesn't work in JS as the documentation is currently wrong and this expression has multiple issues:
Slashes and a dots are not escaped with backslashes.
It doesn't match from the start of the string. I added the start of string anchor ^ after VLAZ pointed it out in the comments. This way we will avoid matching strings that don't start with https, for example malicious.site/?link=https://figma.com/...
It will match not only www. subdomain but any other amount of W which is not great (e.g. wwwww.) — it can be fixed by replacing letter match with a simpler expression. Also this is a useless capturing group, I'll make it non-capturing.
It would be nice if the link matched even if it doesn't begin with https:// as some engines (e.g. Twitter) strip this part for brevity and if person is copying a link from there, it should still be valid.
After applying all the improvements, we are left with the following expression:
/^(?:https:\/\/)?(?:www\.)?figma\.com\/(file|proto)\/([0-9a-zA-Z]{22,128})(?:\/.*)?$/
There is also a dedicated NPM package that simply checks the URL against the similar pattern. However, it contains some of the flaws listed above so I don't advice using it, especially for just one line of code.
Extracting parts of the URL
This expression is extremely useful to use with Figma API as it even extracts necessary parts from the URL such as type of link (proto/file) and the file key. You can access them by indexes.
You can also add a piece of regex to match specific keys in the query such as node-id:
/^(?:https:\/\/)?(?:www\.)?figma\.com\/(file|proto)\/([0-9a-zA-Z]{22,128})(?:\/.*)?node-id=([^&]*)$/
Now you can use it in code and get all the parts of the URL separately:
var pattern = /^(?:https:\/\/)?(?:www\.)?figma\.com\/(file|proto)\/([0-9a-zA-Z]{22,128})(?:\/.*)?node-id=([^&]*)$/
var matched = 'https://www.figma.com/file/OoYmkiTlusAzIjYwAgSbv8wy/Test-File?node-id=0%3A1'.match(pattern)
console.log('url:', matched[0]) // whole matched string
console.log('type:', matched[1]) // group 1
console.log('file key:', matched[2]) // group 2
console.log('node id:', matched[3]) // group 3
Digging deeper
I spent some time recreating this expression almost from scratch so it would match as many possible Figma file/prototype URLs without breaking things. Here are three similar versions of it that would work for different cases.
✅ This version captures the URL parameters and the name of the file separately for easier processing. You can check it here. I added it in the beginning of the answer, because I think it's the cleanest and most useful solution.
/^(?:https:\/\/)?(?:www\.)?figma\.com\/(file|proto)\/([0-9a-zA-Z]{22,128})(?:\/?([^\?]+)?(.*))?$/
The groups in it are as following:
Group 1: file/proto
Group 2: file key/id
Group 3: file name (optional)
Group 4: url parameters (optional)
✅ Next up, I wanted to do the same but separating the /duplicate part that can be added in the end of any Figma URL to create a duplicate of the file upon opening.
/^(?:https:\/\/)?(?:www\.)?figma\.com\/(file|proto)\/([0-9a-zA-Z]{22,128})(?:\/?([^\?]+)?([^\/]*)(\/duplicate)?)?$/
✅ And back to the node-id parameter. The following regex expression finds and captures multiple URLs inside a multiline string successfully. The only downside that I found in the end is that it (as well as all the previous ones) doesn't check if this URL contains unencoded special characters meaning that it can potentially break things, but it can be avoided by manually encoding all parameters using encodeURI() function.
/^(?:https:\/\/)?(?:www\.)?figma\.com\/(file|proto)\/([0-9a-zA-Z]{22,128})(?:\/([^\?\n\r\/]+)?((?:\?[^\/]*?node-id=([^&\n\r\/]+))?[^\/]*?)(\/duplicate)?)?$/gm
There are six groups that can be captured by this expression:
Group 1: file/proto
Group 2: file key/id
Group 3: file name (optional)
Group 4: url parameters (optional)
Group 5: node-id (optional; only present when group 4 is present)
Group 6: /duplicate
And, finally, here is the example of a match and its groups (or try it yourself):

RegEx match only final domain name from any email address

I want to match only parent domain name from an email address, which might or might not have a subdomain.
So far I have tried this:
new RegExp(/.+#(:?.+\..+)/);
The results:
Input: abc#subdomain.maindomain.com
Output: ["abc#subdomain.domain.com", "subdomain.maindomain.com"]
Input: abc#maindomain.com
Output: ["abc#maindomain.com", "maindomain.com"]
I am interested in the second match (the group).
My objective is that in both cases, I want the group to match and give me only maindomain.com
Note: before the down vote, please note that neither have I been able to use existing answers, nor the question matches existing ones.
One simple regex you can use to get only the last 2 parts of the domain name is
/[^.]+\.[^.]$/
It matches a sequence of non-period characters, followed by period and another sequence of non-periods, all at the end of the string. This regex doesn't ensure that this domain name happens after a "#". If you want to make a regex that also does that, you could use lazy matching with "*?":
/#.*?([^.]+\.[^.])$/
However,I think that trying to do everything at once tends to make the make regexes more complicated and hard to read. In this problem I would prefer to do things in two steps: First check that the email has an "#" in it. Then you get the part after the "#" and pass it to the simple regex, which will extract the domain name.
One advantage of separating things is that some changes are easier. For example, if you want to make sure that your email only has a single "#" in it its very easy to do in a separate step but would be tricky to achieve in the "do everything" regex.
You can use this regex:
/#(?:[^.\s]+\.)*([^.\s]+\.[^.\s]+)$/gm
Use captured group #1 for your result.
It matches # followed by 0 or more instance of non-DOT text and a DOT i.e. (?:[^.\s]+\.)*.
Using ([^.\s]+\.[^.\s]+)$ it is matching and capturing last 2 components separated by a DOT.
RegEx Demo
With the following maindomain should always return the maindomain.com bit of the string.
var pattern = new RegExp(/(?:[\.#])(\w[\w-]*\w\.\w*)$/);
var str = "abc#subdomain.maindomain.com";
var maindomain = str.match(pattern)[1];
http://codepen.io/anon/pen/RRvWkr
EDIT: tweaked to disallow domains starting with a hyphen i.e - '-yahoo.com'

if pathname starts with, as well as contains . Regex

I am trying to test the pathname of the url, checking if pathname starts with privmsg as well as contains one of the words in the selection. And my quantifier is selecting that at least one word must be found.
New RegExp thanks to one of the answers and I extended it more.
var post = /(^\/privmsg\?).+(post|reply){1}(.*)?/;
My urls will look like
/privmsg?mode=post
/privmsg?mode=reply
/privmsg?mode=reply&p=2 //another way
Though we have other modes that I do not want. I need to just get the constant url beginning with privmsg and having at least post or reply in it. Can someone explain what is wrong with my regex string and if I used the quantifier incorrectly.
Problem now is that it is still coming out false...
You need to allow for arbitrary characters between ? and (post|reply) (i.e. mode=). E.g.:
var post = /^\/privmsg\?.+(post|reply){1}/g;
\/
|match any sequence of|
|1 or more characters |
You miss to include something for mode=.
With your regex you will match strings like /privmsg?post.
So alter your regex to include mode=:
^\/privmsg\?.*(post|reply)$

Javascript match eveything except given words

Im working on a node.js app, and im doing router matching.
I need to match all routes with all variables except the ones which begin with
"public , static , files or same words with added "/"
i know i could do it using an if statement before regexp, to check if those words are withing url, and if they are, skip regexp, but i dont want to add such nesting, and knowing how to do it using regexp will become in handy in the future anyways.
i know how to match anything except...some letters, ie ^[0-9] , but i cant use the same for words. I googled and found that lookahead could solve this, but... i cant get it to work.
In the end, id like to use something like this (in pseudo code)
where the .+ would match only if the pattern does not match any of the given words.
match(/^(?!public|static|files) .+ /gi)
edit 1:
The format of the url's would be something like this..with or without slashes.
/controller/action/4/var:something/
i want to make a regexp that matches this controller - action - id
pattern, but at the same time wouldnt match patterns like this
/public/images/4
or
static/files/somefile
in general, id like to know how to match a pattern, but only if it doesnt begin with given words.
e.g something like this...but it doesnt work
( match .+, but only if it doesnt contain the words mentioned before
/^(?!public|static|files).+ /gi)
Actually, I'm not having trouble with negative look-aheads. Something like this seems to work just fine, although it's not super extensible.
/^\/(?!public|static|files)([^\/]+)?\/?([^\/]+)?\/?([^\/]+)?\/?(.*)$/i
1st capture will be the controller, 2nd is the action, 3rd is the ID, and 4th is whatever is left.
See this jsfiddle

Categories

Resources