JS regex - convert "any" plain text hostname/url/ip to a link - javascript

I have been looking for a JS regexp that converts plain text url or hostnames to clickable links, but none of the script I found meet my requirements. Unfortunately, I suck at regex and are unable to modify the expression to work the way I want.
The plain text I wish to convert to links are:
Anything staring with http(s):, ftp(s):, mailto: or
file:
domain.tld[:port][path][file][querystring]
any.sub.domain.tld[:port][path][file][querystring]
0/255.0/255.0/255.0/255[:port][path][file][querystring]
locahost[:port][path][file][querystring]
[*] = optional.
Any help are highly appreciated!

If you can live with false positives, such as something.notavalidtld or 999.999.999.999 getting matched, what you are looking for is probably something like this. (Otherwise, it gets more messy.)
Start matching at the beginning of the string.
^(
Match anything starting with http/https/ftp/...
((https?|ftps?|mailto|file):.*?)
OR match the all of the below.
|
Optionally match http/https/ftp/... followed by : and at least one /.
((https?|ftps?|mailto|file):/+)?
Match an IP address...
(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
...or a domain (with optional username/password, which also matches email addresses)...
|([\w\d.:_%+-]+#)?([\w\d-]+\.)+[\w\d]{2,}
... or localhost.
|localhost)
Optionally followed by a port number.
(:\d+)?
Optionally followed by any path/query string.
(/.*)?
Ensuring the string ends here.
)$
All the above parts should be joined together without any whitespace in between.
I haven't tested it extensively, so I might have missed something. But at least you have a starting point.

Related

How to create a regex for alphanumeric, space & apostrophe then ends only in alphanumeric?

I'm trying to create a regex based on some constraints, and I've used a couple helpful resources to try & test this. I understand I need an anchor ($) to check the end of the string but I guess I'm misunderstanding where the anchor should be placed.
I know that /([A-Za-z0-9' ])\w+/ will give me what I want in that it contains alphanumeric+spaces+apostrophe characters, but how do I ensure it only ends in alphanumeric?
What you have there basically says "give me one character that is alphanumeric, an apostrophe or a space, than any number of words". It can also be anywhere, so your string could contain other characters as well. Based on your description, that might not be what you want.
I think you probably want this:
/^([A-Z0-9' ]+)(?:[A-Z0-9])$/i
This says "give me only alphanumeric, apostrophes or spaces and make sure there is an alphanumeric at the end". I took out the lowercase and just added the i (case-insensitive) flag, but you can switch it the other way too.
The (?:[A-Z0-9]) is a non-matching set that checks to make sure there is an alphanumeric here. Since it is bumped up next to the $ at the end, it means it must be directly at the end of the string. You also need a ^ at the beginning to ensure that your whole string meets this criteria, not just part of it.
Here is an example:
const pattern = /^([A-Z0-9' ]+)(?:[A-Z0-9])$/i;
['This should work', "This'll also work", "This won't'"].forEach(s =>
console.log(s, pattern.test(s))
);

Regex finding the last string that doesnt contain a number

Usually in my system i have the following string:
http://localhost/api/module
to find out the last part of the string (which is my route) ive been using the following:
/[^\/]+$/g
However there may be cases where my string looks abit different such as:
http://localhost/api/module/123
Using the above regex it would then return 123. When my String looks like this i know that the last part will always be a number. So my question is how do i make sure that i can always find the last string that does not contain a number?
This is what i came up with which really stricty matches only module for the following lines:
http://localhost/api/module
http://localhost/api/module/123
http://localhost/api/module/123a
http://localhost/api/module/a123
http://localhost/api/module/a123a
http://localhost/api/module/1a3
(?!\w*\d\w*)[^\/][a-zA-Z]+(?=\/\w*\d+\w*|$)
Explanation
I basically just extended your expression with negative lookahead and lookbehind which basically matches your expression given both of the following conditions is true:
(?!\w*\d\w*) May contain letters, but no digits
[a-zA-Z]+ Really, truly only consists of one or more letters (was needed)
(?=\/\d+|$)The match is either followed by a slash, followed by digits or the end of the line
See this in action in my sample at Regex101.
partYouWant = urlString.replace(/^.*\/([a-zA-Z]+)[\/0-9]*$/,'$1')
Here it is in action:
urlString="http://localhost/api/module/123"
urlString.replace(/^.*\/([a-zA-Z]+)[\/0-9]*$/,'$1')
-->"module"
urlString="http://localhost/api/module"
urlString.replace(/^.*\/([a-zA-Z]+)[\/0-9]*$/,'$1')
-->"module"
It just uses a capture expression to find the last non-numeric part.
It's going to do this too, not sure if this is what you want:
urlString="http://localhost/api/module/123/456"
urlString.replace(/^.*\/([a-zA-Z]+)[\/0-9]*$/,'$1')
-->"module"
/([0-9])\w+/g
That would select the numbers. You could use it remove that part from the url. What language are you using it for ?

regex for ng-pattern for filepath

I have arrived at a regex for file path that has these conditions,
Must match regex ^(\\\\[^\\]+\\[^\\]+|https?://[^/]+), so either something like \server\share (optionally followed by one or more "\folder"s), or an HTTP(S) URL
Cannot contain any invalid path name chars( ",<,>, |)
How can i get a single regex to use in angular.js that meets these conditions
Your current regex doesn't seem to match what you want. But given it is correctly doing what you want, then this will add the negation :
^(?!.*[ "<>|])(\\\\[^\\]+\\[^\\]+|https?://[^/]+)
Here we added a negative lookahead to see if any characters are in the string which we will fail the match. If we find none, then the rest of the regular expression will continue.
If I understand your requirements correctly, you could probably do this :
^(?!.*[ "<>|])(\\\\|https?://).*$
This will still not match any invalid characters defined in the negative lookahead, and also meets your criteria of matching one or more path segments, as well as http(s) and is much simpler.
The caviate is that if you require 2 or more path segments, or a trailing slash on the url, than this will not work. This is what your regex seems to suggest.
So in that case this is still somewhat cleaner than the original
^(?!.*[ "<>|])(\\\\[^\\]+\\.|https?://[^/]+/).*$
One more point. You ask to match \server\share, yet your regex opens with \\\\. I have assumed that \server\share should be \\server\share and wrote the regex's accordingly. If this is not the case, then all instances of \\\\ in the examples I gave should be changed to \\
Ok, first the regex, than the explanation:
(?<folderorurl>(?<folder>(\\[^\\\s",<>|]+)+)|(?<url>https?:\/\/[^\s]+))
Your first condition is to match a folder name which must not contain any character from ",<>|" nor a whitespace. This is written as:
[^\s,<>|] # the caret negates the character class, meaning this must not be matched
Additionally, we want to match a folder name optionally followed by another
(sub)folder, so we have to add a backslash to the character class:
[^\\\s,<>|] # added backslash
Now we want to match as many characters as possible but at minimum one, this is what the plus sign is for (+). With this in mind, consider the following string:
\server\folder
At the moment, only "server" is matched, so we need to prepend a backslash, thus "\server" will be matched. Now, if you break a filepath down, it always consists of a backslash + somefoldername, so we need to match backslash + somefoldername unlimited times (but minimum one):
(\\[^\\\s",<>|]+)+
As this is getting somewhat unreadable, I have used a named capturing group ((?<folder>)):
(?<folder>(\\[^\\\s",<>|]+)+)
This will match everything like \server or \server\folder\subfolder\subfolder and store it in the group called folder.
Now comes the URL part. A URL consists of http or https followed by a colon, two forward slashes and "something afterwards":
https?:\/\/[^\s]+ # something afterwards = .+, but no whitespaces
Following the explanation above this is stored in a named group called "url":
(?<folder>(\\[^\\\s",<>|]+)+)
Bear in mind though, that this will match even non-valid url strings (e.g. https://www.google.com.256357216423727...), if this is ok for you, leave it, if not, you may want to have a look at this question here on SO.
Now, last but not least, let's combine the two elements with an or, store it in another named group (folderorurl) and we are done. Simple, right?
(?<folderorurl>(?<folder>(\\[^\\\s",<>|]+)+)|(?<url>https?:\/\/[^\s]+))
Now the folder or a URL can be found in the folderorurl group while still saving the parts in url or folder. Unfortunately, I do know nothing about angular.js but the regex will get you started. Additionally, see this regex101 demo for a working fiddle.
Must match regex ^(\\\\[^\\]+\\[^\\]+|https?://[^/]+), so either something like \\server\share (optionally followed by one or more
"\folder"s), or an HTTP(S) URL
Cannot contain any invalid path name chars( ",<,>, |)
To introduce the second condition in your regex, you mainly just have to include the invalid characters in the negated character sets, e. g. instead of [^/] use [^/"<>|].
Here's a working example with a slightly rearranged regex:
paths = [ '\\server\\share',
'\\\\server\\share',
'\\\\server\\share\\folder',
'http://www.invalid.de',
'https://example.com',
'\\\\<server\\share',
'https://"host.com',
'\\\\server"\\share',
]
for (i in paths)
{
document.body.appendChild(document.createTextNode(paths[i]+' '+
/^\\(\\[^\\"<>|]+){2,}$|^https?:\/\/[^/"<>|]+$/.test(paths[i])))
document.body.appendChild(document.createElement('br'))
}

Unable to find a string matching a regex pattern

While trying to submit a form a javascript regex validation always proves to be false for a string.
Regex:- ^(([a-zA-Z]:)|(\\\\{2}\\w+)\\$?)(\\\\(\\w[\\w].*))+(.jpeg|.JPEG|.jpg|.JPG)$
I have tried following strings against it
abc.jpg,
abc:.jpg,
a:.jpg,
a:asdas.jpg,
What string could possible match this regex ?
This regex won't match against anything because of that $? in the middle of the string.
Apparently using the optional modifier ? on the end string symbol $ is not correct (if you paste it on https://regex101.com/ it will give you an error indeed). If the javascript parser ignores the error and keeps the regex as it is this still means you are going to match an end string in the middle of a string which is supposed to continue.
Unescaped it was supposed to match a \$ (dollar symbol) but as it is written it won't work.
If you want your string to be accepted at any cost you can probably use Firebug or a similar developer tool and edit the string inside the javascript code (this, assuming there's no server side check too and assuming it's not wrong aswell). If you ignore the $? then a matching string will be \\\\w\\\\ww.jpg (but since the . is unescaped even \\\\w\\\\ww%jpg is a match)
Of course, I wrote this answer assuming the escaping is indeed the one you showed in the question. If you need to find a matching pattern for the correctly escaped one ^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))+(\.jpeg|\.JPEG|\.jpg|\.JPG)$ then you can use this tool to find one http://fent.github.io/randexp.js/ (though it will find weird matches). A matching pattern is c:\zz.jpg
If you are just looking for a regular expression to match what you got there, go ahead and test this out:
(\w+:?\w*\.[jpe?gJPE?G]+,)
That should match exactly what you are looking for. Remove the optional comma at the end if you feel like it, of course.
If you remove escape level, the actual regex is
^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))+(.jpeg|.JPEG|.jpg|.JPG)$
After ^start the first pipe (([a-zA-Z]:)|(\\{2}\w+)\$?) which matches an alpha followed by a colon or two backslashes followed by one or more word characters, followed by an optional literal $. There is some needless parenthesis used inside.
The second part (\\(\w[\w].*))+ matches a backslash, followed by two word characters \w[\w] which looks weird because it's equivalent to \w\w (don't need a character class for second \w). Followed by any amount of any character. This whole thing one or more times.
In the last part (.jpeg|.JPEG|.jpg|.JPG) one probably forgot to escape the dot for matching a literal. \. should be used. This part can be reduced to \.(JPE?G|jpe?g).
It would match something like
A:\12anything.JPEG
\\1$\anything.jpg
Play with it at regex101. A better readable could be
^([a-zA-Z]:|\\{2}\w+\$?)(\\\w{2}.*)+\.(jpe?g|JPE?G)$
Also read the explanation on regex101 to understand any pattern, it's helpful!

Exclude Email Addresses from Web Address Regex

Okay, I have two Regex patterns.
([a-zA-Z0-9]?http[s]?:\/\/)?((?:(?:\w+)\.)(?:\S+)(?:\.(?:\w+))+?)
[a-zA-Z0-9._-]+#[a-zA-Z0-9.-]+.[a-zA-Z]{2,6}
The first meets my needs at finding web addresses in a string. The second meets my needs at locating email addresses in a string. However, for some reason the first one is finding email addresses that look like this first.last#d1.d2.d3.d4 or first.last#d1.com. I need some help getting that first one so that it doesn't pick up those email addresses.
For example you could fix it by excluding #
([a-zA-Z0-9]?http[s]?:\/\/)?((?:(?:\w+)\.)(?:[^\s#]+)(?:\.(?:\w+))*?)
and at the very end I suggest use *? instead of +?, +? didn't matched 1st level domain without www
yet it find abc#gmail.com
Sadly I have no idea how to check that 1st symbol before matched substring is not #
edit: bad solution
^[^#]*?([a-zA-Z0-9]?http[s]?:\/\/)?((?:(?:\w+)\.)(?:[^\s#]+)(?:\.(?:\w+))*?)
checks that there is no #s from the start of the line till matched part
([a-zA-Z0-9]?http[s]?:\/\/)?((?:(?:\w+)\.)(?:\S+)(?:\.(?:\w+))+?)
Breaking this down, there are several problems...
( // capture protocol
[a-zA-Z0-9]? // matches alphanumeric, optionally (do you really want that to start the string before the protoco?)
http[s]? // square brackets delimit character class, so are unneccessary here, although don't change functionality
:\/\/ // matches ://
)? // make captured protocol optional
((?:(?:\w+)\.)(?:\S+)(?:\.(?:\w+))+?) // too many lookaheads, not enough patterns. Innefficient and causing your error
I would replace the regex with something more like this...
(https?:\/\/)?(\w[-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?

Categories

Resources