Regex to find web addresses in short copy

Regex to find web addresses in short copy - javascript

Having a short copy I need to match all occurrences of links to websites. To keep things simple a need to find out addresses in this format:
www.aaaaaa.bbbbbb
http://aaaaaa.bbbb
https://aa.bbbb
but also I need to take care of longer www/http/https versions:
www.aaaaa.bbbb.ccc.ddd.eeee
etc. So basically number of subdomains is not known. Now I came up with this regex:
(www\.([a-zA-Z0-9-_]|\.(?!\s))+)[\s|,|$]|(http(s)?:\/\/(?!\.)([a-zA-Z0-9-_]|\.(?!\s))+)[\s|,|$]
If you test on:
this is some tex with www.somewIebsite.dfd.jhh.hjh inside of it or maybe http://www.ssss.com or maybe https://evenore.com hahaah blah
It works fine with exception of when address is at the very end. $ seems to work only when there is \n in the end and it fails for:
this is some tex with www.somewIebsite.dfd.jhh.hjh
I'm guessing fix is simple and I miss something obvious so how would I fix it? BTW I posted regex here if yu want to quickly play around https://regex101.com/r/eL1bI4/3

The problem is that you placed the end anchor $ inside the character group []
[\s|,|$]
It is then interpreted literally as a dollar sign, and not as the anchor (the pipe character | is also interpreted literally, it's not needed there). The solution is to move the $ anchor outside:
(?:[\s,]|$)
However, in this case it makes more sense to use a positive lookahead instead of the noncapturing group (you don't want trailing spaces, or commas):
(?=[\s,]|$)
In the result you will end up with the following regex pattern:
(www\.([a-zA-Z0-9-_]|\.(?!\s))+)(?=[\s,]|$)|(http(s)?:\/\/(?!\.)([a-zA-Z0-9-_]|\.(?!\s))+)(?=[\s,]|$)
See the working demo.
The updated version that handles trailing full stops:
(www\.([a-zA-Z0-9-_]|\.(?!\s|\.|$))+)(?=[\s,.]|$)|(http(s)?:\/\/(?!\.)([a-zA-Z0-9-_]|\.(?!\s|\.|$))+)(?=[\s,.]|$)
See the working demo.

Related

Javascript - regex to check if user write correct formated input

in my CLI users can specify what they want to use:
A user command can look like this:
include=name1,name2,name3
category=name1,name2
category=name1
In another words, a command always consists of 3 parts:
command name: can be just include or category
=: is in every command
name or names of things they want to use, split by ,
How can I test this to get always true but false on everything else.
I am really bad in regex but I tried something like this:
/\category|include=\w/.test(str);
to simply test, at least, the most easy alternative which would be category=name1 but without success.
Can someone help me with this?

You were on the right path. Here's a fixed regex:
/^(category|include)=\w+(,\w+)*$/.test(str);
Note:
the parens around the alternative parts
the + after the \w so that you can have several characters
the optional (,\w+)*
the start and end of string marks (^ and $) in order to check the whole string

You can use this regex for your requorement:
/^(category|include)=(\w+(?:,\w+)*)$/
RegEx Demo
\w+(?:,\w+)*) in the value part after = will allow 1 or more of comma separated words.

Unable to determine the regex for a path format which could be stringA/stringB/StringX or stringA/stringX but not just stringA/stringB

This regex is in JavaScript. More specifically stringA = content, stringB = dam & stringx could be any string.
I have tried this regex & few others:
^\/(content(?!\/(dam)))\/(.*)
but this would recognize
/content/asfcew
/content/reddam
/content/usa/texas
and would not recognize
/content/dam
which is good, but alongside it also does not recognize
/content/dam/asdfafa
/content/damred
which is not good.
Any suggestions are much appreciated, thanks.

You just need to add an end-of-string anchor $ to the look-ahead:
^\/(content(?!\/(dam$)))\/(.*)
^
See demo
Now, (?!\/(dam$)) will only fail the match when dam appears before the end of string.
Note that there are too many capturing groups here, you may remove them like this:
^\/content(?!\/dam$)\/(.*)
See another demo

As the poster above said, you need an end of string anchor $ to the look ahead group.
To enable it capture both /content/dam and the rest use this pattern.
> ^\/(content(?=|\/(dam$)))\/(.*)
See demo here https://regex101.com/r/kO2cZ1/5

Javascript match eveything except given words

Im working on a node.js app, and im doing router matching.
I need to match all routes with all variables except the ones which begin with
"public , static , files or same words with added "/"
i know i could do it using an if statement before regexp, to check if those words are withing url, and if they are, skip regexp, but i dont want to add such nesting, and knowing how to do it using regexp will become in handy in the future anyways.
i know how to match anything except...some letters, ie ^[0-9] , but i cant use the same for words. I googled and found that lookahead could solve this, but... i cant get it to work.
In the end, id like to use something like this (in pseudo code)
where the .+ would match only if the pattern does not match any of the given words.
match(/^(?!public|static|files) .+ /gi)
edit 1:
The format of the url's would be something like this..with or without slashes.
/controller/action/4/var:something/
i want to make a regexp that matches this controller - action - id
pattern, but at the same time wouldnt match patterns like this
/public/images/4
or
static/files/somefile
in general, id like to know how to match a pattern, but only if it doesnt begin with given words.
e.g something like this...but it doesnt work
( match .+, but only if it doesnt contain the words mentioned before
/^(?!public|static|files).+ /gi)

Actually, I'm not having trouble with negative look-aheads. Something like this seems to work just fine, although it's not super extensible.
/^\/(?!public|static|files)([^\/]+)?\/?([^\/]+)?\/?([^\/]+)?\/?(.*)$/i
1st capture will be the controller, 2nd is the action, 3rd is the ID, and 4th is whatever is left.
See this jsfiddle

Javascript RegEx to match URL's but exclude images

I need to replace all text links in a string of HTML text by actual clickable links. Works fine with the following RegEx:
/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/gi
I then noticed it also replaces images and already formatted links. Figures I need to exclude links preceded by src" and > ... I searched a bit and read a lot on negative lookahead in many questions answered here. I tried this (added something right after the first /):
/(^(?!src="|>)\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/gi
But this doesn't match any link anymore. I tried several similar statements, without the ^, changing some brackets, etc etc, but simply nothing seems to work. I tried putting .{0} in between the part I added and \b, to make sure he would only look at stuff right in front of the url and not consider anything farther away.

EDIT: The discussion was getting long, so I decided to update the answer instead.
Trusting that your original regex works, I'm just going to refer to a simplified version through the rest of this answer:
/\b(https?|ftp|file)/gi
Now, you attempted this:
/^(?!src="|>)\b(https?|ftp|file)/gi
^
The main error here is marked by a caret: the caret. That forces your regex to match from the beginning of the line, which is why it matched nothing. Let's remove that and move on:
/(?!src="|>)\b(https?|ftp|file)/gi
The main error, this time, is in your conception of lookahead assertions. As I explained in the comments, this assertion is redundant, because you are saying, "Match http or https or ftp or file, as long as none of these are src=" or >." It's almost so redundant that the sentence doesn't even make sense to us! What you want, instead, is a lookbehind assertion:
/(?<!src="|>)\b(https?|ftp|file)/gi
^
Why? Because you wish to find src=" or > behind the string you potentially wish to match. The problem? JavaScript doesn't support lookbehind assertions. So, I suggested an alternative. Admittedly, it was flawed (although not the cause of the HTML breaking, as you brought up). Here it is, fixed:
/(.[^>"]|[^=]")\b(https?|ftp|file)/gi
^^^^^^^^^^^^
This is indeed a non-intuitive regex, and warrants explanation. It splits our cases into two. Say we have a two-character set. If the set doesn't end in > or ", then we're not suspicious of it; we're good to go; match any URL that might follow. However, if it does end in > or ", well, the only "forgivable" case is where the first character is not an =. So you see, a bit of logic trickery here.
Now, as for why this might break your HTML. Be sure to use JavaScript's replace, and substitute the first captured group back into the page! If you simply substitute each match with nothingness, you end up "eating up" the two-character sets, which we only meant to investigate, not destroy.
html.replace(/(.[^>"]|[^=]")\b(https?|ftp|file)/gi,
function(match, $1, offset, original) {
return $1;
});

I have to go home and haven't tested yet, but I'd feel more comfortable dealing with the easier task of isolating HTML you don't want out first.
Do a match to get an array of the stuff you don't want to deal with.
Rip it all out with a split.
Iterate the split array and replace URLs and then splice matched items back in
Join and return
The only assumption is that you don't end on an anchor or img tag in your text
function zipperParse(htmlText,matcher){
var zipBackInArray = htmlText.match(matcher),
workingArray = htmlText.split(matcher),
i = workingArray.length;
while(i--){
buildAnchorTagIfURLPresent(workingArray[i]); //You got this one covered
workingArray.splice(i,0,zipBackInArray.pop());
//working backwards makes splice much easier to use here
}
return workingArray.join('');
}
var toExclude = /<a[^>]*>[^>]*>|<img[^>]*>/g;
// is supposed to match all img and anchor pairs but not handling tags inside anchors yet
zipperParse(yourHtmlText,toExclude);

this code works for me... just change the Google Api KEY to exclude..=> XXXXXXXXXXXXXXXXXXXXXX i just put it in my functions.php theme of my wordpress. The first thing is to see, how your google maps code appears on your site, and then it is to match it to what is replaced.
function remove_script_version( $src ) {
$parts1 = explode( '?', $src );
$parts2 = str_replace('//maps.googleapis.com/maps/api/js', '//maps.googleapis.com/maps/api/js?language=es&v=3.31&libraries=places&key=XXXXXXXXXXXXXXXXXXXXXX&ver=3.31', $parts1);
return $parts2[0]; }
add_filter( 'script_loader_src', 'remove_script_version', 15, 1 );
add_filter( 'style_loader_src', 'remove_script_version', 15, 1 );

Building a Hashtag in Javascript without matching Anchor Names, BBCode or Escaped Characters

I would like to convert any instances of a hashtag in a String into a linked URL:
#hashtag -> should have "#hashtag" linked.
This is a #hashtag -> should have "#hashtag" linked.
This is a [url=http://www.mysite.com/#name]named anchor[/url] -> should not be linked.
This isn't a pretty way to use quotes -> should not be linked.
Here is my current code:
String.prototype.parseHashtag = function() {
return this.replace(/[^&][#]+[A-Za-z0-9-_]+(?!])/, function(t) {
var tag = t.replace("#","")
return t.link("http://www.mysite.com/tag/"+tag);
});
};
Currently, this appears to fix escaped characters (by excluding matches with the amperstand), handles named anchors, but it doesn't link the #hashtag if it's the first thing in the message, and it seems to grab include the 1-2 characters prior to the "#" in the link.
Halp!

How about the following:
/(^|[^&])#([A-Za-z0-9_-]+)(?![A-Za-z0-9_\]-])/g
matches the hashtags in your example. Since JavaScript doesn't support lookbehind, it tries to either match the start of the string or any character except & before the hashtag. It captures the latter so it can later be replaced. It also captures the name of the hashtag.
So, for example:
subject.replace(/(^|[^&])#([A-Za-z0-9_-]+)(?![A-Za-z0-9_\]-])/g, "$1http://www.mysite.com/tag/$2");
will transform
#hashtag
This is a #hashtag and this one #too.
This is a [url=http://www.mysite.com/#name]named anchor[/url]
This isn't a pretty way to use quotes
into
http://www.mysite.com/tag/hashtag
This is a http://www.mysite.com/tag/hashtag and this one http://www.mysite.com/tag/too.
This is a [url=http://www.mysite.com/#name]named anchor[/url]
This isn't a pretty way to use quotes
This probably isn't what t.link() (which I don't know) would have returned, but I hope it's a good starting point.

There is an open-source Ruby gem to do this sort of thing (hashtags and #usernames) called twitter-text. You might get some ideas and regexes from that, or try out this JavaScript port.
Using the JavaScript port, you'll want to just do:
var linked = TwitterText.auto_link_hashtags(text, {hashtag_url_base: "http://www.mysite.come/tag/"});

Tim, your solution was almost perfect. Here's what I ended up using:
subject.replace(/(^| )#([A-Za-z0-9_-]+)(?![A-Za-z0-9_\]-])/g, "$1#$2");
The only change is the first conditional, changed it to match the beginning of the string or a space character. (I tried \s, but that didn't work at all.)

Develop Reference

JavaScript is the programming language of the Web.

Regex to find web addresses in short copy - javascript

Related

Javascript - regex to check if user write correct formated input

Unable to determine the regex for a path format which could be stringA/stringB/StringX or stringA/stringX but not just stringA/stringB

Javascript match eveything except given words

Javascript RegEx to match URL's but exclude images

Building a Hashtag in Javascript without matching Anchor Names, BBCode or Escaped Characters

Categories

Resources