I searched high and low but cannot aeem to find a definitve answer to this. As is often the case with regexps. So I thought I'd ask here.
I'm trying to put together a regular expression i can use in JavaScript to replace all instances of URLs and email addresses (does'nt need to be ever so strict) with anchor tags pointing to them.
Obviously this is something usually done very simply on the server-side, but in this case it is necessary to work with plain text so an elegant JavaScript solution to perfom the replaces at runtime would be perfect.
Onl problem is, as I've stated before, I have a huge regular expression shaped gaping hole in my skill set :(
I know that one of you has the answer at the tip of your fingers though :)
Well, blindly using regexps from http://www.osix.net/modules/article/?id=586
var emailRegex =
new RegExp(
'([a-zA-Z0-9_\-\.]+)#((\[[0-9]{1,3}' +
'\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.' +
')+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)',
"gi");
var urlRegex =
new RegExp(
'((https?://)' +
'?(([0-9a-z_!~*\'().&=+$%-]+: )?[0-9a-z_!~*\'().&=+$%-]+#)?' + //user#
'(([0-9]{1,3}\.){3}[0-9]{1,3}' + // IP- 199.194.52.184
'|' + // allows either IP or domain
'([0-9a-z_!~*\'()-]+\.)*' + // tertiary domain(s)- www.
'([0-9a-z][0-9a-z-]{0,61})?[0-9a-z]\.' + // second level domain
'[a-z]{2,6})' + // first level domain- .com or .museum
'(:[0-9]{1,4})?' + // port number- :80
'((/?)|' + // a slash isn't required if there is no file name
'(/[0-9a-z_!~*\'().;?:#&=+$,%#-]+)+/?))',
"gi");
then
text.replace(emailRegex, "<a href='mailto::$1'>$1</a>");
and
text.replace(urlRegex, "<a href='$1'>$1</a>");
might to work
Not a canned solution, but this will point you in the right direction.
I use Regex Coach to build and test my regexes. You can find plentiful examples of regexes for urls and email addresses online.
Here's a good article for urls...
https://blog.codinghorror.com/the-problem-with-urls/
emails are more straight forward since they have to end in a .tld
You don't need to get fancy with that one since you're not validating, just matching, so off the top of my head...
[^\s]+#\w[\w-.]*.[a-zA-Z]+
As always, this ("this" being "processing HTML with regex") is going to be difficult and error-prone. The following will work on reasonably well-formed input only, but here's what I would do:
find the element you want to process, take it's innerHTML property value
iteratively find everything that already is a link (/(<a\b.+?</a>/ig)
based on that, cut your string into "this isn't a link"- and "this is a link"-bits, appending all of them them to a neatly orderd array
process the "non-link" bits only (those that don't begin with "<a "), looking for URL- or e-mail-address patterns
wrap every address you find in <a> tags
join() the array back to a string
set the innerHTML property to your new value
I am sure you will find regular expression examples that match e-mail addresses and URLs. Take the ones that suit you most, and use them in step 4.).
Just adding a bit of information on email regexps: Most of them seems to ignore that domain names can have the characters 'åäö' in them. So if your care about that, make sure that the solution you are using has åäöÅÄÖ in the domain part of the regexp.
Related
To start off I know this is bad practice. I know there are libraries out there that are supposed to help with this; however, this is the task to which I was assigned and changing this whole thing to work with a library will be much more work than we can take on right now (since we are on a tight time frame).
In our web app we have fields that people usually type URLs into. We have been assigned a task to 'linkify' anything that looks like a URL. Currently the people who wrote our app seemed to have used a regex to determine if a string of text is a URL. I am basing my regex off that (I am no regex guru, not even a novice).
The 'search' regex looks like so
function DoesTextContainLinks(linktText) {
//replace all urls with links!
var linkifyValue = /((ftp|https?):\/\/)?(www\.)?([a-zA-Z0-9\-]{1,}\.){1,}[a-zA-Z0-9]{1,4}(:[0-9]{1,5})?(\/[a-zA-Z0-9\-\_\.\?\&\#]{1,})*(\/)?$/.test(linktText);
return linkifyValue;
}
Using this regex and https://regex101.com/ I have come up with two regexes that work most of the time.
function WrapLinkTextInAnchorTag(linkText) {
//capture links that only have www and add http to the begining of them (regex ignores entries that have http, https, and ftp in them. They are handled by the next regexes)
linkText = linkText.replace(/(^(?:(?!http).)*^(?:(?!ftp).)(www\.)?([a-zA-Z0-9\-]{1,}\.){1,}[a-zA-Z0-9]{1,4}(:[0-9]{1,5})?(\/[a-zA-Z0-9\-\_\.\?\&\#]{1,})*(\/)?$)/gim, "<a href='http://$1'>$1</a>");
//capture links that have https and http on them and fix those too. No need to prepend http here
linkText = linkText.replace(/(((https|http|ftp?):\/\/)?(www\.)?([a-zA-Z0-9\-]{1,}\.){1,}[a-zA-Z0-9]{1,4}(:[0-9]{1,5})?(\/[a-zA-Z0-9\-\_\.\?\&\#]{1,})*(\/)?$)/gim, "<a href='$1'>$1</a>");
return linkText;
}
The problem here is that some complex URLs seem to not work. I can't understand exactly why they don't work. regex101 is pretty bad ass in that it tells you what each part is doing; however, my trouble is combining these keywords in the regex to get them to do what I want. I have two scenarios to account for : when a user types www.something.com | ftp.something.com and when a user actually types http://www.something.com.
I am looking for some help in pointing out exactly what is wrong with my 2 regexes that prevents them from capturing complicated URLs like the one below
https://pw.something.com/AAPS/default.aspx?guid=a5741c35-6fe1-31a1-b555-4028e931642b
I use this one ...
^(http|https|ftp)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?\/?([a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~])*$
Look here ... Regex Tester
URL RegExp that requires (http, https, ftp)://, A nice domain, and a decent file/folder string. Allows : after domain name, and these characters in the file/folder string (letter, numbers, - . _ ? , ' / \ + & % $ # = ~). It blocks all other special characters and id good for protecting against user input!
If you look closely you will notice that nowhere in your regexps do you match an = character. That's what's breaking on the example you give.
Changing the second regexp by adding a \= to the characters supported in the path:
linkText.replace(/(((https|http|ftp?):\/\/)?(www\.)?([a-zA-Z0-9\-]{1,}\.){1,}[a-zA-Z0-9]{1,4}(:[0-9]{1,5})?(\/[a-zA-Z0-9\-\_\.\?\&\#\=]{1,})*(\/)?$)/gim, "<a href='$1'>$1</a>");
Causes your example URL to match. That said it may be worth slogging through the RFC on urls (http://www.ietf.org/rfc/rfc3986.txt) to find other characters that might be allowed in URLs (even if they have special meanings) because you're probably missing some others.
I am trying to use a RegEx to search through a long string, and I am having trouble coming up with an expression. I am trying to search through some HTML for a set of tags beginning with a tag containing a certain value and ending with a different tag containing another value. The code I am currently using to attempt this is as follows:
matcher = new RegExp(".*(<[^>]+" + startText + "((?!" + endText + ").)*" + endText + ")", 'g');
data.replace(matcher, "$1");
The strangeness around the middle ( ((\\?\\!endText).)* ) is borrowed from another thread, found here, that seems to describe my problem. The issue I am facing is that the expression matches the beginning tag, but it does not find the ending tag and instead includes the remainder of the data. Also, the lookaround in the middle slowed the expression down a lot. Any suggestions as to how I can get this working?
EDIT: I understand that parsing HTML in RegEx isn't the best option (makes me feel dirty), but I'm in a time-crunch and any other alternative I can think of will take too long. It's hard to say what exactly the markup I will be parsing will look like, as I am creating it on the fly. The best I can do is to say that I am looking at a large table of data that is collected for a range of items on a range of dates. Both of these ranges can vary, and I am trying to select a certain range of dates from a single row. The approximate value of startText and endText are \\#\\#ASSET_ID\\#\\#_<YYYY_MM_DD>. The idea is to find the code that corresponds to this range of cells. (This edit could quite possibly have made this even more confusing, but I'm not sure how much more information I could really give without explaining the entire application).
EDIT: Well, this was a stupid question. Apparently, I just forgot to add .* after the last paren. Can't believe I spent so long on this! Thanks to those of you that tried to help!
First of all, why is there a .* Dot Asterisk in the beginning? If you have text like the following:
This is my Text
And you want "my Text" pulled out, you do my\sText. You don't have to do the .*.
That being said, since all you'll be matching now is what you need, you don't need the main Capture Group around "Everything". This: .*(xxx) is a huge no-no, and can almost always be replaced with this: xxx. In other words, your regex could be replaced with:
<[^>]+xxx((?!zzz).)*zzz
From there I examine what it's doing.
You are looking for an HTML opening Delimeter <. You consume it.
You consume at least one character that is NOT a Closing HTML Delimeter, but can consume many. This is important, because if your tag is <table border=2>, then you have, at minimum, so far consumed <t, if not more.
You are now looking for a StartText. If that StartText is table, you'll never find it, because you have consumed the t. So replace that + with a *.
The regex is still success if the following is NOT the closing text, but starts from the VERY END of the document, because the Asterisk is being Greedy. I suggest making it lazy by adding a ?.
When the backtracking fails, it will look for the closing text and gather it successfully.
The result of that logic:
<[^>]*xxx((?!zzz).)*?zzz
If you're going to use a dot anyway, which is okay for new Regex writers, but not suggested for seasoned, I'd go with this:
<[^>]*xxx.*?zzz
So for Javascript, your code would say:
matcher = new RegExp("<[^>]*" + startText + ".*?" + endText, 'gi');
I put the IgnoreCase "i" in there for good measure, but you may or may not want that.
I've got this regex pattern from WMD showdown.js file.
/<((https?|ftp|dict):[^'">\s]+)>/gi
and the code is:
text = text.replace(/<((https?|ftp|dict):[^'">\s]+)>/gi,"$1");
But when I set text to http://www.google.com, it does not anchor it, it returns the original text value as is (http://www.google.com).
P.S: I've tested it with RegexPal and it does not match.
Your code is searching for a url wrapped in <> like: <http://www.google.com>: RegexPal.
Just change it to /((https?|ftp|dict):[^'">\s]+)/gi if you don't want it to search for the <>: RegexPal
As long as you know your url's start with http:// or https:// or whatever you can use:
/((https?|s?ftp|dict|www)(://)?)[A-Za-z0-9.\-]+)/gi
The expression will match till it encounters a character not allowed in the URL i.e. is not A-Za-z\.\-. It will not however detect anything of the form google.com or anything that comes after the domain name like parameters or sub directory paths etc. If that is your requirement that you can simply choose to terminate the terminating condition as you have above in your regex.
I know it seems pointless but it may be useful if you want the display name to be something abbreviated rather than the whole url in case of complex urls.
You could use:
var re = /(http|https|ftp|dict)(:\/\/\S+?)(\.?\s|\.?$)/gi;
with:
el.innerHTML = el.innerHTML.replace(re, '<a href=\'$1$2\'>$1$2<\/a>$3');
to also match URLs at the end of sentences.
But you need to be very careful with this technique, make sure the content of the element is more or less plain text and not complex markup. Regular expressions are not meant for, nor are they good at, processing or parsing HTML.
On my web app, I take a look at the current URL, and if the current URL is a form like this:
http://www.domain.com:11000/invite/abcde16989/root/index.html
-> All I need is to extract the ID which consists of 5 letters and 5 numbers (abcde16989) in another variable for further use.
So I need this:
var current_url = "the whole path, not just the hostname";
if (current_url has ID)
var ID = abcde16989;
You could always use split using / as the delimiter if the ID is always going to be in the same position, eg
var parts = current_url.split('/');
var id = parts[4];
Though your requirement of matching "5 letters and 5 numbers" really does suit a regex match.
var id = current_url.match(/[a-zA-Z]{5}[0-9]{5}/); // returns null if not found
I'm assuming you don't need the full URL, but just the pathname to get your ID. Use the following:
var current_url = window.location.pathname; //gets the pathname
var split_url = current_url.split('/'); //splits the path at each /
current_id = split_url[2]; //1st item in array is "invite", 2nd is your id, 3rd would be "root"
alert(current_id);
Firstly, this doesn't need JQuery; this is simple Javascript. I'll amend your tags after I've replied to reflect this.
A regex would actually be quite an easy way to achieve this, and I don't think a simple one like this would be as difficult to understand as you think.
So I'll answer with the regex option anyway and then move on to other options:
var url = "http://www.domain.com:11000/invite/abcde16989/root/index.html";
//first method:
var id = url.match('^http://www.domain.com:11000/invite/(.+)/root/index.html$')[1];/index.html$/)[1];
//second method: (if you don't know exact format of the rest of the URL but you do know the format of the ID string)
var id = url.match('/([a-z]{5}[0-9]{5})/')[1];
The first method will get the string in the position you specified within the URL. It won't check the formatting; it just looks at the rest of the URL and grabs the bit of it you're asking for. This should be really easy to understand: It's basically just your URL, but with (.+) where your ID goes.
The second method looks specifically for a string in the format you asked for -- ie five letters and then five numbers. This is admittedly a bit harder to read, but should be fairly self explanatory if you look at it given those criteria.
In both cases, the regex itself will return an array of results, with array element zero being the whole string (ie in the first case, including the rest of the URL). This is where the (brackets) come in (ie the bit where we said (.+)). This tells the match function to put the contents of the brackets into another array element so we can use it. In both cases, this means that we can read the ID in array element [1].
Okay, so how about the non-regex options:
In fact, it's going to be quite hard to do it in a simple way without regex in Javascript, since even the simple string splitting function uses a regex match to do the split (granted it would be a very simple one, it is still a regex). A couple of other people have already given you answers using this, but it is still a regex, so technically they've also not answered your question accurately.
I'm going to guess that actually one of these answers will be good enough for you (either mine or more likely one of the answers using split()), despite there still being a regex element. However if you really don't want anything to do with regex, you're going to have to start doing some slightly more complex string manipulation, probably using substring() (though there are other ways to do it).
Something along the lines of this:
var prefixstring="http://www.domain.com:11000/invite/";
var prefixlen=prefixstring.length;
var idlen=10;
var id = url.substring(prefixlen,idlen+prefixlen);
This gets the length of the portion of the URL in front of the ID, and then uses substring() to snip out the required bit. But I'm sure you'll agree that the regex options are simpler? ;-)
Hope that helps. (and I hope it helps you feel less afraid of regex!)
Related (but slightly different):
Javascript Regex: surround #_____, #_____, and http://______ with anchor tags in one pass?
I would like to surround all instances of #_______, #________, and http://________ with anchor tags. Multiple passes is fine with me.
For example, consider this Twitter message:
The quick brown fox #Spreadthemovie jumps over the lazy dog #cow, http://bit.ly/bC9Dy
Running it with the desired regex pattern would yield:
The quick brown fox #Spreadthemovie jumps over the lazy
dog #cow, http://bit.ly/bC9Dy
Only surround words that start with #, # or http:// so that dog#gmail.com would not become dog#gmail.com. Also, note how "#cow," turned into "<a href=urlB>#cow</a>," ... I only want alpha-numeric characters to be on the end of each anchor tagged substring. Also notice the href attribute.
If possible, please include actual javascript code with the regex pattern and replace function.
Many thanks! This problem has been plaguing me for a while
In my code I got similar function, you can take a look and change it to fit your needs:
function checkChatUrl($matches)
{
if(strpos($matches[0],'http://www.xxx.pl/?task=forum')!==false) $n='>forum';
elseif(strpos($matches[0],'http://www.xxx.pl')!==false) $n='>xxx';
elseif(strpos($matches[0],'db.php')!==false) return "";
elseif(strpos($matches[0],'%22')!==false) return "";
else $n=">".substr($matches[1].$matches[2],0,10).((strlen($matches[1].$matches[2])>10)?'..':'');
return "<a href='http://$matches[1]$matches[2]' target=_blank $n</a>";
}
$text=preg_replace_callback("/\bhttp:\/\/([\w\.]+)([\#\,\/\~\?\&\=\;\-\w+\.\/]+)\b/i",'checkChatUrl',$text);
This was designed for url links on chat, it makes its name shorter and for some urls uses prepared shortcuts.
str.replace(
/(\s|^)([##])([\w\d]+)|(http:\/\/\S+)/g,
'$1$2$3$4'
);
For matching # and # tags, I'd suggest using the \w metapattern (matches word characters - so it'll match digits and letters, but not whitespace/punctuation). Thus, you'd want something like the following patterns to pull out the matched items:
(#\w+)
(#\w+)
For matching URLs, a simple but naive pattern would be to just match http:// followed by any non-whitespace:
(http://\S+)
However, there are certain characters not valid in URLs that would get captured by this. A more sophisticated pattern that only allows characters which are valid in URLs would be the following:
(http://[a-zA-Z0-9+$_.+!*'(),#/-]+)
Here is a revised answer based on the revised question. You should actually put the revision/comment on the original question.
It uses 3 patterns for 3 actions and chains them. It uses the word boundary pattern (\b\B) as appropriate instead of (^|\s). This picks up patterns separated by punctuation and no space, eg #tweet,#tag
<script type=text/javascript>
function addTags(str) {
return str.replace(/\B(#)(\w+)/g, '<a href"//twitter.com=/$2">$1$2</a>')
.replace(/\B(#)(\w+)/g, '$1$2')
.replace(/\b(http:\S+[^,.])/g, '$1')
;
}
function testTags() {
document.getElementById('outstr').innerHTML =
document.getElementById('outtxt').innerHTML =
addTags(document.getElementById('instr').value);
}
</script>
<input type=text size=100 id="instr" value="#begin ignore#email.com and then #cow to http://mysite.com and also http://yoursite.com."><br>
<p><textarea id="outtxt" cols=90></textarea>
<p id=outstr></p>
<p><button onclick="testTags();">TEST</button>
I tested it with the above.
One important thing!
Make sure you are aware of the possible risks in doing naive replacement on links.
Do not allow users to insert arbitrary HTML on your site. The name of the XSS game is sanitizing user input. If you stick to a whitelist based approach -- only allow input that you know to be good, and immediately discard anything else -- then you're usually well on your way to solving any XSS problems you might have.
Naïve replacement counts as allowing inserting arbitrary HTML on you site.
At the very least, try to make sure that the resulting <a href=''> does not start with javascipt:, as you'd be open to Cross-Site Request Forgeries.