Javascript RegEx to match URL's but exclude images - javascript

I need to replace all text links in a string of HTML text by actual clickable links. Works fine with the following RegEx:
/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/gi
I then noticed it also replaces images and already formatted links. Figures I need to exclude links preceded by src" and > ... I searched a bit and read a lot on negative lookahead in many questions answered here. I tried this (added something right after the first /):
/(^(?!src="|>)\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/gi
But this doesn't match any link anymore. I tried several similar statements, without the ^, changing some brackets, etc etc, but simply nothing seems to work. I tried putting .{0} in between the part I added and \b, to make sure he would only look at stuff right in front of the url and not consider anything farther away.

EDIT: The discussion was getting long, so I decided to update the answer instead.
Trusting that your original regex works, I'm just going to refer to a simplified version through the rest of this answer:
/\b(https?|ftp|file)/gi
Now, you attempted this:
/^(?!src="|>)\b(https?|ftp|file)/gi
^
The main error here is marked by a caret: the caret. That forces your regex to match from the beginning of the line, which is why it matched nothing. Let's remove that and move on:
/(?!src="|>)\b(https?|ftp|file)/gi
The main error, this time, is in your conception of lookahead assertions. As I explained in the comments, this assertion is redundant, because you are saying, "Match http or https or ftp or file, as long as none of these are src=" or >." It's almost so redundant that the sentence doesn't even make sense to us! What you want, instead, is a lookbehind assertion:
/(?<!src="|>)\b(https?|ftp|file)/gi
^
Why? Because you wish to find src=" or > behind the string you potentially wish to match. The problem? JavaScript doesn't support lookbehind assertions. So, I suggested an alternative. Admittedly, it was flawed (although not the cause of the HTML breaking, as you brought up). Here it is, fixed:
/(.[^>"]|[^=]")\b(https?|ftp|file)/gi
^^^^^^^^^^^^
This is indeed a non-intuitive regex, and warrants explanation. It splits our cases into two. Say we have a two-character set. If the set doesn't end in > or ", then we're not suspicious of it; we're good to go; match any URL that might follow. However, if it does end in > or ", well, the only "forgivable" case is where the first character is not an =. So you see, a bit of logic trickery here.
Now, as for why this might break your HTML. Be sure to use JavaScript's replace, and substitute the first captured group back into the page! If you simply substitute each match with nothingness, you end up "eating up" the two-character sets, which we only meant to investigate, not destroy.
html.replace(/(.[^>"]|[^=]")\b(https?|ftp|file)/gi,
function(match, $1, offset, original) {
return $1;
});

I have to go home and haven't tested yet, but I'd feel more comfortable dealing with the easier task of isolating HTML you don't want out first.
Do a match to get an array of the stuff you don't want to deal with.
Rip it all out with a split.
Iterate the split array and replace URLs and then splice matched items back in
Join and return
The only assumption is that you don't end on an anchor or img tag in your text
function zipperParse(htmlText,matcher){
var zipBackInArray = htmlText.match(matcher),
workingArray = htmlText.split(matcher),
i = workingArray.length;
while(i--){
buildAnchorTagIfURLPresent(workingArray[i]); //You got this one covered
workingArray.splice(i,0,zipBackInArray.pop());
//working backwards makes splice much easier to use here
}
return workingArray.join('');
}
var toExclude = /<a[^>]*>[^>]*>|<img[^>]*>/g;
// is supposed to match all img and anchor pairs but not handling tags inside anchors yet
zipperParse(yourHtmlText,toExclude);

this code works for me... just change the Google Api KEY to exclude..=> XXXXXXXXXXXXXXXXXXXXXX i just put it in my functions.php theme of my wordpress. The first thing is to see, how your google maps code appears on your site, and then it is to match it to what is replaced.
function remove_script_version( $src ) {
$parts1 = explode( '?', $src );
$parts2 = str_replace('//maps.googleapis.com/maps/api/js', '//maps.googleapis.com/maps/api/js?language=es&v=3.31&libraries=places&key=XXXXXXXXXXXXXXXXXXXXXX&ver=3.31', $parts1);
return $parts2[0]; }
add_filter( 'script_loader_src', 'remove_script_version', 15, 1 );
add_filter( 'style_loader_src', 'remove_script_version', 15, 1 );

Related

Javascript regex to check for error tags

I need to write a regex that will tell me if any back-end framework that I'm working with is throwing an error and then store those errors in an array for retrieval if necessary.
The issue is, they use different tags for errors. Tags are as follows:
{{error}}, <<error>>, [[error]], and <{:error:}>
Usually, but not always, a set of braces will come after. Inside the braces will be a string; either an explanation of the error, or a JSON string containing more info about the error, like this:
<<error>> { Something has gone terribly wrong. }
<<error>> {
{"some":"json"}
}
<{:error:}> { What went wrong? }
As of now, it's undergoing a specific check for each tag, which is rather inefficient, like this:
if ( string.indexOf('<<error>>') >= 0 )
// Remove << and >>
if ( string.indexOf('[[error]]') >= 0 )
// Remove [[ and ]]
// So forth...
Then, I am left with a string like this:
error { Some description. }
or
error {
{"some":"json"}
}
Which I need a regex to extract what's between the brackets. This was the regex I wrote, but it falls short on numerous things:
string.match('/error\s?\{([^\}]+)\}/gi');
As I said, this procedure is very inefficient and has issues.
First, it doesn't allow the braces {} after error to be optional. They should be optional.
Second, it does not allow JSON as the charset [^}] is not matched when JSON presents it's closing}. So I need some way of matching all characters in a set until the opening bracket of error is closed. Is this possible?
Given the comments on my first answer, I'd use this regular expression as a replace to convert the data into single-line json, the regex also removes comments. It removes newlines that do not start with a properly labeled error. Multiline must be on.
(?:\/[\s\S*]*?\*\/|\/\/.*$|\s*^\s*(?!<<|{{|\[\[|<{:))) (demo)
or (?:\s*^\s*(?!<<|{{|\[\[|<{:)) if there are never comments to remove
And then this to extract the error information, on the reformatted string, this regex to match.
({{error}}|<<error>>|\[\[error\]\]|<{:error:}>)[ \t]*(?:(.*)}\s*$)? demo
I'll leave the other answer intact as I think it basically explains the problems that a person can encounter doing this.
Good question. Explained your problem, showed what you've tried, gave enough examples of input.
Regex, especially Javascript's limited implementation, is not ideal for parsing many languages and data objects. It can be difficult in this scenario to capture say 5. .* wants to go to 6 and .*? wants to go to 4.
{
{
{
} // 5
} // 5
} // 6
However, if your code is really indented like your examples (it may not be, that could be you making it readable), you should be able to use something like this ({{error}}|<<error>>|\[\[error\]\]|<{:error:}>)\s*(\s*{(.*?(?=$)|[\s\S]*?^)})?, (demo)
What this is doing is
capturing from { to } on the same line and if it can't, it proceeds to step 2 (alternation.
everything between { and } as long as } starts the line.
If the } is always prefixed by a certain number of spaces, you can prefix the marked } with that number of spaces in the regex.
({{error}}|<<error>>|\[\[error\]\]|<{:error:}>)\s*(\s*{(.*?(?=$)|[\s\S]*?^)})?`
^
If the } is always prefixed by the same number of spaces as the opening error marker, you can do this
([t ]*)({{error}}|<<error>>|\[\[error\]\]|<{:error:}>)(?:[ \t]*({(.*?(?=}$)|[\s\S]*?^\1)})?) (demo)
For this example, it's important to look at the full sample indent text. I demonstrate how it can go wrong.
If these won't work, you'll need a more code-oriented solution, but at the very least you can detect presence of errors with this
({{error}}|<<error>>|\[\[error\]\]|<{:error:}>)
. Chris85's simpler version is bad form, it could match <<error]] and any other combination, something he's probably aware of.

Regex to find web addresses in short copy

Having a short copy I need to match all occurrences of links to websites. To keep things simple a need to find out addresses in this format:
www.aaaaaa.bbbbbb
http://aaaaaa.bbbb
https://aa.bbbb
but also I need to take care of longer www/http/https versions:
www.aaaaa.bbbb.ccc.ddd.eeee
etc. So basically number of subdomains is not known. Now I came up with this regex:
(www\.([a-zA-Z0-9-_]|\.(?!\s))+)[\s|,|$]|(http(s)?:\/\/(?!\.)([a-zA-Z0-9-_]|\.(?!\s))+)[\s|,|$]
If you test on:
this is some tex with www.somewIebsite.dfd.jhh.hjh inside of it or maybe http://www.ssss.com or maybe https://evenore.com hahaah blah
It works fine with exception of when address is at the very end. $ seems to work only when there is \n in the end and it fails for:
this is some tex with www.somewIebsite.dfd.jhh.hjh
I'm guessing fix is simple and I miss something obvious so how would I fix it? BTW I posted regex here if yu want to quickly play around https://regex101.com/r/eL1bI4/3
The problem is that you placed the end anchor $ inside the character group []
[\s|,|$]
It is then interpreted literally as a dollar sign, and not as the anchor (the pipe character | is also interpreted literally, it's not needed there). The solution is to move the $ anchor outside:
(?:[\s,]|$)
However, in this case it makes more sense to use a positive lookahead instead of the noncapturing group (you don't want trailing spaces, or commas):
(?=[\s,]|$)
In the result you will end up with the following regex pattern:
(www\.([a-zA-Z0-9-_]|\.(?!\s))+)(?=[\s,]|$)|(http(s)?:\/\/(?!\.)([a-zA-Z0-9-_]|\.(?!\s))+)(?=[\s,]|$)
See the working demo.
The updated version that handles trailing full stops:
(www\.([a-zA-Z0-9-_]|\.(?!\s|\.|$))+)(?=[\s,.]|$)|(http(s)?:\/\/(?!\.)([a-zA-Z0-9-_]|\.(?!\s|\.|$))+)(?=[\s,.]|$)
See the working demo.

regex replace on JSON is removing an Object from Array

I'm trying to improve my understanding of Regex, but this one has me quite mystified.
I started with some text defined as:
var txt = "{\"columns\":[{\"text\":\"A\",\"value\":80},{\"text\":\"B\",\"renderer\":\"gbpFormat\",\"value\":80},{\"text\":\"C\",\"value\":80}]}";
and do a replace as follows:
txt.replace(/\"renderer\"\:(.*)(?:,)/g,"\"renderer\"\:gbpFormat\,");
which results in:
"{"columns":[{"text":"A","value":80},{"text":"B","renderer":gbpFormat,"value":80}]}"
What I expected was for the renderer attribute value to have it's quotes removed; which has happened, but also the C column is completely missing! I'd really love for someone to explain how my Regex has removed column C?
As an extra bonus, if you could explain how to remove the quotes around any value for renderer (i.e. so I don't have to hard-code the value gbpFormat in the regex) that'd be fantastic.
You are using a greedy operator while you need a lazy one. Change this:
"renderer":(.*)(?:,)
^---- add here the '?' to make it lazy
To
"renderer":(.*?)(?:,)
Working demo
Your code should be:
txt.replace(/\"renderer\"\:(.*?)(?:,)/g,"\"renderer\"\:gbpFormat\,");
If you are learning regex, take a look at this documentation to know more about greedyness. A nice extract to understand this is:
Watch Out for The Greediness!
Suppose you want to use a regex to match an HTML tag. You know that
the input will be a valid HTML file, so the regular expression does
not need to exclude any invalid use of sharp brackets. If it sits
between sharp brackets, it is an HTML tag.
Most people new to regular expressions will attempt to use <.+>. They
will be surprised when they test it on a string like This is a
first test. You might expect the regex to match and when
continuing after that match, .
But it does not. The regex will match first. Obviously not
what we wanted. The reason is that the plus is greedy. That is, the
plus causes the regex engine to repeat the preceding token as often as
possible. Only if that causes the entire regex to fail, will the regex
engine backtrack. That is, it will go back to the plus, make it give
up the last iteration, and proceed with the remainder of the regex.
Like the plus, the star and the repetition using curly braces are
greedy.
Try like this:
txt = txt.replace(/"renderer":"(.*?)"/g,'"renderer":$1');
The issue in the expression you were using was this part:
(.*)(?:,)
By default, the * quantifier is greedy by default, which means that it gobbles up as much as it can, so it will run up to the last comma in your string. The easiest solution would be to turn that in to a non-greedy quantifier, by adding a question mark after the asterisk and change that part of your expression to look like this
(.*?)(?:,)
For the solution I proposed at the top of this answer, I also removed the part matching the comma, because I think it's easier just to match everything between quotes. As for your bonus question, to replace the matched value instead of having to hardcode gbpFormat, I used a backreference ($1), which will insert the first matched group into the replacement string.
Don't manipulate JSON with regexp. It's too likely that you will break it, as you have found, and more importantly there's no need to.
In addition, once you have changed
'{"columns": [..."renderer": "gbpFormat", ...]}'
into
'{"columns": [..."renderer": gbpFormat, ...]}' // remove quotes from gbpFormat
then this is no longer valid JSON. (JSON requires that property values be numbers, quoted strings, objects, or arrays.) So you will not be able to parse it, or send it anywhere and have it interpreted correctly.
Therefore you should parse it to start with, then manipulate the resulting actual JS object:
var object = JSON.parse(txt);
object.columns.forEach(function(column) {
column.renderer = ghpFormat;
});
If you want to replace any quoted value of the renderer property with the value itself, then you could try
column.renderer = window[column.renderer];
Assuming that the value is available in the global namespace.
This question falls into the category of "I need a regexp, or I wrote one and it's not working, and I'm not really sure why it has to be a regexp, but I heard they can do all kinds of things, so that's just what I imagined I must need." People use regexps to try to do far too many complex matching, splitting, scanning, replacement, and validation tasks, including on complex languages such as HTML, or in this case JSON. There is almost always a better way.
The only time I can imagine wanting to manipulate JSON with regexps is if the JSON is broken somehow, perhaps due to a bug in server code, and it needs to be fixed up in order to be parseable.

JavaScript + RegEx Complications- Searching Strings Not Containing SubString

I am trying to use a RegEx to search through a long string, and I am having trouble coming up with an expression. I am trying to search through some HTML for a set of tags beginning with a tag containing a certain value and ending with a different tag containing another value. The code I am currently using to attempt this is as follows:
matcher = new RegExp(".*(<[^>]+" + startText + "((?!" + endText + ").)*" + endText + ")", 'g');
data.replace(matcher, "$1");
The strangeness around the middle ( ((\\?\\!endText).)* ) is borrowed from another thread, found here, that seems to describe my problem. The issue I am facing is that the expression matches the beginning tag, but it does not find the ending tag and instead includes the remainder of the data. Also, the lookaround in the middle slowed the expression down a lot. Any suggestions as to how I can get this working?
EDIT: I understand that parsing HTML in RegEx isn't the best option (makes me feel dirty), but I'm in a time-crunch and any other alternative I can think of will take too long. It's hard to say what exactly the markup I will be parsing will look like, as I am creating it on the fly. The best I can do is to say that I am looking at a large table of data that is collected for a range of items on a range of dates. Both of these ranges can vary, and I am trying to select a certain range of dates from a single row. The approximate value of startText and endText are \\#\\#ASSET_ID\\#\\#_<YYYY_MM_DD>. The idea is to find the code that corresponds to this range of cells. (This edit could quite possibly have made this even more confusing, but I'm not sure how much more information I could really give without explaining the entire application).
EDIT: Well, this was a stupid question. Apparently, I just forgot to add .* after the last paren. Can't believe I spent so long on this! Thanks to those of you that tried to help!
First of all, why is there a .* Dot Asterisk in the beginning? If you have text like the following:
This is my Text
And you want "my Text" pulled out, you do my\sText. You don't have to do the .*.
That being said, since all you'll be matching now is what you need, you don't need the main Capture Group around "Everything". This: .*(xxx) is a huge no-no, and can almost always be replaced with this: xxx. In other words, your regex could be replaced with:
<[^>]+xxx((?!zzz).)*zzz
From there I examine what it's doing.
You are looking for an HTML opening Delimeter <. You consume it.
You consume at least one character that is NOT a Closing HTML Delimeter, but can consume many. This is important, because if your tag is <table border=2>, then you have, at minimum, so far consumed <t, if not more.
You are now looking for a StartText. If that StartText is table, you'll never find it, because you have consumed the t. So replace that + with a *.
The regex is still success if the following is NOT the closing text, but starts from the VERY END of the document, because the Asterisk is being Greedy. I suggest making it lazy by adding a ?.
When the backtracking fails, it will look for the closing text and gather it successfully.
The result of that logic:
<[^>]*xxx((?!zzz).)*?zzz
If you're going to use a dot anyway, which is okay for new Regex writers, but not suggested for seasoned, I'd go with this:
<[^>]*xxx.*?zzz
So for Javascript, your code would say:
matcher = new RegExp("<[^>]*" + startText + ".*?" + endText, 'gi');
I put the IgnoreCase "i" in there for good measure, but you may or may not want that.

Building a Hashtag in Javascript without matching Anchor Names, BBCode or Escaped Characters

I would like to convert any instances of a hashtag in a String into a linked URL:
#hashtag -> should have "#hashtag" linked.
This is a #hashtag -> should have "#hashtag" linked.
This is a [url=http://www.mysite.com/#name]named anchor[/url] -> should not be linked.
This isn't a pretty way to use quotes -> should not be linked.
Here is my current code:
String.prototype.parseHashtag = function() {
return this.replace(/[^&][#]+[A-Za-z0-9-_]+(?!])/, function(t) {
var tag = t.replace("#","")
return t.link("http://www.mysite.com/tag/"+tag);
});
};
Currently, this appears to fix escaped characters (by excluding matches with the amperstand), handles named anchors, but it doesn't link the #hashtag if it's the first thing in the message, and it seems to grab include the 1-2 characters prior to the "#" in the link.
Halp!
How about the following:
/(^|[^&])#([A-Za-z0-9_-]+)(?![A-Za-z0-9_\]-])/g
matches the hashtags in your example. Since JavaScript doesn't support lookbehind, it tries to either match the start of the string or any character except & before the hashtag. It captures the latter so it can later be replaced. It also captures the name of the hashtag.
So, for example:
subject.replace(/(^|[^&])#([A-Za-z0-9_-]+)(?![A-Za-z0-9_\]-])/g, "$1http://www.mysite.com/tag/$2");
will transform
#hashtag
This is a #hashtag and this one #too.
This is a [url=http://www.mysite.com/#name]named anchor[/url]
This isn't a pretty way to use quotes
into
http://www.mysite.com/tag/hashtag
This is a http://www.mysite.com/tag/hashtag and this one http://www.mysite.com/tag/too.
This is a [url=http://www.mysite.com/#name]named anchor[/url]
This isn't a pretty way to use quotes
This probably isn't what t.link() (which I don't know) would have returned, but I hope it's a good starting point.
There is an open-source Ruby gem to do this sort of thing (hashtags and #usernames) called twitter-text. You might get some ideas and regexes from that, or try out this JavaScript port.
Using the JavaScript port, you'll want to just do:
var linked = TwitterText.auto_link_hashtags(text, {hashtag_url_base: "http://www.mysite.come/tag/"});
Tim, your solution was almost perfect. Here's what I ended up using:
subject.replace(/(^| )#([A-Za-z0-9_-]+)(?![A-Za-z0-9_\]-])/g, "$1#$2");
The only change is the first conditional, changed it to match the beginning of the string or a space character. (I tried \s, but that didn't work at all.)

Categories

Resources