JavaScript + RegEx Complications- Searching Strings Not Containing SubString - javascript

I am trying to use a RegEx to search through a long string, and I am having trouble coming up with an expression. I am trying to search through some HTML for a set of tags beginning with a tag containing a certain value and ending with a different tag containing another value. The code I am currently using to attempt this is as follows:
matcher = new RegExp(".*(<[^>]+" + startText + "((?!" + endText + ").)*" + endText + ")", 'g');
data.replace(matcher, "$1");
The strangeness around the middle ( ((\\?\\!endText).)* ) is borrowed from another thread, found here, that seems to describe my problem. The issue I am facing is that the expression matches the beginning tag, but it does not find the ending tag and instead includes the remainder of the data. Also, the lookaround in the middle slowed the expression down a lot. Any suggestions as to how I can get this working?
EDIT: I understand that parsing HTML in RegEx isn't the best option (makes me feel dirty), but I'm in a time-crunch and any other alternative I can think of will take too long. It's hard to say what exactly the markup I will be parsing will look like, as I am creating it on the fly. The best I can do is to say that I am looking at a large table of data that is collected for a range of items on a range of dates. Both of these ranges can vary, and I am trying to select a certain range of dates from a single row. The approximate value of startText and endText are \\#\\#ASSET_ID\\#\\#_<YYYY_MM_DD>. The idea is to find the code that corresponds to this range of cells. (This edit could quite possibly have made this even more confusing, but I'm not sure how much more information I could really give without explaining the entire application).
EDIT: Well, this was a stupid question. Apparently, I just forgot to add .* after the last paren. Can't believe I spent so long on this! Thanks to those of you that tried to help!

First of all, why is there a .* Dot Asterisk in the beginning? If you have text like the following:
This is my Text
And you want "my Text" pulled out, you do my\sText. You don't have to do the .*.
That being said, since all you'll be matching now is what you need, you don't need the main Capture Group around "Everything". This: .*(xxx) is a huge no-no, and can almost always be replaced with this: xxx. In other words, your regex could be replaced with:
<[^>]+xxx((?!zzz).)*zzz
From there I examine what it's doing.
You are looking for an HTML opening Delimeter <. You consume it.
You consume at least one character that is NOT a Closing HTML Delimeter, but can consume many. This is important, because if your tag is <table border=2>, then you have, at minimum, so far consumed <t, if not more.
You are now looking for a StartText. If that StartText is table, you'll never find it, because you have consumed the t. So replace that + with a *.
The regex is still success if the following is NOT the closing text, but starts from the VERY END of the document, because the Asterisk is being Greedy. I suggest making it lazy by adding a ?.
When the backtracking fails, it will look for the closing text and gather it successfully.
The result of that logic:
<[^>]*xxx((?!zzz).)*?zzz
If you're going to use a dot anyway, which is okay for new Regex writers, but not suggested for seasoned, I'd go with this:
<[^>]*xxx.*?zzz
So for Javascript, your code would say:
matcher = new RegExp("<[^>]*" + startText + ".*?" + endText, 'gi');
I put the IgnoreCase "i" in there for good measure, but you may or may not want that.

Related

Match a word unless it is preceded by an equals sign?

I have the following string
class=use><em>use</em>
that when searched using us I want to transform into
class=use><em><b>us</b>e</em>
I've tried looking at relating answers but I can't quite get it working the way I want it to. I'm especially interested in this answer's callback approach.
Help appreciated
This is a good exercise for writing regular expressions, and here's a possible solution.
"useclass=use><em>use</em>".replace(/([^=]|^)(us)/g, "$1<b>$2</b>");
// returns "<b>us</b>eclass=use><em><b>us</b>e</em>"
([^=]|^) ensures that the prefix of any matched us is either not an equal sign, or it's the start of the string.
As #jamiec pointed out in the comments, if you are using this to parse/modify HTML, just stop right now. It's mathematically impossible to parse a CFG with a regular grammar (even with enhanced JS regexps you will have a bad time trying to achieve that.)
If you can make any assumptions about the structure of your document, you may be better off using an approach that operates on DOM elements directly rather than parsing the whole document with a regex.
Parsing HTML with a regex has certain problems that can be painful to deal with.
var element = document.querySelector('em');
element.innerHTML = element.innerHTML.replace('us', '<b>us</b>');
<div class=use><em>use</em>
</div>
I would first look for any character other than the equals sign [^=] and separate it by parentheses so that I can use it again in my replacement. Then another set of parentheses around the two characters us ought to do it:
var re = /([^=]|^)(us)/
That will give you two capture groups to work with (inside the parentheses), which you can represent with $1 and $2 in your replacement string.
str.replace( /([^=|^])(us)/, '$1<b>$2</b>' );

regex replace on JSON is removing an Object from Array

I'm trying to improve my understanding of Regex, but this one has me quite mystified.
I started with some text defined as:
var txt = "{\"columns\":[{\"text\":\"A\",\"value\":80},{\"text\":\"B\",\"renderer\":\"gbpFormat\",\"value\":80},{\"text\":\"C\",\"value\":80}]}";
and do a replace as follows:
txt.replace(/\"renderer\"\:(.*)(?:,)/g,"\"renderer\"\:gbpFormat\,");
which results in:
"{"columns":[{"text":"A","value":80},{"text":"B","renderer":gbpFormat,"value":80}]}"
What I expected was for the renderer attribute value to have it's quotes removed; which has happened, but also the C column is completely missing! I'd really love for someone to explain how my Regex has removed column C?
As an extra bonus, if you could explain how to remove the quotes around any value for renderer (i.e. so I don't have to hard-code the value gbpFormat in the regex) that'd be fantastic.
You are using a greedy operator while you need a lazy one. Change this:
"renderer":(.*)(?:,)
^---- add here the '?' to make it lazy
To
"renderer":(.*?)(?:,)
Working demo
Your code should be:
txt.replace(/\"renderer\"\:(.*?)(?:,)/g,"\"renderer\"\:gbpFormat\,");
If you are learning regex, take a look at this documentation to know more about greedyness. A nice extract to understand this is:
Watch Out for The Greediness!
Suppose you want to use a regex to match an HTML tag. You know that
the input will be a valid HTML file, so the regular expression does
not need to exclude any invalid use of sharp brackets. If it sits
between sharp brackets, it is an HTML tag.
Most people new to regular expressions will attempt to use <.+>. They
will be surprised when they test it on a string like This is a
first test. You might expect the regex to match and when
continuing after that match, .
But it does not. The regex will match first. Obviously not
what we wanted. The reason is that the plus is greedy. That is, the
plus causes the regex engine to repeat the preceding token as often as
possible. Only if that causes the entire regex to fail, will the regex
engine backtrack. That is, it will go back to the plus, make it give
up the last iteration, and proceed with the remainder of the regex.
Like the plus, the star and the repetition using curly braces are
greedy.
Try like this:
txt = txt.replace(/"renderer":"(.*?)"/g,'"renderer":$1');
The issue in the expression you were using was this part:
(.*)(?:,)
By default, the * quantifier is greedy by default, which means that it gobbles up as much as it can, so it will run up to the last comma in your string. The easiest solution would be to turn that in to a non-greedy quantifier, by adding a question mark after the asterisk and change that part of your expression to look like this
(.*?)(?:,)
For the solution I proposed at the top of this answer, I also removed the part matching the comma, because I think it's easier just to match everything between quotes. As for your bonus question, to replace the matched value instead of having to hardcode gbpFormat, I used a backreference ($1), which will insert the first matched group into the replacement string.
Don't manipulate JSON with regexp. It's too likely that you will break it, as you have found, and more importantly there's no need to.
In addition, once you have changed
'{"columns": [..."renderer": "gbpFormat", ...]}'
into
'{"columns": [..."renderer": gbpFormat, ...]}' // remove quotes from gbpFormat
then this is no longer valid JSON. (JSON requires that property values be numbers, quoted strings, objects, or arrays.) So you will not be able to parse it, or send it anywhere and have it interpreted correctly.
Therefore you should parse it to start with, then manipulate the resulting actual JS object:
var object = JSON.parse(txt);
object.columns.forEach(function(column) {
column.renderer = ghpFormat;
});
If you want to replace any quoted value of the renderer property with the value itself, then you could try
column.renderer = window[column.renderer];
Assuming that the value is available in the global namespace.
This question falls into the category of "I need a regexp, or I wrote one and it's not working, and I'm not really sure why it has to be a regexp, but I heard they can do all kinds of things, so that's just what I imagined I must need." People use regexps to try to do far too many complex matching, splitting, scanning, replacement, and validation tasks, including on complex languages such as HTML, or in this case JSON. There is almost always a better way.
The only time I can imagine wanting to manipulate JSON with regexps is if the JSON is broken somehow, perhaps due to a bug in server code, and it needs to be fixed up in order to be parseable.

Javascript RegEx to match URL's but exclude images

I need to replace all text links in a string of HTML text by actual clickable links. Works fine with the following RegEx:
/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/gi
I then noticed it also replaces images and already formatted links. Figures I need to exclude links preceded by src" and > ... I searched a bit and read a lot on negative lookahead in many questions answered here. I tried this (added something right after the first /):
/(^(?!src="|>)\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/gi
But this doesn't match any link anymore. I tried several similar statements, without the ^, changing some brackets, etc etc, but simply nothing seems to work. I tried putting .{0} in between the part I added and \b, to make sure he would only look at stuff right in front of the url and not consider anything farther away.
EDIT: The discussion was getting long, so I decided to update the answer instead.
Trusting that your original regex works, I'm just going to refer to a simplified version through the rest of this answer:
/\b(https?|ftp|file)/gi
Now, you attempted this:
/^(?!src="|>)\b(https?|ftp|file)/gi
^
The main error here is marked by a caret: the caret. That forces your regex to match from the beginning of the line, which is why it matched nothing. Let's remove that and move on:
/(?!src="|>)\b(https?|ftp|file)/gi
The main error, this time, is in your conception of lookahead assertions. As I explained in the comments, this assertion is redundant, because you are saying, "Match http or https or ftp or file, as long as none of these are src=" or >." It's almost so redundant that the sentence doesn't even make sense to us! What you want, instead, is a lookbehind assertion:
/(?<!src="|>)\b(https?|ftp|file)/gi
^
Why? Because you wish to find src=" or > behind the string you potentially wish to match. The problem? JavaScript doesn't support lookbehind assertions. So, I suggested an alternative. Admittedly, it was flawed (although not the cause of the HTML breaking, as you brought up). Here it is, fixed:
/(.[^>"]|[^=]")\b(https?|ftp|file)/gi
^^^^^^^^^^^^
This is indeed a non-intuitive regex, and warrants explanation. It splits our cases into two. Say we have a two-character set. If the set doesn't end in > or ", then we're not suspicious of it; we're good to go; match any URL that might follow. However, if it does end in > or ", well, the only "forgivable" case is where the first character is not an =. So you see, a bit of logic trickery here.
Now, as for why this might break your HTML. Be sure to use JavaScript's replace, and substitute the first captured group back into the page! If you simply substitute each match with nothingness, you end up "eating up" the two-character sets, which we only meant to investigate, not destroy.
html.replace(/(.[^>"]|[^=]")\b(https?|ftp|file)/gi,
function(match, $1, offset, original) {
return $1;
});
I have to go home and haven't tested yet, but I'd feel more comfortable dealing with the easier task of isolating HTML you don't want out first.
Do a match to get an array of the stuff you don't want to deal with.
Rip it all out with a split.
Iterate the split array and replace URLs and then splice matched items back in
Join and return
The only assumption is that you don't end on an anchor or img tag in your text
function zipperParse(htmlText,matcher){
var zipBackInArray = htmlText.match(matcher),
workingArray = htmlText.split(matcher),
i = workingArray.length;
while(i--){
buildAnchorTagIfURLPresent(workingArray[i]); //You got this one covered
workingArray.splice(i,0,zipBackInArray.pop());
//working backwards makes splice much easier to use here
}
return workingArray.join('');
}
var toExclude = /<a[^>]*>[^>]*>|<img[^>]*>/g;
// is supposed to match all img and anchor pairs but not handling tags inside anchors yet
zipperParse(yourHtmlText,toExclude);
this code works for me... just change the Google Api KEY to exclude..=> XXXXXXXXXXXXXXXXXXXXXX i just put it in my functions.php theme of my wordpress. The first thing is to see, how your google maps code appears on your site, and then it is to match it to what is replaced.
function remove_script_version( $src ) {
$parts1 = explode( '?', $src );
$parts2 = str_replace('//maps.googleapis.com/maps/api/js', '//maps.googleapis.com/maps/api/js?language=es&v=3.31&libraries=places&key=XXXXXXXXXXXXXXXXXXXXXX&ver=3.31', $parts1);
return $parts2[0]; }
add_filter( 'script_loader_src', 'remove_script_version', 15, 1 );
add_filter( 'style_loader_src', 'remove_script_version', 15, 1 );

Javascript regex whitespace is being wacky

I'm trying to write a regex that searches a page for any script tags and extracts the script content, and in order to accommodate any HTML-writing style, I want my regex to include script tags with any arbitrary number of whitespace characters (e.g. <script type = blahblah> and <script type=blahblah> should both be found). My first attempt ended up with funky results, so I broke down the problem into something simpler, and decided to just test and play around with a regex like /\s*h\s*/g.
When testing it out on string, for some reason completely arbitrary amounts of whitespace around the 'h' would be a match, and other arbitrary amounts wouldn't, e.g. something like " h " would match but " h " wouldn't. Does anyone have an idea of why this occurring or the the error I'm making?
Since you're using JavaScript, why can't you just use getElementsByTagName('script')? That's how you should be doing it.
If you somehow have an HTML string, create an iframe and dump the HTML into it, then run getElementsByTagName('script') on it.
OK, to extend Kolink's answer, you don't need an iframe, or event handlers:
var temp = document.createElement('div');
temp.innerHTML = otherHtml;
var scripts = temp.getElementsByTagName('script');
... now scripts is a DOM collection of the script elements - and the script doesn't get executed ...
Why regex is not a fantastic idea for this:
As a <script> element may not contain the string </script> anywhere, writing a regex to match them would not be difficult: /<script[.\n]+?<\/script>/gi
It looks like you want to only match scripts with a specific type attribute. You could try to include that in your pattern too: /<script[^>]+type\s*=\s*(["']?)blahblah\1[.\n]*?<\/script>/gi - but that is horrible. (That's what happens when you use regular expressions on irregular strings, you need to simplify)
So instead you iterate through all the basic matched scripts, extract the starting tag: result.match(/<script[^>]*>/i)[0] and within that, search for your type attribute /type\s*=\s*((["'])blahblah\2|\bblahblah\b)/.test(startTag). Oh look - it's back to horrible - simplify!
This time via normalisation:
startTag = startTag.replace(/\s*=\s*/g, '=').replace(/=([^\s"'>]+)/g, '="$1"') - now you're in danger territory, what if the = is inside a quoted string? Can you see how it just gets more and more complicated?
You can only have this work using regex if you make robust assumptions about the HTML you'll use it on (i.e. to make it regular). Otherwise your problems will grow and grow and grow!
disclaimer: I haven't tested any of the regex used to see if they do what I say they do, they're just example attempts.

JavaScript Regexp to Wrap URLs and Emails in anchors

I searched high and low but cannot aeem to find a definitve answer to this. As is often the case with regexps. So I thought I'd ask here.
I'm trying to put together a regular expression i can use in JavaScript to replace all instances of URLs and email addresses (does'nt need to be ever so strict) with anchor tags pointing to them.
Obviously this is something usually done very simply on the server-side, but in this case it is necessary to work with plain text so an elegant JavaScript solution to perfom the replaces at runtime would be perfect.
Onl problem is, as I've stated before, I have a huge regular expression shaped gaping hole in my skill set :(
I know that one of you has the answer at the tip of your fingers though :)
Well, blindly using regexps from http://www.osix.net/modules/article/?id=586
var emailRegex =
new RegExp(
'([a-zA-Z0-9_\-\.]+)#((\[[0-9]{1,3}' +
'\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.' +
')+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)',
"gi");
var urlRegex =
new RegExp(
'((https?://)' +
'?(([0-9a-z_!~*\'().&=+$%-]+: )?[0-9a-z_!~*\'().&=+$%-]+#)?' + //user#
'(([0-9]{1,3}\.){3}[0-9]{1,3}' + // IP- 199.194.52.184
'|' + // allows either IP or domain
'([0-9a-z_!~*\'()-]+\.)*' + // tertiary domain(s)- www.
'([0-9a-z][0-9a-z-]{0,61})?[0-9a-z]\.' + // second level domain
'[a-z]{2,6})' + // first level domain- .com or .museum
'(:[0-9]{1,4})?' + // port number- :80
'((/?)|' + // a slash isn't required if there is no file name
'(/[0-9a-z_!~*\'().;?:#&=+$,%#-]+)+/?))',
"gi");
then
text.replace(emailRegex, "<a href='mailto::$1'>$1</a>");
and
text.replace(urlRegex, "<a href='$1'>$1</a>");
might to work
Not a canned solution, but this will point you in the right direction.
I use Regex Coach to build and test my regexes. You can find plentiful examples of regexes for urls and email addresses online.
Here's a good article for urls...
https://blog.codinghorror.com/the-problem-with-urls/
emails are more straight forward since they have to end in a .tld
You don't need to get fancy with that one since you're not validating, just matching, so off the top of my head...
[^\s]+#\w[\w-.]*.[a-zA-Z]+
As always, this ("this" being "processing HTML with regex") is going to be difficult and error-prone. The following will work on reasonably well-formed input only, but here's what I would do:
find the element you want to process, take it's innerHTML property value
iteratively find everything that already is a link (/(<a\b.+?</a>/ig)
based on that, cut your string into "this isn't a link"- and "this is a link"-bits, appending all of them them to a neatly orderd array
process the "non-link" bits only (those that don't begin with "<a "), looking for URL- or e-mail-address patterns
wrap every address you find in <a> tags
join() the array back to a string
set the innerHTML property to your new value
I am sure you will find regular expression examples that match e-mail addresses and URLs. Take the ones that suit you most, and use them in step 4.).
Just adding a bit of information on email regexps: Most of them seems to ignore that domain names can have the characters 'åäö' in them. So if your care about that, make sure that the solution you are using has åäöÅÄÖ in the domain part of the regexp.

Categories

Resources