Skipping over tags and spaces in regex html - javascript

I'm using this regex to find a String that starts with !?, ends with ?!, and has another variable inbetween (in this example "a891d050"). This is what I use:
var pattern = new RegExp(/!\\?.*\s*(a891d050){1}.*\s*\\?!/);
It matches correctly agains this one:
!?v8qbQ5LZDnFLsny7VmVe09HJFL1/WfGD2A:::a891d050?!
But fails when the string is broken up with html tags.
<span class="userContent"><span>!?v8qbQ5LZDnFLsny7VmVe09HJFL1/</span><wbr /><span class="word_break"></span>WfGD2A:::a891d050?!</span></div></div></div></div>
I tried adding \s and {space}*, but it still fails.
The question is, what (special?)characters do I need to account for if I want to ignore whitespace and html tags in my match.
edit: this is how I use the regex:
var pattern = /!\?[\s\S]*a891d050[\s\S]*\?!/;
document.body.innerHTML = document.body.innerHTML.replace(pattern,"new content");
It appears to me that when it encounters the 'plain' string it replaces is correctly. But when faced with String with classes around it and inside, it makes a mess of the classes or doesn't replace at all depending on the context. So I decided to try jquery-replacetext-plugin(as it promises to leave tags as they were) like this:
$("body *").replaceText( pattern, "new content" );
But with no success, the results are the same as before.

Maybe this:
var pattern = /!\?[\s\S]*a891d050[\s\S]*\?!/;
[\s\S] should match any character. I have also removed {1}.

The problem was apparently solved by using this regex:
var pattern = /(!\?)(?:<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])*?>)?(.)*?(a891d050)(?:<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])*?>)?(.)*?(\?!)/;

Related

Regular expression to match a string which is NOT matched by a given regexp

I've been hoving around by some answers here, and I can't find a solution to my problem:
I have this regexp which matches everyting inside an HTML span tag, including contents:
<span\b[^>]*>(.*?)</span>
and I want to find a way to make a search in all the text, except for what is matched with that regexp.
For example, if my text is:
var text = "...for there is a class of <span class="highlight">guinea</span> pigs which..."
... then the regexp would match:
<span class="highlight">guinea</span>
and I want to be able to make a regexp such that if I search for "class", regexp will match "...for there is a class of..."
and will not match inside the tag, like in
"... class="highlight"..."
The word to be matched ("class") might be anywhere within the text. I've tried
(?!<span\b[^>]*>(.*?)</span>)class
but it keeps searching inside tags as well.
I want to find a solution using only regexp, not dealing with DOM nor JQuery. Thanks in advance :).
Although I wouldn't recommend this, I would do something like below
(class)(?:(?=.*<span\b[^>]*>))|(?:(?<=<\/span>).*)(class)
You can see this in action here
Rubular Link for this regex
You can capture your matches from the groups and work with them as needed. If you can, use a HTML parser and then find matches from the text element.
It's not pretty, but if I get you right, this should do what you wan't. It's done with a single RegEx but js can't (to my knowledge) extract the result without joining the results in a loop.
The RegEx: /(?:<span\b[^>]*>.*?<\/span>)|(.)/g
Example js code:
var str = '...for there is a class of <span class="highlight">guinea</span> pigs which...',
pattern = /(?:<span\b[^>]*>.*?<\/span>)|(.)/g,
match,
res = '';
match = pattern.exec(str)
while( match != null )
{
res += match[1];
match = pattern.exec(str)
}
document.writeln('Result:' + res);
In English: Do a non capturing test against your tag-expression or capture any character. Do this globally to get the entire string. The result is a capture group for each character in your string, except the tag. As pointed out, this is ugly - can result in a serious number of capture groups - but gets the job done.
If you need to send it in and retrieve the result in one call, I'd have to agree with previous contributors - It can't be done!

Javascript Regex only replacing first match occurence

I am using regular expressions to do some basic converting of wiki markup code into copy-pastable plain text, and I'm using javascript to do the work.
However, javascript's regex engine behaves much differently to the ones I've used previously as well as the regex in Notepad++ that I use on a daily basis.
For example- given a test string:
==Section Header==
===Subsection 1===
# Content begins here.
## Content continues here.
I want to end up with:
Section Header
Subsection 1
# Content begins here.
## Content continues here.
Simply remove all equals signs.
I began with the regex setup of:
var reg_titles = /(^)(=+)(.+)(=+)/
This regex searches for lines that begin with one or more equals with another set of one or more equals. Rubular shows that it matches my lines accurately and does not catch equals signs in the middle of contet. http://www.rubular.com/r/46PrkPx8OB
The code to replace the string based on regex
var lines = $('.tb_in').val().split('\n'); //use jquery to grab text in a textarea, and split into an array of lines based on the \n
for(var i = 0;i < lines.length;i++){
line_temp = lines[i].replace(reg_titles, "");
lines[i] = line_temp; //replace line with temp
}
$('.tb_out').val(lines.join("\n")); //rejoin and print result
My result is unfortunately:
Section Header==
Subsection 1===
# Content begins here.
## Content continues here.
I cannot figure out why the regex replace function, when it finds multiple matches, seems to only replace the first instance it finds, not all instances.
Even when my regex is updated to:
var reg_titles = /(={2,})/
"Find any two or more equals", the output is still identical. It makes a single replacement and ignores all other matches.
No one regex expression executor behaves this way for me. Running the same replace multiple times has no effect.
Any advice on how to get my string replace function to replace ALL instances of the matched regex instead of just the first one?
^=+|=+$
You can use this.Do not forget to add g and m flags.Replace by ``.See demo.
http://regex101.com/r/nA6hN9/28
Add the g modifier to do a global search:
var reg_titles = /^(=+)(.+?)(=+)/g
Your regex is needlessly complex, and yet doesn't actually accomplish what you set out to do. :) You might try something like this instead:
var reg_titles = /^=+(.+?)=+$/;
lines = $('.tb_in').val().split('\n');
lines.forEach(function(v, i, a) {
a[i] = v.replace(reg_titles, '$1');
})
$('.tb_out').val(lines.join("\n"));

Javascript Regular expression to remove unwanted <br>,

I have a JS stirng like this
<div id="grouplogo_nav"><br> <ul><br> <li><a class="group_hlfppt" target="_blank" href="http://www.hlfppt.org/">&nbsp;</a></li><br> </ul><br> </div>
I need to remove all <br> and $nbsp; that are only between > and <. I tried to write a regular expression, but didn't got it right. Does anybody have a solution.
EDIT :
Please note i want to remove only the tags b/w > and <
Avoid using regex on html!
Try creating a temporary div from the string, and using the DOM to remove any br tags from it. This is much more robust than parsing html with regex, which can be harmful to your health:
var tempDiv = document.createElement('div');
tempDiv.innerHTML = mystringwithBRin;
var nodes = tempDiv.childNodes;
for(var nodeId=nodes.length-1; nodeId >= 0; --nodeId) {
if(nodes[nodeId].tagName === 'br') {
tempDiv.removeChild(nodes[nodeId]);
}
}
var newStr = tempDiv.innerHTML;
Note that we iterate in reverse over the child nodes so that the node IDs remain valid after removing a given child node.
http://jsfiddle.net/fxfrt/
myString = myString.replace(/^( |<br>)+/, '');
... where /.../ denotes a regular expression, ^ denotes start of string, ($nbsp;|<br>) denotes " or <br>", and + denotes "one or more occurrence of the previous expression". And then simply replace that full match with an empty string.
s.replace(/(>)(?: |<br>)+(\s?<)/g,'$1$2');
Don't use this in production. See the answer from Phil H.
Edit: I try to explain it a bit and hope my english is good enough.
Basically we have two different kinds of parentheses here. The first pair and third pair () are normal parentheses. They are used to remember the characters that are matched by the enclosed pattern and group the characters together. For the second pair, we don't need to remember the characters for later use, so we disable the "remember" functionality by using the form (?:) and only group the characters to make the + work as expected. The + quantifier means "one or more occurrences", so or <br> must be there one or more times. The last part (\s?<) matches a whitespace character (\s), which can be missing or occur one time (?), followed by the characters <. $1 and $2 are kind of variables that are replaces by the remembered characters of the first and third parentheses.
MDN provides a nice table, which explains all the special characters.
You need to replace globally. Also don't forget that you can have the being closed . Try this:
myString = myString.replace(/( |<br>|<br \/>)/g, '');
This worked for me, please note for the multi lines
myString = myString.replace(/( |<br>|<br \/>)/gm, '');
myString = myString.replace(/^( |<br>)+/, '');
hope this helps

RegExp: how to exclude matched groups from $N?

I've made a working regexp, but i think it's not the best use-case:
el = '<div style="color:red">123</div>';
el.replace(/(<div.*>)(\d+)(<\/div>)/g, '$1<b>$2</b>$3');
// expecting result: <div style="color:red"><b>123</b></div>
After googling i've found that (?: ... ) in regexps - means ignoring group match, thus:
el.replace(/(?:<div.*>)(\d+)(?:<\/div>)/g, '<b>$1</b>');
// returns <b>123</b>
but i need an expecting result from 1st example.
Is there a way to exclude 'em? just to write replace(/.../, '<b>$1</b>')?
This is just a little case for understanding how to exclude groups in regexp. And i know, what we can't parse HTML with regexp :)
So you want to get the same result while only using the replacement <b>$1</b>?
In your case just replace(/\d+/, '<b>$&</b>') would suffice.
But if you want to make sure there are div tags around the number, you could use lookarounds and \K like in the following expression. Except that JS does not support lookbehind nor \K, so you're out of luck, you have to use a capturing group for that in JS.
<div[^>]*>\K\d+(?=</div>)
There nothing wrong with a replacement value of '$1<b>$2</b>$3'. I would just change your regex to this:
el = '<div style="color:red">123</div>';
el.replace(/(<div[^>]*>)(\d+)(<\/div>)/g, '$1<b>$2</b>$3');
Changing how it matches the first div keeps the full match on the div tags, but makes sure it matches the minimum possible before the closing > of the first div tag rather than the maximum possible.
With your regex, you would not get what you wanted with this input string:
el = '<div style="color:red">123</div><div style="color:red">456</div>';
The problem with using something like:
el.replace(/\d+/, '<b>$&</b>')
is that doesn't work properly with things like this:
el = '<div style="margin-left: 10px">123</div>'
because it picks up the numbers inside the div tag.

REGEX to match starting and ending span tags without their inner text

I am using the following RegEx to do a replacement in a string:
<\/?(span)\b(?:\s+class="highlight")?>
But this regex has a flaw... Take this sample code for example:
<p>
Some text here
<span class="highlight">This is highlighted</span>
<span>This is not highlighted</span>
</p>
My regex will match both of the span tags although i only want the one with the class="highlight" set. How can I achieve this using RegEx?
PS: please do not tell me that I should not use RegEx for this because i will downgrade your answer as it is off-topic. This is a question for the RegEx guys.
EDIT: based on the accepted answer below i am using the following regex to do a replace
NOTE: code is in javascript (mootools)
var regex = new RegExp("(<span[^>]+class\\s*=\\s*(\"|')highlight\\2[^>]*>)(.*?)(</span>)",'g');
var replaced = element.get('html').replace(regex, "$3");
element.set('html', replaced);
The above regex will replace a some text here with "some text here" (without the double quotes)
This should give the most flexibility.
(<span[^>]+class\s*=\s*("|')highlight\2[^>]*>)[^<]*(</span>)
UPDATE:
The captured groups you need for the opening and closing tags are \1 and \3.
Just to show you that an alternative solution is not only possible bot also better than using regex:
$$('span.highlight').each(function (node, idx, Elem) {
var txt = document.createTextNode(Elem.get('text'));
node.parentNode.replaceChild(txt, node)
});
See this fiddle: http://jsfiddle.net/Tomalak/umgZp/
(And this is just off the top of my hat, I've had zero exposure to MooTools so far. There might be more elegant ways than this.)
You are obviously stating that that class=highlight part is optional, by placing a ? in front of the group capturing it.
This should do it for you:
var regex = /(?:<span\s+[^>]*?\s*class\s*=\s*('|")(?:\S+\s+)?highlight(?:\s+\S+)?\1[^>]*>|<\/span>/;
This will also include SPAN tags with class attributes like a b c highlight e f g.
Also, if you want to capture a SPAN tag with its matching ending, you can use this, and access groups 1 and 3 respectively for the opening and ending tags:
var regex = /(<span\s+[^>]*?\s*class\s*=\s*('|")(?:\S+\s+)?highlight(?:\s+\S+)?\1[^>]*>).*?(<\/span>)/;

Categories

Resources