Using Regex to remove html elements and leave the content - javascript

Lets say I have the following html
<b>Item 1</b> Text <br>
<b>Item 2</b> Text <br>
<b>Item 3</b> Text <br>
<p><font color="#000000" face="Arial, Helvetica, sans-serif"><b>Item 4:</b></font></p>
<p><font color="#000000" face="Arial, Helvetica, sans-serif">Detailed Description</font></p>
and am using the following regex to capture data (Item 1:.*?<br>)/gi which returns <b>Item 1</b> Text <br>
How do i drop or remove the <b>,</b> and <br>
to be left with
Item 1 Text
I've been trying to make sense of this code <(\w+)[^>]*>.*<\/\1>, but so far no luck. All the examples I have seen on here seem to require an id class, which my html does not have so i'm a bit stuck in getting those examples to fit my problem.

Try this reg ex: <[^>]*>
This will remove all the html with or without attributes and closing tags.

This should do the trick:
var matches = stringToTest.match(/(Item \d+.*?<br\/?>)/gi);
for (var i = 0; i < matches.length; i++) {
matches[i] = matches[i].replace(/<[^>]+>/g, '');
}
alert(matches);
If you have jQuery:
alert(
$.map(stringToTest.match(/(Item \d+.*?<br\/?>)/gi), function(v) { return v.replace(/<[^>]+>/g, '') })
);

This regex will match b and br tags:
</?br?\s*/?>
To use it in Javascript you write something like this:
result = subject.replace(/<\/?br?\s*\/?>/img, "");
All the matched tags will be replaced with an empty string.
In my experience it is better to replace br tags with a space and replace normal inline tags with empty string. If that is what you want to do, this next regex matches only b tags:
</?b\s*/?>
and this one matches only br tags:
</?br\s*/?>

in a regex, what is between () represents capture groups that can be later accessed as variables (\1 \2 \3 etc.) or sometimes $1 $2 $3. So simply use them to capture the text you want.
I think this regex would work for you:
<b>(Item \d+)</b>(.*?)<br>
in details, the expression means:
(Item \d+): Any string formatted as "Item [at least 1 digit]"
(.*?): any group of characters, the ? minimizes the number of characters in the sequence.
So now in <b>Item 5434</b>hel34lo 0345 345<br>, with regex above your captured groups are:
\1 = Item 5434
\2 = hel34lo 0345 345
I've never programmed in javascript, but more precisely, this piece of code might work:
var myString = "<b>Item 5434</b>hel34lo 0345 345<br>";
var myRegexp = /<b>(Item \d+)</b>(.*?)<br>/g;
var match = myRegexp.exec(myString);
alert(match[1]); // Item 5434
alert(match[2]); // hel34lo 0345 345

Related

Replace a specific character from a string with HTML tags

Having a text input, if there is a specific character it must convert it to a tag. For example, the special character is *, the text between 2 special characters must appear in italic.
For example:
This is *my* wonderful *text*
must be converted to:
This is <i>my</i> wonderful <i>text</i>
So I've tried like:
const arr = "This is *my* wonderful *text*";
if (arr.includes('*')) {
arr[index] = arr.replace('*', '<i>');
}
it is replacing the star character with <i> but doesn't work if there are more special characters.
Any ideas?
You can simply create wrapper and thereafter use regular expression to detect if there is any word that is surrounded by * and simply replace it with any tag, in your example is <i> tag so just see the following
Example
let str = "This is *my* wonderful *text*";
let regex = /(?<=\*)(.*?)(?=\*)/;
while (str.includes('*')) {
let matched = regex.exec(str);
let wrap = "<i>" + matched[1] + "</i>";
str = str.replace(`*${matched[1]}*`, wrap);
}
console.log(str);
here you go my friend:
var arr = "This is *my* wonderful *text*";
const matched = arr.match(/\*(?:.*?)\*/g);
for (let i = 0; i < matched.length; i++) {
arr = arr.replace(matched[i], `<i>${matched[i].replaceAll("*", "")}</i>`);
}
console.log(arr);
an explanation first of all we're matching the regex globaly by setting /g NOTE: that match with global flag returns an array.
secondly we're looking for any character that lies between two astrisks and we're escaping them because both are meta characters.
.*? match everything in greedy way so we don't get something like this my*.
?: for non capturing groups, then we're replacing every element we've matched with itself but without astrisk.

regex replace first element

I have the need to replace a HTML string's contents from one <br> to two. But what I can't achieve is when I have one tag following another one:
(<br\s*\/?>)
will match all the tags in this text:
var text = 'text<BR><BR>text text<BR>text;'
will match and with the replace I will have
text = text.replace.replace(/(<br\s*\/?>)>/gi, "<BR\/><BR\/>")
console.log(text); //text<BR/><BR/><BR/><BR/>text text<BR/><BR/>text;"
Is there a way to only increment one tag with the regex? And achieve this:
console.log(text); //text<BR/><BR/><BR/>text text<BR/><BR/>text;"
Or I only will achieve this with a loop?
You may use either
var text = 'text<BR><BR>text text<BR>text;'
text = text.replace(/(<br\s*\/?>)+/gi, "$&$1");
console.log(text); // => text<BR><BR><BR>text text<BR><BR>text;
Here, (<br\s*\/?>)+/gi matches 1 or more sequences of <br>s in a case insensitive way while capturing each tag on its way (keeping the last value in the group beffer after the last it, and "$&$1" will replace with the whole match ($&) and will add the last <br> with $1.
Or
var text = 'text<BR><BR>text text<BR>text;'
text = text.replace(/(?:<br\s*\/?>)+/gi, function ($0) {
return $0.replace(/<br\s*\/?>/gi, "<BR/>") + "<BR/>";
})
console.log(text); // => text<BR/><BR/><BR/>text text<BR/><BR/>text;
Here, the (?:<br\s*\/?>)+ will also match 1 or more <br>s but without capturing each occurrence, and inside the callback, all <br>s will get normalized as <BR/> and a <BR/> will get appended to the result.
You can use negative look ahead (<br\s*\/?>)(?!<br\s*\/?>)/ to increment only the last tag if there are any consecutive:
var text = 'text<BR><BR>text text<BR>text;'
text = text.replace(/(<br\s*\/?>)(?!<br\s*\/?>)/gi, "<BR\/><BR\/>")
console.log(text);

Modify tag position with regex

Suppose I have following string:
var text = "<p>Some text <ins>Text1</p><p>Text2 </ins><ins>Some other text </ins>and another text<ins>Text3</p><p>Text4 </ins></p>"
I need to clean up the above string into
var text = "<p>Some text Text1</p><p><ins>Text2 </ins><ins>Some other text </ins>and another text Text3</p><p><ins>Text4 </ins></p>"
Assume Text1, Text2, Text3, Text4 are random string
I tried below but just mess up:
text.replace(/<ins>(.*?)<\/p><p>/g, '</p><p><ins>');
Thanks
ADDITIONAL EXPLANATION
Take a look at this:
<ins>Text1</p><p>Text2 </ins>
Above is wrong. It should be:
Text1</p><p><ins>Text2 </ins>
Please try the following regex:
function posChange() {
var text = "<p>Some text <ins>Text1</p><p>Text2 </ins><ins>Some other text </ins>and another text<ins>Text3</p><p>Text4 </ins></p>";
var textnew = text.replace(/(<ins>)([^<]+)(<\/p><p>)([^<]+)/g, '$2$3$1$4');
alert(textnew);
}
posChange()
REGEX EXPLANATION:
/(<ins>) 1st capturing group (i.e: <ins>)....$1
([^<]+) 2nd capturing group (i.e: Text1)....$2
(<\/p><p>) 3rd capturing group (i.e: </p><p>)..$3
([^<]+) 4th capturing group (i.e: Text2 )...$4
/g match all occurrences
Based on the requirements, for each match:
Original String: $1 $2 $3 $4
should be replaced with
New String: $2 $3 $1 $4
In this way, the position of each capturing group gets shifted with the help of regex.
You can remove all <ins>:
text = text.replace(/<ins>/g, '');
and then replace every string ending with </ins> and not containing any tag with sum of <ins> and this string:
var matches = text.match(/[^<>]+<\/ins>/g)
for (i = 0; i < matches.length; i++) {
text = text.replace(matches[i], '<ins>' + matches[i]);
}
result:
<p>Some text Text1</p><p><ins>Text2 </ins><ins>Some other text </ins>and another textText3</p><p><ins>Text4 </ins></p>

Regular expression does not work

I am using the following regular expression in Javascript:
comment_body_content = comment_body_content.replace(
/(<span id="sc_start_commenttext-.*<\/span>)((.|\s)*)(<span id="sc_end_commenttext-.*<\/span>)/,
"$1$4"
);
I want to find in my HTML code this tag <span id="sc_start_commenttext-330"></span> (the number is always different) and the tag <span id="sc_end_commenttext-330"></span>. Then the text and HTML code between those tags should be deleted and given back.
Example before replacing:
Some text and code
<span id="sc_start_commenttext-330"></span>Some text and code<span id="sc_end_commenttext-330"></span>
Some Text and code
Example after replacing:
Some text and code
<span id="sc_start_commenttext-330"></span><span id="sc_end_commenttext-330"></span>
Some text and code
Sometimes my regular expression works and it replaces the text correctly, sometimes not - is there a mistake? Thank you for help!
Alex
You should use a pattern that matches the start with its corresponding end, for example:
/(<span id="sc_start_commenttext-(\d+)"><\/span>)[^]*?(<span id="sc_end_commenttext-\2"><\/span>)/
Here \2 in the end tag refers to the matched string of (\d+) which matches the digits 330 in the start tag. [^] is a simple expression for any character.
Using DOM.
​var $spans = document.getElementsByTagName("span");
var str = "";
for(var i = 0, $span, $sibling; i < $spans.length; ++i) {
$span = $spans[i];
if(/^sc_start_commenttext/i.test($span.id)) {
while($sibling = $span.nextSibling) {
if(/^sc_end_commenttext/i.test($sibling.id)) {
break;
}
str += $sibling.data;
$span.parentNode.removeChild($sibling);
}
}
}
console.log("The enclosed string was: ", str);
Here you have it.
I would start to replace .* with [0-9]+"> -- if I understand correctly your intention.
I agree that it's normaly a bad ide to use regexp to parse html but it can be used effectly on non-nested html
Using RegExp:
var str = 'First text and code<span id="sc_start_commenttext-330"></span>Remove text<span id="sc_end_commenttext-330"></span>Last Text and code';
var re = /(.*<span id="sc_start_commenttext-\d+"><\/span>).*(<span id="sc_end_commenttext-\d+"><\/span>.*)/;
str.replace(re, "$1$2");
Result:
First text and code<span id="sc_start_commenttext-330"></span><span id="sc_end_commenttext-330"></span>Last Text and code

JavaScript Replace Text with HTML Between it

I want to replace some text in a webpage, only the text, but when I replace via the document.body.innerHTML I could get stuck, like so:
HTML:
<p>test test </p>
<p>test2 test2</p>
<p>test3 test3</p>
Js:
var param = "test test test2 test2 test3";
var text = document.body.innerHTML;
document.body.innerHTML = text.replace(param, '*' + param + '*');
I would like to get:
*test test
test2 test2
test3* test3
HTML of 'desired' outcome:
<p>*test test </p>
<p>test2 test2</p>
<p>test3* test3</p>
So If I want to do that with the parameter above ("test test test2 test2 test3") the <p></p> would not be taken into account - resulting into the else section.
How can I replace the text with no "consideration" to the html markup that could be between it?
Thanks in advance.
Edit (for #Sonesh Dabhi):
Basically I need to replace text in a webpage, but when I scan the
webpage with the html in it the replace won't work, I need to scan and
replace based on text only
Edit 2:
'Raw' JavaScript Please (no jQuery)
This will do what you want, it builds a regex expression to find the text between tags and replace in there. Give it a shot.
http://jsfiddle.net/WZYG9/5/
The magic is
(\s*(?:<\/?\w+>)*\s*)*
Which, in the code below has double backslashes to escape them within the string.
The regex itself looks for any number of white space characters (\s). The inner group (?:</?\w+>)* matches any number of start or end tags. ?: tells java script to not count the group in the replacement string, and not remember the matches it finds. < is a literal less than character. The forward slash (which begins an end html tag) needs to be escaped, and the question mark means 0 or 1 occurrence. This is proceeded by any number of white space characters.
Every space within the "text to search" get replaced with this regular expression, allowing it to match any amount of white space and tags between the words in the text, and remember them in the numbered variables $1, $2, etc. The replacement string gets built to put those remembered variables back in.
Which matches any number of tags and whitespace between them.
function wrapTextIn(text, character) {
if (!character) character = "*"; // default to asterik
// trim the text
text = text.replace(/(^\s+)|(\s+$)/g, "");
//split into words
var words = text.split(" ");
// return if there are no words
if (words.length == 0)
return;
// build the regex
var regex = new RegExp(text.replace(/\s+/g, "(\\s*(?:<\\/?\\w+>)*\\s*)*"), "g");
//start with wrapping character
var replace = character;
//for each word, put it and the matching "tags" in the replacement string
for (var i = 0; i < words.length; i++) {
replace += words[i];
if (i != words.length - 1 & words.length > 1)
replace += "$" + (i + 1);
}
// end with the wrapping character
replace += character;
// replace the html
document.body.innerHTML = document.body.innerHTML.replace(regex, replace);
}
WORKING DEMO
USE THAT FUNCTION TO GET TEXT.. no jquery required
First remove tags. i.e You can try document.body.textContent / document.body.innerText or use this example
var StrippedString = OriginalString.replace(/(<([^>]+)>)/ig,"");
Find and replace (for all to be replace add 1 more thing "/g" after search)
String.prototype.trim=function(){return this.replace(/^\s\s*/, '').replace(/\s\s*$/, '');};
var param = "test test test2 test2 test3";
var text = (document.body.textContent || document.body.innerText).trim();
var replaced = text.search(param) >= 0;
if(replaced) {
var re = new RegExp(param, 'g');
document.body.innerHTML = text.replace(re , '*' + param + '*');
} else {
//param was not replaced
//What to do here?
}
See here
Note: Using striping you will lose the tags.

Categories

Resources