Say i have a text like this:
This should also be extracted, <strong>text</strong>
I need the text only from the entire string, I have tried this:
r = r.replace(/<strong[\s\S]*?>[\s\S]*?<\/strong>/g, "$1"); but failed (strong is still there). Is there any proper way to do this?
Expected Result
This should also be extracted, text
Solution:
To target specific tag I used this:
r = r.replace(/<strong\b[^>]*>([^<>]*)<\/strong>/i, "**$1**")
To parse HTML, you need an HTML parser. See this answer for why.
If you just want to remove <strong> and </strong> from the text, you don't need parsing, but of course simplistic solutions tend to fail, which is why you need an HTML parser to parse HTML. Here's a simplistic solution that removes <strong> and </strong>:
str = str.replace(/<\/?strong>/g, "")
var yourString = "This should also be extracted, <strong>text</strong>";
yourString = yourString.replace(/<\/?strong>/g, "")
display(yourString);
function display(msg) {
// Show a message, making sure any HTML tags show
// as text
var p = document.createElement('p');
p.innerHTML = msg.replace(/&/g, "&").replace(/</g, "<");
document.body.appendChild(p);
}
Back to parsing: In your case, you can easily do it with the browser's parser, if you're on a browser:
var yourString = "This should also be extracted, <strong>text</strong>";
var div = document.createElement('div');
div.innerHTML = yourString;
display(div.innerText || div.textContent);
function display(msg) {
// Show a message, making sure any HTML tags show
// as text
var p = document.createElement('p');
p.innerHTML = msg.replace(/&/g, "&").replace(/</g, "<");
document.body.appendChild(p);
}
Most browsers provide innerText; Firefox provides textContent, which is why there's that || there.
In a non-browser environment, you'll want some kind of DOM library (there are lots of them).
You can do this
var r = "This should also be extracted, <strong>text</strong>";
r = r.replace(/<(.+?)>([^<]+)<\/\1>/,"$2");
console.log(r);
I have just included some strict regex. But if you want relaxed version, you can very well do
r = r.replace(/<.+?>/g,"");
Related
I'm getting text from a backend api in this form:
const serverText = "This is a link and so is this. This is also another boring link.";
I'm looking to get it into this form:
const formatted = "This is a link and so is this. This is also another boring link.";
I played around with this with regex but I'm not sure if this is the way to go since it's just outputting an array of the found words.
Is there an easier way to do this with vanilla Javascript without using any extra DOM tools?
Try this:
var yourHtml= `This is a link and so is this. This is also another boring link.`;
var parser = new DOMParser();
var htmlDoc = parser.parseFromString(yourHtml, 'text/html');
var text = htmlDoc.body.innerText;
console.log(text); // Returns: "This is a link and so is this. This is also another boring link."
This converts your HTML string into DOM, and uses .innerText to remove all html elements from your string - leaving only the text.
Update:
Created this simple function that returns text, and only requires the HTML string:
function textFromHTML(str) {
var parser = new DOMParser();
var htmlDoc = parser.parseFromString(str, 'text/html');
return htmlDoc.body.innerText;
}
/* --- Usage --- */
var yourHtml= `This is a link and so is this. This is also another boring link.`;
var text = textFromHTML(yourHtml);
console.log(text); // Returns text
Update 2 (RegEx):
Final version, but uses RegExp instead of the DOMParser():
function textFromHTML(str) {
return str.replace(new RegExp("<.*?>", "g"), "");
}
/* --- Usage --- */
var text = textFromHTML("Hello <span>World!</span> This string is HTML!");
console.log(text); // Returns: "Hello World! This string is HTML!"
I have a string javascript message like this one :
var message = "merci d'ajouter";
And I want this text to be converted into this one (decoding) :
var result = "merci d'ajouter";
I don't want any replace method, i want a general javascript solution working for every caracter encoded. Thanks in advance
This is actually possible in native JavaScript
Heep in mind that IE8 and earlier do not support textContent, so we will have to use innerText for them.
function decode(string) {
var div = document.createElement("div");
div.innerHTML = string;
return typeof div.textContent !== 'undefined' ? div.textContent : div.innerText;
}
var testString = document.getElementById("test-string");
var decodeButton = document.getElementById("decode-button");
var decodedString = document.getElementById("decoded-string");
var encodedString = "merci d'ajouter";
decodeButton.addEventListener("click", function() {
decodedString.innerHTML = decode(encodedString);
});
<h1>Decode this html</h1>
<p id="test-string"></p>
<input type=button id="decode-button" value="Decode HTML"/>
<p id="decoded-string"></p>
An easier solution would be to use the Underscore.js library. This is a fantastic library that provides you with a lot of additional functionality.
Underscore provides an _unescape(string) function
The opposite of escape, replaces &, <, >, ", ` and ' with their unescaped counterparts.
_.unescape('Zebras, Elephants & Penguins');
=> "Zebras, Elephants & Penguins"
I am using Google Apps Script. I am trying to fetch the content inside the HTML content fetched from a web page and saved as a string, using RegEx. I want to fetch the data for the below format,
<font color="#FF0101">
Data which is want to fetch
</font>
Which RegEx should I use to get the data contained within <font> tags (opening and closing tags). Take care of the color attribute as I only want to fetch the data from those tags which have that color attribute and value as given in the code
Instead of wrestling with using RegEx to parse HTML, you can use Google Apps Script's XmlService to interpret well-formed HTML text.
function myFunction() {
var xml = '<font color="#FF0101">Data which is want to fetch</font>';
var doc = XmlService.parse(xml);
var content = doc.getContent(0).getValue();
Logger.log( content ); // "Data which is want to fetch"
var color = doc.getContent(0).asElement().getAttribute('color').getValue();
Logger.log( color ); // "#FF0101"
}
You are using JavaScript, so you have NO excuse for trying to parse HTML with regex.
var div = document.createElement('div');
div.innerHTML = "your HTML here";
var match = div.querySelectorAll("font[color='#FF0101']");
// loop through `match` and get stuff
// e.g. match[0].textContent.replace(/^\s+|\s+$/g,'')
If JS was fully supported, you could use a DOM-based solution.
var html = "<font color=\"#FF0202\">NOT THIS ONE</font><font color=\"#FF0101\">\n Data which is want to fetch\n</font>";
var faketag = document.createElement('faketag');
faketag.innerHTML = html;
var arr = [];
[].forEach.call(faketag.getElementsByTagName("font"), function(v,i,a) {
if (v.hasAttributes() == true) {
for (var o = 0; o < v.attributes.length; o++) {
var attrib = v.attributes[o];
if (attrib.name === "color" && attrib.value === "#FF0101") {
arr.push(v.innerText.replace(/^\s+|\s+$/g, ""));
}
}
}
});
document.body.innerHTML = JSON.stringify(arr);
However, acc. to the GAS reference:
However, because Apps Script code runs on Google's servers (not client-side, except for HTML-service pages), browser-based features like DOM manipulation or the Window API are not available.
You may try obtaining the inner text of <font color="#FF0101"> tags with a regex:
function myFunction() {
var doc = DocumentApp.getActiveDocument();
var paras = doc.getParagraphs();
var MyRegex = new RegExp('<font\\b[^<]*\\s+color="#FF0101"[^<]*>([\\s\\S]*?)</font>','ig');
for (i=0; i<paras.length; ++i) {
while (match = MyRegex.exec(paras[i].getText()))
{
Logger.log(match[1]);
}
}
}
Result against <font color="#FF0202">NOT THIS ONE</font><font color="#FF0101"> Data which is want to fetch</font>:
Regex matches any font tag that have color attribute with the value of #FF0101 inside double quotation marks. Mind that regexps are not reliable when parsing HTML! A better regex for this task is
<font\\b[^<]*\\s+color="#FF0101"[^<]*>([^<]*(?:<(?!/font>)[^<]*)*)</font>
In case your HTML data spans across several paragraphs, use
function myFunction() {
var doc = DocumentApp.getActiveDocument();
var text = doc.getBody().getText();
var MyRegex = new RegExp('<font\\b[^<]*\\s+color="#FF0101"[^<]*>([\\s\\S]*?)</font>','ig');
while (match = MyRegex.exec(text))
{
Logger.log(match[1]);
}
}
With this input:
<font color="#FF0202">NOT THIS ONE</font>
<font color="#FF0101">
Data which is want to fetch
</font>
Result is:
I have some string with html tags.
var str = 'text
<script>
//some code etc
</script>
............... etc
';
I need to remove <script>....</script> using regexp with js's replace() function. Could not figure how to do it.
My efforts were:
/(<script).(</script>)/m
/<script.*>([\s\S]*)</script>/m
/(<script)*(</script>)/
/<script*</script>/
no success =(
Try...
/<script>[\s\S]*<\/script>/
If this is for arbitrary HTML, consider using DOM manipulation methods instead.
var fauxDocumentFragment = document.createElement("div");
fauxDocumentFragment.innerHTML = str;
var scriptElements = fauxDocumentFragment.getElementsByTagName("script");
while (scriptElements.length) {
scriptElements[0].parentNode.removeChild(scriptElements[0]);
}
If you're lucky enough to only have to support the newer browsers, go with...
var fauxDocumentFragment = document.createElement("div");
fauxDocumentFragment.innerHTML = str;
[].forEach(fauxDocumentFragment.querySelectorAll("script"), function(script)
script.parentNode.removeChild(script);
});
You can try the following:
str.replace(/<script.*?>.*?<\/script>/m, "");
I have an HTML element with a title inside, like this. <details>Name of page</details>
How can I make a regex to search for the <details> element, but only returning the text inside, Name of page?
You should never use regex to parse HTML. Especially not when the environment you use provides a DOM parser at your fingertips. Just use it:
var docpart = document.createElement("div"),
details, text = '';
docpart.innerHTML = "your <details>…HTML string…</details> here";
details = docpart.getElementsByTagName("details");
if (details.length > 0) {
text = details[0].textContent;
}
alert(text); // "…HTML string…"
Since you mentioned jQuery in your comment, things get simpler. Here is the jQuery equivalent of the above:
var inputHTML = "your <details>…HTML string…</details> here";
var details = $("<div>", {html: inputHTML}).find("details").text();
Thy this regex:
/<details>(.*?)<\/details>/
$1 regex variable will contain the name.