How to use RegEx to get HTML content using Google Apps Script - javascript

I am using Google Apps Script. I am trying to fetch the content inside the HTML content fetched from a web page and saved as a string, using RegEx. I want to fetch the data for the below format,
<font color="#FF0101">
Data which is want to fetch
</font>
Which RegEx should I use to get the data contained within <font> tags (opening and closing tags). Take care of the color attribute as I only want to fetch the data from those tags which have that color attribute and value as given in the code

Instead of wrestling with using RegEx to parse HTML, you can use Google Apps Script's XmlService to interpret well-formed HTML text.
function myFunction() {
var xml = '<font color="#FF0101">Data which is want to fetch</font>';
var doc = XmlService.parse(xml);
var content = doc.getContent(0).getValue();
Logger.log( content ); // "Data which is want to fetch"
var color = doc.getContent(0).asElement().getAttribute('color').getValue();
Logger.log( color ); // "#FF0101"
}

You are using JavaScript, so you have NO excuse for trying to parse HTML with regex.
var div = document.createElement('div');
div.innerHTML = "your HTML here";
var match = div.querySelectorAll("font[color='#FF0101']");
// loop through `match` and get stuff
// e.g. match[0].textContent.replace(/^\s+|\s+$/g,'')

If JS was fully supported, you could use a DOM-based solution.
var html = "<font color=\"#FF0202\">NOT THIS ONE</font><font color=\"#FF0101\">\n Data which is want to fetch\n</font>";
var faketag = document.createElement('faketag');
faketag.innerHTML = html;
var arr = [];
[].forEach.call(faketag.getElementsByTagName("font"), function(v,i,a) {
if (v.hasAttributes() == true) {
for (var o = 0; o < v.attributes.length; o++) {
var attrib = v.attributes[o];
if (attrib.name === "color" && attrib.value === "#FF0101") {
arr.push(v.innerText.replace(/^\s+|\s+$/g, ""));
}
}
}
});
document.body.innerHTML = JSON.stringify(arr);
However, acc. to the GAS reference:
However, because Apps Script code runs on Google's servers (not client-side, except for HTML-service pages), browser-based features like DOM manipulation or the Window API are not available.
You may try obtaining the inner text of <font color="#FF0101"> tags with a regex:
function myFunction() {
var doc = DocumentApp.getActiveDocument();
var paras = doc.getParagraphs();
var MyRegex = new RegExp('<font\\b[^<]*\\s+color="#FF0101"[^<]*>([\\s\\S]*?)</font>','ig');
for (i=0; i<paras.length; ++i) {
while (match = MyRegex.exec(paras[i].getText()))
{
Logger.log(match[1]);
}
}
}
Result against <font color="#FF0202">NOT THIS ONE</font><font color="#FF0101"> Data which is want to fetch</font>:
Regex matches any font tag that have color attribute with the value of #FF0101 inside double quotation marks. Mind that regexps are not reliable when parsing HTML! A better regex for this task is
<font\\b[^<]*\\s+color="#FF0101"[^<]*>([^<]*(?:<(?!/font>)[^<]*)*)</font>
In case your HTML data spans across several paragraphs, use
function myFunction() {
var doc = DocumentApp.getActiveDocument();
var text = doc.getBody().getText();
var MyRegex = new RegExp('<font\\b[^<]*\\s+color="#FF0101"[^<]*>([\\s\\S]*?)</font>','ig');
while (match = MyRegex.exec(text))
{
Logger.log(match[1]);
}
}
With this input:
<font color="#FF0202">NOT THIS ONE</font>
<font color="#FF0101">
Data which is want to fetch
</font>
Result is:

Related

Parsing an html tag with JavaScript

I want to get number that is stored in a tag like
var x="<a>1234</a>"; using JavaScript. How can I parse this tag to extract the numbers?
Parse the HTML and get value from the tag.
There are 2 methods :
Using DOMParser :
var x="<a>1234</a>";
var parser = new DOMParser();
var doc = parser.parseFromString(x, "text/html");
console.log(doc.querySelector('a').innerHTML)
Creating a dummy element
var x = "<a>1234</a>";
// create a dummy element and set content
var div = document.createElement('div');
div.innerHTML = x;
console.log(div.querySelector('a').innerHTML)
Or using regex(not prefered but in simple html you can use) :
var x = "<a>1234</a>";
console.log(x.match(/<a>(.*)<\/a>/)[1])
console.log(x.match(/\d+/)[0])
REF : Using regular expressions to parse HTML: why not?
var x="<a>1234</a>".replace(/\D/g, "");
alert(x);
should work
var x = "<a>1234</a>";
var tagValue = x.match(/<a>(.*?)<\/a>/i)[1];
console.log(tagValue);
it is by Regular Expression, assume x hold the value of the parsed html string:

JavaScript/jQuery manipulate and then replace all links in my HTML content

I am trying to write a script that, after the page load, will replace all my existing links with links in a different format.
However, while I've managed to work out how to do the link string manipulation, I'm stuck on how to actually replace it on the page.
I have the following code which gets all the links from the page, and then loops through them doing a regular expression to see if they match my pattern and then if they do taking out the name information from the link and creating the new link structure - this bit all works. It's the next stage of doing the replace where I'm stuck.
var str;
var fn;
var ln;
var links = document.getElementsByTagName("a");
for(var i=0; i<links.length; i++) {
str = links[i].href.match(/\/Services\/(.*?)\/People\/(.*?(?=\.aspx))/gi);
if (links[i].href.match(/\/Services\/(.*?)\/People\/(.*?(?=\.aspx))/gi)) {
var linkSplit = links[i].href.split("/");
// Get the last one (so the .aspx and then split again).
// Now split again on the .
var fileNameSplit = linkSplit[linkSplit.length-1].split(".");
var nameSplit = fileNameSplit[0].split(/(?=[A-Z])/);
fn = nameSplit[0];
ln = nameSplit[1];
if(nameSplit[2]){
ln += nameSplit[2];
}
// Build replacement string
var replacementUrl = 'https://www.testsite.co.uk/services/people.aspx?fn='+fn+'&sn='+ln;
// Do the actual replacement
links[i].href.replace(links[i].href, replacementUrl);
}
I've tried a couple of different solutions to make it do the actual replacement, .replace, .replaceWith, and I've tried using a split/join to replace a string with an array that I found here - Using split/join to replace a string with an array
var html = document.getElementsByTagName('html')[0];
var block = html.innerHTML;
var replace_str = links[i].href;
var replace_with = replacementUrl;
var rep_block = block.split(replace_str).join(replace_with);
I've read these, but had no success applying the same logic:
Javascript: How do I change every word visible on screen?
jQuery replace all href="" with onclick="window.location="
How can I fix this problem?
It's simpler than that:
links[i].href = replacementUrl;

Javascript replace tag but preserve content

Say i have a text like this:
This should also be extracted, <strong>text</strong>
I need the text only from the entire string, I have tried this:
r = r.replace(/<strong[\s\S]*?>[\s\S]*?<\/strong>/g, "$1"); but failed (strong is still there). Is there any proper way to do this?
Expected Result
This should also be extracted, text
Solution:
To target specific tag I used this:
r = r.replace(/<strong\b[^>]*>([^<>]*)<\/strong>/i, "**$1**")
To parse HTML, you need an HTML parser. See this answer for why.
If you just want to remove <strong> and </strong> from the text, you don't need parsing, but of course simplistic solutions tend to fail, which is why you need an HTML parser to parse HTML. Here's a simplistic solution that removes <strong> and </strong>:
str = str.replace(/<\/?strong>/g, "")
var yourString = "This should also be extracted, <strong>text</strong>";
yourString = yourString.replace(/<\/?strong>/g, "")
display(yourString);
function display(msg) {
// Show a message, making sure any HTML tags show
// as text
var p = document.createElement('p');
p.innerHTML = msg.replace(/&/g, "&").replace(/</g, "<");
document.body.appendChild(p);
}
Back to parsing: In your case, you can easily do it with the browser's parser, if you're on a browser:
var yourString = "This should also be extracted, <strong>text</strong>";
var div = document.createElement('div');
div.innerHTML = yourString;
display(div.innerText || div.textContent);
function display(msg) {
// Show a message, making sure any HTML tags show
// as text
var p = document.createElement('p');
p.innerHTML = msg.replace(/&/g, "&").replace(/</g, "<");
document.body.appendChild(p);
}
Most browsers provide innerText; Firefox provides textContent, which is why there's that || there.
In a non-browser environment, you'll want some kind of DOM library (there are lots of them).
You can do this
var r = "This should also be extracted, <strong>text</strong>";
r = r.replace(/<(.+?)>([^<]+)<\/\1>/,"$2");
console.log(r);
I have just included some strict regex. But if you want relaxed version, you can very well do
r = r.replace(/<.+?>/g,"");

Remove HTML Tags From A String, Using jQuery

I have a simple string e.g.
var s = "<p>Hello World!</p><p>By Mars</p>";
How do I convert s to a jQuery object? My objective is to remove the <p>s and </p>s. I could have done this using regex, but that's rather not recommended.
In the simplest form (if I am understanding correctly):
var s = "<p>Hello World!</p><p>By Mars</p>";
var o = $(s);
var text = o.text();
Or you could use a conditional selector with a search context:
// load string as object, wrapped in an outer container to use for search context
var o = $("<div><p>Hello World!</p><p>By Mars</p></div>");
// sets the context to only look within o; otherwise, this will return all P tags
var tags = $("P", o);
tags.each(function(){
var tag = $(this); // get a jQuery object for the tag
// do something with the contents of the tag
});
If you are parsing large amounts of HTML (for example, interpreting the results of a screen scrape), use a server-side HTML parsing library, not jQuery (tons of posts on here about HTML parsing).
To get all the strings there use
var s = "<p>Hello World!</p><p>By Mars</p>";
var result = "";
$.each($(s), function(i){
result += " " + $(this).html();
});
if you don't want regex, why don't u just:
var s = "<p>Hello World!</p><p>By Mars</p>";
s = s.replace('<p>', '').replace('</p>', '');

How do I extract a background= value from a string containing HTML in JavaScript?

I have a string containing HTML loaded from another page, how do I extract the background property from it's body tag using Javascript?
The body tag in the string looks like this:
<body onload='init();' background='storage/images/jsb_background.jpg' link='#000000' vlink='#000000' alink='#000000' leftmargin='0' topmargin='0' marginwidth='0' marginheight='0'>
Thanks!
I patched together a regex to do this, which will search the data string variable (containing the HTML) for the background attribute of the body tag. The regex is stolen from here and modified a bit. I'm still new to regex, so I guess it can be done more fluently, but it still gets the job done
var data = /* your html */;
var regex = /body.*background=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?/;
var result = regex.exec(data);
if (result.length > 1) {
var background = result[1];
alert(background);
}
else {
//no match
}
This is my answer as I understand your problem (given the limited details and no code example)...
This is also assuming that your HTML string is valid html...
var html = yourString;
var background = "";
background = $(html).find("body").attr("background");
If you aren't actually appending your HTML string to the DOM there may not be a nice and easy jQuery way to do this. You may have to parse out the background attribute by hand.
var html = yourString;
var charStart = html.indexOf("<body");
var charEnd = html.indexOf(">", charStart);
var bodyTag = html.substring(charStart,charEnd+1);
charStart = bodyTag.indexOf("background='")+12;
charEnd = bodyTag.indexOf("'",charStart+13);
var background = bodyTag.substring(charStart,charEnd);

Categories

Resources