Is DOMParser().parseFromString() worth using? - javascript

https://developer.mozilla.org/en-US/docs/Web/API/DOMParser#DOMParser_HTML_extension
It looks like the DOMParser uses innerHTML to add stringified elements to the DOM. What's the advantage of using it?
I have compared the difference between using DOMParser().parseFromString() and using element.innerHTML below. Am I overlooking something?
Using DOMParser
const main = document.querySelector('main');
const newNodeString = '<body><h2>I made it on the page</h2><p>What should I do now?</p><select name="whichAdventure"><option>Chill for a sec.</option><option>Explore all that this page has to offer...</option><option>Run while you still can!</option></select><p>Thanks for your advice!</p></body>';
// Works as expected
let newNode = new DOMParser().parseFromString(newNodeString, 'text/html');
let div = document.createElement('div');
console.log('%cArray.from: ', 'border-bottom: 1px solid yellow;font-weight:1000;');
Array.from(newNode.body.children).forEach((node, index, array) => {
div.appendChild(node);
console.log('length:', array.length, 'index: ', index, 'node: ', node);
})
main.appendChild(div);
Using innerHTML
const main = document.querySelector('main');
const newNodeString = '<h2>I made it on the page</h2><p>What should I do now?</p><select name="whichAdventure"><option>Chill for a sec.</option><option>Explore all that this page has to offer...</option><option>Run while you still can!</option></select><p>Thanks for your advice!</p>';
// Works as expected
let div = document.createElement('div');
div.innerHTML = newNodeString;
main.appendChild(div);
I expect that DOMParser().parseFromString() provides some additional functionality that I'm unaware of.

Well, for one thing, DOMParser can parse XML files. It also validates that the XML is well formed and produces meaningful errors if not. More to the point, it is using the right tool for the right job.
I've used it in the past to take an uploaded XML file and produce HTML using a predefined XSLT style sheet, without getting a server involved.
Obviously, if all you're doing is appending the string to an existing DOM, and innerHTML (or outerHTML) works for you, continue using it.

Related

Puppeteer querySelectorAll doesn't get elements properly

I'm using Puppeteer and am trying to use document.querySelectorAll to get a list of elements to then loop over and do something, however, it seems that something is wrong in my code, it either returns nothing, undefined or an empty {} despite my elements being on the page, my JS:
let elements = await page.evaluate(() => document.querySelectorAll("div[class^='my-class--']"))
for (let el of Array.from(elements)) {
// do something
}
what's wrong with my elements and page.evaluate here?
As far as I understand, puppeteer returns all the HTML as a giant string. This is because Node doesn't run in the browser so the HTML doesn't get parsed. So DOM selectors won't work.
What you can do to solve this issue is to use the Cheerio.js module, which allows you to grab elements with JQuery as if it is a parsed DOM.
Since puppeteer returns all HTML as a string you could use DOMParser like in the below example.
let doc = new DOMParser().parseFromString('<template class="myClass"><span class="target">check it out</span></template>', 'text/html');
let templateContent = doc.querySelector("template");
let template = new DOMParser().parseFromString(templateContent.innerHTML, 'text/html');
let target = template.querySelector("span");
console.log([templateContent,target]);

Is there a better way to turn HTML to plain text in JavaScript than a series of Regex search/replaces

My goal is to retrieve HTML via a REST API and convert it to plain text. Then I send it through another API to Slack, which does not accept HTML (so far as I'm aware).
I am using a series of Regex scripts to accomplish this.
var noHtml = text.replace(/<(?:.|\n)*?>/gm, '');
var noHtmlEncodeSpace = noHtml.replace(/ /g, ' ');
var noHtmlEncodersquo = noHtmlEncodeSpace.replace(/’/g, "'");
var noHtmlEncodeldsquo = noHtmlEncodersquo.replace(/‘/g, "'");
var noHtmlEncodeSingleQuote = noHtmlEncodeldsquo.replace(/'/g, "'");
var noHtmlEncodeldquo = noHtmlEncodeSingleQuote.replace(/“/g, "`");
var noHtmlEncodeDoubleQuote = noHtmlEncodeldquo.replace(/"/g, "`");
var noHtmlEncoderdquo = noHtmlEncodeDoubleQuote.replace(/”/g, "`");
The results are as expected. But transforming HTML to plain text seems like it is a common-enough task in JavaScript that there may be a smarter way to do it.
I am new to JavaScript. Thank you for any guidance.
You might use DOMParser to safely parse the HTML string into a document, after which you can retrieve the textContent of the document:
const htmlStr = `<div>
foo ’’
</div>
<script>
alert('evil');
</` + `script>
<img src="badsrc" onerror="alert('evil')">`;
const doc = new DOMParser().parseFromString(htmlStr, 'text/html');
console.log(doc.body.textContent);
Depending on the text spacing desired, you might use the innerText property instead:
doc.body.innerText
(This is in contrast to, for example, setting the innerHTML of a newly created element, which wouldn't be as safe - the "evil" scripts could be executed before the textContent is retrieved)

Parsing Html from string into document

I'm trying to parse Html code from string into a document and start appending each node at a time to the real dom.
After some research i have encountered this api :
DOMImplementation.createHTMLDocument()
which works great but if some node has descendants than i need to first append the node and only after its in the real dom than i should start appending its descendants , i think i can use
document.importNode(externalNode, deep);
with the deep value set to false in order to copy only the parent node.
so my approach is good for this case and how should i preserve my order of appended nodes so i wont append the same node twice?
and one more problem is in some cases i need to add more html code into a specific location (for example after some div node) and continue appending , any idea how to do that correctly?
You can use the DOMParser for that:
const parser = new DOMParser();
const doc = parser.parseFromString('<h1>Hello World</h1>', 'text/html');
document.body.appendChild(doc.documentElement);
But if you want to append the same thing multiple times, you will have better performances using a template:
const template = document.createElement('template');
template.innerHTML = '<h1>Hello World</h1>';
const instance = template.cloneNode(true);
document.body.appendChild(instance.content);
const instance2 = template.cloneNode(true);
document.body.appendChild(instance2.content);
Hope this helps

Can we prevent assignment to innerHTML from requesting resources?

The following code results in an HTTP request for an image resource in both Firefox and Chrome.
var el = document.createElement('div');
el.innerHTML = "<img src='junk'/>";
As a programmer, I may or may not want el to be rendered. If I don't, then maybe I don't want a request being sent for the src.
dojo.toDom() shows the same behaviour.
Is there anyway to get a document fragment from a string, without referenced resources from being requested?
Use the DOMParser to create a full document structure from a given string.
Alternatively, use the beforeload event to intercept requests.
Much lighter memory to use strings to create DOM elements instead of creating documentFragments and working with them:
var div = document.createElement('div');
div.innerHTML = 'some text';
document.getElementById('someparent').appendChild('div');
Can be replaced with:
var div = '<div>some text</div>';
document.getElementById('someparent').innerHTML += div;

construct a DOM tree from a string without loading resources (specifically images)

So I am grabbing RSS feeds via AJAX. After processing them, I have a html string that I want to manipulate using various jQuery functionality. In order to do this, I need a tree of DOM nodes.
I can parse a HTML string into the jQuery() function.
I can add it as innerHTML to some hidden node and use that.
I have even tried using mozilla's nonstandard range.createContextualFragment().
The problem with all of these solutions is that when my HTML snippet has an <img> tag, firefox dutifully fetches whatever image is referenced. Since this processing is background stuff that isn't being displayed to the user, I'd like to just get a DOM tree without the browser loading all the images contained in it.
Is this possible with javascript? I don't mind if it's mozilla-only, as I'm already using javascript 1.7 features (which seem to be mozilla-only for now)
The answer is this:
var parser = new DOMParser();
var htmlDoc = parser.parseFromString(htmlString, "text/html");
var jdoc = $(htmlDoc);
console.log(jdoc.find('img'));
If you pay attention to your web requests you'll notice that none are made even though the html string is parsed and wrapped by jquery.
The obvious answer is to parse the string and remove the src attributes from img tags (and similar for other external resources you don't want to load). But you'll have already thought of that and I'm sure you're looking for something less troublesome. I'm also assuming you've already tried removing the src attribute after having jquery parse the string but before appending it to the document, and found that the images are still being requested.
I'm not coming up with anything else, but you may not need to do full parsing; this replacement should do it in Firefox with some caveats:
thestring = thestring.replace("<img ", "<img src='' ");
The caveats:
This appears to work in the current Firefox. That doesn't meant that subsequent versions won't choose to handle duplicated src attributes differently.
This assumes the literal string "general purpose assumption, that string could appear in an attribute value on a sufficiently...interesting...page, especially in an inline onclick handler like this: <a href='#' onclick='$("frog").html("<img src=\"spinner.gif\">")'> (Although in that example, the false positive replacement is harmless.)
This is obviously a hack, but in a limited environment with reasonably well-known data...
You can use the DOM parser to manipulate the nodes.
Just replace the src attributes, store their original values and add them back later on.
Sample:
(function () {
var s = "<img src='http://www.google.com/logos/olympics10-skijump-hp.png' /><img src='http://www.google.com/logos/olympics10-skijump-hp.png' />";
var parser = new DOMParser();
var dom = parser.parseFromString("<div id='mydiv' >" + s + "</div>", "text/xml");
var imgs = dom.getElementsByTagName("img");
var stored = [];
for (var i = 0; i < imgs.length; i++) {
var img = imgs[i];
stored.push(img.getAttribute("src"));
img.setAttribute("myindex", i);
img.setAttribute("src", null);
}
$(document.body).append(new XMLSerializer().serializeToString(dom));
alert("Images appended");
window.setTimeout(function () {
alert("loading images");
$("#mydiv img").each(function () {
this.src = stored[$(this).attr("myindex")];
})
alert("images loaded");
}, 2000);
})();

Categories

Resources