construct a DOM tree from a string without loading resources (specifically images) - javascript

So I am grabbing RSS feeds via AJAX. After processing them, I have a html string that I want to manipulate using various jQuery functionality. In order to do this, I need a tree of DOM nodes.
I can parse a HTML string into the jQuery() function.
I can add it as innerHTML to some hidden node and use that.
I have even tried using mozilla's nonstandard range.createContextualFragment().
The problem with all of these solutions is that when my HTML snippet has an <img> tag, firefox dutifully fetches whatever image is referenced. Since this processing is background stuff that isn't being displayed to the user, I'd like to just get a DOM tree without the browser loading all the images contained in it.
Is this possible with javascript? I don't mind if it's mozilla-only, as I'm already using javascript 1.7 features (which seem to be mozilla-only for now)

The answer is this:
var parser = new DOMParser();
var htmlDoc = parser.parseFromString(htmlString, "text/html");
var jdoc = $(htmlDoc);
console.log(jdoc.find('img'));
If you pay attention to your web requests you'll notice that none are made even though the html string is parsed and wrapped by jquery.

The obvious answer is to parse the string and remove the src attributes from img tags (and similar for other external resources you don't want to load). But you'll have already thought of that and I'm sure you're looking for something less troublesome. I'm also assuming you've already tried removing the src attribute after having jquery parse the string but before appending it to the document, and found that the images are still being requested.
I'm not coming up with anything else, but you may not need to do full parsing; this replacement should do it in Firefox with some caveats:
thestring = thestring.replace("<img ", "<img src='' ");
The caveats:
This appears to work in the current Firefox. That doesn't meant that subsequent versions won't choose to handle duplicated src attributes differently.
This assumes the literal string "general purpose assumption, that string could appear in an attribute value on a sufficiently...interesting...page, especially in an inline onclick handler like this: <a href='#' onclick='$("frog").html("<img src=\"spinner.gif\">")'> (Although in that example, the false positive replacement is harmless.)
This is obviously a hack, but in a limited environment with reasonably well-known data...

You can use the DOM parser to manipulate the nodes.
Just replace the src attributes, store their original values and add them back later on.
Sample:
(function () {
var s = "<img src='http://www.google.com/logos/olympics10-skijump-hp.png' /><img src='http://www.google.com/logos/olympics10-skijump-hp.png' />";
var parser = new DOMParser();
var dom = parser.parseFromString("<div id='mydiv' >" + s + "</div>", "text/xml");
var imgs = dom.getElementsByTagName("img");
var stored = [];
for (var i = 0; i < imgs.length; i++) {
var img = imgs[i];
stored.push(img.getAttribute("src"));
img.setAttribute("myindex", i);
img.setAttribute("src", null);
}
$(document.body).append(new XMLSerializer().serializeToString(dom));
alert("Images appended");
window.setTimeout(function () {
alert("loading images");
$("#mydiv img").each(function () {
this.src = stored[$(this).attr("myindex")];
})
alert("images loaded");
}, 2000);
})();

Related

what's the difference between appending a child element and setting innerHTML

In the first example, the script was executed, but not in the second example, the Dom results are the same.
// executable
var c = 'alert("append a div in which there is a script element")';
var div = document.createElement('div');
var script_2 = document.createElement('script');
script_2.textContent = c;
div.appendChild(script_2);
document.body.appendChild(div);
// unexecutable although the dom result is same as above case
var d = '<script>alert("append a div that has script tag as innerHTML")';
var div_d = document.createElement('div');
div_d.innerHTML = d;
document.body.appendChild(div_d);
.innerHTML allows you to add as much HTML as you want in one easy call.
.appendChild allows you to add a single element (Or multiple elements if you append a DocumentFragment).
If you use .innerHTML then you need to include the opening and closing tags correctly. Your HTML must be proper.
When elements that were created using document.createElement then auto generate the appropriate opening and closing tags.
Your example for .innerHTML is not properly formed. Instead of:
var d = '<script>alert("append a div that has script tag as innerHTML")';
it should be:
var d = '<script>alert("append a div that has script tag as innerHTML")</script>';
UPDATE:
Interesting!!
I know that, in the past, your second example would have worked. But it seems that, probably for security reasons, the browser no longer allows you to insert <script> through .innerHTML.
I tried on Chrome 62 and it fails. Firefox 57 fails and Safari 11.0.2 fails.
My best guess is that this is a security update.
Look here:
https://developer.mozilla.org/en-US/docs/Web/API/Element/innerHTML
And go down to the Security considerations section.
It reads:
It is not uncommon to see innerHTML used to insert text in a web page. This comes with a security risk.
const name = "John";
// assuming 'el' is an HTML DOM element
el.innerHTML = name; // harmless in this case
// ...
name = "<script>alert('I am John in an annoying alert!')</script>";
el.innerHTML = name; // harmless in this case
Although this may look like a cross-site scripting attack, the result is harmless. HTML5 specifies that a tag inserted via innerHTML should not execute.

Javascript: match using dom loaded page

I am trying to grab all links from google search result using Chrome console.
First I wanted to get the dom loaded source. I tried below code.
var source = document.documentElement.innerHTML;
Now when I type source in console source it shows the correct dom loaded source. But if I run alert(source); It's showing default html source of page.
So problem is when I run below code
source.match(/class="r"><a href="(.*?)"/);
It is returning null, because variable source has the source code before dom loaded.
You can use DOM API (i.e. getElementsByTagName) to find all a tags in a page. Take a look:
var anchors = document.getElementsByTagName('A');
var matchingHrefs = Array.prototype.slice.call(anchors).filter(function(a) {
return a.className == 'r';
}).map(function(a) {
return a.href;
});
A
B
C
The Array.prototype.slice.call call turns node list into regular array.
Probably, you need to add /g flag to your regex to match globally.
Like this:
yourHtml.match(/href="([^"]*")/g)

Can we prevent assignment to innerHTML from requesting resources?

The following code results in an HTTP request for an image resource in both Firefox and Chrome.
var el = document.createElement('div');
el.innerHTML = "<img src='junk'/>";
As a programmer, I may or may not want el to be rendered. If I don't, then maybe I don't want a request being sent for the src.
dojo.toDom() shows the same behaviour.
Is there anyway to get a document fragment from a string, without referenced resources from being requested?
Use the DOMParser to create a full document structure from a given string.
Alternatively, use the beforeload event to intercept requests.
Much lighter memory to use strings to create DOM elements instead of creating documentFragments and working with them:
var div = document.createElement('div');
div.innerHTML = 'some text';
document.getElementById('someparent').appendChild('div');
Can be replaced with:
var div = '<div>some text</div>';
document.getElementById('someparent').innerHTML += div;

Given a string of html code, how can I go through every tag and remove ones that is not in my whitelist (in JQuery)?

var whitelist = ['a','div','img', 'span'];
Given a block of HTML code, I want to go through every single tag using JQuery
Then, if that tag is NOT in my whitelist, remove it and all its children.
The final string should now be sanitized.
How do I do that?
By the way, this is my current code to remove specific tags (but I decided I want to do whitelist instead)
var canvas = '<div>'+canvas_html+'</div>';
var blacklist = ['script','object','param','embed','applet','app','iframe',
'form','input', 'link','meta','title','input','button','textarea'
'head','body','kbd'];
blacklist.forEach(function(r){
$(canvas).find(r).remove();
});
canvas_html = $(canvas).get('div').html();
Try this:
var whitelist = ['a','div','img', 'span'];
var output = $('<div>'+canvas_html+'</div>').find('*').each(function() {
if($.inArray(this.nodeName.toLowerCase(), whitelist)==-1) {
$(this).remove();
}
}).html();
// output contains the HTML with everything except those in the whitelist stripped off
try:
$(canvas).find(':not(' + whitelist.join(', ') + ')').remove().html();
The idea is to turn array of whitelist into "el1, el2, el3" format, then use :not selector to get the elements that's not in the whitelist, then delete.
This obviously could be expensive depending on the size of your html and whitelist.
Unfortunately, using jQuery to sanitize HTML in order to prevent XSS is not safe, as jQuery is not just parsing the HTML, but actually creating elements out of it. Even though it doesn't insert these into the DOM, in some cases embedded Javascript will be executed. So, for example, the snippet:
$('<img src="http://i.imgur.com/cncfg.gif" onload="alert(\'gotcha\');"/>')
will trigger an alert.

Is there a getElementsByTagName() like function for javascript string variables?

I can use the getElementsByTagName() function to get a collection of elements from an element in a web page.
I would like to be able to use a similar function on the contents of a javascript string variable instead of the contents of a DOM element.
How do I do this?
EDIT
I can do this by creating an element on the fly.
var myElement = new Element('div');
myElement.innerHTML = "<strong>hello</strong><em>there</em><strong>hot stuff</strong>";
var emCollection = myElement.getElementsByTagName('em');
alert(emCollection.length); // This gives 1
But creating an element on the fly for the convenience of using the getElementsByTagName() function just doesn't seem right and doesn't work with elements in Internet Explorer.
Injecting the string into DOM, as you have shown, is the easiest, most reliable way to do this. If you operate on a string, you will have to take into account all the possible escaping scenarios that would make something that looks like a tag not actually be a tag.
For example, you could have
<button value="<em>"/>
<button value="</em>"/>
in your markup - if you treat it as a string, you may think you have an <em> tag in there, but in actuality, you only have two button tags.
By injecting into DOM via innerHTML you are taking advantage of the browser's built-in HTML parser, which is pretty darn fast. Doing the same via regular expression would be a pain, and browsers don't generally provide DOM like functionality for finding elements within strings.
One other thing you could try would be parsing the string as XML, but I suspect this would be more troublesome and slower than the DOM injection method.
function countTags(html, tagName) {
var matches = html.match(new RegExp("<" + tagName + "[\\s>]", "ig"));
return matches ? matches.length : 0;
}
alert(
countTags(
"<strong>hello</strong><em>there</em><strong>hot stuff</strong>",
"em"
)
); // 1
var domParser = new DOMParser();
var htmlString = "<strong>hello</strong><em>there</em><strong>hot stuff</strong>";
var docElement = domParser.parseFromString(htmlString, "text/html").documentElement;
var emCollection = docElement.getElementsByTagName("em");
for (var i = 0; i < emCollection.length; i++) {
console.log(emCollection[i]);
}
HTML in a string is nothing special. It's just text in a string. It needs to be parsed into a tree for it to be useful. This is why you need to create an element, then call getElementsByTagName on it, as you show in your example.

Categories

Resources