Sanitizing html string with javascript using browser to interpret html

Sanitizing html string with javascript using browser to interpret html - javascript

I want to use a white list of tags, attributes and values to sanitize a html string, before I place it in the dom. Can safely I construct a dom element, and traverse over that to implement the white list filter, assuming that no malicious javascript could execute until I append the dom element to the document? Are there pitfalls to this approach?

It doesn't appear that anything will execute until you insert into the document, as per #rvighne's answer, but there are at least these (unusual) exceptions (tested in FF 27.0):
var userInput = '<a href="http://example.com" onclick="alert(\'boo!\')">Link<\/a>';
var el = document.createElement('div');
el.innerHTML = userInput;
el.addEventListener("click", function(e) {
if (e.target.nodeName.toLowerCase() === 'a') {
alert("I will also cause side effects; I shouldn't run on the wrong link!");
}
});
el.getElementsByTagName('a')[0].click(); // Alerts "boo!" and "I will also cause side effects; I shouldn't run on the wrong link!"
...or...
var userInput = '<a href="http://example.com" onclick="alert(\'boo!\')">Link<\/a>';
var el = document.createElement('div');
el.innerHTML = userInput;
el.addEventListener("cat", function(e) { this.getElementsByTagName('a')[0].click(); });
var event = new CustomEvent("cat", {"detail":{}});
el.dispatchEvent(event); // Alerts "boo!"
...or... (though setUserData is deprecated, it is still working):
var userInput = '<a href="http://example.com" onclick="alert(\'boo!\')">Link<\/a>';
var span = document.createElement('span');
span.innerHTML = userInput;
span.setUserData('key', 10, {handle: function (n1, n2, n3, src) {
src.getElementsByTagName('a')[0].click();
}});
var div = document.createElement('div');
div.appendChild(span);
span.cloneNode(); // Alerts "Boo!"
var imprt = document.importNode(span, true); // Alerts "Boo!"
var adopt = document.adoptNode(span, true); // Alerts "Boo!"
...or during iteration...
var userInput = 'Link';
var span = document.createElement('span');
span.innerHTML = userInput;
var treeWalker = document.createTreeWalker(
span,
NodeFilter.SHOW_ELEMENT,
{ acceptNode: function(node) { node.click(); } },
false
);
var nodeList = [];
while(treeWalker.nextNode()) nodeList.push(treeWalker.currentNode); // Alerts 'Boo!'
But without these kind of (unusual) event interactions, the fact of building into the DOM alone would not, as far as I have been able to detect, cause any side effects (and of course the examples above are contrived and one wouldn't expect to encounter them very often if at all!).

No script embedded in the HTML can execute until it is put in the document. Try running this code on any page:
var html = "<script>document.body.innerHTML = '';</script>";
var div = document.createElement('div');
div.innerHTML = html;
You will notice nothing change. If the "malicious" script in the HTML was run, then the document should have vanished. So, you can use the DOM to sanitize HTML without worrying about bad JS being in the HTML. As long as you snip out the script in your sanitizer of course.
By the way, your approach is pretty safe and smarter than what most people try (parse it with regex, the poor fools). However, it's best to rely on good, trusted HTML sanitizing libraries for this, like HTML Purifier. Or, if you want to do it client-side, you can use ESAPI-JS (recommended by #Brett Zamir)

You can use a "sandboxed" iframe that won't execute anything.
var iframe = document.createElement('iframe');
iframe['sandbox'] = 'allow-same-origin';
From w3schools:
The sandbox attribute enables an extra set of restrictions for the
content in the iframe. When the sandbox attribute is present, and it will:
block form submission
block script execution
disable APIs
...
P.S. That's, by the way, exactly how we do it in our Html Sanitizer https://github.com/jitbit/HtmlSanitizer - we use the browser to interpret HTML and convert it to DOM. Feel free to check the code (or actually use the component)
(disclaimer: I'm the contributor to that OSS project)

Related

importNode and Microsoft Edge

I have a dynamic page where, with a bar button, I can change the main div content.
Most of the pages are static except one, which contains JavaScript (RGraph charts).
That's why in order to make it working I use the following code:
var data = new FormData();
data.append( 'action', 'charts' );
// clean the content
var myNode = document.getElementById("contentView");
while (myNode.firstChild)
{
myNode.removeChild(myNode.firstChild);
}
// set the new content
var div = document.createElement("div");
var t = document.createElement('template');
t.innerHTML = _connectToServer( data );
for (var i=0; i < t.content.childNodes.length; i++)
{
var node = document.importNode(t.content.childNodes[i], true);
div.appendChild(node);
}
document.getElementById("contentView").appendChild(div);
The problem is that as far as I see (and I read) such a code is not compatible with Microsoft Edge, and I would like to make it going with Edge as well.
What's the best way to succeed?

Ok, so I took this over to our DOM team to better understand this interop issue.
This is actually a Chrome bug per spec. What happens is when you create a template element and place innerHTML inside of it, it is treated as a DocumentFragment. And a template element's contents can't have executable script.
Here are the relavent spec links that cover this:
InnerHTML
Template Element
To work around this, as I pointed to earlier you'll need to create a <script> tag not within a <template> element and utilize textContent. Here's an example of this: http://jsbin.com/gizayuyape/edit?html,js,output
Or here is the code:
// Using textContent
var body = document.getElementsByTagName('body')[0];
var s = document.createElement('script');
s.textContent = 'document.write("Script textContent");';
body.appendChild(s);
Good news, at least on the interop front, is that Chrome has fixed this issue starting in version 67:

Set script tag from API call in header of index.html

I'm trying to implement dynatrace in my react app, and this works fine if I just add the required script tag manually in my index.html header.
However I want to make use of dynatraces api which returns the whole script tag element (so I can use for different environments).
How can I add the script tag to my index.html after calling the api? Creating a script element from code won't work because the response of the api call is a script tag itself (which is returned as string).
I tried creating a div element and adding the script as innerHtml, then append it to the document. But scripts don't get executed in innerHtml text.
const wrapperDiv = document.createElement("div");
wrapperDiv.innerHTML = "<script>alert('simple test')</script>";
document.head.appendChild(wrapperDiv.firstElementChild);
Can this be done?
I found a roundabout way of doing this:
const wrapperDiv = document.createElement("div");
const scriptElement = document.createElement("script");
wrapperDiv.innerHTML = "<script src=... type=...></script>";
for(let i = 0; i < wrapperDiv.firstElementChild.attributes.length; i++){
const attr = wrapperDiv.firstElementChild.attributes[i];
scriptElement.setAttribute(attr.name, attr.value);
}
document.head.appendChild(scriptElement);
in this example of the script i'm only using a src but this can be done with the value as well. If there is any better way for doing this pls let me know

This can be achieved without use of eval() :
const source = "alert('simple test')";
const wrapperScript = document.createElement("script");
wrapperScript.src = URL.createObjectURL(new Blob([source], { type: 'text/javascript' }));
document.head.appendChild(wrapperScript);
In the code above you basically create Blob, containing your script, in order to create Object URL (representation of File or Blob object in browser memory).
This solution is based on idea that dynamically added <script> is evaluated by browser in case it has src property.
Update:
Since endpoint returns you <script> tag with some useful attributes, the best solution would be to clone attributes (including src) - your current approach is pretty good.

I found a roundabout way of doing this:
const wrapperDiv = document.createElement("div");
const scriptElement = document.createElement("script");
wrapperDiv.innerHTML = "<script src=... type=...></script>";
for(let i = 0; i < wrapperDiv.firstElementChild.attributes.length; i++){
const attr = wrapperDiv.firstElementChild.attributes[i];
scriptElement.setAttribute(attr.name, attr.value);
}
document.head.appendChild(scriptElement);
in this example of the script i'm only using a src but this can be done with the value as well

How to prevent resource loading of unattached elements in Chrome

I'm working on Chrome extension and I have following problem:
var myDiv = document.createElement('div');
myDiv.innerHTML = '<img src="a.png">';
What happens now is that Chrome tries to load the "a.png" resource, even If I don't attach the "div" element to document. Is there a way to prevent it?
_In the extension I need to get data from a site that doesn't provide any API, so I have to parse the whole HTML to get the necessary data. Writing my own simple HTML parser could be tricky so I would rather use the native HTML parser. However, in Chrome when I put the whole source code to some temporary non-attached element (so it would get parsed and I could filter the necessary data), ale the images (and possibly other resources) start to load as well, causing higher traffic or (in case of relative paths) lots of errors in console. _

To prevent the resources from being loaded, you'll need to create your Node in an entirely new #document. You can use document.implementation.createHTMLDocument for this.
var dom = document.implementation.createHTMLDocument(); // make new #document
// now use this to..
var myDiv = dom.createElement('div'); // ..create a <div>
myDiv.innerHTML = '<img src="a.png">'; // ..parse HTML

You can delay parsing/loading html by storing it in non-standard attribute, then assigning it to innerHtml, "when the time comes":
myDiv.setAttribute('deferredHtml', '<img src="http://upload.wikimedia.org/wikipedia/commons/4/4e/Single_apple.png">');
global.loadDeferredImage = function() {
if(myDiv.hasAttribute('deferredHtml')) {
myDiv.innerHTML = myDiv.getAttribute('deferredHtml');
myDiv.removeAttribute('deferredHtml');
}
};
... onclick="loadDeferredImage()"
I created jsfiddle illustrating this idea:
http://jsfiddle.net/akhikhl/CbCst/3/

Multiple HTML DOMs - Parse and Transfer Data

I am requesting full HTML5 documents via Ajax using jQuery. I want to be able to parse them and transfer elements to my main page DOM, ideally with all major browsers, including mobile. I don't want to create an iframe as I want the process to be as quick as possible. With Chrome & Firefox I can do the following:
var contents = $(document.createElement('html'));
contents[0].innerHTML = data; // data : HTML document string
This will create a proper document, somewhat surprisingly, just without a doctype. In IE9, however, one may not use the innerHTML to set the contents of the html element. I tried to do the following, without any luck:
Create a DOM, open it, write to it and close it. Issue: on doc.open, IE9 throws an exception called Unspecified error..
var doc = document.implementation.createHTMLDocument('');
doc.open();
doc.write(data);
doc.close();
Create an ActiveX DOM. This time, the result is better but upon transferring / copying elements between documents IE9 crashes. Bad because no IE8 support (adoptNode / importNode support).
var doc = new ActiveXObject('htmlfile');
doc.open();
doc.write(data);
doc.close();
contents = $(doc.documentElement);
document.adoptNode(contents);
I was thinking about recursively recreating the elements, instead of transferring them between my documents, but that seems like an expensive task, given that I can have a lot nodes to transfer. I like my last ActiveX example as that will most likely work in IE8 and earlier (for parsing, at least).
Any ideas on this? Again, not only I need to be able to parse the head and body, but I also need to be able to append these new elements to my main dom.
Thanks much!

Answering my own question... To solve my issue I used all solutions mentioned in my post, with try/catch blocks if a browser throws an error (oh, how we love thee IE!). The following works in IE8, IE9, Chrome 23, Firefox 17, iOS 4 and 5, Android 3 & 4. I have not tested Android 2.1-2.3 and IE7.
var contents = $('');
try {
contents = $(document.createElement('html'));
contents[0].innerHTML = data;
}
catch(e) {
try {
var doc = document.implementation.createHTMLDocument('');
doc.open();
doc.write(data);
doc.close();
contents = $(doc.documentElement);
}
catch(e) {
var doc = new ActiveXObject('htmlfile');
doc.open();
doc.write(data);
doc.close();
contents = $(doc.documentElement);
}
}
At this point we can find elements using jQuery. Transferring them to a different DOM creates a bit of a problem. There are a couple of methods that do this, but they are not widely supported yet (importNode & adoptNode) and/or are buggy. Given that our selector string is called 'selector', below I re-created the found elements and append them to '.someDiv'.
var fnd = contents.find(selector);
if(fnd.length) {
var newSelection = $('');
fnd.each(function() {
var n = document.createElement(this.tagName);
var attr = $(this).prop('attributes');
n.innerHTML = this.innerHTML;
$.each(attr,function() { $(n).attr(this.name, this.value); });
newSelection.push(n);
});
$('.someDiv').append(newSelection);
};

parsing HTML with Firefox

uri = 'http://www.nytimes.com/';
searchuri = 'http://www.google.com/search?';
searchuri += 'q='+ encodeURIComponent(uri) +'&btnG=Search+Directory&hl=en&cat=gwd%2FTop';
req = new XMLHttpRequest();
req.open('GET', searchuri, true);
req.onreadystatechange = function (aEvt) {
if (req.readyState == 4) {
if(req.status == 200) {
searchcontents = req.responseText;
myHTML = searchcontents;
var tempDiv = document.createElement('div');
tempDiv.innerHTML = myHTML.replace(/<script(.|\s)*?\/script>/g, '');
parsedHTML = tempDiv;
sitefound = sc_sitefound(uri, parsedHTML);
}
}
};
req.send(null);
function sc_sitefound(uri, parsedHTML) {
alert(parsedHTML);
gclasses = parsedHTML.getElementsByClassName('g');
for (var gclass in gclasses) {
atags = gclass.getElementsByTagName('a');
alert(atags);
tag1 = atags[0];
htmlattribute1 = tag1.getAttribute('html');
if (htmlattribute1 == uri) {
sitefound = htmlattribute1;
return sitefound;
}
}
return null;
}
parsedHTML is a XULElement
gclasses is an HTMLCollection
if there are many divs of class G in the Google Directory search results, why are the g classes empty?

var tempDiv = document.createElement('div');
If you're in an XUL environment, that's not creating an HTML element node: it'll be an XUL element. Since the innerHTML property is exclusive to HTMLElement and not other XML Elements, setting innerHTML on tempDiv will do nothing (other than adding a custom property containing the HTML string). Consequently there are no elements with class ‘g’ inside tempDiv... there are no elements at all inside it.
If you have a plain HTML document loaded in the browser, you could try using content.document.createElement to get an HTML wrapper element on which innerHTML will be available. This still isn't a brilliant way to parse a whole page of HTML because the document in question might have <head> content you can't put in a div, and HTTP headers that you'll be throwing away. Probably better to load the target file into an HTMLDocument object of its own. A good way to do that would be using an iframe. See this page for examples of both these approaches.
tempDiv.innerHTML = myHTML.replace(/<script(.|\s)*?\/script>/g, '');
It's seven shades of not-a-good-idea to process HTML with regex; this could go wrong in many ways when Google slightly change their page markup. Let the browser do the job of parsing instead. Setting innerHTML does not cause script elements to be executed straight away (futher DOM manipulations can though); you can pick out the unwanted script elements later, if you need to. With the XUL iframe approach you can simply disable JavaScript on the iframe.
for (var gclass in gclasses) {
The for...in loop is for use against Objects used as mappings. It should not be used for iterating a sequence (such as Array, NodeList or in this case HTMLCollection) as it doesn't do what you might expect. For iterating sequences, stick to the standard C-style for (var i= 0; i<sequence.length; i++) loop.
You could also do with adding var declarations for all your other local variables.

Develop Reference

JavaScript is the programming language of the Web.

Sanitizing html string with javascript using browser to interpret html - javascript

Related

importNode and Microsoft Edge

Set script tag from API call in header of index.html

How to prevent resource loading of unattached elements in Chrome

Multiple HTML DOMs - Parse and Transfer Data

parsing HTML with Firefox

Categories

Resources