Parse and handle DOM that came as a string input

Parse and handle DOM that came as a string input - javascript

Here is the thing,
I have a textarea (with ID "input_container") full of HTML code, the simple example is:
<!doctype html>
<html>
<head></head>
<body>
the other place
</body>
</html>
How can I parse it and use it in jQuery as a legitimate DOM?
For example, to do
$(varWithDom).find(...)
with that DOM?
What I already tried
I tried to parse it using jQuery but a funny thing happened - jQuery removed the DOCTYPE and all of the HEAD, and left me with nothing but
the other place
My original method is here: jQuery HTML parser is removing some tags without a warning, why and how to prevent it?
I never found a solution yet. Any ideas? might this be a bug on jQuery or what?

If you need the entire content as elements, you might try using an iframe.
// create and append new iframe
var iframe = document.createElement('iframe');
document.documentElement.appendChild(iframe);
// set its innerHTML
iframe.contentWindow.document.documentElement.innerHTML = varWithDOM;
// grab the `window`
var win = iframe.contentWindow;
// remove the iframe
document.documentElement.removeChild(iframe);
Demo that grabs the head: http://jsfiddle.net/K6tR2/
original answer
It isn't so much jQuery removing it as it is the browser. This behavior will vary in different browsers.
One thing you might try would be to place the entire thing in a <div>, so that becomes your context...
$('<div>' + varWithDom + '</div>').find(...)
Now it won't really matter what is stripped away (unless you actually needed something in the <head>), because it will all be descendant of the outer div.
If you didn't want that, then you'd need to do your query twice, once with .find(), and once with .filter()...
var els = $( varWithDom );
var links = els.find( 'a[href]' ).add( els.filter( 'a[href]' ) );

Related

Remove all elements from website except X

I'm not really familiar with Javascript, and even less with how Javascript works in Chrome's F12 developer tools. What I'm trying to do is have a favorite which, when clicked on, loads a web page but removes some of the clutter of the page which is loaded (I don't really care if it removes it before the page is loaded, or loads it and then removes it)
For now, I'm trying to figure out how to remove all elements except the one I want to keep (and its' children), namely, one which has the following html:
<div>
<ul class="c-list-news u-relative" data-load-more-content>...</ul>
</div>
I'm trying the following (from what I could find on SO), but I can't find the right selector (or I'm doing something else wrong, not quite sure):
var elem = document.querySelectorAll('body *:not(div ul.c-list-news, div ul.c-list-news *)');
for(var i=0;i<elem.length;i++) {
elem[i].parentElement.removeChild(elem[i]);
}
(PS : I haven't yet looked into how to put it into a favorite/extension, it will come later)

It's probably easier than you realize. :-) You can get the first element matching .c-list-news like this:
const cListNews = document.querySelector(".c-list-news");
If you want to keep its parent, just add .parentNode to that:
const divContainer = document.querySelector(".c-list-news").parentNode;
Then, wipe out body entirely:
document.body.innerHTML = "";
...and put the element back:
document.body.appendChild(cListNews); // Or `divContainer`
I'm not sure I'd expect the page to continue to be readable, though, since of course this completely changes where the element is in the DOM, which may well make the CSS fail.
You can't make a bookmark (favorite) that both loads the page and does this in one go, because javascript: bookmarks work within the context of the current page. You could use something like TamperMonkey which is an extension that lets you run a script automatically when you go to matching URLs.
But you can make a bookmark that you use when you're already on the page: Just use the javascript: pseudo-protocol and follow it with JavaScript code. For instance:
javascript:var divContainer %3D document.querySelector(".c-list-news").parentNode%3Bdocument.body.innerHTML %3D ""%3Bdocument.body.appendChild(divContainer)%3Bconsole.log("done")%3B
I created that by simply removing line breaks from the code (optional), running the code through encodeURIComponent, and putting javascript: on the front. (Some folks would also convert spaces to %20.)

Save the element to keep to a variable. Remove all nodes from the body, or the element that you want, and add the element to keep. Example:
let elementToKeep = document.getElementById('side');
const myNode = document.getElementsByTagName("body")[0];
while (myNode.firstChild) {
myNode.removeChild(myNode.firstChild);
}
myNode.appendChild(elementToKeep);
Using the removeChild method is faster that setting the innerHtml as empty string.
Check here: Remove all child elements of a DOM node in JavaScript

What's wrong with document.write? What's a viable alternative? [duplicate]

In tutorials I've learnt to use document.write. Now I understand that by many this is frowned upon. I've tried print(), but then it literally sends it to the printer.
So what are alternatives I should use, and why shouldn't I use document.write? Both w3schools and MDN use document.write.

The reason that your HTML is replaced is because of an evil JavaScript function: document.write().
It is most definitely "bad form." It only works with webpages if you use it on the page load; and if you use it during runtime, it will replace your entire document with the input. And if you're applying it as strict XHTML structure it's not even valid code.
the problem:
document.write writes to the document stream. Calling document.write on a closed (or loaded) document automatically calls document.open which will clear the document.
-- quote from the MDN
document.write() has two henchmen, document.open(), and document.close(). When the HTML document is loading, the document is "open". When the document has finished loading, the document has "closed". Using document.write() at this point will erase your entire (closed) HTML document and replace it with a new (open) document. This means your webpage has erased itself and started writing a new page - from scratch.
I believe document.write() causes the browser to have a performance decrease as well (correct me if I am wrong).
an example:
This example writes output to the HTML document after the page has loaded. Watch document.write()'s evil powers clear the entire document when you press the "exterminate" button:
I am an ordinary HTML page. I am innocent, and purely for informational purposes. Please do not <input type="button" onclick="document.write('This HTML page has been succesfully exterminated.')" value="exterminate"/>
me!
the alternatives:
.innerHTML This is a wonderful alternative, but this attribute has to be attached to the element where you want to put the text.
Example: document.getElementById('output1').innerHTML = 'Some text!';
.createTextNode() is the alternative recommended by the W3C.
Example: var para = document.createElement('p');
para.appendChild(document.createTextNode('Hello, '));
NOTE: This is known to have some performance decreases (slower than .innerHTML). I recommend using .innerHTML instead.
the example with the .innerHTML alternative:
I am an ordinary HTML page.
I am innocent, and purely for informational purposes.
Please do not
<input type="button" onclick="document.getElementById('output1').innerHTML = 'There was an error exterminating this page. Please replace <code>.innerHTML</code> with <code>document.write()</code> to complete extermination.';" value="exterminate"/>
me!
<p id="output1"></p>

Here is code that should replace document.write in-place:
document.write=function(s){
var scripts = document.getElementsByTagName('script');
var lastScript = scripts[scripts.length-1];
lastScript.insertAdjacentHTML("beforebegin", s);
}

You can combine insertAdjacentHTML method and document.currentScript property.
The insertAdjacentHTML() method of the Element interface parses the specified text as HTML or XML and inserts the resulting nodes into the DOM tree at a specified position:
'beforebegin': Before the element itself.
'afterbegin': Just inside the element, before its first child.
'beforeend': Just inside the element, after its last child.
'afterend': After the element itself.
The document.currentScript property returns the <script> element whose script is currently being processed. Best position will be beforebegin — new HTML will be inserted before <script> itself. To match document.write's native behavior, one would position the text afterend, but then the nodes from consecutive calls to the function aren't placed in the same order as you called them (like document.write does), but in reverse. The order in which your HTML appears is probably more important than where they're place relative to the <script> tag, hence the use of beforebegin.
document.currentScript.insertAdjacentHTML(
'beforebegin',
'This is a document.write alternative'
)

As a recommended alternative to document.write you could use DOM manipulation to directly query and add node elements to the DOM.

Just dropping a note here to say that, although using document.write is highly frowned upon due to performance concerns (synchronous DOM injection and evaluation), there is also no actual 1:1 alternative if you are using document.write to inject script tags on demand.
There are a lot of great ways to avoid having to do this (e.g. script loaders like RequireJS that manage your dependency chains) but they are more invasive and so are best used throughout the site/application.

I fail to see the problem with document.write. If you are using it before the onload event fires, as you presumably are, to build elements from structured data for instance, it is the appropriate tool to use. There is no performance advantage to using insertAdjacentHTML or explicitly adding nodes to the DOM after it has been built. I just tested it three different ways with an old script I once used to schedule incoming modem calls for a 24/7 service on a bank of 4 modems.
By the time it is finished this script creates over 3000 DOM nodes, mostly table cells. On a 7 year old PC running Firefox on Vista, this little exercise takes less than 2 seconds using document.write from a local 12kb source file and three 1px GIFs which are re-used about 2000 times. The page just pops into existence fully formed, ready to handle events.
Using insertAdjacentHTML is not a direct substitute as the browser closes tags which the script requires remain open, and takes twice as long to ultimately create a mangled page. Writing all the pieces to a string and then passing it to insertAdjacentHTML takes even longer, but at least you get the page as designed. Other options (like manually re-building the DOM one node at a time) are so ridiculous that I'm not even going there.
Sometimes document.write is the thing to use. The fact that it is one of the oldest methods in JavaScript is not a point against it, but a point in its favor - it is highly optimized code which does exactly what it was intended to do and has been doing since its inception.
It's nice to know that there are alternative post-load methods available, but it must be understood that these are intended for a different purpose entirely; namely modifying the DOM after it has been created and memory allocated to it. It is inherently more resource-intensive to use these methods if your script is intended to write the HTML from which the browser creates the DOM in the first place.
Just write it and let the browser and interpreter do the work. That's what they are there for.
PS: I just tested using an onload param in the body tag and even at this point the document is still open and document.write() functions as intended. Also, there is no perceivable performance difference between the various methods in the latest version of Firefox. Of course there is a ton of caching probably going on somewhere in the hardware/software stack, but that's the point really - let the machine do the work. It may make a difference on a cheap smartphone though. Cheers!

The question depends on what you are actually trying to do.
Usually, instead of doing document.write you can use someElement.innerHTML or better, document.createElement with an someElement.appendChild.
You can also consider using a library like jQuery and using the modification functions in there: http://api.jquery.com/category/manipulation/

This is probably the most correct, direct replacement: insertAdjacentHTML.

Try to use getElementById() or getElementsByName() to access a specific element and then to use innerHTML property:
<html>
<body>
<div id="myDiv1"></div>
<div id="myDiv2"></div>
</body>
<script type="text/javascript">
var myDiv1 = document.getElementById("myDiv1");
var myDiv2 = document.getElementById("myDiv2");
myDiv1.innerHTML = "<b>Content of 1st DIV</b>";
myDiv2.innerHTML = "<i>Content of second DIV element</i>";
</script>
</html>

Use
var documentwrite =(value, method="", display="")=>{
switch(display) {
case "block":
var x = document.createElement("p");
break;
case "inline":
var x = document.createElement("span");
break;
default:
var x = document.createElement("p");
}
var t = document.createTextNode(value);
x.appendChild(t);
if(method==""){
document.body.appendChild(x);
}
else{
document.querySelector(method).appendChild(x);
}
}
and call the function based on your requirement as below
documentwrite("My sample text"); //print value inside body
documentwrite("My sample text inside id", "#demoid", "block"); // print value inside id and display block
documentwrite("My sample text inside class", ".democlass","inline"); // print value inside class and and display inline

I'm not sure if this will work exactly, but I thought of
var docwrite = function(doc) {
document.write(doc);
};
This solved the problem with the error messages for me.

Add tags to <head> reliably in Javascript

I am writing a program that does the following:
Creates an iframe in the DOM
Makes an AJAX request to a page (a site's main page)
If the page has changed, I use iframe.srcdoc = contents; to the iframe, where contents is what came back from AJAX
Note that this way any image etc. with a relative URL specified will not render correctly. To make it look right, I have to add a <base> tag to <head>.
I am very reluctant to use regexp like this:
contents = contents.replace('<head>','<head><base href="http://www.example.com/">');
Because it might stuff things up (but, am I being way too overcautious and over-paranoid?).
NOTE: I cannot do this by manipulating DOM: if I do iframe.srcdoc = contents; and then add the <base> tag, the page will still render incorrectly. The <base> tag needs to be there before I assign it to iframe.srcdoc...
How would you go about this?
Merc.

Use appendChild DOM operation to add the element.
document.getElementsByTagName('head')[0].appendChild('<base href="http://www.site.com" />');

In my holy opinion using appendChild with string-values it not the best idea, so here is my approach.
// create new "base"-node
var node = document.createElement('base');
// set href="http://www.site.com"
node.setAttribute('href', 'http://www.site.com');
// append new "base"-node to first "head"-node in html-document
document.getElementsByTagName('head')[0].appendChild(node);
see W3-School The HTML DOM (Document Object Model) for details about DOM-Manipulation, DOM-Understanding and Javascript-Reference.
Solution for inject "base"-tag with string-manipuation (kind of "non-dom-offline") is Regex to prepend base-tag before closing head-tag.
contents = contents.replace(/<\/head>/ig, '<base href="http://www.site.com" />$&');
An other solution can is using jQuery to construct an "offline-DOM" of the contents of the iframe and using DOM-Manipulation-Methods.
contents = jQuery(contents).find('head:first').append('<base ... />').html()
// no guarantee here that this will work ;-) it was just out of my mind, but should work.

Append html to jQuery element without running scripts inside the html

I have written some code that takes a string of html and cleans away any ugly HTML from it using jQuery (see an early prototype in this SO question). It works pretty well, but I stumbled on an issue:
When using .append() to wrap the html in a div, all script elements in the code are evaluated and run (see this SO answer for an explanation why this happens). I don't want this, I really just want them to be removed, but I can handle that later myself as long as they are not run.
I am using this code:
var wrapper = $('<div/>').append($(html));
I tried to do it this way instead:
var wrapper = $('<div>' + html + '</div>');
But that just brings forth the "Access denied" error in IE that the append() function fixes (see the answer I referenced above).
I think I might be able to rewrite my code to not require a wrapper around the html, but I am not sure, and I'd like to know if it is possible to append html without running scripts in it, anyway.
My questions:
How do I wrap a piece of unknown html
without running scripts inside it,
preferably removing them altogether?
Should I throw jQuery out the window
and do this with plain JavaScript and
DOM manipulation instead? Would that help?
What I am not trying to do:
I am not trying to put some kind of security layer on the client side. I am very much aware that it would be pointless.
Update: James' suggestion
James suggested that I should filter out the script elements, but look at these two examples (the original first and the James' suggestion):
jQuery("<p/>").append("<br/>hello<script type='text/javascript'>console.log('gnu!'); </script>there")
keeps the text nodes but writes gnu!
jQuery("<p/>").append(jQuery("<br/>hello<script type='text/javascript'>console.log('gnu!'); </script>there").not('script'))`
Doesn't write gnu!, but also loses the text nodes.
Update 2:
James has updated his answer and I have accepted it. See my latest comment to his answer, though.

How about removing the scripts first?
var wrapper = $('<div/>').append($(html).not('script'));
Create the div container
Use plain JS to put html into div
Remove all script elements in the div
Assuming script elements in the html are not nested in other elements:
var wrapper = document.createElement('div');
wrapper.innerHTML = html;
$(wrapper).children().remove('script');
var wrapper = document.createElement('div');
wrapper.innerHTML = html;
$(wrapper).find('script').remove();
This works for the case where html is just text and where html has text outside any elements.

You should remove the script elements:
var wrapper = $('<div/>').append($(html).remove("script"));
Second attempt:
node-validator can be used in the browser:
https://github.com/chriso/node-validator
var str = sanitize(large_input_str).xss();
Alternatively, PHPJS has a strip_tags function (regex/evil based):
http://phpjs.org/functions/strip_tags:535

The scripts in the html kept executing for me with all the simple methods mentioned here, then I remembered jquery has a tool for this (since 1.8), jQuery.parseHTML. There's still a catch, according to the documentation events inside attributes(i.e. <img onerror>) will still run.
This is what I'm using:
var $dom = $($.parseHTML(d));
$dom will be a jquery object with the elements found

Dynamically inserting javascript into HTML that uses document.write

I am currently loading a lightbox style popup that loads it's HTML from an XHR call. This content is then displayed in a 'modal' popup using element.innerHTML = content This works like a charm.
In another section of this website I use a Flickr 'badge' (http://www.elliotswan.com/2006/08/06/custom-flickr-badge-api-documentation/) to load flickr images dynamically. This is done including a script tag that loads a flickr javascript, which in turn does some document.write statments.
Both of them work perfectly when included in the HTML. Only when loading the flickr badge code inside the lightbox, no content is rendered at all. It seems that using innerHTML to write document.write statements is taking it a step too far, but I cannot find any clue in the javascript implementations (FF2&3, IE6&7) of this behavior.
Can anyone clarify if this should or shouldn't work? Thanks.

In general, script tags aren't executed when using innerHTML. In your case, this is good, because the document.write call would wipe out everything that's already in the page. However, that leaves you without whatever HTML document.write was supposed to add.
jQuery's HTML manipulation methods will execute scripts in HTML for you, the trick is then capturing the calls to document.write and getting the HTML in the proper place. If it's simple enough, then something like this will do:
var content = '';
document.write = function(s) {
content += s;
};
// execute the script
$('#foo').html(markupWithScriptInIt);
$('#foo .whereverTheDocumentWriteContentGoes').html(content);
It gets complicated though. If the script is on another domain, it will be loaded asynchronously, so you'll have to wait until it's done to get the content. Also, what if it just writes the HTML into the middle of the fragment without a wrapper element that you can easily select? writeCapture.js (full disclosure: I wrote it) handles all of these problems. I'd recommend just using it, but at the very least you can look at the code to see how it handles everything.
EDIT: Here is a page demonstrating what sounds like the effect you want.

I created a simple test page that illustrates the problem:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<title>Document Write Testcase</title>
</head>
<body>
<div id="container">
</div>
<div id="container2">
</div>
<script>
// This doesn't work!
var container = document.getElementById('container');
container.innerHTML = "<script type='text/javascript'>alert('foo');document.write('bar');<\/script>";
// This does!
var container2 = document.getElementById('container2');
var script = document.createElement("script");
script.type = 'text/javascript';
script.innerHTML = "alert('bar');document.write('foo');";
container.appendChild(script);
</script>
</body>
</html>
This page alerts 'bar' and prints 'foo', while I expected it to also alert 'foo' and print 'bar'. But, unfortunately, since the script tag is part of a larger HTML page, I cannot single out that tag and append it like the example above. Well, I can, but that would require scanning innerHTML content for script tags, and replacing them in the string by placeholders, and then inserting them using the DOM. Sounds not that trivial.

Use document.writeln(content); instead of document.write(content).
However, the better method is using the concatenation of innerHTML, like this:
element.innerHTML += content;
The element.innerHTML = content; method will replace the old content with the new one, which will overwrite your element's innerHTML!
Whereas using the the += operator in element.innerHTML += content will append your text after the old content. (similar to what document.write does.)

document.write is about as deprecated as they come. Thanks to the wonders of JavaScript, though, you can just assign your own function to the write method of the document object which uses innerHTML on an element of your choosing to append the supplied content.

Can I get some clarification first to make sure I get the problem?
document.write calls will add content to the markup at the point in the markup at which they occur. For example if you include document.write calls in a function but call the function elsewhere, the document.write output will happen at the point in the markup the function is defined not where it is called.
Therefore for this to work at all the Flickr document.write statements will need to be part of the content in element.innerHTML = content. Is this definitely the case?
You might quickly test if this should work at all by adding a single and simple document.write call in the content that is set as the innerHTML and see what this does:
<script>
var content = "<p>1st para</p><script>document.write('<p>2nd para</p>');</script>"
element.innerHTML = content;
</script>
If that works, the concept of document.write working in content set as the innerHTML of an element might just work.
My gut feeling is that it won't work, but it should be pretty straightforward to test the concept.

So you're using a DOM method to create a script element and append that to an existing element and this then causes the content of the appended script element to execute? That sounds good.
You say that the script tag is part of a larger HTML page and therefore cannot be singled out. Can you not give the script tag an ID and target it? I'm probably missing something obvious here.

In theory, yes, I can single out a script tag that way. The problem is that we potentially have dozens of situations where this occurs, so I am trying to find some cause or documentation of this behavior.
Also, the script tag does not seem to be a part of the DOM anymore after it gets loaded. In our environment, my container div remains empty, so I cannot fetch the script tag. It should work, though, because in my example above the script does not get executed, but is still part of the DOM.

Develop Reference

JavaScript is the programming language of the Web.