How to load and parse HTML without modifying its contents - javascript

There are many ways to parse and traverse HTML4 files using many technologies. But I can not find a suitable one to save that DOM to file again.
I want to be able to load an HTML file into a DOM, change one small thing (e.g. an attribute's value), save the DOM to file again and when diffing the source file and the created file, I want them to be completely identical, except that small change.
This kind of task is absolutely no problem when working with XML and suitable XML libraries, but when it comes to HTML there are several issues: Whitespace such as indentations or linebreaks get lost or are inserted, self-closing start tags (such as <link...>) emerge as <link.../> and/or the content of CDATA sections (e.g. between <script> and </script>) is wrapped into <![CDATA[ ]]>. These things are critical in my case.
Which way can I go to load, traverse, manipulate and save HTML without the drawbacks described above, most importantly without whitespace text nodes to be altered?

comparison
If you want to get really serious leave out the GUI and go headless, SO example with Phantom

I am going with the HTML Agility Pack.
Loading and saving does not manipulate anything else than invalid parts.

Related

PHP - safely allow users to save html

I'm writting a website that allows users to paste in html for their blogs.
The html they paste in will then get saved to a file and this is what will be read and changed when the user makes changes. The file will almost be like a complete web page so it will have all your normal tags; head, body, div's, etc. etc.
This means it should allow almost all html and css tags apart from anything that could cause a security breach. So it essentially needs to strip php tags, certain style tags, and html/javascript script tags.
I looked into the strip_tags function but I'd rather not use that because:
a) it removes html comments which I'd rather keep and
b) it would be a lot of working specifying all the tags that it needs to ignore, considering I want it to ignore vastly more tags than I want it to strip.
My guess is that this is something regex-esc using preg_replace?
I'd like to add; I'm newly aware of XSS attacks through CSS too so any ideas/thoughts on how I could block certain css style tags out would be wonderful :)
Any ideas on what I could do?

Serialization of the full page DOM. Can I get at the JS code that is loaded up, or must I AJAX it separately?

I have a bug I'm trying to track down, and it is very difficult to do so because of the complexity of the web app. There are many frames, and many instances of Javascript code that is embedded into the HTML in different ways.
The thing that needs to be fixed is a sub-page created with showModalDialog (so you already know it's going to be a disaster), and I am hoping that I can find a way to serialize as much of the DOM as possible within this dialog page context, so that I may open it to the same content both when the bug is present and when it is not, in hopes of detecting missing/extra/different Javascript, which would become apparent by pumping the result through a diff.
I tried jQuery(document).children().html(). This gets a little bit of the way there (it's able to serialize one of the outer <script> tags!) but does not include the contents of the iframe (most of the page content is about 3 iframe/frame levels deep).
I do have a custom script which I'm very glad I made, as it's able to walk down into the frame hierarchy recursively, so I imagine I can use .html() in conjunction with that to obtain my "serialization" which I can then do some manual checking to see if it matches up with what the web inspector tells me.
Perhaps there exists some flag I can give to html() to get it to recurse into the iframes/frames?
The real question, though, is about how to get a dump of all the JS code that is loaded in this particular page context. Because of the significant server-side component of this situation, javascript resources can be entirely dynamic and therefore should also be checked for differences. How would I go about (in JS on the client) extracting the raw contents of a <script src='path'> tag to place into the serialization? I can work around this by manually intercepting these resources but it would be nice if everything can go into one thing for use with the diff.
Is there no way to do this other than by separately re-requesting those JS resources (not from script tags) with ajax?

removing unused DOM elements for performance

I'm writing a single page application. When the page is initially served, it contains many DOM elements that contain json strings that I'm injecting into the page.
When the page loads on the client, the first thing that happens is that these DOM elements are parsed from json to javascript objects and then they're never used again.
Would there be a performance benefit into deleting them from the DOM and reducing its size? I haven't found any conclusive data about this aspect. For info, these elements are about 500K in size.
Thanks for your suggestions.
Would there be a performance benefit into deleting them from the DOM
and reducing its size?
In terms of the performance of the DOM, generally, the less you interact with your DOM, the better.
Remember that your selectors traverse the dom, so if you use inefficient selectors, it will be poor performance no matter what, but if you're selecting elements by ID rather than class or attribute wherever possible, you can squeeze good performance out of the page, regardless of whether you've deleted those extraneous markup items.
When the page is initially served, it contains many DOM elements that
contain json strings that I'm injecting into the page.... For info,
these elemnts are about 500K in size.
In terms of page load, your performance is suffering already from having to transfer all of that data. if you can find a way to transfer the json only, it could only help. Deleting after the fact wouldn't be solving this particular problem, though.
Independent of whether you pull the JSON from the DOM, you might consider this technique for including your JSON objects:
<script type='text/json' id='whatever'>
{ "some": "json" }
</script>
Browsers will completely ignore the contents of those scripts (except of course for embedded strings that look like </script>, which it occurs to me as things a JSON serializer might want to somehow deal with), so you don't risk running into proplems with embedded ampersands and things. That's probably a performance win in and of itself: the browser probably has an easier time looking for the end of the <script> block than it does looking through stuff that it thinks is actual content. You can get the content verbatim with .innerHTML.

Parsing HTML Response in jQuery

My page needs to grab a specific div from another page to display within a div on the current page. It's a perfect job for $.load, except that the HTML source I'm pulling from is not necessarily well-formed, and seems to have occasional tag errors, and IE just plain won't use it in that case. So I've changed it to a $.get to grab the HTML source of the page as a string. Passing it to $ to parse it as a DOM has the same problem in IE as $.load, so I can't do that. I need a way to parse the HTML string to find the contents of my div#information, but not the rest of the page after its </div>. (PS My div#information contains various other div's and elements.)
EDIT: Also if anyone has a fix for jQuery's $.load not being able to parse response HTML in IE, I'd love to hear that too.
If the resource you are trying to load is under your control, your implementation spec is poorly optimized. You don't want to ask your server for an entire page of content when you only really need a small piece of that content.
What you'll want to do is isolate the content you want, and have the server return only what you need.
As a side note, since you are aware that you have malformed HTML, you should probably bite the bullet and validate your markup. That will save you some trouble (like this) in the future.
Finally, if you truly cannot optimize this process, my guess is that you are creating an inconsistency because some element in the parsed HTML has the same ID as an element on your current page. Identical ID's are invalid and lead to many cross-browser JavaScript problems.
Strictly with strings you could use a regular expression to find the id="information" tag contents. Never parse it as html.
I'd try the $.load parameter that accepts a html selector as well
$('#results').load('/MySite/Url #Information');

Disable JavaScript in iframe/div

I am making a small HTML page editor. The editor loads a file into an iframe. From there, it could add, modify, or delete the elements on the page with new attributes, styles, etc. The problem with this, is that JavaScript (and/or other programming languages) can completely modify the page when it loads, before you start editing the elements. So when you save, it won't save the original markup, but the modified page + your changes.
So, I need some way to disable the JavaScript on the iframe, or somehow remove all the JavaScript before the JavaScript starts modifying the page. (I figure I'll have to end up parsing the file for PHP, but that shouldn't be too hard) I considered writing a script to loop through all the elements, removing any tags, onclick's, onfocus's, onmouseover's, etc. But that would be a real pain.
Does anyone know of an easier way to get rid of JavaScript from running inside an iframe?
UPDATE: unless I've missed something, I believe there is no way to simply 'disable JavaScript.' Please correct me if I'm wrong. But, I guess the only way to do it would be to parse out any script tags and JavaScript events (click, mouseover, etc) from a requested page string.
HTML5 introduces the sandbox attribute on the iframe that, if empty, disables JavaScript loading and execution.
Yes, your update is correct. You must assume that any input you receive from the user may contain malicious elements. Thus, you must validate on the server before accepting their input.
You can try the same solution adopted by CKEditor, of which you have a demo here.
By switching from RTE mode to view source mode, you can enter some JavaScript and see the result, which is a replacement of the JS node in a safely encoded string.
If you are in view source mode, by entering some JS line like:
<script type="text/javascript">
// comment
alert('Ciao');
</script>
you will see it rendered this way when going back to rich text editor mode:
<!--{cke_protected}%3Cscript%20type%3D%22text%2Fjavascript%22%3E%0D%0A%2F%2F%20comment%0D%0Aalert('Ciao')%3B%0D%0A%3C%2Fscript%3E-->
I think it is one of the easiest and effective way, since the RegExp to parse JS nodes is not complex.
An example:
var pattern = /<script(\s+(\w+\s*=\s*("|').*?\3)\s*)*\s*(\/>|>.*?<\/script\s*>)/;
var match = HTMLString.match(pattern); // array containing the occurrences found
(Of course, to replace the script node you should use the replace() method).
Regards.
You can set Content Security Policy as script-src 'self' in header. CSP will block any scripts other than from our own website. This way we can prevent any scripts from iframe changing the elements in our page.
You can get more information from here http://www.html5rocks.com/en/tutorials/security/content-security-policy/

Categories

Resources