PHP - safely allow users to save html - javascript

I'm writting a website that allows users to paste in html for their blogs.
The html they paste in will then get saved to a file and this is what will be read and changed when the user makes changes. The file will almost be like a complete web page so it will have all your normal tags; head, body, div's, etc. etc.
This means it should allow almost all html and css tags apart from anything that could cause a security breach. So it essentially needs to strip php tags, certain style tags, and html/javascript script tags.
I looked into the strip_tags function but I'd rather not use that because:
a) it removes html comments which I'd rather keep and
b) it would be a lot of working specifying all the tags that it needs to ignore, considering I want it to ignore vastly more tags than I want it to strip.
My guess is that this is something regex-esc using preg_replace?
I'd like to add; I'm newly aware of XSS attacks through CSS too so any ideas/thoughts on how I could block certain css style tags out would be wonderful :)
Any ideas on what I could do?

Related

How to load and parse HTML without modifying its contents

There are many ways to parse and traverse HTML4 files using many technologies. But I can not find a suitable one to save that DOM to file again.
I want to be able to load an HTML file into a DOM, change one small thing (e.g. an attribute's value), save the DOM to file again and when diffing the source file and the created file, I want them to be completely identical, except that small change.
This kind of task is absolutely no problem when working with XML and suitable XML libraries, but when it comes to HTML there are several issues: Whitespace such as indentations or linebreaks get lost or are inserted, self-closing start tags (such as <link...>) emerge as <link.../> and/or the content of CDATA sections (e.g. between <script> and </script>) is wrapped into <![CDATA[ ]]>. These things are critical in my case.
Which way can I go to load, traverse, manipulate and save HTML without the drawbacks described above, most importantly without whitespace text nodes to be altered?
comparison
If you want to get really serious leave out the GUI and go headless, SO example with Phantom
I am going with the HTML Agility Pack.
Loading and saving does not manipulate anything else than invalid parts.

Access surrounding HTML elements through JavaScript with no context

What I want is to be able to work on context-free HTML elements surrounding 'entry points' (like <script> tags or events) using JavaScript. I have some tight restrictions on what I can do.
Summary
I have multiple user-generated blocks of HTML which need to be processed on their own, as they load.
The content of these blocks can contain similar blocks, with similar behaviour.
The content will also need to be duplicated once, possibly restarting execution of the scripts mentioned in 2.
These blocks are generated from static templates and cannot (initially) contain unique identifying data, like IDs or random attribute values.
These blocks can also be generated through AJAX, which is out of my control. They will need to take care of themselves as they appear, without relying on any order of execution.
Background
This is for a forum software, where certain BBCode tags, say [tag]content[/tag] are replaced with fixed HTML. I have no access to any server-side scripting, so the replacements are context-independent, i.e. always the same.
For example, [tag]{content}[/tag] would turn into something like:
<span ...>
...
{content}
...
<!-- script entry point -->
</span>
I need to do some client-side processing around the time when the data is loaded.
I cannot change the requirements of the problem. The end product is code for generating tabs, like:
[tabspace]
[tab]content 1[/tab]
[tab]
content 2, and
nested tabs:
[tabspace]...[/tabspace]
[/tab]
[/tabspace]
The content that is output by the script consists in the "tab buttons" themselves, which would link to their respective content.
Restrictions
I cannot use IDs. Everything must be or start as a context-free replacement of the template.
[tag]s can be nested.
The script needs to duplicate some of the content, parts of which can be similar scripts and so on. This messes up scripts which are already running on the outer pairs of tags. I have a bottom-up solution to solve this, but it relies on starting scripts at any depth without worrying about similar surroundings. (another way to phrase it)
These tags can be loaded dynamically in any post on the page, at any time, using AJAX.
There can be scripts at any point in the template, to make the problem easier.
What I tried
Using JS to dynamically output a <span id='something random'></span> as it loads, search for the given ID and use its location to find the surrounding elements. It doesn't work when I load pages dynamically and when tags are nested.
Give a class to the surrounding element and find the last element of its kind. This doesn't work because we may have updates in the middle of the page, after it's loaded.
Solutions which I'd rather avoid
Using <img src='bogus' onerror='script entry point'/>, I can run a script and access surrounding tags using this. But I'd rather not use broken links and errors to solve a simple and pertinent problem.

How to create an independent HTML block?

I want to know if there is some way to create an independent HTML block .
For more explanation :
My problem is that I have a webpage in which I allow some users can add content (may contain HTML & CSS )
I allow them to add their content inside a certain block , but sometimes their content may not be clean code , and may contain some DIVS with no end , Or even some DIV end with no starting DIV
This sometimes distort my page completely
Is there any way to make their content displayed independently from my parent div , so that my div is first displayed well , and then the content inside it is displayed ?
I'm sorry for long message .
Thanks for any trial to help
sometimes their content may not be clean code , and may contain some
DIVS with no end , Or even some DIV end with no starting DIV This
sometimes distort my page completely
The easiest solution for you is going to be to add the submitted content to your page inside an <iframe>. That way, it doesn't matter if the submitted HTML is invalid.
If you have to worry about users possibly submitting malicious content (such as JavaScript), the problem becomes much harder: you need to sanitize the HTML. I can't tell you how to do this without knowing what server-side language you're using.
My problem is that I have a webpage in which I allow some users can add content (may contain HTML & CSS ) I allow them to add their content inside a certain block , but sometimes their content may not be clean code , and may contain some DIVS with no end , Or even some DIV end with no starting DIV This sometimes distort my page completely
If that is the problem you are trying to solve, then having some markup to say a chunk of code was independent wouldn't help: They might include the "End of independent section" code in the HTML.
If you want to put the code in a page, you need to parse it, sanitise it (using a whitelist) to remove anything potentially harmful and then generate clean markup from the DOM.
you could use Static iframes.
check this out http://www.samisite.com/test-csb2nf/id43.htm
The safest way is to restrict the tags they can submit, and validate/sanitize those that they do, similar to the way we can use markup on here.
Having unchecked HTML injected into your page is asking for trouble.
Failing that, good old iframe will do the trick.
Okay, i belive there is something you can do, but it can require some time. You can use a parser to go through the users html, and get the tags and their content, and recreate the html making it clean.
But, as there are a lot of tags that can be used, or even invented tags, than you can limit the tags that the user are able to use in their html. You put a legend with the acceptable tags.
There are some pretty good html parsers for php, but they may break for some very bad html code, so this is why i suggest you just recreate it based on the parsing with a limited subset of acceptable tags.
I know it's a difficult/time consuming solution, but this is what i have in mind

Are HTML-only widgets secure?

I would like to display some boxes on random pages through a browser plug-in. The content of these boxes is also random.
Is a simple check to remove scripts from said boxes enough to offer a user a safe experience?
Do I have to put the boxes in iframes?
Do I have strip off additional code from HTML? (is removing 'script' tags enough?)
Do you know of some library that can do that automatically?
Do I have to put the boxes in iframes?
Yes or no, depending on your definition of safe.
That will not stop the scripts from initiating downloads of malware, redirecting the user to a phishing page, XSRFing a poorly designed site the user is currently logged into.
Is a simple check to remove scripts from said boxes enough to offer a user a safe experience?
No. There are many ways to embed scripts, and simple checks rarely get it right. For example, scripts can be embedded in links, CSS, SVG, data: URLs, etc.
Don't roll your own HTML sanitizer.
Directly relevant to your question about safe HTML widgets though is sandboxed JavaScript. See
http://code.google.com/p/google-caja/wiki/CorkboardDemo
No, plane HTML can still be malicious. An <iframe> could be used to load a drive-by-exploit from any website. an <img> tag could be used to exploit a GET based Cross-Site Request Forgery(CSRF) vulnerability. A POST based CSRF exploit would require one line of javascript or some user interaction.
Removing javascript form html is far more complex than just removing script tags. HTMLPurier is comprised of hundreds of regular expressions and its the best method of removing javascript, but its not perfect.
That all depends on from where the content is coming from and what kind of content it is.
For example, if the content is just text from your site, you might want to filter out HTML, just in case.

Disable JavaScript in iframe/div

I am making a small HTML page editor. The editor loads a file into an iframe. From there, it could add, modify, or delete the elements on the page with new attributes, styles, etc. The problem with this, is that JavaScript (and/or other programming languages) can completely modify the page when it loads, before you start editing the elements. So when you save, it won't save the original markup, but the modified page + your changes.
So, I need some way to disable the JavaScript on the iframe, or somehow remove all the JavaScript before the JavaScript starts modifying the page. (I figure I'll have to end up parsing the file for PHP, but that shouldn't be too hard) I considered writing a script to loop through all the elements, removing any tags, onclick's, onfocus's, onmouseover's, etc. But that would be a real pain.
Does anyone know of an easier way to get rid of JavaScript from running inside an iframe?
UPDATE: unless I've missed something, I believe there is no way to simply 'disable JavaScript.' Please correct me if I'm wrong. But, I guess the only way to do it would be to parse out any script tags and JavaScript events (click, mouseover, etc) from a requested page string.
HTML5 introduces the sandbox attribute on the iframe that, if empty, disables JavaScript loading and execution.
Yes, your update is correct. You must assume that any input you receive from the user may contain malicious elements. Thus, you must validate on the server before accepting their input.
You can try the same solution adopted by CKEditor, of which you have a demo here.
By switching from RTE mode to view source mode, you can enter some JavaScript and see the result, which is a replacement of the JS node in a safely encoded string.
If you are in view source mode, by entering some JS line like:
<script type="text/javascript">
// comment
alert('Ciao');
</script>
you will see it rendered this way when going back to rich text editor mode:
<!--{cke_protected}%3Cscript%20type%3D%22text%2Fjavascript%22%3E%0D%0A%2F%2F%20comment%0D%0Aalert('Ciao')%3B%0D%0A%3C%2Fscript%3E-->
I think it is one of the easiest and effective way, since the RegExp to parse JS nodes is not complex.
An example:
var pattern = /<script(\s+(\w+\s*=\s*("|').*?\3)\s*)*\s*(\/>|>.*?<\/script\s*>)/;
var match = HTMLString.match(pattern); // array containing the occurrences found
(Of course, to replace the script node you should use the replace() method).
Regards.
You can set Content Security Policy as script-src 'self' in header. CSP will block any scripts other than from our own website. This way we can prevent any scripts from iframe changing the elements in our page.
You can get more information from here http://www.html5rocks.com/en/tutorials/security/content-security-policy/

Categories

Resources