I have a web page with control, that render user's HTML markup.
I want remove all JS calls (and CSS, I guess) to prevent users from injecting malware code. Replacing all script tags and all onclick with others handlers seems to be a bad idea, so questin is about the best solution for this XSS problem in .Net world.
I'd strongly suggest not going down the regex route (You can't parse HTML with Regex), and consider something like HTMLAgilityPack.
This would allow you to remove all script elements, as well as remove all event handlers from elements regardless of how they're set up.
The alternative is to escape all HTML input, and then manually parse the particular tags you're interested in.
<b>Hello</b>
Becomes
<b>Hello</>
And you can then match <(b|i|u|p|em|othertagsgohere)>(.+?)</$1> so that it will only match tags with no attributes on them of the types that you're interested in and. But ultimately I think the HTMLAgiltiyPack route is the better one.
Related
I am learning about XSS (for ethical purposes), and I was wondering how to execute some JavaScript code without using <script> tags. This is within the HTML tag:
"The search term" <p> *JavaScript here* </p> "returned no results"
For some reason, the script tags are not working.
Try putting in different types of strings with special characters and look if any of these get encoded or outputed. (I personaly use '';!--"<XSS>=&{()})
Now you have three options:
Inside a HTML Tag: The <> won't matter, because you are already inside a HTML Tag. You can look if this Tag supports Events and use some kind of onload=alert(1) or other event. If <> is allowed, you can break out and create your own tag '><img src=0 onerror=alert(1)>
Outside of HTML Tag: the <> are important. With these you can open a new Tag and the whole world is below your feet (or so...)
Inside Javascript: Well...if you can break out of a string with '", then you can basically write ';alert(1)
Craft your XSS accordingly to your encoded characters and the surrounding of where the string get's outputed
<XSS> disappears entirely: the application uses some kind of strip_tags . If you are outside of a HTML Tag and no HTML Tags are whitelisted, I unfortunatly don't know any method to achieve an XSS.
Crafting your own payload
There are various methods to achieve this and too much to name them all.
Look on these two sites, which have a lot of the methods and concept to construct your own.
It comes down to: What the page allows to go through.
https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet#XSS_Locator_.28short.29
https://html5sec.org/
You can use the onclick attribute that is presented in HTML elements so you can create something like this:
"The search term" <p> Click me to see the JavaScript work! </p> "returned no results"
Now when clicking on the element the JavaScript will be executed.
Another one was mentioned at: https://stackoverflow.com/a/53430230/895245
asdf
Works on Chromium 81.
More important perhaps is the question of how to sanitize against it, see e.g.:
How to prevent XSS with HTML/PHP?
How to sanitize HTML code in Java to prevent XSS attacks?
Is using jquery parseHTML to remove script tags enough to prevent XSS attacks?
I want to know if there is some way to create an independent HTML block .
For more explanation :
My problem is that I have a webpage in which I allow some users can add content (may contain HTML & CSS )
I allow them to add their content inside a certain block , but sometimes their content may not be clean code , and may contain some DIVS with no end , Or even some DIV end with no starting DIV
This sometimes distort my page completely
Is there any way to make their content displayed independently from my parent div , so that my div is first displayed well , and then the content inside it is displayed ?
I'm sorry for long message .
Thanks for any trial to help
sometimes their content may not be clean code , and may contain some
DIVS with no end , Or even some DIV end with no starting DIV This
sometimes distort my page completely
The easiest solution for you is going to be to add the submitted content to your page inside an <iframe>. That way, it doesn't matter if the submitted HTML is invalid.
If you have to worry about users possibly submitting malicious content (such as JavaScript), the problem becomes much harder: you need to sanitize the HTML. I can't tell you how to do this without knowing what server-side language you're using.
My problem is that I have a webpage in which I allow some users can add content (may contain HTML & CSS ) I allow them to add their content inside a certain block , but sometimes their content may not be clean code , and may contain some DIVS with no end , Or even some DIV end with no starting DIV This sometimes distort my page completely
If that is the problem you are trying to solve, then having some markup to say a chunk of code was independent wouldn't help: They might include the "End of independent section" code in the HTML.
If you want to put the code in a page, you need to parse it, sanitise it (using a whitelist) to remove anything potentially harmful and then generate clean markup from the DOM.
you could use Static iframes.
check this out http://www.samisite.com/test-csb2nf/id43.htm
The safest way is to restrict the tags they can submit, and validate/sanitize those that they do, similar to the way we can use markup on here.
Having unchecked HTML injected into your page is asking for trouble.
Failing that, good old iframe will do the trick.
Okay, i belive there is something you can do, but it can require some time. You can use a parser to go through the users html, and get the tags and their content, and recreate the html making it clean.
But, as there are a lot of tags that can be used, or even invented tags, than you can limit the tags that the user are able to use in their html. You put a legend with the acceptable tags.
There are some pretty good html parsers for php, but they may break for some very bad html code, so this is why i suggest you just recreate it based on the parsing with a limited subset of acceptable tags.
I know it's a difficult/time consuming solution, but this is what i have in mind
My page needs to grab a specific div from another page to display within a div on the current page. It's a perfect job for $.load, except that the HTML source I'm pulling from is not necessarily well-formed, and seems to have occasional tag errors, and IE just plain won't use it in that case. So I've changed it to a $.get to grab the HTML source of the page as a string. Passing it to $ to parse it as a DOM has the same problem in IE as $.load, so I can't do that. I need a way to parse the HTML string to find the contents of my div#information, but not the rest of the page after its </div>. (PS My div#information contains various other div's and elements.)
EDIT: Also if anyone has a fix for jQuery's $.load not being able to parse response HTML in IE, I'd love to hear that too.
If the resource you are trying to load is under your control, your implementation spec is poorly optimized. You don't want to ask your server for an entire page of content when you only really need a small piece of that content.
What you'll want to do is isolate the content you want, and have the server return only what you need.
As a side note, since you are aware that you have malformed HTML, you should probably bite the bullet and validate your markup. That will save you some trouble (like this) in the future.
Finally, if you truly cannot optimize this process, my guess is that you are creating an inconsistency because some element in the parsed HTML has the same ID as an element on your current page. Identical ID's are invalid and lead to many cross-browser JavaScript problems.
Strictly with strings you could use a regular expression to find the id="information" tag contents. Never parse it as html.
I'd try the $.load parameter that accepts a html selector as well
$('#results').load('/MySite/Url #Information');
I am making a small HTML page editor. The editor loads a file into an iframe. From there, it could add, modify, or delete the elements on the page with new attributes, styles, etc. The problem with this, is that JavaScript (and/or other programming languages) can completely modify the page when it loads, before you start editing the elements. So when you save, it won't save the original markup, but the modified page + your changes.
So, I need some way to disable the JavaScript on the iframe, or somehow remove all the JavaScript before the JavaScript starts modifying the page. (I figure I'll have to end up parsing the file for PHP, but that shouldn't be too hard) I considered writing a script to loop through all the elements, removing any tags, onclick's, onfocus's, onmouseover's, etc. But that would be a real pain.
Does anyone know of an easier way to get rid of JavaScript from running inside an iframe?
UPDATE: unless I've missed something, I believe there is no way to simply 'disable JavaScript.' Please correct me if I'm wrong. But, I guess the only way to do it would be to parse out any script tags and JavaScript events (click, mouseover, etc) from a requested page string.
HTML5 introduces the sandbox attribute on the iframe that, if empty, disables JavaScript loading and execution.
Yes, your update is correct. You must assume that any input you receive from the user may contain malicious elements. Thus, you must validate on the server before accepting their input.
You can try the same solution adopted by CKEditor, of which you have a demo here.
By switching from RTE mode to view source mode, you can enter some JavaScript and see the result, which is a replacement of the JS node in a safely encoded string.
If you are in view source mode, by entering some JS line like:
<script type="text/javascript">
// comment
alert('Ciao');
</script>
you will see it rendered this way when going back to rich text editor mode:
<!--{cke_protected}%3Cscript%20type%3D%22text%2Fjavascript%22%3E%0D%0A%2F%2F%20comment%0D%0Aalert('Ciao')%3B%0D%0A%3C%2Fscript%3E-->
I think it is one of the easiest and effective way, since the RegExp to parse JS nodes is not complex.
An example:
var pattern = /<script(\s+(\w+\s*=\s*("|').*?\3)\s*)*\s*(\/>|>.*?<\/script\s*>)/;
var match = HTMLString.match(pattern); // array containing the occurrences found
(Of course, to replace the script node you should use the replace() method).
Regards.
You can set Content Security Policy as script-src 'self' in header. CSP will block any scripts other than from our own website. This way we can prevent any scripts from iframe changing the elements in our page.
You can get more information from here http://www.html5rocks.com/en/tutorials/security/content-security-policy/
In the 1990s, there was a fashion to put Javascript code directly into <a> href attributes, like this:
Press me!
And then suddenly I stopped to see it. They were all replaced by things like:
Press me!
For a link whose sole purpose is to trigger Javascript code, and has no real href target, why is it encouraged to use the onclick property instead of the href property?
The execution context is different, to see this, try these links instead:
Press me! <!-- result: undefined -->
Press me! <!-- result: A -->
javascript: is executed in the global context, not as a method of the element, which is usually want you want. In most cases you're doing something with or in relation to the element you acted on, better to execute it in that context.
Also, it's just much cleaner, though I wouldn't use in-line script at all. Check out any framework for handling these things in a much cleaner way. Example in jQuery:
$('a').click(function() { alert(this.tagName); });
Actually, both methods are considered obsolete. Developers are instead encouraged to separate all JavaScript in an external JS file in order to separate logic and code from genuine markup
http://www.alistapart.com/articles/behavioralseparation
http://en.wikipedia.org/wiki/Unobtrusive_JavaScript
The reason for this is that it creates code that is easier to maintain and debug, and it also promotes web standards and accessibility. Think of it like this: Looking at your example, what if you had hundreds of links like that on a page and needed to change out the alert behavior for some other function using external JS references, you'd only need to change a single event binding in one JS file as opposed to copying and pasting a bunch of code over and over again or doing a find-and-replace.
Couple of reasons:
Bad code practice:
The HREF tag is to indicate that there is a hyperlink reference to another location. By using the same tag for a javascript function which is not actually taking the user anywhere is bad programming practice.
SEO problems:
I think web crawlers use the HREF tag to crawl throughout the web site & link all the connected parts. By putting in javascript, we break this functionality.
Breaks accessibility:
I think some screen readers will not be able to execute the javascript & might not know how to deal with the javascript while they expect a hyperlink. User will expect to see a link in the browser status bar on hover of the link while they will see a string like: "javascript:" which might confuse them etc.
You are still in 1990's:
The mainstream advice is to have your javascript in a seperate file & not mingle with the HTML of the page as was done in 1990's.
HTH.
I open lots of links in new tabs - only to see javascript:void(0). So you annoy me, as well as yourself (because Google will see the same thing).
Another reason (also mentioned by others) is that different languages should be separated into different documents. Why? Well,
Mixed languages aren't well supported
by most IDEs and validators.
Embedding CSS and JS into HTML pages
(or anything else for that matter)
pretty much destroys opportunities to
have the embedded language checked for correctness
statically. Sometimes, the embedding language as well.
(A PHP or ASP document isn't valid HTML.)
You don't want syntax
errors or inconsistencies to show up
only at runtime.
Another reason is to have a cleaner separation between
the kinds of things you need to
specify: HTML for content, CSS for
layout, JS usually for more layout
and look-and-feel. These don't map
one to one: you usually want to apply
layout to whole categories of
content elements (hence CSS) and look and feel as well
(hence jQuery). They may be changed at different
times that the content elements are changed (in fact
the content is often generated on the fly) and by
different people. So it makes sense to keep them in
separate documents as well.
Using the javascript: protocol affects accessibility, and also hurts how SEO friendly your page is.
Take note that HTML stands for Hypter Text something something... Hyper Text denotes text with links and references in it, which is what an anchor element <a> is used for.
When you use the javascript: 'protocol' you're misusing the anchor element. Since you're misusing the <a> element, things like the Google Bot and the Jaws Screen reader will have trouble 'understanding' your page, since they don't care much about your JS but care plenty about the Hyper Text ML, taking special note of the anchor hrefs.
It also affects the usability of your page when a user who does not have JavaScript enabled visits your page; you're breaking the expected functionality and behavior of links for those users. It will look like a link, but it won't act like a link because it uses the javascript protocol.
You might think "but how many people have JavaScript disabled nowadays?" but I like to phrase that idea more along the lines of "How many potential customers am I willing to turn away just because of a checkbox in their browser settings?"
It boils down to how href is an HTML attribute, and as such it belongs to your site's information, not its behavior. The JavaScript defines the behavior, but your never want it to interfere with the data/information. The epitome of this idea would be the external JavaScript file; not using onclick as an attribute, but instead as an event handler in your JavaScript file.
Short Answer: Inline Javascript is bad for the reasons that inline CSS is bad.
The worst problem is probably that it breaks expected functionality.
For example, as others has pointed out, open in new window/tab = dead link = annoyed/confused users.
I always try to use onclick instead, and add something to the URL-hash of the page to indicate the desired function to trigger and add a check at pageload to check the hash and trigger the function.
This way you get the same behavior for clicks, new tab/window and even bookmarked/sent links, and things don't get to wacky if JS is off.
In other words, something like this (very simplified):
For the link:
onclick = "doStuff()"
href = "#dostuff"
For the page:
onLoad = if(hash="dostuff") doStuff();
Also, as long as we're talking about deprecation and semantics, it's probably worth pointing out that '</a>' doesn't mean 'clickable' - it means 'anchor,' and implies a link to another page. So it would make sense to use that tag to switch to a different 'view' in your application, but not to perform a computation. The fact that you don't have a URL in your href attribute should be a sign that you shouldn't be using an anchor tag.
You can, alternately, assign a click event action to nearly any html element - maybe an <h1>, an <img>, or a <p> would be more appropriate? At any rate, as other people have mentioned, add another attribute (an 'id' perhaps) that javascript can use as a 'hook' (document.getElementById) to get to the element and assign an onclick. That way you can keep your content (HTML) presentation (CSS) and interactivity (JavaScript) separated. And the world won't end.
I typically have a landing page called "EnableJavascript.htm" that has a big message on it saying "Javascript must be enabled for this feature to work". And then I setup my anchor tags like this...
<a href="EnableJavascript.htm" onclick="funcName(); return false;">
This way, the anchor has a legitimate destination that will get overwritten by your Javascript functionality whenever possible. This will degrade gracefully. Although, now a days, I generally build web sites with complete functionality before I decide to sprinkle some Javascript into the mix (which all together eliminates the need for anchors like this).
Using onclick attribute directly in the markup is a whole other topic, but I would recommend an unobtrusive approach with a library like jQuery.
I think it has to do with what the user sees in the status bar. Typically applications should be built for failover in case javascript isn't enabled however this isn't always the case.
With all the spamming that is going on people are getting smarter and when an email looks 'phishy' more and more people are looking at the status bar to see where the link will actually take them.
Remember to add 'return false;' to the end of your link so the page doesn't jump to the top on the user (unless that's the behaviour you are looking for).