I have a web page with control, that render user's HTML markup.
I want remove all JS calls (and CSS, I guess) to prevent users from injecting malware code. Replacing all script tags and all onclick with others handlers seems to be a bad idea, so questin is about the best solution for this XSS problem in .Net world.
I'd strongly suggest not going down the regex route (You can't parse HTML with Regex), and consider something like HTMLAgilityPack.
This would allow you to remove all script elements, as well as remove all event handlers from elements regardless of how they're set up.
The alternative is to escape all HTML input, and then manually parse the particular tags you're interested in.
<b>Hello</b>
Becomes
<b>Hello</>
And you can then match <(b|i|u|p|em|othertagsgohere)>(.+?)</$1> so that it will only match tags with no attributes on them of the types that you're interested in and. But ultimately I think the HTMLAgiltiyPack route is the better one.
I'm creating firefox addon to add onclick event to the specific button. ("input" element)
The button is placed in http://example.com/welcome#_pg=compose
but when I open the page, following error occures:
TypeError: document.querySelector("#send_top") is null
#send_top is id of the button which I want to modify. So, the button is not found.
This error occurs because http://example.com/welcome and http://example.com/welcome#_pg=compose is completely different pages.
In this case, the addon seems loading http://example.com/welcome but there is no button whose '#send_top' ID.
When #_pg=compose anchor is added, the button is loaded by JavaScript.
How can I load http://example.com/welcome#_pg=compose to modify the button?
Three thoughts to help you debug this:
to correctly match the url you should consider using a regular expression instead of the page-match syntax - this might allow you to react to the anchors in a more predictable way
I've found that when using content scripts with pages that are heavily modified by JS, you can run into timing issues. A hacky workaround might be to look for the element you want and, if it isn' there, do a setTimeout for a 100 milliseconds or so and then re-check. Ugly, yes, but it worked for some example code I used with the new twitter UI, for example.
You can use the unsafeWindow variable in your content script to directly access the page's window object - this object will contain any changes JS has made to the page and is not proxied. You should use unsafeWindow with great caution however as its use represent a possible security problem. In particular, you should never trust any data coming from unsafeWindow, ever.
I would like to display some boxes on random pages through a browser plug-in. The content of these boxes is also random.
Is a simple check to remove scripts from said boxes enough to offer a user a safe experience?
Do I have to put the boxes in iframes?
Do I have strip off additional code from HTML? (is removing 'script' tags enough?)
Do you know of some library that can do that automatically?
Do I have to put the boxes in iframes?
Yes or no, depending on your definition of safe.
That will not stop the scripts from initiating downloads of malware, redirecting the user to a phishing page, XSRFing a poorly designed site the user is currently logged into.
Is a simple check to remove scripts from said boxes enough to offer a user a safe experience?
No. There are many ways to embed scripts, and simple checks rarely get it right. For example, scripts can be embedded in links, CSS, SVG, data: URLs, etc.
Don't roll your own HTML sanitizer.
Directly relevant to your question about safe HTML widgets though is sandboxed JavaScript. See
http://code.google.com/p/google-caja/wiki/CorkboardDemo
No, plane HTML can still be malicious. An <iframe> could be used to load a drive-by-exploit from any website. an <img> tag could be used to exploit a GET based Cross-Site Request Forgery(CSRF) vulnerability. A POST based CSRF exploit would require one line of javascript or some user interaction.
Removing javascript form html is far more complex than just removing script tags. HTMLPurier is comprised of hundreds of regular expressions and its the best method of removing javascript, but its not perfect.
That all depends on from where the content is coming from and what kind of content it is.
For example, if the content is just text from your site, you might want to filter out HTML, just in case.
Based on a click event on the page, via ajax I fetch a block of html and script, I am able to take the script element and append it to the head element, however WebKit based browsers are not treating it as script (ie. I cannot invoke a function declared in the appended script).
Using the Chrome Developer Tools I can see that my script node is indeed there, but it shows up differently then a script block that is not added dynamically, a non-dynamic script has a text child element and I cannot figure out a way to duplicate this for the dynamic script.
Any ideas or better ways to be doing this? The driving force is there is potentially a lot of html and script that would never be needed unless a user clicks on a particular tab, in which case the relevant content (and script) would be loaded. Thanks!
You could try using jQuery... it provides a method called .getScript that will load the JavaScript dynamically in the proper way. And it works fine in all well known browsers.
How about calling eval() on the content you receive from the server? Of course, you have to cut off the <script> and </script> parts.
If you're using a library like jQuery just use the built-in methods for doing this.
Otherwise you'd need to append it to the document rather than the head like this:
document.write("<scr" + "ipt type=\"text/javascript\" src=\"http://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js\"></scr" + "ipt>");
In all honesty, I have no idea why the script tag is cut like that, but a lot of examples do that so there's probably a good reason.
You'll also need to account for the fact that loading the script might take quite a while, so after you've appended this to the body you should set up a timer that checks if the script is loaded. This can be achieved with a simple typeof check on any global variable the script exports.
Or you could just do an eval() on the actual javascript body, but there might be some caveats.
Generally speaking though, I'd leave this kind of thing up to the browser cache and just load the javascript on the page that your tabs are on. Just try not to use any onload events, but rather call whatever initializers you need when the tab is displayed.
In the 1990s, there was a fashion to put Javascript code directly into <a> href attributes, like this:
Press me!
And then suddenly I stopped to see it. They were all replaced by things like:
Press me!
For a link whose sole purpose is to trigger Javascript code, and has no real href target, why is it encouraged to use the onclick property instead of the href property?
The execution context is different, to see this, try these links instead:
Press me! <!-- result: undefined -->
Press me! <!-- result: A -->
javascript: is executed in the global context, not as a method of the element, which is usually want you want. In most cases you're doing something with or in relation to the element you acted on, better to execute it in that context.
Also, it's just much cleaner, though I wouldn't use in-line script at all. Check out any framework for handling these things in a much cleaner way. Example in jQuery:
$('a').click(function() { alert(this.tagName); });
Actually, both methods are considered obsolete. Developers are instead encouraged to separate all JavaScript in an external JS file in order to separate logic and code from genuine markup
http://www.alistapart.com/articles/behavioralseparation
http://en.wikipedia.org/wiki/Unobtrusive_JavaScript
The reason for this is that it creates code that is easier to maintain and debug, and it also promotes web standards and accessibility. Think of it like this: Looking at your example, what if you had hundreds of links like that on a page and needed to change out the alert behavior for some other function using external JS references, you'd only need to change a single event binding in one JS file as opposed to copying and pasting a bunch of code over and over again or doing a find-and-replace.
Couple of reasons:
Bad code practice:
The HREF tag is to indicate that there is a hyperlink reference to another location. By using the same tag for a javascript function which is not actually taking the user anywhere is bad programming practice.
SEO problems:
I think web crawlers use the HREF tag to crawl throughout the web site & link all the connected parts. By putting in javascript, we break this functionality.
Breaks accessibility:
I think some screen readers will not be able to execute the javascript & might not know how to deal with the javascript while they expect a hyperlink. User will expect to see a link in the browser status bar on hover of the link while they will see a string like: "javascript:" which might confuse them etc.
You are still in 1990's:
The mainstream advice is to have your javascript in a seperate file & not mingle with the HTML of the page as was done in 1990's.
HTH.
I open lots of links in new tabs - only to see javascript:void(0). So you annoy me, as well as yourself (because Google will see the same thing).
Another reason (also mentioned by others) is that different languages should be separated into different documents. Why? Well,
Mixed languages aren't well supported
by most IDEs and validators.
Embedding CSS and JS into HTML pages
(or anything else for that matter)
pretty much destroys opportunities to
have the embedded language checked for correctness
statically. Sometimes, the embedding language as well.
(A PHP or ASP document isn't valid HTML.)
You don't want syntax
errors or inconsistencies to show up
only at runtime.
Another reason is to have a cleaner separation between
the kinds of things you need to
specify: HTML for content, CSS for
layout, JS usually for more layout
and look-and-feel. These don't map
one to one: you usually want to apply
layout to whole categories of
content elements (hence CSS) and look and feel as well
(hence jQuery). They may be changed at different
times that the content elements are changed (in fact
the content is often generated on the fly) and by
different people. So it makes sense to keep them in
separate documents as well.
Using the javascript: protocol affects accessibility, and also hurts how SEO friendly your page is.
Take note that HTML stands for Hypter Text something something... Hyper Text denotes text with links and references in it, which is what an anchor element <a> is used for.
When you use the javascript: 'protocol' you're misusing the anchor element. Since you're misusing the <a> element, things like the Google Bot and the Jaws Screen reader will have trouble 'understanding' your page, since they don't care much about your JS but care plenty about the Hyper Text ML, taking special note of the anchor hrefs.
It also affects the usability of your page when a user who does not have JavaScript enabled visits your page; you're breaking the expected functionality and behavior of links for those users. It will look like a link, but it won't act like a link because it uses the javascript protocol.
You might think "but how many people have JavaScript disabled nowadays?" but I like to phrase that idea more along the lines of "How many potential customers am I willing to turn away just because of a checkbox in their browser settings?"
It boils down to how href is an HTML attribute, and as such it belongs to your site's information, not its behavior. The JavaScript defines the behavior, but your never want it to interfere with the data/information. The epitome of this idea would be the external JavaScript file; not using onclick as an attribute, but instead as an event handler in your JavaScript file.
Short Answer: Inline Javascript is bad for the reasons that inline CSS is bad.
The worst problem is probably that it breaks expected functionality.
For example, as others has pointed out, open in new window/tab = dead link = annoyed/confused users.
I always try to use onclick instead, and add something to the URL-hash of the page to indicate the desired function to trigger and add a check at pageload to check the hash and trigger the function.
This way you get the same behavior for clicks, new tab/window and even bookmarked/sent links, and things don't get to wacky if JS is off.
In other words, something like this (very simplified):
For the link:
onclick = "doStuff()"
href = "#dostuff"
For the page:
onLoad = if(hash="dostuff") doStuff();
Also, as long as we're talking about deprecation and semantics, it's probably worth pointing out that '</a>' doesn't mean 'clickable' - it means 'anchor,' and implies a link to another page. So it would make sense to use that tag to switch to a different 'view' in your application, but not to perform a computation. The fact that you don't have a URL in your href attribute should be a sign that you shouldn't be using an anchor tag.
You can, alternately, assign a click event action to nearly any html element - maybe an <h1>, an <img>, or a <p> would be more appropriate? At any rate, as other people have mentioned, add another attribute (an 'id' perhaps) that javascript can use as a 'hook' (document.getElementById) to get to the element and assign an onclick. That way you can keep your content (HTML) presentation (CSS) and interactivity (JavaScript) separated. And the world won't end.
I typically have a landing page called "EnableJavascript.htm" that has a big message on it saying "Javascript must be enabled for this feature to work". And then I setup my anchor tags like this...
<a href="EnableJavascript.htm" onclick="funcName(); return false;">
This way, the anchor has a legitimate destination that will get overwritten by your Javascript functionality whenever possible. This will degrade gracefully. Although, now a days, I generally build web sites with complete functionality before I decide to sprinkle some Javascript into the mix (which all together eliminates the need for anchors like this).
Using onclick attribute directly in the markup is a whole other topic, but I would recommend an unobtrusive approach with a library like jQuery.
I think it has to do with what the user sees in the status bar. Typically applications should be built for failover in case javascript isn't enabled however this isn't always the case.
With all the spamming that is going on people are getting smarter and when an email looks 'phishy' more and more people are looking at the status bar to see where the link will actually take them.
Remember to add 'return false;' to the end of your link so the page doesn't jump to the top on the user (unless that's the behaviour you are looking for).