How do online HTML editors prevent bugs from invalid user code?

How do online HTML editors prevent bugs from invalid user code? - javascript

I have built a browser-based HTML/Markdown code editor, that seems to be working great for the most part. I have live rendering previews as the user updates code, etc. I already sanitize any <script> tags, but other elements such as <div> or <style> are allowed.
The problem is if the user saves the document with invalid HTML (i.e., unclosed <style> tags, etc.), upon reloading the document, the whole site will be inoperable due to my HTML elements being "eaten" by the unclosed tag in the user code.
Question: Is there a solid strategy to render user code in an isolated container such that errors inside that container do not bleed out into the rest of the page?
I am using Javascript with React. This seems like a use case for iFrames, but I have had it beaten into me that iFrames are never a good idea, and there is always a better way to do what you want to do.

Iframes are the perfect solution for problems like this. Over using them can be a problem on a page, but in this case it is the right tool for the job.
Also not sure why that was beaten into you, the frame tag is always bad, but iframes used sparingly are useful.

Related

Inserting user provided content into a document--validating HTML string, insertAdjacentHTML and iframe usage?

If I want to accept HTML built by a user of an extension, and not from a web source, and display it within an existing extension document, is there an alternative to using an iframe?
For example, if a user provides mathematical expresions using MathML and they are to be displayed in the current web page, and the user may add a <div> or <p> tag inaccurately and have incomplete HTML, how can it be added to the page without corrupting the layout of the page, apart from an iframe?
Does insertAdjacentHTML really accomplish this? This MDN article seems to imply so, where it reads, "It does not reparse the element it is being used on, and thus it does not corrupt the existing elements inside that element."
Or, is there a way to validate the HTML string before inserting into the DOM, such as DOMParser?
Also, for users that are knowledgeable in HTML, CSS, JS and can construct a small interactive document rather than just an expression, to be dislayed within the page, is an iframe the only option? The user provided code will be stored in indexedDB and rendered only on the user's machine and within this extension tool. So, something similar to a snippet on stackoverflow. I have this working in an iframe now but the user could add about a dozen of these to the page at any one time and I wondered if there is a better way of accomplishing this, regarding memory usage and in general.
Thank you.

Show incorrect html on page that doesn't influence the rest of the page

I have got some detail content pages on my site where I don't have the complete control over the html content that is displayed in a certain div. Now when the content of the external resource contains invalid html, like having no ending my navigation in the right-bar is also italic. I don't want to use iframes, like ebay, and there is probably other ways to fix this. Hope on an answer.
<html>
<body>
<div id="page">
<div id="content">[content of external resource]</div>
<div id="right-bar">[My navigation]</div>
</div>
</body>
</html>
A simplified structure of my page is above.

I hate to tell people to use Tables when they aren't necessary, but in this case, I feel that it could solve your problem.
The issue that you are facing is one of the browsers well-formed html check, so, you could have some browsers that work as you hope, and others that work the way that bothers you, as each rendering engine is going to perform it's own well-formed html check flavor.
If you wrap it inside a td, then I don't think that it will be able to bleed styling the way that you are seeing. Just a thought. The reason that a td container is going to help more than the div container that you are currently using is the following: Since you are wrapping their stuff in a div, and they are most likely wrapping their own stuff in divs, the browser doesn't know where the mistake is at. It doesn't know where the missing div tag should be inserted. So essentially, div in div in div creates problems for the well-formed html check, as it is not sure which of the tag you forgot. However, div in div in td, that is more distinct. If the td open and closes, then it knows that the missing tags belong to a smaller group of possible elements. In other words, you are making it easier on the well-formed check to do it's job by wrapping it inside different tag types.
This makes sense to me. I hope that I have explained it ok.

I think that there is no other way than using iframes. If you don't use an iframe to host the external content means that the external content will be included in the DOM structure of your page, so unless you parse all the external content code to check all the possible things that should affect your page (and this could be a madness), you will never be sure that your page will be safe from collateral effects coming from the external code.
And even using iframes, you should parse anyway the external content to look for script tags, to prevent any undesired javascript code been executing inside your page.

Are HTML-only widgets secure?

I would like to display some boxes on random pages through a browser plug-in. The content of these boxes is also random.
Is a simple check to remove scripts from said boxes enough to offer a user a safe experience?
Do I have to put the boxes in iframes?
Do I have strip off additional code from HTML? (is removing 'script' tags enough?)
Do you know of some library that can do that automatically?

Do I have to put the boxes in iframes?
Yes or no, depending on your definition of safe.
That will not stop the scripts from initiating downloads of malware, redirecting the user to a phishing page, XSRFing a poorly designed site the user is currently logged into.
Is a simple check to remove scripts from said boxes enough to offer a user a safe experience?
No. There are many ways to embed scripts, and simple checks rarely get it right. For example, scripts can be embedded in links, CSS, SVG, data: URLs, etc.
Don't roll your own HTML sanitizer.
Directly relevant to your question about safe HTML widgets though is sandboxed JavaScript. See
http://code.google.com/p/google-caja/wiki/CorkboardDemo

No, plane HTML can still be malicious. An <iframe> could be used to load a drive-by-exploit from any website. an <img> tag could be used to exploit a GET based Cross-Site Request Forgery(CSRF) vulnerability. A POST based CSRF exploit would require one line of javascript or some user interaction.
Removing javascript form html is far more complex than just removing script tags. HTMLPurier is comprised of hundreds of regular expressions and its the best method of removing javascript, but its not perfect.

That all depends on from where the content is coming from and what kind of content it is.
For example, if the content is just text from your site, you might want to filter out HTML, just in case.

Why is it bad practice to use links with the javascript: "protocol"?

In the 1990s, there was a fashion to put Javascript code directly into <a> href attributes, like this:
Press me!
And then suddenly I stopped to see it. They were all replaced by things like:
Press me!
For a link whose sole purpose is to trigger Javascript code, and has no real href target, why is it encouraged to use the onclick property instead of the href property?

The execution context is different, to see this, try these links instead:
Press me! <!-- result: undefined -->
Press me! <!-- result: A -->
javascript: is executed in the global context, not as a method of the element, which is usually want you want. In most cases you're doing something with or in relation to the element you acted on, better to execute it in that context.
Also, it's just much cleaner, though I wouldn't use in-line script at all. Check out any framework for handling these things in a much cleaner way. Example in jQuery:
$('a').click(function() { alert(this.tagName); });

Actually, both methods are considered obsolete. Developers are instead encouraged to separate all JavaScript in an external JS file in order to separate logic and code from genuine markup
http://www.alistapart.com/articles/behavioralseparation
http://en.wikipedia.org/wiki/Unobtrusive_JavaScript
The reason for this is that it creates code that is easier to maintain and debug, and it also promotes web standards and accessibility. Think of it like this: Looking at your example, what if you had hundreds of links like that on a page and needed to change out the alert behavior for some other function using external JS references, you'd only need to change a single event binding in one JS file as opposed to copying and pasting a bunch of code over and over again or doing a find-and-replace.

Couple of reasons:
Bad code practice:
The HREF tag is to indicate that there is a hyperlink reference to another location. By using the same tag for a javascript function which is not actually taking the user anywhere is bad programming practice.
SEO problems:
I think web crawlers use the HREF tag to crawl throughout the web site & link all the connected parts. By putting in javascript, we break this functionality.
Breaks accessibility:
I think some screen readers will not be able to execute the javascript & might not know how to deal with the javascript while they expect a hyperlink. User will expect to see a link in the browser status bar on hover of the link while they will see a string like: "javascript:" which might confuse them etc.
You are still in 1990's:
The mainstream advice is to have your javascript in a seperate file & not mingle with the HTML of the page as was done in 1990's.
HTH.

I open lots of links in new tabs - only to see javascript:void(0). So you annoy me, as well as yourself (because Google will see the same thing).
Another reason (also mentioned by others) is that different languages should be separated into different documents. Why? Well,
Mixed languages aren't well supported
by most IDEs and validators.
Embedding CSS and JS into HTML pages
(or anything else for that matter)
pretty much destroys opportunities to
have the embedded language checked for correctness
statically. Sometimes, the embedding language as well.
(A PHP or ASP document isn't valid HTML.)
You don't want syntax
errors or inconsistencies to show up
only at runtime.
Another reason is to have a cleaner separation between
the kinds of things you need to
specify: HTML for content, CSS for
layout, JS usually for more layout
and look-and-feel. These don't map
one to one: you usually want to apply
layout to whole categories of
content elements (hence CSS) and look and feel as well
(hence jQuery). They may be changed at different
times that the content elements are changed (in fact
the content is often generated on the fly) and by
different people. So it makes sense to keep them in
separate documents as well.

Using the javascript: protocol affects accessibility, and also hurts how SEO friendly your page is.
Take note that HTML stands for Hypter Text something something... Hyper Text denotes text with links and references in it, which is what an anchor element <a> is used for.
When you use the javascript: 'protocol' you're misusing the anchor element. Since you're misusing the <a> element, things like the Google Bot and the Jaws Screen reader will have trouble 'understanding' your page, since they don't care much about your JS but care plenty about the Hyper Text ML, taking special note of the anchor hrefs.
It also affects the usability of your page when a user who does not have JavaScript enabled visits your page; you're breaking the expected functionality and behavior of links for those users. It will look like a link, but it won't act like a link because it uses the javascript protocol.
You might think "but how many people have JavaScript disabled nowadays?" but I like to phrase that idea more along the lines of "How many potential customers am I willing to turn away just because of a checkbox in their browser settings?"
It boils down to how href is an HTML attribute, and as such it belongs to your site's information, not its behavior. The JavaScript defines the behavior, but your never want it to interfere with the data/information. The epitome of this idea would be the external JavaScript file; not using onclick as an attribute, but instead as an event handler in your JavaScript file.

Short Answer: Inline Javascript is bad for the reasons that inline CSS is bad.

The worst problem is probably that it breaks expected functionality.
For example, as others has pointed out, open in new window/tab = dead link = annoyed/confused users.
I always try to use onclick instead, and add something to the URL-hash of the page to indicate the desired function to trigger and add a check at pageload to check the hash and trigger the function.
This way you get the same behavior for clicks, new tab/window and even bookmarked/sent links, and things don't get to wacky if JS is off.
In other words, something like this (very simplified):
For the link:
onclick = "doStuff()"
href = "#dostuff"
For the page:
onLoad = if(hash="dostuff") doStuff();

Also, as long as we're talking about deprecation and semantics, it's probably worth pointing out that '</a>' doesn't mean 'clickable' - it means 'anchor,' and implies a link to another page. So it would make sense to use that tag to switch to a different 'view' in your application, but not to perform a computation. The fact that you don't have a URL in your href attribute should be a sign that you shouldn't be using an anchor tag.
You can, alternately, assign a click event action to nearly any html element - maybe an <h1>, an <img>, or a <p> would be more appropriate? At any rate, as other people have mentioned, add another attribute (an 'id' perhaps) that javascript can use as a 'hook' (document.getElementById) to get to the element and assign an onclick. That way you can keep your content (HTML) presentation (CSS) and interactivity (JavaScript) separated. And the world won't end.

I typically have a landing page called "EnableJavascript.htm" that has a big message on it saying "Javascript must be enabled for this feature to work". And then I setup my anchor tags like this...
<a href="EnableJavascript.htm" onclick="funcName(); return false;">
This way, the anchor has a legitimate destination that will get overwritten by your Javascript functionality whenever possible. This will degrade gracefully. Although, now a days, I generally build web sites with complete functionality before I decide to sprinkle some Javascript into the mix (which all together eliminates the need for anchors like this).
Using onclick attribute directly in the markup is a whole other topic, but I would recommend an unobtrusive approach with a library like jQuery.

I think it has to do with what the user sees in the status bar. Typically applications should be built for failover in case javascript isn't enabled however this isn't always the case.
With all the spamming that is going on people are getting smarter and when an email looks 'phishy' more and more people are looking at the status bar to see where the link will actually take them.
Remember to add 'return false;' to the end of your link so the page doesn't jump to the top on the user (unless that's the behaviour you are looking for).

Noscript Tag, JavaScript Disabled Warning and Google Penalty

I have been using a noscript tag to show a warning when users have JavaScript disabled or are using script blocking plugins like Noscript. The website will not function properly if JavaScript is disabled and users may not figure out why it is not working without the warning.
After the latest Google algorithm shuffle, I have seen the daily traffic drop to about 1/3 of what it was in the previous months. I have also seen pages that were ranking #1 or #2 in the SERPS drop out of the results. After doing some investigating in webmaster tools, I noticed that "JavaScript" is listed as #16 in the keywords section. This makes no sense because the site has nothing to do with JavaScript and the only place that word appears is in the text between the noscript tags.
It seems that Google is now including and indexing the content between the noscript tags. I don't believe that this was happening before. The warning is three sentences. I'd imagine that having the same three sentences appearing at the top of every single page on the site could have a damaging effect on the SEO.
Do you think this could be causing a problem with SEO? And, is there any other method to provide a warning to users who have JavaScript disabled in a way that won't be indexed or read by search engines?

Put the <noscript> content at the end of your HTML, and then use CSS to position it at the top of the browser window. Google will no longer consider it important.
Stack Overflow itself uses this technique - do a View Source on this page and you'll see a "works best with JavaScript" warning near the end of the HTML, which appears at the top of the page when you switch off JavaScript.

<noscript> is not meant for meaningless warnings like:
<noscript>
Oh, no! You don't have JavaScript enabled! If you don't enable JS, you're doomed. [Long explanation about how to enable JS in every browser ever made]
</noscript>
It's meant for you to provide as much content as you can, along with a polite mention that enabling JS will provide access to certain extra features. You'll find that basically every popular site follows this guideline.

I don't think using <noscript> is a good idea. I've heard that it is ineffective when the client is behind a JavaScript-blocking firewall - if the client's browser has JavaScript enabled the <noscript> tag won't activate, because, as far as the browser's concerned, JavaScript is fully operable within the document...
A better method IMO, is to have all would-be 'noscript' content hidden by JavaScript.
Here's a very basic example:
...
<body>
<script>
document.body.className += ' js-enabled';
</script>
<div id="noscript">
Welcome... here's some content...
</div>
And within your StyleSheet:
body.js-enabled #noscript { display: none; }
More info:
Replacing <noscript> with accessible, unobtrusive DOM/JavaScript
Reasons to avoid NOSCRIPT

Somebody on another forum mentioned using an image for the warning. The way I see it, this would have three benefits:
There wouldn't be any irrelevant text for search engines to index.
The code to display a single image is less bulky than a text warning (which gets loaded on every page).
Tracking could be implemented to determine how many times the image is called, to give an idea of how many visitors have JavaScript disabled or blocked.
If you combine this with something like the non-noscript technique mentioned by J-P, it seems to be the best possible solution.

Just wanted to post an interesting tidbit related to this. For a site of mine I have ended up doing something similar to what stack overflow uses, but with the addition of a "find out more" link as my users are not as technical as this site.
The interesting part is that following advice of people aboce, my solution ditched the noscript tag, instead opting to hide the message divs with javascript. But I found that if firefox is waiting for its master password, this hiding of the message is interupted, so I think I will go back to noscript.

If you choose a solution based on replacing the div content (if js is enabled, then the div content gets updated) rather than using a noscript tag, be careful about how google views this practice:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=66353
I'm not sure google will consider it deceptive, but it's something to consider and research further. Here's another stackoverflow post about this: noscript google snapshot, the safe way

Develop Reference

JavaScript is the programming language of the Web.