Google custom search (or alternative) in XHML

Google custom search (or alternative) in XHML - javascript

Google's JavaScript API makes use of the function document.write, thus is not usable in XHTML.
Do you know a workaround how to get the custom search working in XHTML? Or is there a working alternative?

Are you actually serving your XHTML as XML (application/xhtml+xml)? If not, you don't have to worry about it, yet. document.write will still work in text/html mode though it is certainly poor practice in general.
If you really are serving native XHTML... well, I suspect you may get more problems than just document.write, as there are a fair few things that can trip scripting up when it's not expecting to be run in XHTML. But you can hack it the problem by sabotaging document.write.
The simplest method would be something like:
document.write= function(s) {
document.getElementById('placetoputwrittenstuff').innerHTML= s;
};
however you would need more messing around if it tried to write <script> tags (since setting them through innerHTML doesn't execute them; you would have to pick them out with getElementsByTagName and run each one manually), or partial bits of elements across different calls to write (in which case you'd have to collect strings and glue them together when it's finished).

Related

Javascript inlined between script tags vs src DATA URI UTF-8 with percent-encoding

I build mobile first and I use tiny frameworks (under 10kB) which I inline in index.html to save on HTTP request.
I looked for days now and it seems like everyone else who inlines javascript does it like this:
<script>UGLIFIED JAVASCRIPT</script>
I do it like this:
<script src="data:application/javascript;utf8, UGLIFIED PERCENT-ENCODED JAVASCRIPT"></script>
You may say percent encoding will make a file much larger but it actually doesn't because the way gzip works- it replaces the repetition and it doesn't matter if the repeated phrase is <div> or %3Cdiv%3E.
My question is- are there any potential advantages of my approach?
PS. One of my ideas was browser caching file-like DATA-URI elements but I don't know if this makes sense since then I would have to also find the way of controlling how to prevent the load of parts of index.html. Unless I could use the cached elements elsewhere - that would have it's use cases too. Thoughts?

First, if your site isn't an SPA, inlining your shared scripts (regardless of method) means you're loading them on every page, negating the value of the browser cache.
Second, the trip across the wire may be similar for encoded vs. not script, but the more important metric is the time it takes for the Javascript to be parsed and compiled. URL decoding isn't free, but while I don't think it's going to matter much in the grand scheme of things, I see no reason why it would actually be faster to load than just script within the tag.

Selector interpreted as HTML on my website. Can an attacker easily exploit this?

I have a website that is only accessible via https.
It does not load any content from other sources. So all content is on the local webserver.
Using the Retire.js Chrome plugin I get a warning that the jquery 1.8.3 I included is vulnerable to 'Selector interpreted as HTML'
(jQuery bug 11290)
I am trying to motivate for a quick upgrade, but I need something more concrete information to motivate the upgrade to the powers that be.
My question are :
Given the above, should I be worried ?
Can this result in a XSS type attack ?

What the bug is telling you is that jQuery may mis-identify a selector containing a < as being an HTML fragment instead, and try to parse and create the relevant elements.
So the vulnerability, such as it is, is that a cleverly-crafted selector, if then passed into jQuery, could define a script tag that then executes arbitrary script code in the context of the page, potentially taking private information from the page and sending it to someone with malicious (or merely prurient) intent.
This is largely only useful if User A can write a selector that will later be given to jQuery in User B's session, letting User A steal information from User B's page. (It really doesn't matter if a user can "tricky" jQuery this way on their own page; they can do far worse things from the console, or with "save as".)
So: If nothing in your code lets users provide selectors that will be saved and then retrieved by other users and passed to jQuery, I wouldn't be all that worried. If it does (with or without the fix to the bug), I'd examine those selector strings really carefully. I say "with or without the bug" because if you didn't filter what the users typed at all, they could still just provide an HTML fragment where the first non-whitespace character is <, which would still cause jQuery to parse it as an HTML fragment.

As the author of Retire.js let me shed some light on this. There are two weaknesses in older versions of jQuery, but those are not vulnerabilities by themselves. It depends on how jQuery is used. Two examples abusing the bugs are shown here: research.insecurelabs.org/jquery/test/
The two examples are:
$("#<img src=x onerror=...>")
and
$("element[attribute='<img src=x onerror=...>'")
Typically this becomes a problem if you do something like:
$(location.hash)
This was a fairly common pattern for many web sites, when single page web sites started to occur.
So this becomes a problem if and only if you put untrusted user data inside the jQuery selector function.
And yes the end result is XSS, if the site is in fact vulnerable. https will not protect you against these kinds of flaws.

What are the arguments against the inclusion of server side scripting in JavaScript code blocks?

I've been arguing for some time against embedding server-side tags in JavaScript code, but was put on the spot today by a developer who seemed unconvinced
The code in question was a legacy ASP application, although this is largely unimportant as it could equally apply to ASP.NET or PHP (for example).
The example in question revolved around the use of a constant that they had defined in ServerSide code.
'VB
Const MY_CONST: MY_CONST = 1
If sMyVbVar = MY_CONST Then
'Do Something
End If
//JavaScript
if (sMyJsVar === "<%= MY_CONST%>"){
//DoSomething
}
My standard arguments against this are:
Script injection: The server-side tag could include code that can break the JavaScript code
Unit testing. Harder to isolate units of code for testing
Code Separation : We should keep web page technologies apart as much as possible.
The reason for doing this was so that the developer did not have to define the constant in two places. They reasoned that as it was a value that they controlled, that it wasn't subject to script injection. This reduced my justification for (1) to "We're trying to keep the standards simple, and defining exception cases would confuse people"
The unit testing and code separation arguments did not hold water either, as the page itself was a horrible amalgam of HTML, JavaScript, ASP.NET, CSS, XML....you name it, it was there. No code that was every going to be included in this page could possibly be unit tested.
So I found myself feeling like a bit of a pedant insisting that the code was changed, given the circumstances.
Are there any further arguments that might support my reasoning, or am I, in fact being a bit pedantic in this insistence?

Script injection: The server-side tag could include code that can break the JavaScript code
So write the code properly and make sure that values are correctly escaped when introduced into the JavaScript context. If your framework doesn't include a JavaScript "quoter" tool (hint: the JSON support is probably all you need), write one.
Unit testing. Harder to isolate units of code for testing
This is a good point, but if it's necessary for the server to drop things into the page for code to use, then it's necessary. I mean, there are times when this simply has to be done. A good way to do it is for the page to contain some sort of minimal block of data. Thus the server-munged JavaScript on the page really isn't "code" to be tested, it's just data. The real client code included from .js files can find the data and use it.
Thus, the page may contain:
<script>
(function(window) {
window['pageData'] = {
companyName: '<%= company.name %>',
// etc
};
})(this);
</script>
Now your nicely-encapsulated pure JavaScript code in ".js" files just has to check for window.pageData, and it's good to go.
Code Separation : We should keep web page technologies apart as much as possible.
Agreed, but it's simply a fact that sometimes server-side data needs to drive client-side behavior. To create hidden DOM nodes solely for the purpose of storing data and satisfying your rules is itself a pretty ugly practice.
Coding rules and aesthetics are Good Things. However, one should be pragmatic and take everything in perspective. It's important to remember that the context of such rules is not always a Perfect Divine Creation, and in the case of HTML, CSS, and JavaScript I think that fact is glaringly clear. In such an imperfect environment, hard-line rules can force you into unnecessary work and code that's actually harder to maintain.
edit — oh here's something else I just thought of; sort-of a compromise. A "trick" popularized (in part) by the jQuery gang with their "micro template" facility (apologies to the web genius who actually hit upon this first) is to use <script> tags that are sort-of "neutered":
<script id='pageData' type='text/plain'>
{
'companyName': '<%= company.name %>',
'accountType': '<%= user.primaryAccount.type %>',
// etc
}
</script>
Now the browser itself will not even execute that script - the "type" attribute isn't something it understands as being code, so it just ignores it. However, browsers do make the content of such scripts available, so your code can find the script by "id" value and then, via some safe JSON library or a native browser API if available, parse the notation and extract what it needs. The values still have to be properly quoted etc, but you're somewhat safer from XSS holes because it's being parsed as JSON and not as "live" full-blown JavaScript.

The reason for doing this was so that the developer did not have to define the constant in two places.
To me, this is a better argument than any argument you can make against it. It is the DRY principle. And it greatly enhances code maintainability.
Every style guide/rule taken to extreme leads to an anti-pattern. In this case your insistence of separation of technology breaks the DRY principle and can potentially make code harder to maintain. Even DRY itself if taken to extreme can lead to an anti-pattern: softcoding.
Code maintainability is a fine balance. Style guides are there to help maintain that balance. But you have to know when those very guides help and when they themselves become a problem.
Note that in the example you have given the code would not break syntax hilighting or parsing (even stackoverflow hilights it correctly) so the IDE argument would not work since the IDE can still parse that code correctly.

it simply gets unreadable. You have to take a closer look to divide the different languages. If JavaScript and the mixed-in language use the same variable names, things are getting even worse. This is especially hard for people that have to look at others people code.
Many IDEs have problems with syntax highlighting of heavily mixed documents, which can lead to the loss of Auto-Completion, proper Syntax Highlighting and so on.
It makes the code less re-usable. Think of a JavaScript function that does a common task, like echoing an array of things. If you separate the JavaScript-logic from the data it's iterating over, you can use the same function all over your application, and changes to this function have to be done only once. If the data it's iterating over is mixed with the JavaScript output loop you probably end up repeating the JavaScript code just because the mixed in language has an additional if-statement before each loop.

javascript in the <body>

I used to believe that you should not insert javascript blocks
<script language="javascript">
<!--
//-->
</script>
into the body part of a (what, HTML, XHTML?) document, but rather into the head.
But is that still true?

Script in the body (not links to external files) is like putting CSS in the head--people move toward separating it so that they can have the markup and logic separate for clarity and ease of maintenance.
I'm a big fan of the document ready feature of Jquery...but that's just personal preference. A dom loader is really the only way to guarantee loading is going to be identical between the various different browsers. Thanks, Microsoft!
I say use common sense...it's not worth doing another file for a single line of code...or even two. If we all went to the extremes that best practices sometimes ask us to go, we'd all be nuts.....or at least more nuts than we are now.

But is that still true?
I'm not sure it ever was. Are you thinking about <style> CSS elements? Because those are illegal in the body.
But it is usually the better choice to put Javascript code into the head or a separate script, and wrap it in a document.ready or onload event.
However, in-body Javascript can have its place, for example when embedding external JavaScripts that document.write() stuff into the document. Top-modern, bleeding-edge Google Analytics relies on a <script> segment being inserted into the very end of the body.

But is that still true?
It is a matter of good/best practices. HTML and Javascript should be separate. This is even knows as unobtrusive javascript/code.
More info at wikipedia:
Unobtrusive JavaScript
Although this is a good practice, but you can still put javascript at any part of the page, however you should avoid this as much as possible.
Some advocate that javascript should only go at the end of the page, for example they seem to say that is is better in terms of SEO (Search Engine Optimization) as well as performance as denoted by #David Dorward in his comment.

According to Yahoo, for the best performance it's recommended to put any script tags at the end of your document just before the closing html tags:
http://developer.yahoo.com/performance/rules.html
Google suggests using a deferred method to load scripts:
http://code.google.com/speed/page-speed/docs/payload.html#DeferLoadingJS
But they should almost always be script calls to an external .js file. There are very few occasions where it's better to have the .js embedded on the page.

It's not recommended because if you try to access elements in the body itself (i.e forms, fields, etc) since they may only become available once the entire body has rendered. However, it's a valid and actually very common practice.

Is there an Open Source Python library for sanitizing HTML and removing all Javascript?

I want to write a web application that allows users to enter any HTML that can occur inside a <div> element. This HTML will then end up being displayed to other users, so I want to make sure that the site doesn't open people up to XSS attacks.
Is there a nice library in Python that will clean out all the event handler attributes, <script> elements and other Javascript cruft from HTML or a DOM tree?
I am intending to use Beautiful Soup to regularize the HTML to make sure it doesn't contain unclosed tags and such. But, as far as I can tell, it has no pre-packaged way to strip all Javascript.
If there is a nice library in some other language, that might also work, but I would really prefer Python.
I've done a bunch of Google searching and hunted around on pypi, but haven't been able to find anything obvious.
Related
Sanitising user input using Python

As Klaus mentions, the clear consensus in the community is to use BeautifulSoup for these tasks:
soup = BeautifulSoup.BeautifulSoup(html)
for script_elt in soup.findAll('script'):
script_elt.extract()
html = str(soup)

Whitelist approach to allowed tags, attributes and their values is the only reliable way. Take a look at Recipe 496942: Cross-site scripting (XSS) defense
What is wrong with existing markup languages such as used on this very site?

You could use BeautifulSoup. It allows you to traverse the markup structure fairly easily, even if it's not well-formed. I don't know that there's something made to order that works only on script tags.

I would honestly look at using something like bbcode or some other alternative markup with it.

Eric,
Have you thought about using a 'SAX' type parser for the HTML? I'm really not sure
though that it would ignore the events properly though. It would also be a bit harder to construct than using something like Beautiful Soup. Handling syntax errors may be a problem with SAX as well.
What I like to do in situations like this is to construct python objects (subclassed from an XML_Element class) from the parsed HTML. Then remove any undesired objects from the tree, and finally re-serialize the objects back to html. It's not all that hard in python.
Regards,

Develop Reference

JavaScript is the programming language of the Web.