JavaScript / Lit-element safe way to parse HTML

JavaScript / Lit-element safe way to parse HTML - javascript

I am getting some html from my server that I want to put into my page. However I want it to be sanitized (just in case).
However I am not quite sure how to do this.
So far I've tried:
<div .innerHTML="${body}"></div>
Since that should parse it as HTML but I am not 100% sure that this is the best way.
I have also looked at online sanitizers but haven't been able to find any that match my project (Lit-element web component).
Is there a better way to parse HTML and if so how?

Take a look at the DOMParser interface API
document.getElementById('my-target').append(new DOMParser().parseFromString(data, 'text/html').body.children);

It's not clear whether you want to render the html as html or as text.
I seem to remember that lit-html does some things behind the scenes to produce secure templates but surprisingly I cannot find official content to back up that statement. Others have asked about this before.
In that GitHub issue we can see that Mike Samuel mentions that what you're doing is not secure.You can trust Mike Samuel: he's worked in the security field at Google and I had the privilege to listen to one of his talks on the topic two years ago.
Here's a quick example:
<p .innerHTML=${'<button onclick="alert(42)">click</button>'}></p>
This renders a button which produces an alert when you click on it. In this example the JavaScript code is harmless but it's not hard to imagine something way more dangerous.
However this simply renders the code as string. The content is somehow escaped and therefore totally harmless.
<p>${'<button onclick="alert(42)">click</button>'}></p>
In fact similar to React's dangerouslySetInnerHTML attribute you need to "opt out" from secure templating via lit-html unsafeHTML directive:
Renders the argument as HTML, rather than text.
Note, this is unsafe to use with any user-provided input that hasn't been
sanitized or escaped, as it may lead to cross-site-scripting vulnerabilities.
<p>${unsafeHTML('<button onclick="alert(42)">click</button>')}></p>
About DOMParser#parseFromString
In this introductory article about trusted-types we can see that this method is a known XSS sink.
Sure it won't execute <script> blocks but it won't sanitise the string for you. You are still at risk of XSS here:
<p .innerHTML="${(new DOMParser()).parseFromString('<button onclick="alert(42)">click</button>','text/html').body.innerHTML}"></p>

Related

Selector interpreted as HTML on my website. Can an attacker easily exploit this?

I have a website that is only accessible via https.
It does not load any content from other sources. So all content is on the local webserver.
Using the Retire.js Chrome plugin I get a warning that the jquery 1.8.3 I included is vulnerable to 'Selector interpreted as HTML'
(jQuery bug 11290)
I am trying to motivate for a quick upgrade, but I need something more concrete information to motivate the upgrade to the powers that be.
My question are :
Given the above, should I be worried ?
Can this result in a XSS type attack ?

What the bug is telling you is that jQuery may mis-identify a selector containing a < as being an HTML fragment instead, and try to parse and create the relevant elements.
So the vulnerability, such as it is, is that a cleverly-crafted selector, if then passed into jQuery, could define a script tag that then executes arbitrary script code in the context of the page, potentially taking private information from the page and sending it to someone with malicious (or merely prurient) intent.
This is largely only useful if User A can write a selector that will later be given to jQuery in User B's session, letting User A steal information from User B's page. (It really doesn't matter if a user can "tricky" jQuery this way on their own page; they can do far worse things from the console, or with "save as".)
So: If nothing in your code lets users provide selectors that will be saved and then retrieved by other users and passed to jQuery, I wouldn't be all that worried. If it does (with or without the fix to the bug), I'd examine those selector strings really carefully. I say "with or without the bug" because if you didn't filter what the users typed at all, they could still just provide an HTML fragment where the first non-whitespace character is <, which would still cause jQuery to parse it as an HTML fragment.

As the author of Retire.js let me shed some light on this. There are two weaknesses in older versions of jQuery, but those are not vulnerabilities by themselves. It depends on how jQuery is used. Two examples abusing the bugs are shown here: research.insecurelabs.org/jquery/test/
The two examples are:
$("#<img src=x onerror=...>")
and
$("element[attribute='<img src=x onerror=...>'")
Typically this becomes a problem if you do something like:
$(location.hash)
This was a fairly common pattern for many web sites, when single page web sites started to occur.
So this becomes a problem if and only if you put untrusted user data inside the jQuery selector function.
And yes the end result is XSS, if the site is in fact vulnerable. https will not protect you against these kinds of flaws.

Pass innerText value of a page to Chrome extension without opening the page

I've been looking all over for this, and I think the problem is that I inherently suck at programming or scripting of any sort, and I don't know the right words to use...
Basically: I want to make a Chrome extension that reads the the innerText value from the ticketing system at the place I work with. As an example...
<span class="infomsg">Tickets Found [<span id="tickets_count">5</span>]</span>
The goal would be for the extension to display the text "5" over the icon.
What's the best way to do this? I've tried configuring the background.html page with an iframe with the URL with the ticket count as the source, but then I run into the cross-domain scripting issue. document.getElementById("tickets_count").innerHTML can't use a specified URL, as near as I've found.
I'm sure I haven't described it very well at all - totally floundering here, to be honest...let me know what I can clarify, and I'll edit my post.
Thanks!

It depends on whether the page you're looking at is static (e.g. the server sends you HTML with this information already in it) or dynamic (e.g. some JavaScript on the page requests additional information and then adds this to the page).
If it's static, you can use XHR to request the page and find the string you need in the "raw" HTML response. You can't use getElementById in that case - you'll need to find a way to find the string yourself.
If it's dynamic, that won't work. An iframe-in-the-background approach is valid - but you can't access the contents of the iframe. Instead, you should inject a content script in that page and request the information you need.
I understand it's a broad answer - but your question is also quite broad.

loading a external content so that searchable by Google for SEO purposes

I'm working on a project where we'd like to load external content onto a customers site. The main requirements are that we'd like the customer to have as simple of an include as possible (like a one-line link similar to Doubleclick) and would preferably not have to be involved in any server-side language. The two proposed ways of doing this were an iframe or loading a javascript file that document.write's out the content.
We looked more at the latter since it seemed to produce more reliable legibility and simplicity for the end user - a single line of Javascript. We have been hit with the reality that this will be indexed unpredictably by Google. I have read most of the posts on this topic regarding javascript and indexing (for example http://www.seroundtable.com/google-ajax-execute-15169.html, https://twitter.com/mattcutts/status/131425949597179904). Currenlty we have (for example):
<html>
<body>
<div class='main-container'>
<script src='http://www.other.com/page.js'></script>
</div>
</body>
</html>
and
// at http://www.other.com/page.js
document.write('blue fish and green grass');
but it looks like google indexes this type of content only sometimes based upon 'Fetch As Google' used in Google's webmaster tools. Since it does sometimes work, I know it's possible for this indexing to be ok. More specifically, if we isolate our content to something like the above and remove extraneous content, it will index it each time (as opposed to the EXACT SAME Javascript in a regular customer html page). If we have our content in a customer's html file it doesn't seem to get indexed.
What would be a better option to ensure that Google has indexed the content (remote isn't any better)? Ideas I have tried / come across would be to load a remote file in for example PHP, something like:
echo file_get_contents('http://www.other.com/page');
This is obviously blocking but possibly not a deal-breaker.
Given the above requirements, would there be any other solution?
thx

This is a common problem and I've created a JS plugin that you can use to solve this.
Url: https://github.com/kubrickology/Logical-escaped_fragment
Make sure to use the: __init() function instead of standard DOM ready functions and you know for sure that Google is able to index.

Angular $sce vs HTML in external locale files

A question regarding ng-bind-html whilst upgrading an Angular app from 1.0.8 to 1.2.8:
I have locale strings stored in files named en_GB.json, fr_FR.json, etc. So far, I have allowed the use of HTML within the locale strings to allow the team writing the localized content to apply basic styling or adding inline anchor tags. This would result in the following example JSON:
{
"changesLater": "<strong>Don't forget</strong> that you can always make changes later."
"errorEmailExists": "That email address already exists, please sign in to continue."
}
When using these strings with ng-bind-html="myStr", I understand that I now need to use $sce.trustAsHtml(myStr). I could even write a filter as suggested in this StackOverflow answer which would result in using ng-bind-html="myStr | unsafe".
Questions:
By doing something like this, is my app now insecure? And if so, how might an attacker exploit this?
I can understand potential exploits if the source of the displayed HTML string was a user (ie. blog post-style comments that will be displayed to other users), but would my app really be at risk if I'm only displaying HTML from a JSON file hosted on the same domain?
Is there any other way I should be looking to achieve the marking-up of externally loaded content strings in an angular app?

You are not making your app any less secure. You were already inserting HTML in your page with the old method of ng-bind-html-unsafe. You are still doing the same thing, except now you have to explicitly trust the source of the HTML rather than just specifying that part of your template can output raw HTML. Requiring the use of $sce makes it harder to accidentally accept raw HTML from an untrusted source - in the old method where you only declared the trust in the template, bad input might make its way into your model in ways you didn't think of.
If the content comes from your domain, or a domain you control, then you're safe - at least as safe as you can be. If someone is somehow able to highjack the payload of a response from your own domain, then your security is already all manner of screwed. Note, however, you should definitely not ever call $sce.trustAsHtml on content that comes from a domain that isn't yours.
Apart from maintainability concerns, I don't see anything wrong with the way you're doing it. Having a ton of HTML live in a JSON file is maybe not ideal, but as long as the markup is reasonably semantic and not too dense, I think it's fine. If the markup becomes significantly more complex, I'd consider splitting it into separate angular template files or directives as needed, rather than trying to manage a bunch of markup wrapped in JSON strings.

Is there an Open Source Python library for sanitizing HTML and removing all Javascript?

I want to write a web application that allows users to enter any HTML that can occur inside a <div> element. This HTML will then end up being displayed to other users, so I want to make sure that the site doesn't open people up to XSS attacks.
Is there a nice library in Python that will clean out all the event handler attributes, <script> elements and other Javascript cruft from HTML or a DOM tree?
I am intending to use Beautiful Soup to regularize the HTML to make sure it doesn't contain unclosed tags and such. But, as far as I can tell, it has no pre-packaged way to strip all Javascript.
If there is a nice library in some other language, that might also work, but I would really prefer Python.
I've done a bunch of Google searching and hunted around on pypi, but haven't been able to find anything obvious.
Related
Sanitising user input using Python

As Klaus mentions, the clear consensus in the community is to use BeautifulSoup for these tasks:
soup = BeautifulSoup.BeautifulSoup(html)
for script_elt in soup.findAll('script'):
script_elt.extract()
html = str(soup)

Whitelist approach to allowed tags, attributes and their values is the only reliable way. Take a look at Recipe 496942: Cross-site scripting (XSS) defense
What is wrong with existing markup languages such as used on this very site?

You could use BeautifulSoup. It allows you to traverse the markup structure fairly easily, even if it's not well-formed. I don't know that there's something made to order that works only on script tags.

I would honestly look at using something like bbcode or some other alternative markup with it.

Eric,
Have you thought about using a 'SAX' type parser for the HTML? I'm really not sure
though that it would ignore the events properly though. It would also be a bit harder to construct than using something like Beautiful Soup. Handling syntax errors may be a problem with SAX as well.
What I like to do in situations like this is to construct python objects (subclassed from an XML_Element class) from the parsed HTML. Then remove any undesired objects from the tree, and finally re-serialize the objects back to html. It's not all that hard in python.
Regards,

Develop Reference

JavaScript is the programming language of the Web.

JavaScript / Lit-element safe way to parse HTML - javascript

Take a look at the DOMParser interface API document.getElementById('my-target').append(new DOMParser().parseFromString(data, 'text/html').body.children);

Related

Selector interpreted as HTML on my website. Can an attacker easily exploit this?

Pass innerText value of a page to Chrome extension without opening the page

loading a external content so that searchable by Google for SEO purposes

Angular $sce vs HTML in external locale files

Is there an Open Source Python library for sanitizing HTML and removing all Javascript?

Categories

Resources