How to detect JavaScript from html in NodeJs and stop JS rendering - javascript

Is there a way we can detect if an html file carries Javascript ? and can we stop rendering Javascript from html, in Node JS ?
I know we can stop the html rendering all together by setting the response content-type from text/html to text/plain. But I'm trying to figure out some way to stop rendering the JS only.
Kindly let me know if it's even possible, Thanks.

I'm guessing you're sending the file to a browser from Node.js (you talked about changing the content type header).
To do this, you'll need to:
Parse the file with an HTML parser (there are a few available for Node.js). Be sure it's one that normalizes input, so that (for instance), xxx is normalized to .... (Thanks Quentin for emphasizing that!)
Using the resulting document model, remove:
any script elements
any onxyz attributes (onclick, onmouseover) on elements
For instance, <div onclick="..." should be changed to <div ....
remove any URL attributes (like href on a elements) that use the javascript: scheme
For instance, <a href="javascript:codeHere()" should be changed to <a href="#" or similar (if you remove href entirely, that works to, but the link will no longer automatically be a tabstop etc.).
(This is where normalization in the parser is important.)
Serialize the resulting document model to HTML and send it to the browser

Related

How to understand and view parsing of HTML-embedded JavaScript?

I want to learn more about XSS, but I can't seem to find good resources on how HTML-embedded JavaScript, like the below code snippet, is parsed.
How can I view in the browser, how this code is parsed? I.e. how many rounds of parsing are performed, how each round transforms the input (e.g. decoding) etc.
<!DOCTYPE html>
<html>
<body>
<button type="button" onclick="setTimeout(() => alert(1), 1000)">Click this!</button>
</body>
</html>
After HTML parsing is performed, that decodes HTML encoded entites, what does the program look like? Does HTML parsing also mess with the onclick attribute?
Your question is fairly broad but seems to generally relate to how browsers render web pages.
Your browser requests the web document and begins interpreting it, reading line by line and building a Object Oriented representation of the Document Object Model (DOM); the DOM interfaces between javascript logic and the html document to dynamically construct the application.
When the browser reaches your button it renders the element with the written attributes; upon parsing the onclick attribute, it sets a event listener to the DOM for the click event and the event listener invokes the defined function asynchronously with the set parameters when the event is detected.
Please update your question if you require clarification on something specific or not addressed.

HTML attributes that can contain javascript

I'm looking for a simple list of all the html attributes that can contain javascript that will automatically run when an action is performed. I know this will differ between browsers and versions but I'd rather be safer than sorry. I currently know of the following javascript attributes: onload, onclick, onchange, onmouseover, onmouseout, onmousedown, and onmouseup
Backstory:
I'm getting a full html document from an untrusted source and I want to strip all javascript that could run from the original html document so I'm removing all script tags as well as any attributes that could hold javascript before its displayed in an iframe. For this implantation there is no server side processing and no way of sandboxing the code since I need to run javascript that is being added locally after all of the original javascript is removed.
There are two places where Javascript can be used in HTML attributes:
Any onEVENT attribute. I suggest just treating any attribute that begins with on as an event binding, and strip them all out.
Any attribute that can contain a URI will be executed as Javascript if the URI uses the javascript: scheme, such as href and src. A complete list is in
COMPLETE list of HTML tag attributes which have a URL value?
http://www.w3.org/TR/html401/interact/scripts.html#h-18.2.3
Scroll down to 18.2.3 Intrinsic events
I've had a similar requirement in a project. Don't forget to strip script elements, as well.

Parsing HTML Response in jQuery

My page needs to grab a specific div from another page to display within a div on the current page. It's a perfect job for $.load, except that the HTML source I'm pulling from is not necessarily well-formed, and seems to have occasional tag errors, and IE just plain won't use it in that case. So I've changed it to a $.get to grab the HTML source of the page as a string. Passing it to $ to parse it as a DOM has the same problem in IE as $.load, so I can't do that. I need a way to parse the HTML string to find the contents of my div#information, but not the rest of the page after its </div>. (PS My div#information contains various other div's and elements.)
EDIT: Also if anyone has a fix for jQuery's $.load not being able to parse response HTML in IE, I'd love to hear that too.
If the resource you are trying to load is under your control, your implementation spec is poorly optimized. You don't want to ask your server for an entire page of content when you only really need a small piece of that content.
What you'll want to do is isolate the content you want, and have the server return only what you need.
As a side note, since you are aware that you have malformed HTML, you should probably bite the bullet and validate your markup. That will save you some trouble (like this) in the future.
Finally, if you truly cannot optimize this process, my guess is that you are creating an inconsistency because some element in the parsed HTML has the same ID as an element on your current page. Identical ID's are invalid and lead to many cross-browser JavaScript problems.
Strictly with strings you could use a regular expression to find the id="information" tag contents. Never parse it as html.
I'd try the $.load parameter that accepts a html selector as well
$('#results').load('/MySite/Url #Information');

When objects (e.g. img tag) is created by javascript, it doesn't have the closing tag. How do I make it W3C valid?

Image tag (<img src="" alt="" />), line break tags (<br />), or horizontal rule tags (<hr />) have the slashes at the end to indicate itself as self-closing tags. However, when these objects are created by javascript, and I look into the source, they don't have the slashes, making them invalid by W3C standards.
How can I get over this problem?
(I use javascript Prototype library)
How are you looking at ‘the source’? JavaScript-created elements don't appear in ‘View Source’. Are you talking about innerHTML?
If so then what you are getting is the web browser's serialisation of the DOM nodes in the document. In a browser the HTML markup of a page is not the definitive store for document state. The document is stored as a load of Node objects; when these objects are serialised back to markup, that markup may not look much like the original HTML page markup that was parsed to get the document.
So regardless of which of:
<img src="x" alt="y"/>
<img src="x" alt="y">
<img alt = "y" src="x">
img= document.createElement('img'); img.src= 'x'; img.alt= 'y';
you use to create an HTMLImageElement node, when you serialise it using innerHTML the browser will typically generate the same HTML markup.
If the browser is in native-XML mode (ie because the page was served as application/xhtml+html), then the innerHTML value certainly will contain self-closing syntax like<img/>. (You might also see other XML stuff like namespaces too, in some browsers.)
However if, as is more likely today, you're serving ‘HTML-compatible XHTML’ under the media type text/html, the browser isn't actually using XHTML at all, so you'll get old-school-HTML when you access innerHTML and you won't see the self-closing /> form.
There's is no problem. W3C standards only refer to the markup in the original file, any changes after that are made directly to the DOM, (not your sauce code) which is also a W3C standard. This will in no way affect the standards compliance of your website.
To go into further detail, HTML and XHTML are only different ways of building the DOM (Document Object Model), which is best described as a large tree structure of nodes which is used to describe a web page. Once the DOM has been built, the language used to build it is irrelevant, You can even build the DOM using pure javascript if you wished.
It never matters!, I used to confirm every standard of W3C but it turns to be a stupid thing!
Just conform the safe ones of them that allows you to code cross-browsers and your case is definitely not one of them since it's never an issue and causes no problems.

Disable JavaScript in iframe/div

I am making a small HTML page editor. The editor loads a file into an iframe. From there, it could add, modify, or delete the elements on the page with new attributes, styles, etc. The problem with this, is that JavaScript (and/or other programming languages) can completely modify the page when it loads, before you start editing the elements. So when you save, it won't save the original markup, but the modified page + your changes.
So, I need some way to disable the JavaScript on the iframe, or somehow remove all the JavaScript before the JavaScript starts modifying the page. (I figure I'll have to end up parsing the file for PHP, but that shouldn't be too hard) I considered writing a script to loop through all the elements, removing any tags, onclick's, onfocus's, onmouseover's, etc. But that would be a real pain.
Does anyone know of an easier way to get rid of JavaScript from running inside an iframe?
UPDATE: unless I've missed something, I believe there is no way to simply 'disable JavaScript.' Please correct me if I'm wrong. But, I guess the only way to do it would be to parse out any script tags and JavaScript events (click, mouseover, etc) from a requested page string.
HTML5 introduces the sandbox attribute on the iframe that, if empty, disables JavaScript loading and execution.
Yes, your update is correct. You must assume that any input you receive from the user may contain malicious elements. Thus, you must validate on the server before accepting their input.
You can try the same solution adopted by CKEditor, of which you have a demo here.
By switching from RTE mode to view source mode, you can enter some JavaScript and see the result, which is a replacement of the JS node in a safely encoded string.
If you are in view source mode, by entering some JS line like:
<script type="text/javascript">
// comment
alert('Ciao');
</script>
you will see it rendered this way when going back to rich text editor mode:
<!--{cke_protected}%3Cscript%20type%3D%22text%2Fjavascript%22%3E%0D%0A%2F%2F%20comment%0D%0Aalert('Ciao')%3B%0D%0A%3C%2Fscript%3E-->
I think it is one of the easiest and effective way, since the RegExp to parse JS nodes is not complex.
An example:
var pattern = /<script(\s+(\w+\s*=\s*("|').*?\3)\s*)*\s*(\/>|>.*?<\/script\s*>)/;
var match = HTMLString.match(pattern); // array containing the occurrences found
(Of course, to replace the script node you should use the replace() method).
Regards.
You can set Content Security Policy as script-src 'self' in header. CSP will block any scripts other than from our own website. This way we can prevent any scripts from iframe changing the elements in our page.
You can get more information from here http://www.html5rocks.com/en/tutorials/security/content-security-policy/

Categories

Resources