Finding comments in HTML - javascript

I have an HTML file and within it there may be Javascript, PHP and all this stuff people may or may not put into their HTML file.
I want to extract all comments from this html file.
I can point out two problems in doing this:
What is a comment in one language may not be a comment in another.
In Javascript, remainder of lines are commented out using the // marker. But URLs also contain // within them and I therefore may well eliminate parts of URLs if I
just apply substituting // and then the
remainder of the line, with nothing.
So this is not a trivial problem.
Is there anywhere some solution for this already available?
Has anybody already done this?

Problem 2: Isn't every url quoted, with either "www.url.com" or 'www.url.com', when you write it in either language? I'm not sure. If that's the case then all you haft to do is to parse the code and check if there's any quote marks preceding the backslashes to know if it's a real url or just a comment.

Look into parser generators like ANTLR which has grammars for many languages and write a nesting parser to reliably find comments. Regular expressions aren't going to help you if accuracy is important. Even then, it won't be 100% accurate.
Consider
Problem 3, a comment in a language is not always a comment in a language.
<textarea><!-- not a comment --></textarea>
<script>var re = /[/*]not a comment[*/]/, str = "//not a comment";</script>
Problem 4, a comment embedded in a language may not obviously be a comment.
<button onclick="// this is a comment//
notAComment()">
Problem 5, what is a comment may depend on how the browser is configured.
<noscript><!-- </noscript> Whether this is a comment depends on whether JS is turned on -->
<!--[if IE 8]>This is a comment, except on IE 8<![endif]-->
I had to solve this problem partially for contextual templating systems that elide comments from source code to prevent leaking software implementation details.
https://github.com/mikesamuel/html-contextual-autoescaper-java/blob/master/src/tests/com/google/autoesc/HTMLEscapingWriterTest.java#L1146 shows a testcase where a comment is identified in JavaScript, and later testcases show comments identified in CSS and HTML. You may be able to adapt that code to find comments. It will not handle comments in PHP code sections.

It seems from your word that you are pondering some approach based on regular expressions: it is a pain to do so on the whole file, try to use some tools to highlight or to discard interesting or uninteresting text and then work on what is left from your sieve according to the keep/discard criteria. Have a look at HTML::Tree and TreeBuilder, it could be very useful to deal with the HTML markup.

I would convert the HTML file into a character array and parse it. You can detect key strings like "<", "--" ,"www", "http", as you move forward and either skip or delete those segments.
The start/end indices will have to be identified properly, which is a challenge but you will have full power.
There are also other ways to simplify the process if performance is not a problem. For example, all tags can be grabbed with XML::Twig and the string can be parsed to detect JS comments.

Related

JavaScript - Is filtering '<' good enough to secure HTML before displaying?

In JavaScript, is there any known string that can cause mischief if we filter out all 'less than' ('<') characters then display the result as HTML?
var str = GetDangerousString().toString();
var secure = str.replace(/</g, '');
$('#safe').html(secure); // or
document.getElementById('safe').innerHTML = secure;
This question addresses sanitizing ID's in particular. I'm looking for a general HTML string. Ideal answer is the simplest working example of a string that would inject code or other potentially dangerous elements.
That's not enough for sure... You need to HTML encode any HTML you embed in your pages that you want to be editable by an end user. Otherwise, you need to sanitize it.
You can find out more here at the Owasp site
EDIT: In response to your comment, I'm not 100% sure. It sounds like double encoding will get you in some cases if you're not careful.
https://www.owasp.org/index.php/Double_Encoding
For example, this string from that page is supposed to demonstrate an exploit that hides the "<" character:
%253Cscript%253Ealert('XSS')%253C%252Fscript%253E
Also, the character "<" can be encoded lots of different ways in HTML, as suggested by this table:
https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet#Character_escape_sequences
So to me, that's the thing to be careful of - the fact that there may be exploitable cases that are hard to understand, but may leave you open.
But back to your original question - can you give me an example of HTML that renders as HTML that doesn't contain the "<" character? I'm trying to understand what HTML you want users to be able to use that would be in an "id".
Also, if your site is small, if you're open to rewriting parts of it (specifically how you use javascript in your pages), then you could consider using Content Security Policies to protect your users from XSS. This works in most modern browsers, and would protect lots of your users from XSS attacks if you were to take this step.

Writing a Parser for javascript code

I want to extract javasscript code and find out if there are any dynamic tag creations like document.createElement('script'); I have tried to do this with Regular expressions but using regular expressions restricts me to get only some formats so i thought of writing a javascript parser which extracts all the keywords, strings and functions from the javascript code.
In general there is no way to know if a given line of code will ever run, you would need to solve the halting problem.
If you restrict your analysis to just finding occurances of a function call you don't make much progress. Naive methods will still be easy to trick, if you just regex match for document.createElement, you would not be able to match something as simple as document["create" + "Element"]. In general you would need to not only parse the code but evaluate it as well to get around this. And to be sure that you can evaluate the code you would again need to solve the halting problem.
Maybe you should try using Burrito
Well the first rule is never use regex for big things like this, or DOM, or ... . You have to parse it by tokens. The good news is that you don't have to write your own. There are a few JS to JS parsers.
UglifyJS
narcissus
Esprima
ZeParser
They may be a bit hard to work with it. But well better to work with them. There are other projects that are uses these such as burrito or code surgeon. So you can have a look at the source code and see how they uses them.
But there is bad news too, which people can still outsmart other people, let alone the parsers and the code they write. At least you need to evaluate the code with some execution time variables and see if it tries to access the DOM or not.

Fulltext search ignoring comments

I want fulltext search for my JavaScript code, but I'm usually not interested in matches from the comments.
How can I have fulltext search ignoring any commented match? Such a feature would increase my productivity as a programmer.
Also, how can I do the opposite: search within the comments only?
(I'm currently using Text Mate, but happy to change.)
See our Source Code Search Engine (SCSE). This tool indexes your code base using the langauge structure to guide the indexing; it can do so for many languages including JavaScript. Search queries are then stated in terms of abstract language tokens, e.g., to find identifiers involving the string "tax" multiplied by some constant, you'd write:
I=*tax* '*' N
This will search all indexed languages only for identifiers (in each language) following by a '*' token, followed by some kind of number. Because the tool understands language structure, it isn't confused by whitespace, formatting or interverning comments. Because it understands comments, you can search inside just comments (say, for authors):
C=*Author*
Given a query, the SCSE finds all the hits across the code base (possibly millions of lines), and offers these as set of choices; clicking on choice pulls up the file with the hit in the middle outlined where the match occurs.
If you insist on searching just raw text, the SCSE provides grep-style searches. If you have only a small set of files, this is still pretty fast. If you have a big set of files, this is a lot slower than language-structure based searches. In both cases, grep like searches get you more hits, usually at the cost of false positives (e.g., finding "tax" in a comment, or finding a variable named "Authorization_code"). But at least you have the choice.
While this doesn't operate from inside an editor, you can launch your editor (for most editors) on a file once you've found the hit you want.
Use ultraedit , It fully supports full text search ignoring comment or also within the comment search
How about NetBeans way (Find Symbol in the Navigate Menu),
It searches all variables,functions,objects etc.
Or you could customize JSLint and customize it if you want to integrate it in a web application or something like that.
I personnaly use Notepad++ wich is a great free code editor. It seems you need an editor supporting regular expression search (in one or many files). If you know Reg you can use powerfull search like in/out javascript comments...the work will be to build the right expression and test it with one file with all differents cases to be sure it will not miss things during real search, or maybe you can google for 'javascript comments regular expression' or something like...
Then must have a look at Notepad++ plugins, one is 'RegEx Helper' wich helps for building regular expressions.

Syntax / Logical checker In Javascript?

I'm building a solution for a client which allows them to create very basic code,
now i've done some basic syntax validation but I'm stuck at variable verification.
I know JSLint does this using Javascript and i was wondering if anyone knew of a good way to do this.
So for example say the user wrote the code
moose = "barry"
base = 0
if(moose == "barry"){base += 100}
Then i'm trying to find a way to clarify that the "if" expression is in the correct syntax, if the variable moose has been initialized etc etc
but I want to do this without scanning character by character,
the code is a mini language built just for this application so is very very basic and doesn't need to manage memory or anything like that.
I had thought about splitting first by Carriage Return and then by Space but there is nothing to say the user won't write something like moose="barry" or if(moose=="barry")
and there is nothing to say the user won't keep the result of a condition inline.
Obviously compilers and interpreters do this on a much more extensive scale but i'm not sure if they do do it character by character and if they do how have they optimized?
(Other option is I could send it back to PHP to process which would then releave the browser of responsibility)
Any suggestions?
Thanks
The use case is limited, the syntax will never be extended in this case, the language is a simple scripted language to enable the client to create a unique cost based on their users input the end result will be processed by PHP regardless to ensure the calculation can't be adjusted by the end user and to ensure there is some consistency.
So for example, say there is a base cost of £1.00
and there is a field on the form called "Additional Cost", the language will allow them manipulate the base cost relative to the "additional cost" field.
So
base = 1;
if(additional > 100 && additional < 150){base += 50}
elseif(additional == 150){base *= 150}
else{base += additional;}
This is a basic example of how the language would be used.
Thank you for all your answers,
I've investigated a parser and creating one would be far more complex than is required
having run several tests with 1000's of lines of code and found that character by character it only takes a few seconds to process even on a single core P4 with 512mb of memory (which is far less than the customer uses)
I've decided to build a PHP based syntax checker which will check the information and convert the variables etc into valid PHP code whilst it's checking it (so that it's ready to be called later without recompilation) using this instead of javascript this seems more appropriate and will allow for more complex code to arise without hindering the validation process
It's only taken an hour and I have code which is able to check the validity of an if statement and isn't confused by nested if's, spaces or odd expressions, there is very little left to be checked whereas a parser and full blown scripting language would have taken a lot longer
You've all given me a lot to think about and i've rated relevant answers thank you
If you really want to do this — and by that I mean if you really want your software to work properly and predictably, without a bunch of weird "don't do this" special cases — you're going to have to write a real parser for your language. Once you have that, you can transform any program in your language into a data structure. With that data structure you'll be able to conduct all sorts of analyses of the code, including procedures that at least used to be called use-definition and definition-use chain analysis.
If you concoct a "programming language" that enables some scripting in an application, then no matter how trivial you think it is, somebody will eventually write a shockingly large program with it.
I don't know of any readily-available parser generators that generate JavaScript parsers. Recursive descent parsers are not too hard to write, but they can get ugly to maintain and they make it a little difficult to extend the syntax (esp. if you're not very experienced crafting the original version).
You might want to look at JS/CC which is a parser generator that generates a parser for a grammer, in Javascript. You will need to figure out how to describe your language using a BNF and EBNF. Also, JS/CC has its own syntax (which is somewhat close to actual BNF/EBNF) for specifying the grammar. Given the grammer, JS/CC will generate a parser for that grammar.
Your other option, as Pointy said, is to write your own lexer and recursive-descent parser from scratch. Once you have a BNF/EBNF, it's not that hard. I recently wrote a parser from an EBNF in Javascript (the grammar was pretty simple so it wasn't that hard to write one YMMV).
To address your comments about it being "client specific". I will also add my own experience here. If you're providing a scripting language and a scripting environment, there is no better route than an actual parser.
Handling special cases through a bunch of if-elses is going to be horribly painful and a maintenance nightmare. When I was a freshman in college, I tried to write my own language. This was before I knew anything about recursive-descent parsers, or just parsers in general. I figured out by myself that code can be broken down into tokens. From there, I wrote an extremely unwieldy parser using a bunch of if-elses, and also splitting the tokens by spaces and other characters (exactly what you described). The end result was terrible.
Once I read about recursive-descent parsers, I wrote a grammar for my language and easily created a parser in a 10th of the time it took me to write my original parser. Seriously, if you want to save yourself a lot of pain, write an actual parser. If you go down your current route, you're going to be fixing issues forever. You're going to have to handle cases where people put the space in the wrong place, or perhaps they have one too many (or one too little) spaces. The only other alternative is to provide an extremely rigid structure (i.e, you must have exactly x number of spaces following this statement) which is liable to make your scripting environment extremely unattractive. An actual parser will automatically fix all these problems.
Javascript has a function 'eval'.
var code = 'alert(1);';
eval(code);
It will show alert. You can use 'eval' to execute basic code.

Parsing Custom JavaScript Annotations

Implementing a large JavaScript application with a lot of scripts, its become necessary to put together a build script. JavaScript labels being ubiquitous, I've decided to use them as annotations for a custom script collator. So far, I'm just employing the use statement, like this:
use: com.example.Class;
However, I want to support an 'optional quotes' syntax, so the following would be parsed correctly as well
use: 'com.example.Class';
I'm currently using this pattern to parse the first form:
/\s*use:\s*(\S+);\s*/g
The '\S+' gloms all characters between the annotation name declaration and the terminating semi colon. What rule can I write to substitute for \S+ that will return an annotation value without quotes, no matter if it was quoted or not to begin with? I can do it in two steps, but I want to do it in one.
Thanks- I know I've put this a little awkwardly
Edit 1.
I've been able to use this, but IMHO its a mess- any more elegant solutions? (By the way, this one will parse ALL label names)
/\s*([a-z]+):\s*(?:['])([a-zA-Z0-9_.]+)(?:['])|([a-zA-Z0-9_.]+);/g
Edit 2.
The logic is the same, but expresses a little more succinctly. However, it poses a problem as it seems to pull in all sorts of javascript code as well.
/\s*([a-z]+):\s*'([\w_\.]+)'|([\w_\.]+);/g
Ok -this seemed to do it. Hope someone can improve on it.
/\s*([a-z]+): *('[\w_\/\.]+'|[\w_\/\.]+);/g

Categories

Resources