Writing a Parser for javascript code - javascript

I want to extract javasscript code and find out if there are any dynamic tag creations like document.createElement('script'); I have tried to do this with Regular expressions but using regular expressions restricts me to get only some formats so i thought of writing a javascript parser which extracts all the keywords, strings and functions from the javascript code.

In general there is no way to know if a given line of code will ever run, you would need to solve the halting problem.
If you restrict your analysis to just finding occurances of a function call you don't make much progress. Naive methods will still be easy to trick, if you just regex match for document.createElement, you would not be able to match something as simple as document["create" + "Element"]. In general you would need to not only parse the code but evaluate it as well to get around this. And to be sure that you can evaluate the code you would again need to solve the halting problem.

Maybe you should try using Burrito

Well the first rule is never use regex for big things like this, or DOM, or ... . You have to parse it by tokens. The good news is that you don't have to write your own. There are a few JS to JS parsers.
UglifyJS
narcissus
Esprima
ZeParser
They may be a bit hard to work with it. But well better to work with them. There are other projects that are uses these such as burrito or code surgeon. So you can have a look at the source code and see how they uses them.
But there is bad news too, which people can still outsmart other people, let alone the parsers and the code they write. At least you need to evaluate the code with some execution time variables and see if it tries to access the DOM or not.

Related

Comma Operator to Semicolons

I have a chunk of javascript that has many comma operators, for example
"i".toString(), "e".toString(), "a".toString();
Is there a way with JavaScript to convert these to semicolons?
"i".toString(); "e".toString(); "a".toString();
This might seem like a cop-out answer... but I'd suggest against trying it. Doing any kind of string manipulation to change it would be virtually impossible. In addition to function definition argument lists, you'd also need to skip text in string literals or regex literals or function calls or array literals or object literals or variable declarations.... maybe even more. Regex can't handle it, turning on and off as you see keywords can't handle it.
If you want to actually convert these, you really have to actually parse the code and figure out which ones are the comma operator. Moreover, there might be some cases where the comma's presence is relevant:
var a = 10, 20;
is not the same as
var a = 10; 20;
for example.
So I really don't think you should try it. But if you do want to, I'd start by searching for a javascript parser (or writing one, it isn't super hard, but it'd probably take the better part of a day and might still be buggy). I'm pretty sure the more advanced minifiers like Google's include a parser, maybe their source will help.
Then, you parse it to find the actual comma expressions. If the return value is used, leave it alone. If not, go ahead and replace them with expression statements, then regenerate the source code string. You could go ahead and format it based on scope indentation at this time too. It might end up looking pretty good. It'll just be a fair chunk of work.
Here's a parser library written in JS: http://esprima.org/ (thanks to #torazaburo for this comment)

Finding comments in HTML

I have an HTML file and within it there may be Javascript, PHP and all this stuff people may or may not put into their HTML file.
I want to extract all comments from this html file.
I can point out two problems in doing this:
What is a comment in one language may not be a comment in another.
In Javascript, remainder of lines are commented out using the // marker. But URLs also contain // within them and I therefore may well eliminate parts of URLs if I
just apply substituting // and then the
remainder of the line, with nothing.
So this is not a trivial problem.
Is there anywhere some solution for this already available?
Has anybody already done this?
Problem 2: Isn't every url quoted, with either "www.url.com" or 'www.url.com', when you write it in either language? I'm not sure. If that's the case then all you haft to do is to parse the code and check if there's any quote marks preceding the backslashes to know if it's a real url or just a comment.
Look into parser generators like ANTLR which has grammars for many languages and write a nesting parser to reliably find comments. Regular expressions aren't going to help you if accuracy is important. Even then, it won't be 100% accurate.
Consider
Problem 3, a comment in a language is not always a comment in a language.
<textarea><!-- not a comment --></textarea>
<script>var re = /[/*]not a comment[*/]/, str = "//not a comment";</script>
Problem 4, a comment embedded in a language may not obviously be a comment.
<button onclick="// this is a comment//
notAComment()">
Problem 5, what is a comment may depend on how the browser is configured.
<noscript><!-- </noscript> Whether this is a comment depends on whether JS is turned on -->
<!--[if IE 8]>This is a comment, except on IE 8<![endif]-->
I had to solve this problem partially for contextual templating systems that elide comments from source code to prevent leaking software implementation details.
https://github.com/mikesamuel/html-contextual-autoescaper-java/blob/master/src/tests/com/google/autoesc/HTMLEscapingWriterTest.java#L1146 shows a testcase where a comment is identified in JavaScript, and later testcases show comments identified in CSS and HTML. You may be able to adapt that code to find comments. It will not handle comments in PHP code sections.
It seems from your word that you are pondering some approach based on regular expressions: it is a pain to do so on the whole file, try to use some tools to highlight or to discard interesting or uninteresting text and then work on what is left from your sieve according to the keep/discard criteria. Have a look at HTML::Tree and TreeBuilder, it could be very useful to deal with the HTML markup.
I would convert the HTML file into a character array and parse it. You can detect key strings like "<", "--" ,"www", "http", as you move forward and either skip or delete those segments.
The start/end indices will have to be identified properly, which is a challenge but you will have full power.
There are also other ways to simplify the process if performance is not a problem. For example, all tags can be grabbed with XML::Twig and the string can be parsed to detect JS comments.

Is this safe? eval or new Function for simple arithmetic expression

I have heard so many bad things about eval that I've never even tried to use it. However today I have a situation where it seems to be the right answer.
I need a script that can do simple calculations by combining variables. For example, if value=5 and max=8, I want to evaluate value*100/max. Both the values and the formulas will be retrieved from external sources, which is why I am concerned with eval.
I have set up a jsfiddle demo with some sample code:
http://jsfiddle.net/6yzgA/
The values are converted to numbers using parseFloat, so I believe I'm pretty safe here. The characters in the formula are matched again this regular expression:
regex=/[^0-9\.+-\/*<>!=&()]/, // allows numbers (including decimal), operations, comparison
My questions:
Does my regex filter protect me from any attack?
Is there any reason to use eval vs. new Function in this case?
Is there another, safer way to evaluate formulas?
Since you aren't sending anything sending anything to your server, or using anything on anyone else's system, the worst that can happen is that the user crashes his own browser, nothing more. There is nothing unsafe about using eval here, since everything happens user-side.
Escaping and preventing anything on the client-side doesn't make sense at all. User can alter any piece of JS code and run it just as easy as I can change the jsfiddle you posted. Trust me, it's just that simple and you cannot rely on the client-side security.
If you remember to escape input fields on the server-side it's nothing to be worried about. There are plenty of functions for that by default, depending on which language you're using.
If user wants to type in <script>haxx(l33t);</script> - let him do it. Just remember to escape special characters so you'll have <script>haxx(l33t);</script>.

Syntax / Logical checker In Javascript?

I'm building a solution for a client which allows them to create very basic code,
now i've done some basic syntax validation but I'm stuck at variable verification.
I know JSLint does this using Javascript and i was wondering if anyone knew of a good way to do this.
So for example say the user wrote the code
moose = "barry"
base = 0
if(moose == "barry"){base += 100}
Then i'm trying to find a way to clarify that the "if" expression is in the correct syntax, if the variable moose has been initialized etc etc
but I want to do this without scanning character by character,
the code is a mini language built just for this application so is very very basic and doesn't need to manage memory or anything like that.
I had thought about splitting first by Carriage Return and then by Space but there is nothing to say the user won't write something like moose="barry" or if(moose=="barry")
and there is nothing to say the user won't keep the result of a condition inline.
Obviously compilers and interpreters do this on a much more extensive scale but i'm not sure if they do do it character by character and if they do how have they optimized?
(Other option is I could send it back to PHP to process which would then releave the browser of responsibility)
Any suggestions?
Thanks
The use case is limited, the syntax will never be extended in this case, the language is a simple scripted language to enable the client to create a unique cost based on their users input the end result will be processed by PHP regardless to ensure the calculation can't be adjusted by the end user and to ensure there is some consistency.
So for example, say there is a base cost of £1.00
and there is a field on the form called "Additional Cost", the language will allow them manipulate the base cost relative to the "additional cost" field.
So
base = 1;
if(additional > 100 && additional < 150){base += 50}
elseif(additional == 150){base *= 150}
else{base += additional;}
This is a basic example of how the language would be used.
Thank you for all your answers,
I've investigated a parser and creating one would be far more complex than is required
having run several tests with 1000's of lines of code and found that character by character it only takes a few seconds to process even on a single core P4 with 512mb of memory (which is far less than the customer uses)
I've decided to build a PHP based syntax checker which will check the information and convert the variables etc into valid PHP code whilst it's checking it (so that it's ready to be called later without recompilation) using this instead of javascript this seems more appropriate and will allow for more complex code to arise without hindering the validation process
It's only taken an hour and I have code which is able to check the validity of an if statement and isn't confused by nested if's, spaces or odd expressions, there is very little left to be checked whereas a parser and full blown scripting language would have taken a lot longer
You've all given me a lot to think about and i've rated relevant answers thank you
If you really want to do this — and by that I mean if you really want your software to work properly and predictably, without a bunch of weird "don't do this" special cases — you're going to have to write a real parser for your language. Once you have that, you can transform any program in your language into a data structure. With that data structure you'll be able to conduct all sorts of analyses of the code, including procedures that at least used to be called use-definition and definition-use chain analysis.
If you concoct a "programming language" that enables some scripting in an application, then no matter how trivial you think it is, somebody will eventually write a shockingly large program with it.
I don't know of any readily-available parser generators that generate JavaScript parsers. Recursive descent parsers are not too hard to write, but they can get ugly to maintain and they make it a little difficult to extend the syntax (esp. if you're not very experienced crafting the original version).
You might want to look at JS/CC which is a parser generator that generates a parser for a grammer, in Javascript. You will need to figure out how to describe your language using a BNF and EBNF. Also, JS/CC has its own syntax (which is somewhat close to actual BNF/EBNF) for specifying the grammar. Given the grammer, JS/CC will generate a parser for that grammar.
Your other option, as Pointy said, is to write your own lexer and recursive-descent parser from scratch. Once you have a BNF/EBNF, it's not that hard. I recently wrote a parser from an EBNF in Javascript (the grammar was pretty simple so it wasn't that hard to write one YMMV).
To address your comments about it being "client specific". I will also add my own experience here. If you're providing a scripting language and a scripting environment, there is no better route than an actual parser.
Handling special cases through a bunch of if-elses is going to be horribly painful and a maintenance nightmare. When I was a freshman in college, I tried to write my own language. This was before I knew anything about recursive-descent parsers, or just parsers in general. I figured out by myself that code can be broken down into tokens. From there, I wrote an extremely unwieldy parser using a bunch of if-elses, and also splitting the tokens by spaces and other characters (exactly what you described). The end result was terrible.
Once I read about recursive-descent parsers, I wrote a grammar for my language and easily created a parser in a 10th of the time it took me to write my original parser. Seriously, if you want to save yourself a lot of pain, write an actual parser. If you go down your current route, you're going to be fixing issues forever. You're going to have to handle cases where people put the space in the wrong place, or perhaps they have one too many (or one too little) spaces. The only other alternative is to provide an extremely rigid structure (i.e, you must have exactly x number of spaces following this statement) which is liable to make your scripting environment extremely unattractive. An actual parser will automatically fix all these problems.
Javascript has a function 'eval'.
var code = 'alert(1);';
eval(code);
It will show alert. You can use 'eval' to execute basic code.

Parsing Custom JavaScript Annotations

Implementing a large JavaScript application with a lot of scripts, its become necessary to put together a build script. JavaScript labels being ubiquitous, I've decided to use them as annotations for a custom script collator. So far, I'm just employing the use statement, like this:
use: com.example.Class;
However, I want to support an 'optional quotes' syntax, so the following would be parsed correctly as well
use: 'com.example.Class';
I'm currently using this pattern to parse the first form:
/\s*use:\s*(\S+);\s*/g
The '\S+' gloms all characters between the annotation name declaration and the terminating semi colon. What rule can I write to substitute for \S+ that will return an annotation value without quotes, no matter if it was quoted or not to begin with? I can do it in two steps, but I want to do it in one.
Thanks- I know I've put this a little awkwardly
Edit 1.
I've been able to use this, but IMHO its a mess- any more elegant solutions? (By the way, this one will parse ALL label names)
/\s*([a-z]+):\s*(?:['])([a-zA-Z0-9_.]+)(?:['])|([a-zA-Z0-9_.]+);/g
Edit 2.
The logic is the same, but expresses a little more succinctly. However, it poses a problem as it seems to pull in all sorts of javascript code as well.
/\s*([a-z]+):\s*'([\w_\.]+)'|([\w_\.]+);/g
Ok -this seemed to do it. Hope someone can improve on it.
/\s*([a-z]+): *('[\w_\/\.]+'|[\w_\/\.]+);/g

Categories

Resources