Is JavaScript/ECMAScript a Regular Language? - javascript

I've tried a few google searches, and I cannot come up with any articles/previous questions that address this. The reason is a minor dispute I'm having with someone about using input validation to reject possible XSS. I know for a fact that HTML isn't a regular language, but I can't make that argument quite as strongly for javascript.
I checked this link:
http://www.dlsi.ua.es/~mlf/nnafmc/pbook/node46.html
And I've come up with this: Since tags in html can be infinitely nested, that's an intuitive notion as to why HTML is NOT a regular language. By extension, since you can infinitely nest blocks of JavaScript code with {}, then javascript too is NOT a regular language.
I'd like to see a more formal presentation either for or against this informal proposition. Or maybe even a discussion about regex extensions in programming languages that perhaps make it possible to do this kind of thing without writing a parser.

Indeed, JavaScript is not a regular language, which can be proved with the fact that braces must be balanced as your intuition suggested.
A useful tool for demonstrating that languages are not regular is the Pumping Lemma. You can use it to demonstrate that if JavaScript was regular, some sequence like
function(){ function() { function(){ ... function () {
(in which { are not matched) could be repeated any number of times when surrounded by a certain prefix and suffix, which is in obviously in contradiction with the fact that curly braces must be matched.

Related

Lexers which require parsing information

I am currently writing a little programming language and have come across a problem. In javascript template literals, we can embed arbitrary expressions, like:
let a = `hello ${ { a: 10, b: 15 } } world`
To properly lex the above-given snippet, the lexer needs to understand bracket matching(parsing essentially) as it can't just assume the first } to be the end of the embedded expression. How do lexers idiomatically solve this problem? One way is to check for proper bracket matching instead of treating them as simple operators, but I am not sure it is the best way. Looking into the code of some javascript lexers also was not very helpful.
The ECMAScript standard specifies (as a theoretical model) different "goal symbols" which are used in different syntactic contexts. Templated strings are one of the contexts with specific goal symbols.
That means that you need a lexical scanner which can switch states. Whether it does so itself, by duplicating part of the work of the parser, or as a result of a syntax action depends on the precise structure of the parsing architecture. You'll find implementations corresponding to both possibilities.
Putting this kind of logic into the parser is easier with predictive (top-down) parsers, such as a recursive descent parser. You could insert a call to the lexer's interface for changing states immediately after recognising the token which triggers the state change (backtick in your templated literal example). Or you could write the lexer interface so the "get a token" function also takes a lexical state argument; then your parser can maintain the lexical state. In effect, this last option is equivalent to using multiple lexical scanners, one for each state, which is also an attractive option, but it depends on separating the lexical scanner from the mechanism for reading input. (Personally, I favour this separation, but it's rarely discussed or implemented.)
Alternatively, you can use a bottom-up parser. In that case you need to be careful with synchronization between the parser's lookahead mechanism and the lexer scanner, since the interface between a bottom-up parser and a lexical scanner always allows the parser to read at least one token ahead, and it's possible that the state change needed to be done before that token was scanned. There are ways to handle this synchronization issue, but it's common for bottom-up parsers to put simple lexical state transitions into the lexer in order to avoid this issue. This necessarily involves a little duplication of effort between scanner and parser but counting braces is not so complicated.
If you're trying to use ECMAScript parsers as a source of inspiration, you need to be aware that there are many other complications with ECMAScript, particularly automatic semicolon insertion, which also involve coordination between lexer and parser. Solving those may impose other constraints on the overall architecture, and certainly makes the resulting parser code harder to read and understand.

RegExp for parsing a Math Expression?

Hey I've written a fractal-generating program in JavaScript and HTML5 (here's the link), which was about a 2 year process including all the research I did on Complex math and fractal equations, and I was looking to update the interface, since it is quite intimidating for people to look at. While looking through the code I noticed that some of my old techniques for going about doing things were very inefficient, such as my Complex.parseFunction.
I'm looking for a way to use RegExp to parse components of the expression such as functions, operators, and variables, as well as implementing the proper order of operations for the expression. An example below might demonstrate what I mean:
//the first example parses an expression with two variables and outputs to string
console.log(Complex.parseFunction("i*-sinh(C-Z^2)", ["Z","C"], false))
"Complex.I.mult(Complex.neg(Complex.sinh(C.sub(Z.cPow(new Complex(2,0,2,0))))))"
//the second example parses the same expression but outputs to function
console.log(Complex.parseFunction("i*-sinh(C-Z^2)", ["Z","C"], true))
function(Z,C){
return Complex.I.mult(Complex.neg(Complex.sinh(C.sub(Z.cPow(new Complex(2,0,2,0))))));
}
I know how to handle RegExp using String.prototype.replace and all that, all I need is the RegExp itself. Please note that it should be able to tell the difference between the subtraction operator (e.g. "C-Z^2") and the negative function (e.g. "i*-(Z^2+C)") by noting whether it is directly after a variable or an operator respectively.
While you can use regular expressions as part of an expression parser, for example to break out tokens, regular expressions do not have the computational power to parse properly nested mathematical expressions. That is essentially one of the core results of computing theory (finite state automata vs. push down automata). You probably want to look at something like recursive-descent or LR parsing.
I also wouldn't worry too much about the efficiency of parsing an expression provided you only do it once. Given all of the other math you are doing, I doubt it is material.

Is this safe? eval or new Function for simple arithmetic expression

I have heard so many bad things about eval that I've never even tried to use it. However today I have a situation where it seems to be the right answer.
I need a script that can do simple calculations by combining variables. For example, if value=5 and max=8, I want to evaluate value*100/max. Both the values and the formulas will be retrieved from external sources, which is why I am concerned with eval.
I have set up a jsfiddle demo with some sample code:
http://jsfiddle.net/6yzgA/
The values are converted to numbers using parseFloat, so I believe I'm pretty safe here. The characters in the formula are matched again this regular expression:
regex=/[^0-9\.+-\/*<>!=&()]/, // allows numbers (including decimal), operations, comparison
My questions:
Does my regex filter protect me from any attack?
Is there any reason to use eval vs. new Function in this case?
Is there another, safer way to evaluate formulas?
Since you aren't sending anything sending anything to your server, or using anything on anyone else's system, the worst that can happen is that the user crashes his own browser, nothing more. There is nothing unsafe about using eval here, since everything happens user-side.
Escaping and preventing anything on the client-side doesn't make sense at all. User can alter any piece of JS code and run it just as easy as I can change the jsfiddle you posted. Trust me, it's just that simple and you cannot rely on the client-side security.
If you remember to escape input fields on the server-side it's nothing to be worried about. There are plenty of functions for that by default, depending on which language you're using.
If user wants to type in <script>haxx(l33t);</script> - let him do it. Just remember to escape special characters so you'll have <script>haxx(l33t);</script>.

Is it possible to do validation with regular expressions in a localized web application environment?

Is it possible to do client side validation in a localized web application environment?
I've only seen regular expressions written in English, can they be written for other languages? Would the regular expressions have to be changed based on the language chosen by an end user or is it possible to use just 1?
Are there any tools/frameworks to help with this?
Previous answer was good, but it's not clear to me that it answered the question. For that matter, I don't really understand the question. If you're asking whether JavaScript regular expressions are independent of language, then the answer is yes, they are just looking at characters in a string. But obviously the things you're looking for with those regular expressions (words, numbers, phone numbers, dates, etc.) would presumably vary with language and locale. So you may be able to construct a universal regex that works to validate all phone numbers, for example, but it's probably unlikely, and in any case there may be cases where a valid number in one context is invalid in another. You're better off to create language-specific regular expressions used for validation just as you would create language specific strings. Does that answer your question?
No. Please do not confuse validation for well-formedness. The former is a measure of conformity to a grammar definition and the later is a measure of conformity to a syntax requirement. Even if your regex was so extremely awesome as to account for all well-formedness conditions it is absent the context of structured definitions where the structure is recursive and reflective.

Syntax / Logical checker In Javascript?

I'm building a solution for a client which allows them to create very basic code,
now i've done some basic syntax validation but I'm stuck at variable verification.
I know JSLint does this using Javascript and i was wondering if anyone knew of a good way to do this.
So for example say the user wrote the code
moose = "barry"
base = 0
if(moose == "barry"){base += 100}
Then i'm trying to find a way to clarify that the "if" expression is in the correct syntax, if the variable moose has been initialized etc etc
but I want to do this without scanning character by character,
the code is a mini language built just for this application so is very very basic and doesn't need to manage memory or anything like that.
I had thought about splitting first by Carriage Return and then by Space but there is nothing to say the user won't write something like moose="barry" or if(moose=="barry")
and there is nothing to say the user won't keep the result of a condition inline.
Obviously compilers and interpreters do this on a much more extensive scale but i'm not sure if they do do it character by character and if they do how have they optimized?
(Other option is I could send it back to PHP to process which would then releave the browser of responsibility)
Any suggestions?
Thanks
The use case is limited, the syntax will never be extended in this case, the language is a simple scripted language to enable the client to create a unique cost based on their users input the end result will be processed by PHP regardless to ensure the calculation can't be adjusted by the end user and to ensure there is some consistency.
So for example, say there is a base cost of £1.00
and there is a field on the form called "Additional Cost", the language will allow them manipulate the base cost relative to the "additional cost" field.
So
base = 1;
if(additional > 100 && additional < 150){base += 50}
elseif(additional == 150){base *= 150}
else{base += additional;}
This is a basic example of how the language would be used.
Thank you for all your answers,
I've investigated a parser and creating one would be far more complex than is required
having run several tests with 1000's of lines of code and found that character by character it only takes a few seconds to process even on a single core P4 with 512mb of memory (which is far less than the customer uses)
I've decided to build a PHP based syntax checker which will check the information and convert the variables etc into valid PHP code whilst it's checking it (so that it's ready to be called later without recompilation) using this instead of javascript this seems more appropriate and will allow for more complex code to arise without hindering the validation process
It's only taken an hour and I have code which is able to check the validity of an if statement and isn't confused by nested if's, spaces or odd expressions, there is very little left to be checked whereas a parser and full blown scripting language would have taken a lot longer
You've all given me a lot to think about and i've rated relevant answers thank you
If you really want to do this — and by that I mean if you really want your software to work properly and predictably, without a bunch of weird "don't do this" special cases — you're going to have to write a real parser for your language. Once you have that, you can transform any program in your language into a data structure. With that data structure you'll be able to conduct all sorts of analyses of the code, including procedures that at least used to be called use-definition and definition-use chain analysis.
If you concoct a "programming language" that enables some scripting in an application, then no matter how trivial you think it is, somebody will eventually write a shockingly large program with it.
I don't know of any readily-available parser generators that generate JavaScript parsers. Recursive descent parsers are not too hard to write, but they can get ugly to maintain and they make it a little difficult to extend the syntax (esp. if you're not very experienced crafting the original version).
You might want to look at JS/CC which is a parser generator that generates a parser for a grammer, in Javascript. You will need to figure out how to describe your language using a BNF and EBNF. Also, JS/CC has its own syntax (which is somewhat close to actual BNF/EBNF) for specifying the grammar. Given the grammer, JS/CC will generate a parser for that grammar.
Your other option, as Pointy said, is to write your own lexer and recursive-descent parser from scratch. Once you have a BNF/EBNF, it's not that hard. I recently wrote a parser from an EBNF in Javascript (the grammar was pretty simple so it wasn't that hard to write one YMMV).
To address your comments about it being "client specific". I will also add my own experience here. If you're providing a scripting language and a scripting environment, there is no better route than an actual parser.
Handling special cases through a bunch of if-elses is going to be horribly painful and a maintenance nightmare. When I was a freshman in college, I tried to write my own language. This was before I knew anything about recursive-descent parsers, or just parsers in general. I figured out by myself that code can be broken down into tokens. From there, I wrote an extremely unwieldy parser using a bunch of if-elses, and also splitting the tokens by spaces and other characters (exactly what you described). The end result was terrible.
Once I read about recursive-descent parsers, I wrote a grammar for my language and easily created a parser in a 10th of the time it took me to write my original parser. Seriously, if you want to save yourself a lot of pain, write an actual parser. If you go down your current route, you're going to be fixing issues forever. You're going to have to handle cases where people put the space in the wrong place, or perhaps they have one too many (or one too little) spaces. The only other alternative is to provide an extremely rigid structure (i.e, you must have exactly x number of spaces following this statement) which is liable to make your scripting environment extremely unattractive. An actual parser will automatically fix all these problems.
Javascript has a function 'eval'.
var code = 'alert(1);';
eval(code);
It will show alert. You can use 'eval' to execute basic code.

Categories

Resources