lightweight javascript to javascript parser

lightweight javascript to javascript parser - javascript

How would I go about writing a lightweight javascript to javascript parser. Something simple that can convert some snippets of code.
I would like to basically make the internal scope objects in functions public.
So something like this
var outer = 42;
window.addEventListener('load', function() {
var inner = 42;
function magic() {
var in_magic = inner + outer;
console.log(in_magic);
}
magic();
}, false);
Would compile to
__Scope__.set('outer', 42);
__Scope__.set('console', console);
window.addEventListener('load', constructScopeWrapper(__Scope__, function(__Scope__) {
__Scope__.set('inner', 42);
__Scope__.set('magic',constructScopeWrapper(__Scope__, function _magic(__Scope__) {
__Scope__.set('in_magic', __Scope__.get('inner') + __Scope__.get('outer'));
__Scope__.get('console').log(__Scope__.get('in_magic'));
}));
__Scope__.get('magic')();
}), false);
Demonstation Example
Motivation behind this is to serialize the state of functions and closures and keep them synchronized across different machines (client, server, multiple servers). For this I would need a representation of [[Scope]]
Questions:
Can I do this kind of compiler without writing a full JavaScript -> (slightly different) JavaScript compiler?
How would I go about writing such a compiler?
Can I re-use existing js -> js compilers?

I don't think your task is easy or short given that you want to access and restore all the program state. One of the issues is that you might have to capture the program state at any moment during a computation, right? That means the example as shown isn't quite right; that captures state sort of before execution of that code (except that you've precomputed the sum that initializes magic, and that won't happen before the code runs for the original JavaScript). I assume you might want to capture the state at any instant during execution.
The way you've stated your problem, is you want a JavaScript parser in JavaScript.
I assume you are imagining that your existing JavaScript code J, includes such a JavaScript parser and whatever else is necessary to generate your resulting code G, and that when J starts up it feeds copies of itself to G, manufacturing the serialization code S and somehow loading that up.
(I think G is pretty big and hoary if it can handle all of Javascript)
So your JavaScript image contains J, big G, S and does an expensive operation (feed J to G) when it starts up.
What I think might serve you better is a tool G that processes your original JavaScript code J offline, and generates program state/closure serialization code S (to save and restore that state) that can be added to/replace J for execution. J+S are sent to the client, who never sees G or its execution. This decouples the generation of S from the runtime execution of J, saving on client execution time and space.
In this case, you want a tool that will make generation of such code S easiest. A pure JavaScript parser is a start but isn't likely enough; you'll need symbol table support to know which function code is connected a function call F(...), and which variable definition in which scope corresponds to assignments or accesses to a variable V. You may need to actually modify your original code J to insert points of access where the program state can be captured. You may need flow analysis to find out where some values went. Insisting all of this in JavaScript narrows your range of solutions.
For these tasks, you will likely find a program transformation tool useful. Such tools contain parsers for the langauge of interest, build ASTs representing the program, enable the construction of identifier-to-definition maps ("symbol tables"), can carry out modifications to the ASTs representing insertion of access points, or synthesis of ASTs representing your demonstration example, and then regenerate valid JavaScript code containing the modified J and the additions S.
Of all the program transformation systems that I know about (which includes all the ones at the Wikipedia site), none are implemented in JavaScript.
Our DMS Software Reengineering Toolkit is such a program transformation system offering all the features I just described. (Yes, its big and hoary; it has to be to handle the complexities of real computer languages). It has a JavaScript front end that contains a complete JavaScript parser to ASTs, and the machinery to regenerate JavaScript code from modified or synthesized ASTs. (Also big and hoary; good thing that hoary + hoary is still just hoary). Should it be useful, DMS also provides support for building control and dataflow analysis.

If you want something with a simple interface, you could try node-burrito: https://github.com/substack/node-burrito
It generates an AST using the uglify-js parser and then recursively walks the nodes. All you have to do is give a single callback which tests each node. You can alter the ones you need to change, and it outputs the resulting code.

I'd try to look for an existing parser to modify. Perhaps you could adapt JSLint/JSHint?

There is a problem with the rewriting above, you're not hoisting the initialization of magic to the top of the scope.
There's a number of projects out there that parse JavaScript.
Crock's Pratt parser which works well on JavaScript that fits within "The good parts" and less well on other JS.
The es-lab parser based on ometa which handles the full grammar including a lot of corner cases that Crock's parser misses. It may not perform as well as Crock's.
narcissus parser and evaluator. I don't have much experience with this.
There are also a number of high-quality lexers for JavaScript that let you manipulate JS at the token level. This can be tougher than it sounds though since JavaScript is not lexically regular, and predicting semicolon insertion is difficult without a full parse.
My es5-lexer is a carefully constructed and efficient lexer for EcmaScript 5 that provides the ability to tokenize JavaScript. It is heuristic where JavaScript's grammar is not lexically regular but the heuristic is very good and it provides a means to transform a token stream so that an interpreter is guaranteed to interpret it the way the lexer interpreted the tokens so if you don't trust your input, you can still be sure that the interpretation underlying the security transformations is sound even if not correct according to the spec for some bizarre inputs.

Your problem seams to be in same family of problems as what is solved with the JS Opfuscators and JS Compressors -- they as well as you need to be able to parse and reformat the JS to an equivalent script;
There was a good discussion on obfuscators here and the possible solution to your problem could be to leverage the parse and generator part from one of the FOSS versions.
One callout, your example code does not take into account the scopes of the variables you want to set/get and that will eventually become a problem that you will have to solve.
Addition
Given the scope problem for closure defined functions, you are probably unlikely to be able to solve this problem as a static parsing problem, as the scope variables outside the closure will have to be imported/exported to resolve/save and re-instantiate scope. Hence you may need to dig into the evaluation engine itself, and perhaps get the V8 engine and make a hack to the interpreter itself -- that is assuming that you do not need this to be generic cross all script engines and that you can tie it down to a single implementation which you control.

Related

Lexers which require parsing information

I am currently writing a little programming language and have come across a problem. In javascript template literals, we can embed arbitrary expressions, like:
let a = `hello ${ { a: 10, b: 15 } } world`
To properly lex the above-given snippet, the lexer needs to understand bracket matching(parsing essentially) as it can't just assume the first } to be the end of the embedded expression. How do lexers idiomatically solve this problem? One way is to check for proper bracket matching instead of treating them as simple operators, but I am not sure it is the best way. Looking into the code of some javascript lexers also was not very helpful.

The ECMAScript standard specifies (as a theoretical model) different "goal symbols" which are used in different syntactic contexts. Templated strings are one of the contexts with specific goal symbols.
That means that you need a lexical scanner which can switch states. Whether it does so itself, by duplicating part of the work of the parser, or as a result of a syntax action depends on the precise structure of the parsing architecture. You'll find implementations corresponding to both possibilities.
Putting this kind of logic into the parser is easier with predictive (top-down) parsers, such as a recursive descent parser. You could insert a call to the lexer's interface for changing states immediately after recognising the token which triggers the state change (backtick in your templated literal example). Or you could write the lexer interface so the "get a token" function also takes a lexical state argument; then your parser can maintain the lexical state. In effect, this last option is equivalent to using multiple lexical scanners, one for each state, which is also an attractive option, but it depends on separating the lexical scanner from the mechanism for reading input. (Personally, I favour this separation, but it's rarely discussed or implemented.)
Alternatively, you can use a bottom-up parser. In that case you need to be careful with synchronization between the parser's lookahead mechanism and the lexer scanner, since the interface between a bottom-up parser and a lexical scanner always allows the parser to read at least one token ahead, and it's possible that the state change needed to be done before that token was scanned. There are ways to handle this synchronization issue, but it's common for bottom-up parsers to put simple lexical state transitions into the lexer in order to avoid this issue. This necessarily involves a little duplication of effort between scanner and parser but counting braces is not so complicated.
If you're trying to use ECMAScript parsers as a source of inspiration, you need to be aware that there are many other complications with ECMAScript, particularly automatic semicolon insertion, which also involve coordination between lexer and parser. Solving those may impose other constraints on the overall architecture, and certainly makes the resulting parser code harder to read and understand.

Declaring variables using single or multiple var statements. Performance wise

(Question moved from Webmasters.)
Someone once told me that it's better for Javascript performance to use one var statement rather then multiple.
I.e.
// Version A. Allegedly Faster
var a = 1,
b = 2,
c = 3;
// Version B. Allegedly Slower
var a = 1;
var b = 2;
var c = 3;
The reasoning behind this was along the lines of: For every var statement, Javascript will start allocating memory and then stop at the semicolon. Whereas, if you have only one var statement, many JS implementations will optimize it and allocate space for all variables in the same call. Thus making things go faster.
However, when googling to confirm this I only find rants about how some consider the second example to be simpler from a maintenance point of view. And some disagree.
This JSPerf test sais there is no difference.
So my question: From a performance perspective, is there any reason version A or B would be better?
(You do save a few bytes on not writing the "var" declaration, but that is in delivery of data to a browser. This question includes server side JS)

First of all, your question is not answerable without specifying what JavaScript implementation is concerned. Performance is not the concern of the language and anyone could write a conforming JavaScript implementation where separate var statements would be slower because performance characteristics are not specified in the language specification.
You mentioned server side so I'm going to assume V8, which has 2 different compilers for JavaScript.
The first compiler would parse your source code into an AST and then walk the tree's nodes while spitting machine instructions for each tree node. So no, the syntactic form of your var statements could have not affected the performance here.
The second compiler is much smarter and takes its time to analyze the code through multiple different representations (not just AST) in order to generate optimized code. Since the dumb compiler can already see that the codes have the same semantics, so can the smart one, so there is again no difference.
Syntactic differences where semantics are exactly the same will not in general have any performance difference unless there is a bug. But this is a trivial case so I would find it very hard to believe any world class JavaScript implementation failing here.

C interpreter written in javascript

Is there any C interpreter written in javascript or java ?
I don't need a full interpreter but I need to be able to do a step by step execution of the program and being able to see the values of variables, the stack...all that in a web interface.
The idea is to help C beginners by showing them the step by step execution of the program.
We are using GWT to build the interface so if something exists in Java we should be able to use it.
I can modify it to suit my needs but if I can avoid to write the parser / abstract-syntax tree walker / stack manipulation... that would be great.
Edit :
To be clear I don't want to simulate the complete C because some programs can be really tricky.
By step I mean a basic operation such as : expression evaluation, affectation, function call.
The C I want to simulate will contains : variables, for, while, functions, arrays, pointers, maths functions.
No goto, string functions, ctypes.h, setjmp.h... (at least for now).
Here is a prototype : http://www.di.ens.fr/~fevrier/war/simu.html
In this example we have manually converted the C code to a javascript representation but it's limited (expressions such as a == 2 || a = 1 are not handled) and is limited to programs manually converted.
We have a our disposal a C compiler on a remote server so we can check if the code is correct (and doesn't have any undefined behavior). The parsing / AST construction can also be done remotely (so any language) but the AST walking needs to be in javascript in order to run on the client side.

There's a C grammar available for antlr that you can use to generate a C parser in Java, and possibly JavaScript too.

There is em-scripten which converts LLVM languages to JS a little hacking on it and you may be able to produce a C interperter.

felixh's JSCPP project provides a C++ interpreter in Javascript, though with some limitations.
https://github.com/felixhao28/JSCPP
So an example program might look like this:
var JSCPP = require('JSCPP');
var launcher = JSCPP.launcher;
var code = 'int main(){int a;cin>>a;cout<<a;return 0;}';
var input = '4321';
var exitcode = launcher.run(code, input);
console.info('program exited with code ' + exitcode);
As of March 2015 this is under active development, so while it's usable there are still areas where it may continue to expand. Check the documentation for limitations. It looks like you can use it as a straight C interpreter with limited library support for now with no further issues.

I don't know of any C interpreters written in JavaScript, but here is a discussion of available C interpreters:
Is there an interpreter for C?
You might do better to look for any sort of virtual machine that runs on top of JavaScript, and then see if you can find a C compiler that emits the proper machine code for the VM. A likely one would seem to be LLVM; if you can find a JavaScript VM that can run LLVM, then you will be in great shape.
I did a few Google searches and found Emscripten, which translates C code into JavaScript directly! Perhaps you can do something with this:
https://github.com/kripken/emscripten/wiki
Perhaps you can modify Emscripten to emit a "sequence point" after each compiled line of C, and then you can make your simulated environment single-step from sequence point to sequence point.
I believe Emscripten is implementing LLVM, so it may actually have virtual registers; if so it might be ideal for your purposes.

I know you specified C code, but you might want to consider a JavaScript emulation of a simpler language. In particular, please consider FORTH.
FORTH runs on an extremely simple virtual machine. In FORTH there are two stacks, a data stack and a flow-of-control stack (called the "return" stack); plus some global memory. Originally FORTH was a 16-bit language but there are plenty of 32-bit FORTH implementations out there now.
Because FORTH code is sort of "close to the machine" it is easy to understand how it all works when you see it working. I learned FORTH before I learned C, and I found it to be a valuable learning experience.
There are several FORTH interpreters available in JavaScript already. The FORTH virtual machine is so simple, it doesn't take very long to implement it!
You could even then get a C-to-FORTH translator and let the students watch the FORTH virtual machine interpret compiled C code.
I consider this answer a long shot for you, so I'll stop writing here. If you are in fact interested in the idea, comment below and ask for more details and I will be happy to share them. It's been a long time since I wrote any FORTH code but I still remember it fondly, and I'd be happy to talk more about FORTH.
EDIT: Despite this answer being downvoted to a negative score, I am going to leave it up here. A simulation for educational purposes is IMHO more valuable if the simulation is simple and easy to understand. The simple stack-based virtual machine for FORTH is very simple, yet you could compile C code to run on it. (In the 80's there was even a CPU chip made that had FORTH instructions as its native machine code.) And, as I said, I personally studied FORTH when I was a complete beginner and it helped me to understand assembly language and C.
The question has no accepted answer, now over two years after it was asked. It could be that Loïc Février didn't find any suitable JavaScript interpreter. As I said, there already exist several JavaScript interpreters for the FORTH virtual machine. Therefore, this answer is a practical one.

C is a compiled language, not an interpreted language, and has features like pointers which JS completely doesn't support, so interpreting C in Javascript doesn't really make sense

GWT reduce compiled javascript size

I have found that the size of the compiled JavaScript grows faster than I had expected. Adding a few lines of Java code to my project can increase the script size in several Kbs.
At the moment my compiled project weights 1Mb. I'm not using any external libraries except for those for MVP (Activities & Places) , testing (JUnit) and logging.
I would like to know if there are any coding practices/recommendations to keep the compiled script as small as possible. I'm not refering to code splitting, but to coding techniques or patterns that can make the compiled JavaScript effectively smaller.
Many thanks

GWT uses a "pay as you go" design philosophy, and since you're not allowed to use reflection the compiler can statically prove (on a method-by-method basis) that a section of code is "reachable", and eliminate those that are not. For example, if you never use the remove() method on ArrayList, then that code does not get included in the resulting JavaScript.
If you are seeing several kilobyte jumps with the addition of just a few lines, it probably means that you've introduced the use of a new type (and possibly one that depends on other new types) that you had not yet been using. It might also mean that you've made a change to send this new type "over the wire" back to the server, in which case a GWT generator had to include JavaScript for marshaling that type, and any new types that are reachable via its "has-a" and "is-a" references.
So if it were me, I would begin there: when you catch a 2-line change making a multi-kilobyte increase, start by looking at the types and asking whether it is a type that you have used before, and whether you're sending a new type over the wire, and whether that type also depends on other types under the hood.
One final thought: in Ray Ryan's 2009 presentation at Google I/O he mentioned a superstition that he had picked up from the GWT compiler team, where they recommended against using generic types (I'm not speaking of Java Generics here, but rather supertypes) as RPC arguments & return values. In particular, instead of having your RPC call take or return a Map, have it take or return a HashMap instead. The belief is that the GWT generator can then narrow the amount of serialization code that it has to create at compile time (because it could, for example, refrain from generating serialization code for a TreeMap).
I hope this helps.

GWT creates different output versions for each supported browser, so when you say the project size is 1MB are you then referring to the combined size of these ? (each browser only download's the one it actually needs).
I have tried to experiment with the generated output when using various inheritance/class/generics constructs. Unfortunately the extra complexity introduced far outweighs the small size improvements gained (when fx. dumping generics).
I have been on some large GWT projects (+50.000 lines) and have found that code obfuscating coupled with turning on compression on the web server to be the simplest most effective way to minimize the downloads. If this does not shrink the code enough, then look into GWT's compilation report which you can use to pinpoint potential problematic classes and places to insert code splitting.

Syntax / Logical checker In Javascript?

I'm building a solution for a client which allows them to create very basic code,
now i've done some basic syntax validation but I'm stuck at variable verification.
I know JSLint does this using Javascript and i was wondering if anyone knew of a good way to do this.
So for example say the user wrote the code
moose = "barry"
base = 0
if(moose == "barry"){base += 100}
Then i'm trying to find a way to clarify that the "if" expression is in the correct syntax, if the variable moose has been initialized etc etc
but I want to do this without scanning character by character,
the code is a mini language built just for this application so is very very basic and doesn't need to manage memory or anything like that.
I had thought about splitting first by Carriage Return and then by Space but there is nothing to say the user won't write something like moose="barry" or if(moose=="barry")
and there is nothing to say the user won't keep the result of a condition inline.
Obviously compilers and interpreters do this on a much more extensive scale but i'm not sure if they do do it character by character and if they do how have they optimized?
(Other option is I could send it back to PHP to process which would then releave the browser of responsibility)
Any suggestions?
Thanks
The use case is limited, the syntax will never be extended in this case, the language is a simple scripted language to enable the client to create a unique cost based on their users input the end result will be processed by PHP regardless to ensure the calculation can't be adjusted by the end user and to ensure there is some consistency.
So for example, say there is a base cost of £1.00
and there is a field on the form called "Additional Cost", the language will allow them manipulate the base cost relative to the "additional cost" field.
So
base = 1;
if(additional > 100 && additional < 150){base += 50}
elseif(additional == 150){base *= 150}
else{base += additional;}
This is a basic example of how the language would be used.
Thank you for all your answers,
I've investigated a parser and creating one would be far more complex than is required
having run several tests with 1000's of lines of code and found that character by character it only takes a few seconds to process even on a single core P4 with 512mb of memory (which is far less than the customer uses)
I've decided to build a PHP based syntax checker which will check the information and convert the variables etc into valid PHP code whilst it's checking it (so that it's ready to be called later without recompilation) using this instead of javascript this seems more appropriate and will allow for more complex code to arise without hindering the validation process
It's only taken an hour and I have code which is able to check the validity of an if statement and isn't confused by nested if's, spaces or odd expressions, there is very little left to be checked whereas a parser and full blown scripting language would have taken a lot longer
You've all given me a lot to think about and i've rated relevant answers thank you

If you really want to do this — and by that I mean if you really want your software to work properly and predictably, without a bunch of weird "don't do this" special cases — you're going to have to write a real parser for your language. Once you have that, you can transform any program in your language into a data structure. With that data structure you'll be able to conduct all sorts of analyses of the code, including procedures that at least used to be called use-definition and definition-use chain analysis.
If you concoct a "programming language" that enables some scripting in an application, then no matter how trivial you think it is, somebody will eventually write a shockingly large program with it.
I don't know of any readily-available parser generators that generate JavaScript parsers. Recursive descent parsers are not too hard to write, but they can get ugly to maintain and they make it a little difficult to extend the syntax (esp. if you're not very experienced crafting the original version).

You might want to look at JS/CC which is a parser generator that generates a parser for a grammer, in Javascript. You will need to figure out how to describe your language using a BNF and EBNF. Also, JS/CC has its own syntax (which is somewhat close to actual BNF/EBNF) for specifying the grammar. Given the grammer, JS/CC will generate a parser for that grammar.
Your other option, as Pointy said, is to write your own lexer and recursive-descent parser from scratch. Once you have a BNF/EBNF, it's not that hard. I recently wrote a parser from an EBNF in Javascript (the grammar was pretty simple so it wasn't that hard to write one YMMV).
To address your comments about it being "client specific". I will also add my own experience here. If you're providing a scripting language and a scripting environment, there is no better route than an actual parser.
Handling special cases through a bunch of if-elses is going to be horribly painful and a maintenance nightmare. When I was a freshman in college, I tried to write my own language. This was before I knew anything about recursive-descent parsers, or just parsers in general. I figured out by myself that code can be broken down into tokens. From there, I wrote an extremely unwieldy parser using a bunch of if-elses, and also splitting the tokens by spaces and other characters (exactly what you described). The end result was terrible.
Once I read about recursive-descent parsers, I wrote a grammar for my language and easily created a parser in a 10th of the time it took me to write my original parser. Seriously, if you want to save yourself a lot of pain, write an actual parser. If you go down your current route, you're going to be fixing issues forever. You're going to have to handle cases where people put the space in the wrong place, or perhaps they have one too many (or one too little) spaces. The only other alternative is to provide an extremely rigid structure (i.e, you must have exactly x number of spaces following this statement) which is liable to make your scripting environment extremely unattractive. An actual parser will automatically fix all these problems.

Javascript has a function 'eval'.
var code = 'alert(1);';
eval(code);
It will show alert. You can use 'eval' to execute basic code.

Develop Reference

JavaScript is the programming language of the Web.