Parse JavaScript and keep track of all variables and their values

Parse JavaScript and keep track of all variables and their values - javascript

I was watching Bret Victor's talk "Inventing on Principle" the other night and decided to try and build the real time JavaScript editor he demoed. You can see it in action at 18:05 when he implements binary search.
It doesn't look like he ever released such an editor, but regardless, I thought I could learn a lot building one like it.
Here's what I have so far
What it can currently do:
Keep track of variables and their values (if assigned as literals)
Print them on the same line on the right
Show parsing errors
I'm using Electron and Angular to build the app, so it's a desktop app for OSX, but written in JavaScript and HTML.
For parsing, I'm using Acorn. So far it's a fantastic parser, but it's really hard to actually run the code after it's been parsed. Permitting only literal assignments such as var x = 1 is doable, but things get really complex fast once you try to do stuff as simple as var x = 1 + 2, due to how Acorn structures the parsed result.
I don't want to just eval the whole thing, since it's could be dangerous and there's probably better ways to do it.
Ideally, I could find a safe way to evaluate the code on the left and keep track of all the variables somehow. Unfortunately, my research indicates that there is no access to private variables in JavaScript, so I'm hoping I can count on fellow developers' ingenuity to help me with this. Any hints on how to do this better/easier than with Acorn would be greatly appreciated.
If you need it, my code base is here: https://github.com/dchacke/nasherai

Try sandbox for safe evaluation of strings.
var s = new Sandbox()
s.run( '1 + 1 + " apples"', function( output ) {
// output.result == "2 apples"
})

Related

Why someone would use a hexadecimal approach in javascript?

I apologize cause this may be a bizarre question but I'm pretty confused. Would anyone know why someone would use what I believe to be "a hexadecimal approach" to javascript? i.e.) I see someone naming variables like
_0x2f8b, _0xcb6545,_0x893751, _0x1d2177, etc
Why would anyone ever do this? Also, I see code like
'to\x20bypass\x20this\x20link'
as well as hexadecimals for numbers such as 725392
0xb1190
So, how would anyone even get this kind of naming convention and why would they ever want to use this?

Code like this has been, 99% of the time, automatically mangled/obfuscated from the original source code, in an attempt to make it more difficult to reverse engineer.
For example, if you start with
const foo = 'bar';
const somethingElse = 5;
you might use an obfuscation tool to come up with
var _0x2f8b, _0xcb6545;
_0x2f8b = 'bar';
_0xcb6545 = 5;
and serve that to clients. Reading 200 lines of obfuscated code is a lot harder than reading 200 lines of the original source code.
as well as hexadecimals for numbers such as 725392
Same thing - it's easier for a human to make sense of 725392 (which may be a magic number important for the application) than 0xb1190.
This isn't something that would be present in source code.

Native options for obscuring/encrypting a string?

As an exercise, I've been working on replicating this game. In case it becomes inaccessible, the premise of the game is to take a quote that's been scrambled by swapping pairs of letters (eg replace A with M and vice versa), and unscramble it to its original arrangement.
As I'm studying this game, I realize it's almost trivial to extract the solution from the source - there are any number of breakpoints you can place to access it. I've been trying to come up with a way to obscure the string in a way that it isn't immediately accessible, and the only thing I can think of is some kind of native obscuring function before the quote even has a chance to land in a variable. Something like this:
var litmus, quotes = [
"String One",
"String Two",
....
"String n",
];
litmus = obscureString(quotes[Math.floor(Math.random()*(n-1))]);
This way the user can't summon up the raw quote, or even the random integer that was used - they're gone by the time the breakpoint hits.
My question is this: is there any kind of native function that would fit the role of obscureString() in the above example, even loosely? I'm aware JavaScript doesn't have any native encryption/hash methods, and any libraries that provide that functionality just provide a chance to drop a breakpoint. Thus, I'm hoping someone here can come up with a creative way to natively obscure a string, if it's even possible in JS.

Been crunching on it for a while, and I found a very makeshift solution.
The only native (read: non-user-corruptible) transformation/hash function I was able to find was window.btoa. It does exactly what I need, in letting me obscure a string before the user ever has a chance to get their hands on it. The problem, however, is that it has a counterpart window.atob, whose only purpose is to reverse the process.
To solve that, I was able to neutralize window.atob with the following line of code, essentially making window.btoa a one-way trip:
window.atob = function(f){ return f; };
Don't make a habit of this.
This is horrific practice, and I feel dirty for writing it. It's passable in this case because my application is small, self-contained, and won't ever need to rely on that function elsewhere - but I can't in good conscience recommend this as a general solution. Many browsers won't even let you override native functions in the first place.
Just wanted to post the answer in case someone found themselves in a similar situation needing a similar answer - this may be the closest we can get to a one-way native hash function for now.

lightweight javascript to javascript parser

How would I go about writing a lightweight javascript to javascript parser. Something simple that can convert some snippets of code.
I would like to basically make the internal scope objects in functions public.
So something like this
var outer = 42;
window.addEventListener('load', function() {
var inner = 42;
function magic() {
var in_magic = inner + outer;
console.log(in_magic);
}
magic();
}, false);
Would compile to
__Scope__.set('outer', 42);
__Scope__.set('console', console);
window.addEventListener('load', constructScopeWrapper(__Scope__, function(__Scope__) {
__Scope__.set('inner', 42);
__Scope__.set('magic',constructScopeWrapper(__Scope__, function _magic(__Scope__) {
__Scope__.set('in_magic', __Scope__.get('inner') + __Scope__.get('outer'));
__Scope__.get('console').log(__Scope__.get('in_magic'));
}));
__Scope__.get('magic')();
}), false);
Demonstation Example
Motivation behind this is to serialize the state of functions and closures and keep them synchronized across different machines (client, server, multiple servers). For this I would need a representation of [[Scope]]
Questions:
Can I do this kind of compiler without writing a full JavaScript -> (slightly different) JavaScript compiler?
How would I go about writing such a compiler?
Can I re-use existing js -> js compilers?

I don't think your task is easy or short given that you want to access and restore all the program state. One of the issues is that you might have to capture the program state at any moment during a computation, right? That means the example as shown isn't quite right; that captures state sort of before execution of that code (except that you've precomputed the sum that initializes magic, and that won't happen before the code runs for the original JavaScript). I assume you might want to capture the state at any instant during execution.
The way you've stated your problem, is you want a JavaScript parser in JavaScript.
I assume you are imagining that your existing JavaScript code J, includes such a JavaScript parser and whatever else is necessary to generate your resulting code G, and that when J starts up it feeds copies of itself to G, manufacturing the serialization code S and somehow loading that up.
(I think G is pretty big and hoary if it can handle all of Javascript)
So your JavaScript image contains J, big G, S and does an expensive operation (feed J to G) when it starts up.
What I think might serve you better is a tool G that processes your original JavaScript code J offline, and generates program state/closure serialization code S (to save and restore that state) that can be added to/replace J for execution. J+S are sent to the client, who never sees G or its execution. This decouples the generation of S from the runtime execution of J, saving on client execution time and space.
In this case, you want a tool that will make generation of such code S easiest. A pure JavaScript parser is a start but isn't likely enough; you'll need symbol table support to know which function code is connected a function call F(...), and which variable definition in which scope corresponds to assignments or accesses to a variable V. You may need to actually modify your original code J to insert points of access where the program state can be captured. You may need flow analysis to find out where some values went. Insisting all of this in JavaScript narrows your range of solutions.
For these tasks, you will likely find a program transformation tool useful. Such tools contain parsers for the langauge of interest, build ASTs representing the program, enable the construction of identifier-to-definition maps ("symbol tables"), can carry out modifications to the ASTs representing insertion of access points, or synthesis of ASTs representing your demonstration example, and then regenerate valid JavaScript code containing the modified J and the additions S.
Of all the program transformation systems that I know about (which includes all the ones at the Wikipedia site), none are implemented in JavaScript.
Our DMS Software Reengineering Toolkit is such a program transformation system offering all the features I just described. (Yes, its big and hoary; it has to be to handle the complexities of real computer languages). It has a JavaScript front end that contains a complete JavaScript parser to ASTs, and the machinery to regenerate JavaScript code from modified or synthesized ASTs. (Also big and hoary; good thing that hoary + hoary is still just hoary). Should it be useful, DMS also provides support for building control and dataflow analysis.

If you want something with a simple interface, you could try node-burrito: https://github.com/substack/node-burrito
It generates an AST using the uglify-js parser and then recursively walks the nodes. All you have to do is give a single callback which tests each node. You can alter the ones you need to change, and it outputs the resulting code.

I'd try to look for an existing parser to modify. Perhaps you could adapt JSLint/JSHint?

There is a problem with the rewriting above, you're not hoisting the initialization of magic to the top of the scope.
There's a number of projects out there that parse JavaScript.
Crock's Pratt parser which works well on JavaScript that fits within "The good parts" and less well on other JS.
The es-lab parser based on ometa which handles the full grammar including a lot of corner cases that Crock's parser misses. It may not perform as well as Crock's.
narcissus parser and evaluator. I don't have much experience with this.
There are also a number of high-quality lexers for JavaScript that let you manipulate JS at the token level. This can be tougher than it sounds though since JavaScript is not lexically regular, and predicting semicolon insertion is difficult without a full parse.
My es5-lexer is a carefully constructed and efficient lexer for EcmaScript 5 that provides the ability to tokenize JavaScript. It is heuristic where JavaScript's grammar is not lexically regular but the heuristic is very good and it provides a means to transform a token stream so that an interpreter is guaranteed to interpret it the way the lexer interpreted the tokens so if you don't trust your input, you can still be sure that the interpretation underlying the security transformations is sound even if not correct according to the spec for some bizarre inputs.

Your problem seams to be in same family of problems as what is solved with the JS Opfuscators and JS Compressors -- they as well as you need to be able to parse and reformat the JS to an equivalent script;
There was a good discussion on obfuscators here and the possible solution to your problem could be to leverage the parse and generator part from one of the FOSS versions.
One callout, your example code does not take into account the scopes of the variables you want to set/get and that will eventually become a problem that you will have to solve.
Addition
Given the scope problem for closure defined functions, you are probably unlikely to be able to solve this problem as a static parsing problem, as the scope variables outside the closure will have to be imported/exported to resolve/save and re-instantiate scope. Hence you may need to dig into the evaluation engine itself, and perhaps get the V8 engine and make a hack to the interpreter itself -- that is assuming that you do not need this to be generic cross all script engines and that you can tie it down to a single implementation which you control.

C interpreter written in javascript

Is there any C interpreter written in javascript or java ?
I don't need a full interpreter but I need to be able to do a step by step execution of the program and being able to see the values of variables, the stack...all that in a web interface.
The idea is to help C beginners by showing them the step by step execution of the program.
We are using GWT to build the interface so if something exists in Java we should be able to use it.
I can modify it to suit my needs but if I can avoid to write the parser / abstract-syntax tree walker / stack manipulation... that would be great.
Edit :
To be clear I don't want to simulate the complete C because some programs can be really tricky.
By step I mean a basic operation such as : expression evaluation, affectation, function call.
The C I want to simulate will contains : variables, for, while, functions, arrays, pointers, maths functions.
No goto, string functions, ctypes.h, setjmp.h... (at least for now).
Here is a prototype : http://www.di.ens.fr/~fevrier/war/simu.html
In this example we have manually converted the C code to a javascript representation but it's limited (expressions such as a == 2 || a = 1 are not handled) and is limited to programs manually converted.
We have a our disposal a C compiler on a remote server so we can check if the code is correct (and doesn't have any undefined behavior). The parsing / AST construction can also be done remotely (so any language) but the AST walking needs to be in javascript in order to run on the client side.

There's a C grammar available for antlr that you can use to generate a C parser in Java, and possibly JavaScript too.

There is em-scripten which converts LLVM languages to JS a little hacking on it and you may be able to produce a C interperter.

felixh's JSCPP project provides a C++ interpreter in Javascript, though with some limitations.
https://github.com/felixhao28/JSCPP
So an example program might look like this:
var JSCPP = require('JSCPP');
var launcher = JSCPP.launcher;
var code = 'int main(){int a;cin>>a;cout<<a;return 0;}';
var input = '4321';
var exitcode = launcher.run(code, input);
console.info('program exited with code ' + exitcode);
As of March 2015 this is under active development, so while it's usable there are still areas where it may continue to expand. Check the documentation for limitations. It looks like you can use it as a straight C interpreter with limited library support for now with no further issues.

I don't know of any C interpreters written in JavaScript, but here is a discussion of available C interpreters:
Is there an interpreter for C?
You might do better to look for any sort of virtual machine that runs on top of JavaScript, and then see if you can find a C compiler that emits the proper machine code for the VM. A likely one would seem to be LLVM; if you can find a JavaScript VM that can run LLVM, then you will be in great shape.
I did a few Google searches and found Emscripten, which translates C code into JavaScript directly! Perhaps you can do something with this:
https://github.com/kripken/emscripten/wiki
Perhaps you can modify Emscripten to emit a "sequence point" after each compiled line of C, and then you can make your simulated environment single-step from sequence point to sequence point.
I believe Emscripten is implementing LLVM, so it may actually have virtual registers; if so it might be ideal for your purposes.

I know you specified C code, but you might want to consider a JavaScript emulation of a simpler language. In particular, please consider FORTH.
FORTH runs on an extremely simple virtual machine. In FORTH there are two stacks, a data stack and a flow-of-control stack (called the "return" stack); plus some global memory. Originally FORTH was a 16-bit language but there are plenty of 32-bit FORTH implementations out there now.
Because FORTH code is sort of "close to the machine" it is easy to understand how it all works when you see it working. I learned FORTH before I learned C, and I found it to be a valuable learning experience.
There are several FORTH interpreters available in JavaScript already. The FORTH virtual machine is so simple, it doesn't take very long to implement it!
You could even then get a C-to-FORTH translator and let the students watch the FORTH virtual machine interpret compiled C code.
I consider this answer a long shot for you, so I'll stop writing here. If you are in fact interested in the idea, comment below and ask for more details and I will be happy to share them. It's been a long time since I wrote any FORTH code but I still remember it fondly, and I'd be happy to talk more about FORTH.
EDIT: Despite this answer being downvoted to a negative score, I am going to leave it up here. A simulation for educational purposes is IMHO more valuable if the simulation is simple and easy to understand. The simple stack-based virtual machine for FORTH is very simple, yet you could compile C code to run on it. (In the 80's there was even a CPU chip made that had FORTH instructions as its native machine code.) And, as I said, I personally studied FORTH when I was a complete beginner and it helped me to understand assembly language and C.
The question has no accepted answer, now over two years after it was asked. It could be that Loïc Février didn't find any suitable JavaScript interpreter. As I said, there already exist several JavaScript interpreters for the FORTH virtual machine. Therefore, this answer is a practical one.

C is a compiled language, not an interpreted language, and has features like pointers which JS completely doesn't support, so interpreting C in Javascript doesn't really make sense

Syntax / Logical checker In Javascript?

I'm building a solution for a client which allows them to create very basic code,
now i've done some basic syntax validation but I'm stuck at variable verification.
I know JSLint does this using Javascript and i was wondering if anyone knew of a good way to do this.
So for example say the user wrote the code
moose = "barry"
base = 0
if(moose == "barry"){base += 100}
Then i'm trying to find a way to clarify that the "if" expression is in the correct syntax, if the variable moose has been initialized etc etc
but I want to do this without scanning character by character,
the code is a mini language built just for this application so is very very basic and doesn't need to manage memory or anything like that.
I had thought about splitting first by Carriage Return and then by Space but there is nothing to say the user won't write something like moose="barry" or if(moose=="barry")
and there is nothing to say the user won't keep the result of a condition inline.
Obviously compilers and interpreters do this on a much more extensive scale but i'm not sure if they do do it character by character and if they do how have they optimized?
(Other option is I could send it back to PHP to process which would then releave the browser of responsibility)
Any suggestions?
Thanks
The use case is limited, the syntax will never be extended in this case, the language is a simple scripted language to enable the client to create a unique cost based on their users input the end result will be processed by PHP regardless to ensure the calculation can't be adjusted by the end user and to ensure there is some consistency.
So for example, say there is a base cost of £1.00
and there is a field on the form called "Additional Cost", the language will allow them manipulate the base cost relative to the "additional cost" field.
So
base = 1;
if(additional > 100 && additional < 150){base += 50}
elseif(additional == 150){base *= 150}
else{base += additional;}
This is a basic example of how the language would be used.
Thank you for all your answers,
I've investigated a parser and creating one would be far more complex than is required
having run several tests with 1000's of lines of code and found that character by character it only takes a few seconds to process even on a single core P4 with 512mb of memory (which is far less than the customer uses)
I've decided to build a PHP based syntax checker which will check the information and convert the variables etc into valid PHP code whilst it's checking it (so that it's ready to be called later without recompilation) using this instead of javascript this seems more appropriate and will allow for more complex code to arise without hindering the validation process
It's only taken an hour and I have code which is able to check the validity of an if statement and isn't confused by nested if's, spaces or odd expressions, there is very little left to be checked whereas a parser and full blown scripting language would have taken a lot longer
You've all given me a lot to think about and i've rated relevant answers thank you

If you really want to do this — and by that I mean if you really want your software to work properly and predictably, without a bunch of weird "don't do this" special cases — you're going to have to write a real parser for your language. Once you have that, you can transform any program in your language into a data structure. With that data structure you'll be able to conduct all sorts of analyses of the code, including procedures that at least used to be called use-definition and definition-use chain analysis.
If you concoct a "programming language" that enables some scripting in an application, then no matter how trivial you think it is, somebody will eventually write a shockingly large program with it.
I don't know of any readily-available parser generators that generate JavaScript parsers. Recursive descent parsers are not too hard to write, but they can get ugly to maintain and they make it a little difficult to extend the syntax (esp. if you're not very experienced crafting the original version).

You might want to look at JS/CC which is a parser generator that generates a parser for a grammer, in Javascript. You will need to figure out how to describe your language using a BNF and EBNF. Also, JS/CC has its own syntax (which is somewhat close to actual BNF/EBNF) for specifying the grammar. Given the grammer, JS/CC will generate a parser for that grammar.
Your other option, as Pointy said, is to write your own lexer and recursive-descent parser from scratch. Once you have a BNF/EBNF, it's not that hard. I recently wrote a parser from an EBNF in Javascript (the grammar was pretty simple so it wasn't that hard to write one YMMV).
To address your comments about it being "client specific". I will also add my own experience here. If you're providing a scripting language and a scripting environment, there is no better route than an actual parser.
Handling special cases through a bunch of if-elses is going to be horribly painful and a maintenance nightmare. When I was a freshman in college, I tried to write my own language. This was before I knew anything about recursive-descent parsers, or just parsers in general. I figured out by myself that code can be broken down into tokens. From there, I wrote an extremely unwieldy parser using a bunch of if-elses, and also splitting the tokens by spaces and other characters (exactly what you described). The end result was terrible.
Once I read about recursive-descent parsers, I wrote a grammar for my language and easily created a parser in a 10th of the time it took me to write my original parser. Seriously, if you want to save yourself a lot of pain, write an actual parser. If you go down your current route, you're going to be fixing issues forever. You're going to have to handle cases where people put the space in the wrong place, or perhaps they have one too many (or one too little) spaces. The only other alternative is to provide an extremely rigid structure (i.e, you must have exactly x number of spaces following this statement) which is liable to make your scripting environment extremely unattractive. An actual parser will automatically fix all these problems.

Javascript has a function 'eval'.
var code = 'alert(1);';
eval(code);
It will show alert. You can use 'eval' to execute basic code.

Develop Reference

JavaScript is the programming language of the Web.