Static Parsing: Tell if Two Javascript Functions are the Same

Static Parsing: Tell if Two Javascript Functions are the Same - javascript

I am looking for a way, using static analysis of two JavaScript functions, to tell if they are the same. Let me define multiple definitions of "the same".
Level 1: The functions are the same except for possible different whitespace, e.g. TABS, CR, LF and SPACES.
Level 2 The functions may have different whitespace like Level 1, but also may have different variable names.
Level 3 ???
For level one, I think I could just remove all (non-literal, which may be tough) whitespace from each string containing the two JS function definitions, and then compare the strings.
For level two, I think I would need to use something like SpiderMonkey's parser to generate a two parse trees, and then write a comparer which walks the trees and allows variables to have different names.
[Edit] Williham, makes a good point below. I do mean identical. Now, I'm looking for some practical strategies, particularly with regards to using parse trees.

Reedit:
To expound on my suggestion for determining identical functions, the following flow can be suggested:
Level 1: Remove any whitespace that is not part of a string literal; insert newlines after each {, ; and } and compare. If equal; the functions are identical, if not:
Level 2: Move all variable declarations and assignments that don't depend on the state of other variables defined in the same scope to the start of the scope they are declared in (or if not wanting to actually parse the JS; the start of the braces); and order them by line length; treating all variable names as being 4 characters long, and falling back to alphabetical ordering ignoring variable names in case of tied lengths. Reorder all collections in alphabetical order, and rename all variables vSNN, where v is literal, S is the number of nested braces and NN is the order in which the variable was encountered.
Compare; if equal, the functions are identical, if not:
Level 3: Replace all string literals with "sNN", where " and s are literal, and NN is the order in which the string was encountered. Compare; if equal, the functions are identical, if not:
Level 4: Normalize the names of any functions known to be the same by using the name of the function with the highest priority according to alphabetical order (in the example below, any calls to p_strlen() would be replaced with c_strlen(). Repeat re-orderings as per level 1 if necessary. Compare; if equal, the functions are identical, if not; the functions are almost certainly not identical.
Original answer:
I think you'll find that you mean "identical", not "the same".
The difference, as you'll find, is critical:
Two functions are identical if, following some manner of normalization, (removing non-literal whitespace, renaming and reordering variables to a normalized order, replacing string literals with placeholders, …) they compare to literally equal.
Two functions are the same if, when called for any given input value they give the same return value. Consider, in the general case, a programming language which has counted, zero-terminated strings (hybrid Pascal/C strings, if you will). A function p_strlen(str) might look at the character count of the string and return that. A function c_strlen(str) might count the number of characters in the string and return that.
While these functions certainly won't be identical, they will be the same: For any given (valid) input value they will give the same value.
My point is:
Determining wether two functions are identical (what you seem to want to achieve) is a (moderately) trivial problem, done as you describe.
Determining wether two functions are truly the same (what you might actually want to achieve) is non-trivial; in fact, it's downright Hard, probably related to the Halting Problem, and not something that can be done with static analysis.
Edit: Of course, functions that are identical are also the same; but in a highly specific and rarely useful way for complete analysis.

Your approach for level 1 seems reasonable.
For level 2, how about do some rudimentary variable substitution on each function and then do approch for level 1? Start at the beginning and for each variable declaration you encounter rename them to var1, var2, ... varX.
This does not help if the functions declare variables in different orders... var i and var j may be used the same way in both functions but are declared in different orders. Then you are probably left doing a comparison of parse trees like you mention.

See my company's (Semantic Designs) Smart Differencer tool. This family of tools parses source code according to compiler-level-detail grammar for the language of interest (in your case, JavaScript), builds ASTs, and then compares the ASTs (which effectively ignores whitespace and comments). Literal values are normalized, so it doesn't matter how they are "spelled"; 10E17 has the same normalized value as 1E18.
If the two trees are the same, it will tell you "no differences". If they differ by a consistent renaming of an identifier, the tool will tell you the consisten renaming and the block in which it occurs. Other differences are reported as language-element (identifier, statement, block, class,...) insertions, deletions, copies, or moves. The goal is to report the small set of deltas that plausibly explain the differences. You can see examples for a number of languages at the web site.
You can't in practice go much beyond this; to determine if two functions compute the same answer, in principle you have to solve the halting problem. You might be able to detect where two language elements that are elements of a commutative list, can be commuted without effect; we're working on this. You might be able to apply normalization rewrites to canonicalize certain forms (e.g., map all multiple declarations into a sequence of lexically sorted single declarations). You might be able to convert the source code into its equivalent set of dataflows, and do a graph isomorphism match (the Programmer's Apprentice from MIT back in the 1980's proposed to do this, but I don't think they ever got there).
All of there are more work to do than you might expect.

Related

Is there a way to match only top level parentheses with regex?

With Javascript, suppose I have a string like (1)(((2)(3))4), can I get a regex to match just (1) and (((2)(3))4), or do I need to do something more complicated?
Ideally the regex would return ["((2)(3))","4"] if you searched ((2)(3))4. Actually that's really a requirement. The point is to group things into the chunks that need to be worked on first, like the way parentheses work in math.

No, there is no way to match only top level parentheses with regex
Looking only at the top level doesn't make the problem easier than general "parsing" of recursive structures. (See this relevant popular SO question with a great answer).
Here's a simple intuitive reason why Regex can't parse arbitrary levels of nesting:
To keep track of the level of nesting, one must count. If one wants to be able to keep track of an arbitrary level of nesting, one needs an arbitrarily large number while running the program.
But regular expressions are exactly those that can be implemented by DFAs, that is Deterministice finite automatons. These have only a finite number of states. Thus they can't keep track of an arbitrarily large number.
This argument works also for your specific concern of being only interested in the top level parentheses.
To recognize the top level parentheses, you must keep track of arbitrary nesting preceding any one of them:
((((..arbitrarily deep nesting...))))((.....)).......()......
^toplevel ^^ ^ ^^
So yes, you need something more powerful than regex.
While if you are very pragmatic, for your concrete application it might be okay to say that you won't encounter any nesting deeper than, say, 1000 (and so you might be willing to go with regex), it's also a very practical fact that any regex recognizing a nesting level of more than 2 is basically unreadable.

Well, here is one way to do it. As Jo So pointed out, you can't really do it in javascript with indefinite amounts of recursion, but you can make something arbitrarily recursive pretty easily. I'm not sure how the performance scales though.
First I figured out that you need recursion. Then I realized that you can just make your regex 'recursive' by just copying and pasting recursively, like so (using curly braces for clarity):
Starting regex
Finds stuff in brackets that isn't itself brackets.
/{([^{}])*}/g
Then copy and paste the whole regex inside itself! (I spaced it out so you can see where it was pasted in.) So now it is basically like a( x | a( x )b )b
/{([^{}] | {([^{}])*} )*}/g
That will get you one level of recursion and you can continue ad nauseum in this fashion and actually double the amount of recursions each time:
//matches {4{3{2{1}}}}
/{([^{}]|{([^{}]|{([^{}]|{([^{}])*})*})*})*}/g
//matches {8{7{6{5{4{3{2{1}}}}}}}}
/{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}]|{([^{}])*})*})*})*})*})*})*})*}/g
Finally I just add |[^{}]+ on the end of the expression to match stuff that is completely outside of brackets. Crazy, but it works for my needs. I feel like there is probably some clever way to combine this concept with a recursive function in order to get a truly recursive matcher, but I can't think of it now.

If you can be sure that the parentheses are balanced (I'm sure there are other resources out there that can answer that question for you if required) and if by "top-level" you're happy to find local as well as global maxima then all you need to do is find any content that starts with an open bracket and closes with a close-bracket, with no intermediate open-bracket between the two:
I think the following should do that for you and helpfully group any "top-level" content:
\(([^\(]*?)\)
That content may not all be at the same "level", but if you think of the nested brackets as describing the branching of a tree, the regex will return to you the leaves. If you pre-process your text to be wrapped in parentheses to start with, and the earlier assumptions are met, you can guarantee always getting at least one "leaf".

regex for matching finite-depth nested strings -- slow, crashy behavior

I was writing some regexes in my text editor (Sublime) today in an attempt to quickly find specific segments of source code, and it required getting a little creative because sometimes the function call might contain more function calls. For example I was looking for jQuery selectors:
$("div[class='should_be_using_dot_notation']");
$(escapeJQSelector("[name='crazy{"+getName(object)+"}']"));
I don't consider it unreasonable to expect one of my favorite powertools (regex) to help me do this sort of searching, but it's clear that the expression required to parse the second bit of code there will be somewhat complex as there are two levels of nested parens.
I am sufficiently versed in the theory to know that this sort of parsing is exactly what a context-free grammar parser is for, and that building out a regex is likely to suck up more memory and time (perhaps in an exponential rather than O(n^3) fashion). However I am not expecting to see that sort of feature available in my text editor or web browser any time soon, and I just wanted to squeak by with a big nasty regex.
Starting from this (This matches zero levels of nested parens, and no trivial empty ones):
\$\([^)(]+?\)
Here's what the one-level nested parens one I came up with looks like:
\$\(((\([^)(]*\))|[^)(])+?\)
Breaking it down:
\$\( begin text
( groups the contents of the $() call
(\( groups a level 1 nested pair of parens
[^)(]* only accept a valid pair of parens (it shall contain anything but parens)
\)) close level 1 nesting
| contents also can be
[^)(] anything else that also is not made of parens
)+? not sure if this should be plus or star or if can be greedy (the contents are made up of either a level 1 paren group or any other character)
\) end
This worked great! But I need one more level of nesting.
I started typing up the two-level nested expression in my editor and it began to pause for 2-3 seconds at a time when I put in *'s.
So I gave up on that and moved to regextester.com, and before very long at all, the entire browser tab was frozen.
My question is two-fold.
What's a good way of constructing an arbitrary-level regex? Is this something that only human pattern-recognition can ever hope to achieve? It seems to me that I can get a good deal of intuition for how to go about making the regex capable of matching two levels of nesting based on the similarities between the first two. I think this could just be distilled down into a few "guidelines".
Why does regex parsing on non-enormous regexes block or freeze for so long?
I understand the O(n) linear time is for n where n is length of input to run the regex over (i.e. my test strings). But in a system where it recompiles the regex each time I type a new character into it, what would cause it to freeze up? Is this necessarily a bug in the regex code (I hope not, I thought the Javascript regex impl was pretty solid)? Part of my reasoning moving to a different regex tester from my editor was that I'd no longer be running it (on each keypress) over all ~2000 lines of source code, but it did not prevent the whole environment from locking up as I edited my regex. It would make sense if each character changed in the regex would correspond to some simple transformation in the DFA that represents that expression. But this appears not to be the case. If there are certain exponential time or space consequences to adding a star in a regex, it could explain this super-slow-to-update behavior.
Meanwhile I'll just go work out the next higher nested regexes by hand and copy them in to the fields once i'm ready to test them...

Um. Okay, so nobody wants to write the answer, but basically the answer here is
Backtracking
It can cause exponential runtime when you do certain non-greedy things.
The answer to the first part of my question:
The two-nested expression is as follows:
\$\(((\(((\([^)(]*\))|[^)(])*\))|[^)(])*\)
The transformation to make the next nested expression is to replace instances of [^)(]* with ((\([^)(]*\))|[^)(])*, or, as a meta-regex (where the replace-with section does not need escaping):
s/\[^\)\(\]\*/((\([^)(]*\))|[^)(])*/
This is conceptually straightforward: In the expression matching N levels of nesting, if we replace the part that forbids more nesting with something that matches one more level of nesting then we get the expression for N+1 levels of nesting!

To match an arbitrary number of nested (), with only one pair on each level of nesting, you could use the following, changing 2 to whatever number of nested () you require
/(?:\([^)(]*){2}(?:[^)(]*\)){2}/
To avoid excessive backtracking you want to avoid using nested quantifiers, particularly when the sub-pattern on both sides of an inner alternation is capable of matching the same substring.

Use of letters for doing matrix math in Javascript

I'm doing a course in Quantum Computation. In it, we represent possible actions, or operators, by matrices. I've been looking into creating a webpage for solving these maths problems.
It is also a small challenge for myself in order to freshen up my JS.
I've been looking at various options, like Sylvester, MathJax and MathML.
Problem: However, none of the above appear to give functionality for using letters throughout my computation.
For instance, in Quantum Computation we often use multiply a matrix containing unknowns alpha and beta, with other matrices.
This is the sort of math I need to do:
http://i.stack.imgur.com/vH9Dk.gif
Ideally, I'd write this in the style of:
M=[[a],[b]], which of course, I cannot. Further, I'd be able to multiply to get "2*a" etc.
Any suggestions?

As suggested in the comments on the question, you could use strings. Then you just have to write your own matrix-matrix multiplication routine which will understand the difference between an entry containing a string and an entry containing a number.
However, as soon as you do more than one of these, you'll end up with expressions as well as variables and numbers. So we can generalise this to make every element be an expression. This is the beginnings of a symbolic algebra system as #High Performance Mark pointed out.
In javascript, I would guess that you want a set of expression objects, each implementing an interface including a method that returns whether the expression is determined or not yet. The gnarly bit is simplifying the resulting expressions to resolve the values of the variables.
Alternatively, do a bit more maths beforehand; move the variables out of the equations, and then let the code do the calculation.

JS Object navigation-- when to use object.sub and object["sub"]?

Looking at this from a point of semantics, there's two ways to navigate objects in JS, you can either use the dot operator or work through them like it's a nested array.
When is it appropriate to use one operator over the other? Why shouldn't I just use the method with square brackets all the time (it seems more powerful)? They both seems easy to read, but the dot operator looks like it's limited in that it cannot be provided with variable names of nested objects, or work through arrays.

The main reasons for using []
You can't access properties with the names of keywords via the dot notation (at least not in < ES5)
You can use it to access properties given by a string
The reasons for using .
Always using [] makes syntax highlighting, code completion etc. pretty much useless
Even with automatic insertion of the closing counter parts [' is a hell lot slower to type (especially on non US Keyboard layouts)
I mean, just imagine writing Chat['users']['getList']()['sort']()['each'].
Rule of thumb: Use . where ever possible, and fall back to [] when there's no other way.

Creating a Basic Formula Editor in JavaScript

I'm working on creating a basic RPG game engine prototype using JavaScript and canvas. I'm still working out some design specs on paper, and I've hit a bit of a problem I'm not quite sure how to tackle.
I will have a Character object that will have an array of Attribute objects. Attributes will look something like this:
function(name, value){
this.name = name;
this.value = value;
...
}
A Character will also have "skills" that are calculated off attributes. A skills value can also be determined by a formula entered by the user. A legit formula would look something like this:
((#attribute1Name + (#attribute2Name / 2) * 5)
where any text following the # sign represents the name of an attribute belonging to that character. The formula will be entered into a text field as a string.
What I'm having a problem with is understanding the proper way to parse and evaluate this formula. Initially, my plan was to do a simple replace on the attribute names and eval the expression (if invalid, the eval would fail). However, this presents a problem as it would allow for JavaScript injection into the field. I'm assuming I'll need some kind of FSM similar to an infix calculator to solve this, but I'm a little rusty on my computation theory (thanks corporate world!). I'm really not asking for someone to just hand me the code so much as I'd like to get your input on what is the best solution to this problem?
EDIT: Thanks for the responses. Unfortunately life has kept me busy and I haven't tried a solution yet. Will update when I get a result (good or bad).

Different idea, hence a separate suggestion:
eval() works fine, and there's no need to re-invent the wheel.
Assuming that there's only a small and fixed number of variables in your formula language, it would be sufficient to scan your way through the expression and verify that everything you encounter is either a parenthesis, an operator or one of your variable names. I don't think there would be any way to assemble those pieces into a piece of code that could have malicious side effects on eval.
So:
Scan the expression to verify that it draws from just a very limited vocabulary.
Let eval() work it out.
Probably the compromise with the least amount of work and code while bringing risk down to (near?) 0. At worst, a misuser could tack parentheses on a variable name in an attempt to execute the variable.

I think instead of letting them put the whole formula in, you could have select tags that have operations and values, and let them choose.
ie. a set of tags with attribute-operation-number:
<select> <select> <input type="text">
#attribute1Name1 + (check if input is number)
#attribute1Name2 -
#attribute1Name3 *
#attribute1Name4 /
etc.

There is a really simple solution: Just enter a normal JavaScript formula (i.e. as if you were writing a method for your object) and use this to reference the object you're working on.
To change this when evaluating the method use apply() or call() (see this answer).

I recently wrote a similar application. I probably invested far too much work, but I went the whole 9 yards and wrote both a scanner and a parser.
The scanner converted the text into a series of tokens; tokens are simple objects consisting of token type and value. For the punctuation marks, value = character, for numbers the values would be integers corresponding to the numeric value of the number, and for variables it would be (a reference to) a variable object, where that variable would be sitting in a list of objects having a name. Same variable object = same variable, natch.
The parser was a simple brute force recursive descent parser. Here's the code.
My parser does logic expressions, with AND/OR taking the place of +/-, but I think you can see the idea. There are several levels of expressions, and each tries to assemble as much of itself as it can, and calls to lower levels for parsing nested constructs. When done, my parser has generated a single Node containing a tree structure that represents the expression.
In your program, I guess you could just store that Node, as its structure will essentially represent the formula for its evaluation.
Given all that work, though, I'd understand just how tempting it would be to just cave in and use eval!

I'm fascinated by the task of getting this done by the simplest means possible.
Here's another approach:
Convert infix to postfix;
use a very simple stack-based calculator to evaluate the resulting expression.
The rationale here being, once you get rid of the complication of "* before +" and parentheses, the remaining calculation is very straightforward.

You could look at running the user-defined code in a sandbox to prevent attacks:
Is It Possible to Sandbox JavaScript Running In the Browser?

Develop Reference

JavaScript is the programming language of the Web.