I'm implementing a pretty-printer for a JavaScript AST and I wanted to ask if someone is aware of a "proper" algorithm to automatically parenthesize expressions with minimal parentheses based on operator precedence and associativity. I haven't found any useful material on the google.
What seems obvious is that an operator whose parent has a higher precedence should be parenthesized, e.g.:
(x + y) * z // x + y has lower precedence
However, there are also some operators which are not associative, in which case parentheses are still are needed, e.g.:
x - (y - z) // both operators have the same precedence
I'm wondering what would be the best rule for this latter case. Whether it's sufficient to say that for division and subtraction, the rhs sub-expression should be parenthesized if it has less than or equal precedence.
I stumbled on your question in search of the answer myself. While I haven't found a canonical algorithm, I have found that, like you say, operator precedence alone is not enough to minimally parenthesize expressions. I took a shot at writing a JavaScript pretty printer in Haskell, though I found it tedious to write a robust parser so I changed the concrete syntax: https://gist.github.com/kputnam/5625856
In addition to precedence, you must take operator associativity into account. Binary operations like / and - are parsed as left associative. However, assignment =, exponentiation ^, and equality == are right associative. This means the expression Div (Div a b) c can be written a / b / c without parentheses, but Exp (Exp a b) c must be parenthesized as (a ^ b) ^ c.
Your intuition is correct: for left-associative operators, if the left operand's expression binds less tightly than its parent, it should be parenthesized. If the right operand's expression binds as tightly or less tightly than its parent, it should be parenthesized. So Div (Div a b) (Div c d) wouldn't require parentheses around the left subexpression, but the right subexpression would: a / b / (c / d).
Next, unary operators, specifically operators which can either be binary or unary, like negation and subtraction -, coercion and addition +, etc might need to be handled on a case-by-case basis. For example Sub a (Neg b) should be printed as a - (-b), even though unary negation binds more tightly than subtraction. I guess it depends on your parser, a - -b may not be ambiguous, just ugly.
I'm not sure how unary operators which can be both prefix and postfix should work. In expressions like ++ (a ++) and (++ a) ++, one of the operators must bind more tightly than the other, or ++ a ++ would be ambiguous. But I suspect even if parentheses aren't needed in one of those, for the sake of readability, you may want to add parentheses anyway.
It depends on the rules for the specific grammar. I think you have it right for operators with different precedence, and right for subtraction and division.
Exponentiation, however, is often treated differently, in that its right hand operand is evaluated first. So you need
(a ** b) ** c
when c is the right child of the root.
Which way the parenthesization goes is determined by what the grammar rules define. If your grammar is of the form of
exp = sub1exp ;
exp = sub1exp op exp ;
sub1exp = sub1exp ;
sub1exp = sub1exp op1 sub2exp ;
sub2exp = sub3exp ;
sub2exp = sub3exp op2 sub2exp ;
sub3exp = ....
subNexp = '(' exp ')' ;
with op1 and op2 being non-associative, then you want to parenthesize the right subtree of op1 if the subtree root is also op1, and you want to parenthesize the left subtree of op2 if the left subtree has root op2.
There is a generic approach to pretty printing expressions with minimal parentheses. Begin by defining an unambiguous grammar for your expression language which encodes precedence and associativity rules. For example, say I have a language with three binary operators (*, +, #) and a unary operator (~), then my grammar might look like
E -> E0
E0 -> E1 '+' E0 (+ right associative, lowest precedence)
E0 -> E1
E1 -> E1 '*' E2 (* left associative; # non-associative; same precedence)
E1 -> E2 '#' E2
E1 -> E2
E2 -> '~' E2 (~ binds the tightest)
E2 -> E3
E3 -> Num (atomic expressions are numbers and parenthesized expressions)
E3 -> '(' E0 ')'
Parse trees for the grammar contain all necessary (and unnecessary) parentheses, and it is impossible to construct a parse tree whose flattening results in an ambiguous expression. For example, there is no parse tree for the string
1 # 2 # 3
because '#' is non-associative and always requires parentheses. On the other hand, the string
1 # (2 # 3)
has parse tree
E(E0(E1( E2(E3(Num(1)))
'#'
E2(E3( '('
E0(E1(E2(E3(Num(2)))
'#'
E2(E3(Num(3)))))
')')))
The problem is thus reduced to the problem of coercing an abstract syntax tree to a parse tree. The minimal number of parentheses is obtained by avoiding coercing an AST node to an atomic expression whenever possible. This is easy to do in a systematic way:
Maintain a pair consisting of a pointer to the current node in the AST and the current production being expanded. Initialize the pair with the root AST node and the 'E' production. In each case for the possible forms of the AST node, expand the grammar as much as necessary to encode the AST node. This will leave an unexpanded grammar production for each AST subtree. Apply the method recursively on each (subtree, production) pair.
For example, if the AST is (* (+ 1 2) 3), then proceed as follows:
expand[ (* (+ 1 2) 3); E ] --> E( E0( E1( expand[(+ 1 2) ; E1]
'*'
expand[3 ; E2] ) ) )
expand[ (+ 1 2) ; E1 ] --> E1(E2(E3( '('
E0( expand[ 1 ; E1 ]
'+'
expand[ 2 ; E0 ] )
')' )))
...
The algorithm can of course be implemented in a much less explicit way, but the method can be used to guide an implementation without going insane :).
Related
I'm working on a expression parser made in Jison, which supports basic things like arithmetics, comparisons etc. I want to allow chained comparisons like 1 < a < 10 and x == y != z. I've already implemented the logic needed to compare multiple values, but I'm strugling with the grammar – Jison keeps grouping the comparisons like (1 < a) < 10 or x == (y != z) and I can't make it recognize the whole thing as one relation.
This is roughly the grammar I have:
expressions = e EOF
e = Number
| e + e
| e - e
| Relation %prec '=='
| ...
Relation = e RelationalOperator Relation %prec 'CHAINED'
| e RelationalOperator Relation %prec 'NONCHAINED'
RelationalOperator = '==' | '!=' | ...
(Sorry, I don't know the actual Bison syntax, I use JSON. Here's the entire source.)
The operator precedence is roughly: NONCHAINED, ==, CHAINED, + and -.
I have an action set up on e → Relation, so I need that Relation to match the whole chained comparison, not only a part of it. I tried many things, including tweaking the precedence and changing the right-recursive e RelationalOperator Relation to a left-recursive Relation RelationalOperator e, but nothing worked so far. Either the parser matches only the smallest Relation possible, or it warns me that the grammar is ambiguous.
If you decided to experiment with the program, cloning it and running these commands will get you started:
git checkout develop
yarn
yarn test
There are basically two relatively easy solutions to this problem:
Use a cascading grammar instead of precedence declarations.
This makes it relatively easy to write a grammar for chained comparison, and does not really complicate the grammar for binary operators nor for tight-binding unary operators.
You'll find examples of cascading grammars all over the place, including most programming languages. A reasonably complete example is seen in this grammar for C expressions (just look at the grammar up to constant_expression:).
One of the advantages of cascading grammars is that they let you group operators at the same precedence level into a single non-terminal, as you try to do with comparison operators and as the linked C grammar does with assignment operators. That doesn't work with precedence declarations because precedence can't "see through" a unit production; the actual token has to be visibly part of the rule with declared precedence.
Another advantage is that if you have specific parsing needs for chained operators, you can just write the rule for the chained operators accordingly; you don't have to worry about it interfering with the rest of the grammar.
However, cascading grammars don't really get unary operators right, unless the unary operators are all at the top of the precedence hierarchy. This can be seen in Python, which uses a cascading grammar and has several unary operators low in the precedence hierarchy, such as the not operator, leading to the following oddity:
>>> if False == not True: print("All is well")
File "<stdin>", line 1
if False == not True: print("All is well")
^
SyntaxError: invalid syntax
That's a syntax error because == has higher precedence than not. The cascading grammar only allows an expression to appear as the operand of an operator with lower precedence than any operator in the expression, which means that the expression not True cannot be the operand of ==. (The precedence ordering allows not a == b to be grouped as not (a == b).) That prohibition is arguably ridiculous, since there is no other possible interpretation of False == not True other than False == (not True), and the fact that the precedence ordering forbids the only possible interpretation makes the only possible interpretation a syntax error. This doesn't happen with precedence declarations, because the precedence declaration is only used if there is more than one possible parse (that is, if there is really an ambiguity).
Your grammar puts not at the top of the precedence hierarchy, although it should really share that level with unary minus rather than being above unary minus [Note 1]. So that's not an impediment to using a cascading grammar. However, I see that you also want to implement an if … then … else operator, which is syntactically a low-precedence prefix operator. So if you wanted 4 + if x then 0 else 1 to have the value 5 when x is false (rather than being a syntax error), the cascading grammar would be problematic. You might not care about this, and if you don't, that's probably the way to go.
Stick with precedence declarations and handle the chained comparison as an exception in the semantic action.
This will allow the simplest possible grammar, but it will complicate your actions a bit. To implement it, you'll want to implement the comparison operators as left-associative, and then you'll need to be able to distinguish in the semantic actions between a comparison (which is a list of expressions and comparison operators) from any other expression (which is a string). The semantic action for a comparison operator needs to either extend or create the list, depending on whether the left-hand operand is a list or a string. The semantic action for any other operator (including parenthetic grouping) and for the right-hand operand in a comparison needs to check if it has received a list, and if so compile it into a string.
Whichever of those two options you choose, you'll probably want to fix the various precedence errors in the existing grammar, some of which were already present in your upstream source (like the unary minus / not confusion mentioned above). These include:
Exponentiation is configured as left-associative, whereas it is almost universally considered a right-associative operator. Many languages also make it higher precedence than unary minus, as well, since -a2 is pretty well always read as the negative of a squared rather than the square of minus a (which would just be a squared).
I suppose you are going to ditch the ternary operator ?: in favour of your if … then … else operator. But if you leave ?: in the grammar, you should make it right associative, as it is in every language other than PHP. (And the associativity in PHP is generally recognised as a design error. See this summary.)
The not in operator is actually two token, not and in, and not has quite high precedence. And that's how it will be parsed by your grammar, with the result that 4 + 3 in (7, 8) evaluates to true (because it was grouped as (4 + 3) in (7, 8)), while 4 + 3 not in (7, 8) evaluates rather surprisingly to 5, having been grouped as 4 + (3 not in (7, 8)).
Notes
If you used a cascading precedence grammar, you'd see that only one of - not 0 and not - 0 is parseable. Of course, both are probably type violations, but that's not something the syntax should concern itself with.
The interpreter built by my university, codeboot.org, offers step-by-step execution for an expression. As a result, I was able to see how the program reads an arithmetic expression. And this is where I start to confus.
For example, this expression: 10-5+(7+2)/3
We always say that we should calculate the expression in the parenthesis, as a result, this is what the order that I expect
7+2=9, 9/3=3, 10-5=5, 5+3=8
However, what the interpreter executes is completely different.
10-5=5, 7+2=9, 9/3=3, 5+3=8
Even though the result is the same, but why would it calculate 10-5 first? and what happens with "we have to calculate whatever is in the parenthesis first"? This makes me really confusing
I would like to know if this is the right behavior or not that the interpreter always goes from left to right and calculate whatever it can calculate first. Instead of jumping right into the () as we would expect
"Do the parentheses first" is not a rule in JS. And "go from left to right" isn't really a rule either. E.g. consider 1 + 4 * 6. Strict left-to-right would result in
1+4 = 5, 5*6 = 30
and that's not what JS does.
Instead, JS parses your expression into an expression tree, and then evaluates it starting at the root of the tree. (Strictly speaking, a JS implementation implementation isn't required to build a tree, but it's required to give the same results as if it did.)
For instance, your example expression 10-5+(7+2)/3 would result in a tree roughly like this:
AdditiveExpression:
AdditiveExpression
AdditiveExpression
MultiplicativeExpression
... NumericLiteral 10
- -
MultiplicativeExpression
... NumericLiteral 5
+ +
MultiplicativeExpression
MultiplicativeExpression
... ParenthesizeExpression
( (
Expression
... AdditiveExpression 7+2
) )
MultiplicativeOperator /
ExponentiationExpression
... NumericLiteral 3
where:
I've used indentation to convey nesting;
I've used "..." when I've left out lots of intermediate derivations; and
I haven't bothered to give the full sub-tree for "7+2".
(I couldn't find a way to get codeboot.org to show its parse tree. If there is some way, or if you use some other tool to show an expression's parse tree, note that it may not look exactly as above, but it should be similar enough that it will give the same behavior.)
To evaluate the expression, it starts at the root, an AdditiveExpression whose children are:
another AdditiveExpression (for 10-5),
the + token, and
a MultiplicativeExpression (for (7+2)/3).
The rule is to
(a) evaluate the left operand, then
(b) evaluate the right, then
(c) perform the addition on the results.
So that's why (a) 10-5 => 5 is the first thing your interpreter calculates.
Next is to (b) evaluate the MultiplicativeExpression for (7+2)/3. The rule here is similar, so we need to:
(b1) evaluate the left operand (the MultiplicativeExpression for (7+2)), then
(b2) evaluate the right operand (the ExponentiationExpression for 3), then
(b3) perform the operation indicated by the MultiplicativeOperator /.
So (b1) 7+2 => 9 is the next thing,
then (b2) 3 => 3,
then (b3) 9/3 => 3.
We're now finished step (b), so we proceed to (c) 5+3 => 8.
This matches the series of calculations that your interpreter performs.
I've been coding for years and suddenly stuck to some simple thing about operators precedence in case of increment/decrement operators.
According to https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Operator_Precedence
the postfix increment/decrement has higher priority than the prefix one.
So, I expect that in expression
x = --a + a++;
the increment will be calculated first and only after that the decrement.
But, in tests this expression calculates left-to-right like that operators have the same priority. And as result
a=1;x = --a + a++ equals to 0 instead of 2.
Ok. Assuming that prefix/postfix operators have the same precedence, I try to reorder it with parentheses:
a=1;x = --a + ( a++ )
But again, the result will be 0 and not 2 as I expected.
Can someone explain that please? why parentheses here do not affect anything? How can I see that postfix has higher precedence than prefix?
In the expression, evaluation proceeds like this:
--a is evaluated. The variable a is decremented, and the value of --a is therefore 0.
a++ is evaluated. The value of a is obtained, and then a is incremented. The value of a++ is therefore 0.
The + operation is performed, and the result is 0.
The final value of a is 1.
Because --a and a++ are on either side of the lower-precedence + operator, the difference in precedence between pre-decrement and post-increment doesn't matter; the + operator evaluates the left-hand subexpression before it evaluates the right-hand subexpression.
Operator precedence is not the same thing as evaluation order.
Operator precedence tells you that
f() + g() * h()
is parsed as
f() + (g() * h())
(because * has higher precedence than +), but not which function is called first. That is controlled by evaluation order, which in JavaScript is always left-to-right.
Parentheses only override precedence (i.e. they affect how subexpressions are grouped), not order of evaluation:
(f() + g()) * h()
performs addition before multiplication, but in all cases f is called first and h last.
In your example
--a + a++
the relative precedence of prefix -- and postfix ++ doesn't matter because they're not attached to the same operand. Infix + has much lower precedence, so this expression parses as
(--a) + (a++)
As always, JS expressions are evaluated from left to right, so --a is done first.
If you had written
--a++
that would have been parsed as
--(a++)
(and not (--a)++) because postfix ++ has higher precedence, but the difference doesn't matter here because either version is an error: You can't increment/decrement the result of another increment/decrement operation.
However, in the operator precedence table you can see that all prefix operators have the same precedence, so we can show an alternative:
// ! has the same precedence as prefix ++
!a++
is valid code because it parses as !(a++) due to postfix ++ having higher precedence than !. If it didn't, it would be interpreted as (!a)++, which is an error.
Well, it strange, but looks like increment/decrement behaves in expressions like a function call, not like an operator.
I could not find any documentation regarding order of function evaluations in expression, but looks like it ignores any precedence rules. So strange expression like that:
console.log(1) + console.log(2)* ( console.log(3) + console.log(4))
will show 1234 in console. That the only explanation I could found why parentheses do not affect to order of evaluation of dec/inc in expressions
Executing it in the browser console it says SyntaxError: Unexpected token **.
Trying it in node:
> -1**2
...
...
...
...^C
I thought this is an arithmetic expression where ** is the power operator. There is no such issue with other operators.
Strangely, typing */ on the second line triggers the execution:
> -1**2
... */
-1**2
^^
SyntaxError: Unexpected token **
What is happening here?
Executing it in the browser console says SyntaxError: Unexpected token **.
Because that's the spec. Designed that way to avoid confusion about whether it's the square of the negation of one (i.e. (-1) ** 2), or the negation of the square of one (i.e. -(1 ** 2)). This design was the result of extensive discussion of operator precedence, and examination of how this is handled in other languages, and finally the decision was made to avoid unexpected behavior by making this a syntax error.
From the documentation on MDN:
In JavaScript, it is impossible to write an ambiguous exponentiation expression, i.e. you cannot put a unary operator (+/-/~/!/delete/void/typeof) immediately before the base number.
The reason is also explained in that same text:
In most languages like PHP and Python and others that have an exponentiation operator (typically ^ or **), the exponentiation operator is defined to have a higher precedence than unary operators such as unary + and unary -, but there are a few exceptions. For example, in Bash the ** operator is defined to have a lower precedence than unary operators.
So to avoid confusion it was decided that the code must remove the ambiguity and explicitly put the parentheses:
(-1)**2
or:
-(1**2)
As a side note, the binary - is not treated that way -- having lower precedence -- and so the last expression has the same result as this valid expression:
0-1**2
Exponentiation Precedence in Other Programming Languages
As already affirmed in above quote, most programming languages that have an infix exponentiation operator, give a higher precedence to that operator than to the unary minus.
Here are some other examples of programming languages that give a higher precedence to the unary minus operator:
bc
VBScript
AppleScript
COBOL
Rexx
Orc
According to Wikipedia, the modulo operator (remainder of integer division) on n should yield a result between 0 and n-1.
This is indeed the case in python:
print(-1%5) # outputs 4
In Ruby:
puts -1%5 # outputs 4
In Haskell
main = putStrLn $ show $ mod (-1) 5
But inJavascript:
console.log(-1%5);
The result is -1 !!
Why is that ? is there logic behind this design decision ?
While -1 is equivalent to 4 modulo 5, negative values makes it harder to use as indices.
For example:
arr[x%5]
would always be a valid expression in python and Ruby when the array length is more than 5, but in the Javascript, this is an exception waiting to happen.
If by "modulus" you understand the remainder of the Euclidean division as common in mathematics, the % operator is not the modulo operator. It rather is called the remainder operator, whose result always has the same sign as the dividend. (In Haskell, the rem function does this).
This behaviour is pretty common amongst programming languages, notably also in C and Java from which the arithmetic conventions of JavaScript were inspired.
Note that while in most languages, ‘%’ is a remainder operator, in some (e.g. Python, Perl) it is a modulo operator. For two values of the same sign, the two are equivalent, but when the dividend and divisor are of different signs, they give different results. To obtain a modulo in JavaScript, in place of a % n, use ((a % n ) + n ) % n
see
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Remainder