I'm working on a expression parser made in Jison, which supports basic things like arithmetics, comparisons etc. I want to allow chained comparisons like 1 < a < 10 and x == y != z. I've already implemented the logic needed to compare multiple values, but I'm strugling with the grammar – Jison keeps grouping the comparisons like (1 < a) < 10 or x == (y != z) and I can't make it recognize the whole thing as one relation.
This is roughly the grammar I have:
expressions = e EOF
e = Number
| e + e
| e - e
| Relation %prec '=='
| ...
Relation = e RelationalOperator Relation %prec 'CHAINED'
| e RelationalOperator Relation %prec 'NONCHAINED'
RelationalOperator = '==' | '!=' | ...
(Sorry, I don't know the actual Bison syntax, I use JSON. Here's the entire source.)
The operator precedence is roughly: NONCHAINED, ==, CHAINED, + and -.
I have an action set up on e → Relation, so I need that Relation to match the whole chained comparison, not only a part of it. I tried many things, including tweaking the precedence and changing the right-recursive e RelationalOperator Relation to a left-recursive Relation RelationalOperator e, but nothing worked so far. Either the parser matches only the smallest Relation possible, or it warns me that the grammar is ambiguous.
If you decided to experiment with the program, cloning it and running these commands will get you started:
git checkout develop
yarn
yarn test
There are basically two relatively easy solutions to this problem:
Use a cascading grammar instead of precedence declarations.
This makes it relatively easy to write a grammar for chained comparison, and does not really complicate the grammar for binary operators nor for tight-binding unary operators.
You'll find examples of cascading grammars all over the place, including most programming languages. A reasonably complete example is seen in this grammar for C expressions (just look at the grammar up to constant_expression:).
One of the advantages of cascading grammars is that they let you group operators at the same precedence level into a single non-terminal, as you try to do with comparison operators and as the linked C grammar does with assignment operators. That doesn't work with precedence declarations because precedence can't "see through" a unit production; the actual token has to be visibly part of the rule with declared precedence.
Another advantage is that if you have specific parsing needs for chained operators, you can just write the rule for the chained operators accordingly; you don't have to worry about it interfering with the rest of the grammar.
However, cascading grammars don't really get unary operators right, unless the unary operators are all at the top of the precedence hierarchy. This can be seen in Python, which uses a cascading grammar and has several unary operators low in the precedence hierarchy, such as the not operator, leading to the following oddity:
>>> if False == not True: print("All is well")
File "<stdin>", line 1
if False == not True: print("All is well")
^
SyntaxError: invalid syntax
That's a syntax error because == has higher precedence than not. The cascading grammar only allows an expression to appear as the operand of an operator with lower precedence than any operator in the expression, which means that the expression not True cannot be the operand of ==. (The precedence ordering allows not a == b to be grouped as not (a == b).) That prohibition is arguably ridiculous, since there is no other possible interpretation of False == not True other than False == (not True), and the fact that the precedence ordering forbids the only possible interpretation makes the only possible interpretation a syntax error. This doesn't happen with precedence declarations, because the precedence declaration is only used if there is more than one possible parse (that is, if there is really an ambiguity).
Your grammar puts not at the top of the precedence hierarchy, although it should really share that level with unary minus rather than being above unary minus [Note 1]. So that's not an impediment to using a cascading grammar. However, I see that you also want to implement an if … then … else operator, which is syntactically a low-precedence prefix operator. So if you wanted 4 + if x then 0 else 1 to have the value 5 when x is false (rather than being a syntax error), the cascading grammar would be problematic. You might not care about this, and if you don't, that's probably the way to go.
Stick with precedence declarations and handle the chained comparison as an exception in the semantic action.
This will allow the simplest possible grammar, but it will complicate your actions a bit. To implement it, you'll want to implement the comparison operators as left-associative, and then you'll need to be able to distinguish in the semantic actions between a comparison (which is a list of expressions and comparison operators) from any other expression (which is a string). The semantic action for a comparison operator needs to either extend or create the list, depending on whether the left-hand operand is a list or a string. The semantic action for any other operator (including parenthetic grouping) and for the right-hand operand in a comparison needs to check if it has received a list, and if so compile it into a string.
Whichever of those two options you choose, you'll probably want to fix the various precedence errors in the existing grammar, some of which were already present in your upstream source (like the unary minus / not confusion mentioned above). These include:
Exponentiation is configured as left-associative, whereas it is almost universally considered a right-associative operator. Many languages also make it higher precedence than unary minus, as well, since -a2 is pretty well always read as the negative of a squared rather than the square of minus a (which would just be a squared).
I suppose you are going to ditch the ternary operator ?: in favour of your if … then … else operator. But if you leave ?: in the grammar, you should make it right associative, as it is in every language other than PHP. (And the associativity in PHP is generally recognised as a design error. See this summary.)
The not in operator is actually two token, not and in, and not has quite high precedence. And that's how it will be parsed by your grammar, with the result that 4 + 3 in (7, 8) evaluates to true (because it was grouped as (4 + 3) in (7, 8)), while 4 + 3 not in (7, 8) evaluates rather surprisingly to 5, having been grouped as 4 + (3 not in (7, 8)).
Notes
If you used a cascading precedence grammar, you'd see that only one of - not 0 and not - 0 is parseable. Of course, both are probably type violations, but that's not something the syntax should concern itself with.
Related
I am trying to make a web-app in JavaScript that converts arithmetic expressions to i486-compatible assembly. You can see it here:
http://flatassembler.000webhostapp.com/compiler.html
I have tried to make it able to deal with expressions that contain the incremental and decremental operators ("--" and "++"). Now it appears to correctly deal with expressions such as:
c++
However, in a response to an expression such as:
c--
the web-app responds with:
Tokenizer error: Unable to assign the type to the operator '-' (whether it's unary or binary).
The error message seems quite self-explanatory. Namely, I made the tokenizer assign to the "-" operator a type (unary or binary) and put the parentheses where they are needed, so that the parser can deal with the expressions such as:
10 * -2
And now, because of that, I am not able to implement the decremental operator. I've been thinking about this for days, and I can't decide what to even try. Do you have any ideas?
Please note that the web-app now correctly deals with the expressions such as:
a - -b
The way that this works in all existing languages (that I know of anyway) that have these operators, is that -- is a single token. So when you see a -, you check whether the very next character is another -. If it is, you generate a -- token (consuming both - characters). If it isn't, you generate a - token (leaving the next character in the buffer).
Then in the parser an l-expression, followed by -- token becomes a postfix decrement expression and -- followed by an l-expression becomes a prefix decrement expression. A -- token in any other position is a syntax error.
This means that spaces between -s matter: --x is a prefix decrement (or a syntax error if the language doesn't allow prefix increment and decrement), - -x is a double negative that cancels out to just x.
I should also note that in languages where postfix increment/decrement is an expression, it evaluates to the original value of the operand, not the incremented value. So if x starts out as 5, the value of x++ should be 5 and afterwards the value of x should be 6. So your current code does not actually correctly implement postfix ++ (or at least not in a way consistent with other languages). Also x++ + y++ currently produces a syntax error, so it doesn't seem like it's really supported at all.
Executing it in the browser console it says SyntaxError: Unexpected token **.
Trying it in node:
> -1**2
...
...
...
...^C
I thought this is an arithmetic expression where ** is the power operator. There is no such issue with other operators.
Strangely, typing */ on the second line triggers the execution:
> -1**2
... */
-1**2
^^
SyntaxError: Unexpected token **
What is happening here?
Executing it in the browser console says SyntaxError: Unexpected token **.
Because that's the spec. Designed that way to avoid confusion about whether it's the square of the negation of one (i.e. (-1) ** 2), or the negation of the square of one (i.e. -(1 ** 2)). This design was the result of extensive discussion of operator precedence, and examination of how this is handled in other languages, and finally the decision was made to avoid unexpected behavior by making this a syntax error.
From the documentation on MDN:
In JavaScript, it is impossible to write an ambiguous exponentiation expression, i.e. you cannot put a unary operator (+/-/~/!/delete/void/typeof) immediately before the base number.
The reason is also explained in that same text:
In most languages like PHP and Python and others that have an exponentiation operator (typically ^ or **), the exponentiation operator is defined to have a higher precedence than unary operators such as unary + and unary -, but there are a few exceptions. For example, in Bash the ** operator is defined to have a lower precedence than unary operators.
So to avoid confusion it was decided that the code must remove the ambiguity and explicitly put the parentheses:
(-1)**2
or:
-(1**2)
As a side note, the binary - is not treated that way -- having lower precedence -- and so the last expression has the same result as this valid expression:
0-1**2
Exponentiation Precedence in Other Programming Languages
As already affirmed in above quote, most programming languages that have an infix exponentiation operator, give a higher precedence to that operator than to the unary minus.
Here are some other examples of programming languages that give a higher precedence to the unary minus operator:
bc
VBScript
AppleScript
COBOL
Rexx
Orc
I have gone through similar questions and answers on StackOverflow and found this:
parseInt("123hui")
returns 123
Number("123hui")
returns NaN
As, parseInt() parses up to the first non-digit and returns whatever it had parsed and Number() tries to convert the entire string into a number, why unlikely behaviour in case of parseInt('') and Number('').
I feel ideally parseInt should return NaNjust like it does with Number("123hui")
Now my next question:
As 0 == '' returns true I believe it interprets like 0 == Number('') which is true. So does the compiler really treat it like 0 == Number('') and not like 0 == parseInt('') or am I missing some points?
The difference is due in part to Number() making use of additional logic for type coercion. Included in the rules it follows for that is:
A StringNumericLiteral that is empty or contains only white space is converted to +0.
Whereas parseInt() is defined to simply find and evaluate numeric characters in the input, based on the given or detected radix. And, it was defined to expect at least one valid character.
13) If S contains a code unit that is not a radix-R digit, let Z be the substring of S consisting of all code units before the first such code unit; otherwise, let Z be S.
14) If Z is empty, return NaN.
Note: 'S' is the input string after any leading whitespace is removed.
As 0=='' returns true I believe it interprets like 0==Number('') [...]
The rules that == uses are defined as Abstract Equality.
And, you're right about the coercion/conversion that's used. The relevant step is #6:
If Type(x) is Number and Type(y) is String,
return the result of the comparison x == ToNumber(y).
To answer your question about 0==''returning true :
Below is the comparison of a number and string:
The Equals Operator (==)
Type (x) Type(y) Result
-------------------------------------------
x and y are the same type Strict Equality (===) Algorithm
Number String x == toNumber(y)
and toNumber does the following to a string argument:
toNumber:
Argument type Result
------------------------
String In effect evaluates Number(string)
“abc” -> NaN
“123” -> 123
Number('') returns 0. So that leaves you with 0==0 which is evaluated using Strict Equality (===) Algorithm
The Strict Equals Operator (===)
Type values Result
----------------------------------------------------------
Number x same value as y true
(but not NaN)
You can find the complete list # javascriptweblog.wordpress.com - truth-equality-and-javascript.
parseInt("") is NaN because the standard says so even if +"" is 0 instead (also simply because the standard says so, implying for example that "" == 0).
Don't look for logic in this because there's no deep profound logic, just history.
You are in my opinion making a BIG mistake... the sooner you correct it the better will be for your programming life with Javascript. The mistake is that you are assuming that every choice made in programming languages and every technical detail about them is logical. This is simply not true.
Especially for Javascript.
Please remeber that Javascript was "designed" in a rush and, just because of fate, it became extremely popular overnight. This forced the community to standardize it before any serious thought to the details and therefore it was basically "frozen" in its current sad state before any serious testing on the field.
There are parts that are so bad they aren't even funny (e.g. with statement or the == equality operator that is so broken that serious js IDEs warn about any use of it: you get things like A==B, B==C and A!=C even using just normal values and without any "special" value like null, undefined, NaN or empty strings "" and not because of precision problems).
Nonsense special cases are everywhere in Javascript and trying to put them in a logical frame is, unfortunately, a wasted effort. Just learn its oddities by reading a lot and enjoy the fantastic runtime environment it provides (this is where Javascript really shines... browsers and their JIT are a truly impressive piece of technology: you can write a few lines and get real useful software running on a gajillion of different computing devices).
The official standard where all oddities are enumerated is quite hard to read because aims to be very accurate, and unfortunately the rules it has to specify are really complex.
Moreover as the language gains more features the rules will get even more and more complex: for example what is for ES5 just another weird "special" case (e.g. ToPrimitive operation behavior for Date objects) becomes a "normal" case in ES6 (where ToPrimitive can be customized).
Not sure if this "normalization" is something to be happy about... the real problem is the frozen starting point and there are no easy solutions now (if you don't want to throw away all existing javascript code, that is).
The normal path for a language is starting clean and nice and symmetric and small. Then when facing real world problems the language gains (is infected by) some ugly parts (because the world is ugly and asymmetrical).
Javascript is like that. Except that it didn't start nice and clean and moreover there was no time to polish it before throwing it in the game.
I'm implementing a pretty-printer for a JavaScript AST and I wanted to ask if someone is aware of a "proper" algorithm to automatically parenthesize expressions with minimal parentheses based on operator precedence and associativity. I haven't found any useful material on the google.
What seems obvious is that an operator whose parent has a higher precedence should be parenthesized, e.g.:
(x + y) * z // x + y has lower precedence
However, there are also some operators which are not associative, in which case parentheses are still are needed, e.g.:
x - (y - z) // both operators have the same precedence
I'm wondering what would be the best rule for this latter case. Whether it's sufficient to say that for division and subtraction, the rhs sub-expression should be parenthesized if it has less than or equal precedence.
I stumbled on your question in search of the answer myself. While I haven't found a canonical algorithm, I have found that, like you say, operator precedence alone is not enough to minimally parenthesize expressions. I took a shot at writing a JavaScript pretty printer in Haskell, though I found it tedious to write a robust parser so I changed the concrete syntax: https://gist.github.com/kputnam/5625856
In addition to precedence, you must take operator associativity into account. Binary operations like / and - are parsed as left associative. However, assignment =, exponentiation ^, and equality == are right associative. This means the expression Div (Div a b) c can be written a / b / c without parentheses, but Exp (Exp a b) c must be parenthesized as (a ^ b) ^ c.
Your intuition is correct: for left-associative operators, if the left operand's expression binds less tightly than its parent, it should be parenthesized. If the right operand's expression binds as tightly or less tightly than its parent, it should be parenthesized. So Div (Div a b) (Div c d) wouldn't require parentheses around the left subexpression, but the right subexpression would: a / b / (c / d).
Next, unary operators, specifically operators which can either be binary or unary, like negation and subtraction -, coercion and addition +, etc might need to be handled on a case-by-case basis. For example Sub a (Neg b) should be printed as a - (-b), even though unary negation binds more tightly than subtraction. I guess it depends on your parser, a - -b may not be ambiguous, just ugly.
I'm not sure how unary operators which can be both prefix and postfix should work. In expressions like ++ (a ++) and (++ a) ++, one of the operators must bind more tightly than the other, or ++ a ++ would be ambiguous. But I suspect even if parentheses aren't needed in one of those, for the sake of readability, you may want to add parentheses anyway.
It depends on the rules for the specific grammar. I think you have it right for operators with different precedence, and right for subtraction and division.
Exponentiation, however, is often treated differently, in that its right hand operand is evaluated first. So you need
(a ** b) ** c
when c is the right child of the root.
Which way the parenthesization goes is determined by what the grammar rules define. If your grammar is of the form of
exp = sub1exp ;
exp = sub1exp op exp ;
sub1exp = sub1exp ;
sub1exp = sub1exp op1 sub2exp ;
sub2exp = sub3exp ;
sub2exp = sub3exp op2 sub2exp ;
sub3exp = ....
subNexp = '(' exp ')' ;
with op1 and op2 being non-associative, then you want to parenthesize the right subtree of op1 if the subtree root is also op1, and you want to parenthesize the left subtree of op2 if the left subtree has root op2.
There is a generic approach to pretty printing expressions with minimal parentheses. Begin by defining an unambiguous grammar for your expression language which encodes precedence and associativity rules. For example, say I have a language with three binary operators (*, +, #) and a unary operator (~), then my grammar might look like
E -> E0
E0 -> E1 '+' E0 (+ right associative, lowest precedence)
E0 -> E1
E1 -> E1 '*' E2 (* left associative; # non-associative; same precedence)
E1 -> E2 '#' E2
E1 -> E2
E2 -> '~' E2 (~ binds the tightest)
E2 -> E3
E3 -> Num (atomic expressions are numbers and parenthesized expressions)
E3 -> '(' E0 ')'
Parse trees for the grammar contain all necessary (and unnecessary) parentheses, and it is impossible to construct a parse tree whose flattening results in an ambiguous expression. For example, there is no parse tree for the string
1 # 2 # 3
because '#' is non-associative and always requires parentheses. On the other hand, the string
1 # (2 # 3)
has parse tree
E(E0(E1( E2(E3(Num(1)))
'#'
E2(E3( '('
E0(E1(E2(E3(Num(2)))
'#'
E2(E3(Num(3)))))
')')))
The problem is thus reduced to the problem of coercing an abstract syntax tree to a parse tree. The minimal number of parentheses is obtained by avoiding coercing an AST node to an atomic expression whenever possible. This is easy to do in a systematic way:
Maintain a pair consisting of a pointer to the current node in the AST and the current production being expanded. Initialize the pair with the root AST node and the 'E' production. In each case for the possible forms of the AST node, expand the grammar as much as necessary to encode the AST node. This will leave an unexpanded grammar production for each AST subtree. Apply the method recursively on each (subtree, production) pair.
For example, if the AST is (* (+ 1 2) 3), then proceed as follows:
expand[ (* (+ 1 2) 3); E ] --> E( E0( E1( expand[(+ 1 2) ; E1]
'*'
expand[3 ; E2] ) ) )
expand[ (+ 1 2) ; E1 ] --> E1(E2(E3( '('
E0( expand[ 1 ; E1 ]
'+'
expand[ 2 ; E0 ] )
')' )))
...
The algorithm can of course be implemented in a much less explicit way, but the method can be used to guide an implementation without going insane :).
I put cross-browser split into my code and run it through jsHint and got Unexpected use of '>>>' in lines:
limit = limit === undef ?
-1 >>> 0 : // Math.pow(2, 32) - 1
limit >>> 0; // ToUint32(limit)
the same is when I put it in one line and also put expressions into parentheses
is this error? How can I fix it?
You can disable the error (well, they really should call it a warning) by turning off the "When bitwise operators are used" option (the top option listed here — oddly, the documentation doesn't mention all of the bitwise operators it relates to); if you do, the code above doesn't produce an error.
Here's the rationale for warning about the use of bitwise operators from the original JSLint tool (JSHint is a friendlier version of JSLint with more options to turn off "errors" that are purely style):
Bitwise Operators
JavaScript does not have an integer type, but it does have bitwise operators. The bitwise operators convert their operands from floating point to integers and back, so they are not as efficient as in C or other languages. They are rarely useful in browser applications. The similarity to the logical operators can mask some programming errors. The bitwise option allows the use of these operators: << >> >>> ~ & |.