How to parse expressions with the decremental operator (e.g. "c--")? - javascript

I am trying to make a web-app in JavaScript that converts arithmetic expressions to i486-compatible assembly. You can see it here:
http://flatassembler.000webhostapp.com/compiler.html
I have tried to make it able to deal with expressions that contain the incremental and decremental operators ("--" and "++"). Now it appears to correctly deal with expressions such as:
c++
However, in a response to an expression such as:
c--
the web-app responds with:
Tokenizer error: Unable to assign the type to the operator '-' (whether it's unary or binary).
The error message seems quite self-explanatory. Namely, I made the tokenizer assign to the "-" operator a type (unary or binary) and put the parentheses where they are needed, so that the parser can deal with the expressions such as:
10 * -2
And now, because of that, I am not able to implement the decremental operator. I've been thinking about this for days, and I can't decide what to even try. Do you have any ideas?
Please note that the web-app now correctly deals with the expressions such as:
a - -b

The way that this works in all existing languages (that I know of anyway) that have these operators, is that -- is a single token. So when you see a -, you check whether the very next character is another -. If it is, you generate a -- token (consuming both - characters). If it isn't, you generate a - token (leaving the next character in the buffer).
Then in the parser an l-expression, followed by -- token becomes a postfix decrement expression and -- followed by an l-expression becomes a prefix decrement expression. A -- token in any other position is a syntax error.
This means that spaces between -s matter: --x is a prefix decrement (or a syntax error if the language doesn't allow prefix increment and decrement), - -x is a double negative that cancels out to just x.
I should also note that in languages where postfix increment/decrement is an expression, it evaluates to the original value of the operand, not the incremented value. So if x starts out as 5, the value of x++ should be 5 and afterwards the value of x should be 6. So your current code does not actually correctly implement postfix ++ (or at least not in a way consistent with other languages). Also x++ + y++ currently produces a syntax error, so it doesn't seem like it's really supported at all.

Related

How to make a chained comparison in Jison (or Bison)

I'm working on a expression parser made in Jison, which supports basic things like arithmetics, comparisons etc. I want to allow chained comparisons like 1 < a < 10 and x == y != z. I've already implemented the logic needed to compare multiple values, but I'm strugling with the grammar – Jison keeps grouping the comparisons like (1 < a) < 10 or x == (y != z) and I can't make it recognize the whole thing as one relation.
This is roughly the grammar I have:
expressions = e EOF
e = Number
| e + e
| e - e
| Relation %prec '=='
| ...
Relation = e RelationalOperator Relation %prec 'CHAINED'
| e RelationalOperator Relation %prec 'NONCHAINED'
RelationalOperator = '==' | '!=' | ...
(Sorry, I don't know the actual Bison syntax, I use JSON. Here's the entire source.)
The operator precedence is roughly: NONCHAINED, ==, CHAINED, + and -.
I have an action set up on e → Relation, so I need that Relation to match the whole chained comparison, not only a part of it. I tried many things, including tweaking the precedence and changing the right-recursive e RelationalOperator Relation to a left-recursive Relation RelationalOperator e, but nothing worked so far. Either the parser matches only the smallest Relation possible, or it warns me that the grammar is ambiguous.
If you decided to experiment with the program, cloning it and running these commands will get you started:
git checkout develop
yarn
yarn test
There are basically two relatively easy solutions to this problem:
Use a cascading grammar instead of precedence declarations.
This makes it relatively easy to write a grammar for chained comparison, and does not really complicate the grammar for binary operators nor for tight-binding unary operators.
You'll find examples of cascading grammars all over the place, including most programming languages. A reasonably complete example is seen in this grammar for C expressions (just look at the grammar up to constant_expression:).
One of the advantages of cascading grammars is that they let you group operators at the same precedence level into a single non-terminal, as you try to do with comparison operators and as the linked C grammar does with assignment operators. That doesn't work with precedence declarations because precedence can't "see through" a unit production; the actual token has to be visibly part of the rule with declared precedence.
Another advantage is that if you have specific parsing needs for chained operators, you can just write the rule for the chained operators accordingly; you don't have to worry about it interfering with the rest of the grammar.
However, cascading grammars don't really get unary operators right, unless the unary operators are all at the top of the precedence hierarchy. This can be seen in Python, which uses a cascading grammar and has several unary operators low in the precedence hierarchy, such as the not operator, leading to the following oddity:
>>> if False == not True: print("All is well")
File "<stdin>", line 1
if False == not True: print("All is well")
^
SyntaxError: invalid syntax
That's a syntax error because == has higher precedence than not. The cascading grammar only allows an expression to appear as the operand of an operator with lower precedence than any operator in the expression, which means that the expression not True cannot be the operand of ==. (The precedence ordering allows not a == b to be grouped as not (a == b).) That prohibition is arguably ridiculous, since there is no other possible interpretation of False == not True other than False == (not True), and the fact that the precedence ordering forbids the only possible interpretation makes the only possible interpretation a syntax error. This doesn't happen with precedence declarations, because the precedence declaration is only used if there is more than one possible parse (that is, if there is really an ambiguity).
Your grammar puts not at the top of the precedence hierarchy, although it should really share that level with unary minus rather than being above unary minus [Note 1]. So that's not an impediment to using a cascading grammar. However, I see that you also want to implement an if … then … else operator, which is syntactically a low-precedence prefix operator. So if you wanted 4 + if x then 0 else 1 to have the value 5 when x is false (rather than being a syntax error), the cascading grammar would be problematic. You might not care about this, and if you don't, that's probably the way to go.
Stick with precedence declarations and handle the chained comparison as an exception in the semantic action.
This will allow the simplest possible grammar, but it will complicate your actions a bit. To implement it, you'll want to implement the comparison operators as left-associative, and then you'll need to be able to distinguish in the semantic actions between a comparison (which is a list of expressions and comparison operators) from any other expression (which is a string). The semantic action for a comparison operator needs to either extend or create the list, depending on whether the left-hand operand is a list or a string. The semantic action for any other operator (including parenthetic grouping) and for the right-hand operand in a comparison needs to check if it has received a list, and if so compile it into a string.
Whichever of those two options you choose, you'll probably want to fix the various precedence errors in the existing grammar, some of which were already present in your upstream source (like the unary minus / not confusion mentioned above). These include:
Exponentiation is configured as left-associative, whereas it is almost universally considered a right-associative operator. Many languages also make it higher precedence than unary minus, as well, since -a2 is pretty well always read as the negative of a squared rather than the square of minus a (which would just be a squared).
I suppose you are going to ditch the ternary operator ?: in favour of your if … then … else operator. But if you leave ?: in the grammar, you should make it right associative, as it is in every language other than PHP. (And the associativity in PHP is generally recognised as a design error. See this summary.)
The not in operator is actually two token, not and in, and not has quite high precedence. And that's how it will be parsed by your grammar, with the result that 4 + 3 in (7, 8) evaluates to true (because it was grouped as (4 + 3) in (7, 8)), while 4 + 3 not in (7, 8) evaluates rather surprisingly to 5, having been grouped as 4 + (3 not in (7, 8)).
Notes
If you used a cascading precedence grammar, you'd see that only one of - not 0 and not - 0 is parseable. Of course, both are probably type violations, but that's not something the syntax should concern itself with.

Why is -1**2 a syntax error in JavaScript?

Executing it in the browser console it says SyntaxError: Unexpected token **.
Trying it in node:
> -1**2
...
...
...
...^C
I thought this is an arithmetic expression where ** is the power operator. There is no such issue with other operators.
Strangely, typing */ on the second line triggers the execution:
> -1**2
... */
-1**2
^^
SyntaxError: Unexpected token **
What is happening here?
Executing it in the browser console says SyntaxError: Unexpected token **.
Because that's the spec. Designed that way to avoid confusion about whether it's the square of the negation of one (i.e. (-1) ** 2), or the negation of the square of one (i.e. -(1 ** 2)). This design was the result of extensive discussion of operator precedence, and examination of how this is handled in other languages, and finally the decision was made to avoid unexpected behavior by making this a syntax error.
From the documentation on MDN:
In JavaScript, it is impossible to write an ambiguous exponentiation expression, i.e. you cannot put a unary operator (+/-/~/!/delete/void/typeof) immediately before the base number.
The reason is also explained in that same text:
In most languages like PHP and Python and others that have an exponentiation operator (typically ^ or **), the exponentiation operator is defined to have a higher precedence than unary operators such as unary + and unary -, but there are a few exceptions. For example, in Bash the ** operator is defined to have a lower precedence than unary operators.
So to avoid confusion it was decided that the code must remove the ambiguity and explicitly put the parentheses:
(-1)**2
or:
-(1**2)
As a side note, the binary - is not treated that way -- having lower precedence -- and so the last expression has the same result as this valid expression:
0-1**2
Exponentiation Precedence in Other Programming Languages
As already affirmed in above quote, most programming languages that have an infix exponentiation operator, give a higher precedence to that operator than to the unary minus.
Here are some other examples of programming languages that give a higher precedence to the unary minus operator:
bc
VBScript
AppleScript
COBOL
Rexx
Orc

Why does Number('') returns 0 whereas parseInt('') returns NaN

I have gone through similar questions and answers on StackOverflow and found this:
parseInt("123hui")
returns 123
Number("123hui")
returns NaN
As, parseInt() parses up to the first non-digit and returns whatever it had parsed and Number() tries to convert the entire string into a number, why unlikely behaviour in case of parseInt('') and Number('').
I feel ideally parseInt should return NaNjust like it does with Number("123hui")
Now my next question:
As 0 == '' returns true I believe it interprets like 0 == Number('') which is true. So does the compiler really treat it like 0 == Number('') and not like 0 == parseInt('') or am I missing some points?
The difference is due in part to Number() making use of additional logic for type coercion. Included in the rules it follows for that is:
A StringNumericLiteral that is empty or contains only white space is converted to +0.
Whereas parseInt() is defined to simply find and evaluate numeric characters in the input, based on the given or detected radix. And, it was defined to expect at least one valid character.
13) If S contains a code unit that is not a radix-R digit, let Z be the substring of S consisting of all code units before the first such code unit; otherwise, let Z be S.
14) If Z is empty, return NaN.
Note: 'S' is the input string after any leading whitespace is removed.
As 0=='' returns true I believe it interprets like 0==Number('') [...]
The rules that == uses are defined as Abstract Equality.
And, you're right about the coercion/conversion that's used. The relevant step is #6:
If Type(x) is Number and Type(y) is String,
return the result of the comparison x == ToNumber(y).
To answer your question about 0==''returning true :
Below is the comparison of a number and string:
The Equals Operator (==)
Type (x) Type(y) Result
-------------------------------------------
x and y are the same type Strict Equality (===) Algorithm
Number String x == toNumber(y)
and toNumber does the following to a string argument:
toNumber:
Argument type Result
------------------------
String In effect evaluates Number(string)
“abc” -> NaN
“123” -> 123
Number('') returns 0. So that leaves you with 0==0 which is evaluated using Strict Equality (===) Algorithm
The Strict Equals Operator (===)
Type values Result
----------------------------------------------------------
Number x same value as y true
(but not NaN)
You can find the complete list # javascriptweblog.wordpress.com - truth-equality-and-javascript.
parseInt("") is NaN because the standard says so even if +"" is 0 instead (also simply because the standard says so, implying for example that "" == 0).
Don't look for logic in this because there's no deep profound logic, just history.
You are in my opinion making a BIG mistake... the sooner you correct it the better will be for your programming life with Javascript. The mistake is that you are assuming that every choice made in programming languages and every technical detail about them is logical. This is simply not true.
Especially for Javascript.
Please remeber that Javascript was "designed" in a rush and, just because of fate, it became extremely popular overnight. This forced the community to standardize it before any serious thought to the details and therefore it was basically "frozen" in its current sad state before any serious testing on the field.
There are parts that are so bad they aren't even funny (e.g. with statement or the == equality operator that is so broken that serious js IDEs warn about any use of it: you get things like A==B, B==C and A!=C even using just normal values and without any "special" value like null, undefined, NaN or empty strings "" and not because of precision problems).
Nonsense special cases are everywhere in Javascript and trying to put them in a logical frame is, unfortunately, a wasted effort. Just learn its oddities by reading a lot and enjoy the fantastic runtime environment it provides (this is where Javascript really shines... browsers and their JIT are a truly impressive piece of technology: you can write a few lines and get real useful software running on a gajillion of different computing devices).
The official standard where all oddities are enumerated is quite hard to read because aims to be very accurate, and unfortunately the rules it has to specify are really complex.
Moreover as the language gains more features the rules will get even more and more complex: for example what is for ES5 just another weird "special" case (e.g. ToPrimitive operation behavior for Date objects) becomes a "normal" case in ES6 (where ToPrimitive can be customized).
Not sure if this "normalization" is something to be happy about... the real problem is the frozen starting point and there are no easy solutions now (if you don't want to throw away all existing javascript code, that is).
The normal path for a language is starting clean and nice and symmetric and small. Then when facing real world problems the language gains (is infected by) some ugly parts (because the world is ugly and asymmetrical).
Javascript is like that. Except that it didn't start nice and clean and moreover there was no time to polish it before throwing it in the game.

How is Javascript trying to interpret this illegal statement?

There was a mistake in some Javascript code where the decimal 0.04 was declared as 0.0.4 like the following:
var x = 0.0.4;
When this was run in Firefox the error given was:
SyntaxError: missing ; before statement
IE stated:
Expected ';'
And Chrome stated:
Uncaught SyntaxError: Unexpected number
I understand that 0.0.4 is not a number or a literal but how is Javascript trying to inerpret the statement? Why exactly does it think there is a missing ; ?
Chrome
Chrome sees the first part of the assignment as a floating point number 0.0, for which you want to access property at key 0, which is not allowed in JavaScript (you cannot use dot notation to reference numeric keys). In other words, you can split the assignment to the following equivalent:
var x = 0.0
x.0 // Throws unexpected number
In comparison, this is valid:
var x = 0.0.toString()
In my opinion, this is the most logical implementation of the parser with an error that makes most sense.
Firefox
Firefox sees the first part of the assignment as the floating point number 0.0, but gets confused about the .0 part - it does not recognise this as a property access and instead thinks you want to start a new statement (.4, which is a shorthand for 0.4) - thus it tells you you missed a semicolon, which is used to terminate statements.
It still correctly interprets the statement when you use a valid property accessor (a string):
var x = 0.0.toString()
Somewhat related is also this SO question which discusses similar behaviour.
Depending on the parser (browser) the way it handles a line of code may be different. My thoughts are that when IE and Firefox see a number they only "expect" a single decimal for a number and then numerical characters after it until either a space or a ';' occurs.
If another character type is encountered the parser would throw an exception that it was looking for the end of the number - hence the Expected ';'.
In the realm of how Chrome handled it, it seems like that it can tell it was supposed to be a number, but it was given in an invalid format.
The messages from IE and Firefox make perfect sense. Parse the number part of the statement as 0.0, perform the assignment x = 0.0, then expect an end of statement. If the character following isn't a semi-colon or newline (implicit end of statement), throw an error saying we expect one.
Chrome's is a little more cryptic, but still makes sense with a little thought. It has performed the assignment as with FF and IE (x = 0.0) but has interpreted the dot as a dot-operator rather than a decimal. (x = 0.0.toString() would be perfectly valid.) Rather than stating what it expected, it's stating what it saw, an "unexpected number", which was the 4.
It's unclear whether IE and FF first parsed the .4 as a number or if the unexpected character was simply a dot.

Javascript eval - obfuscation?

I came across some eval code:
eval('[+!+[]+!+[]+!+[]+!+[]+!+[]]');
This code equals the integer 5.
What is this type of thing called? I've tried searching the web but I can't seem to figure out what this is referred to. I find this very interesting and would like to know where/how one learns how to print different things instead of just the integer 5. Letters, symbols and etc. Since I can't pin point a pattern in that code I've had 0 success taking from and adding to it to make different results.
Is this some type of obfuscation?
This type of obfuscation, eval() aside, is known as Non-alphanumeric obfuscation.
To be completely Non-alphanumeric, the eval would have to be performed by Array constructor prototypes functions and subscript notation:
[]["sort"]["constructor"]("string to be evaled");
These strings are then converted to non-alphanumeric form.
AFAIK, it was first proposed by Yosuke Hosogawa around 2009.
If you want to see it in action see this tool: http://www.jsfuck.com/
It is not considered a good type of obfuscation because it is easy to reverse back to the original source code, without even having to run the code (statically). Plus, it increases the file size tremendously.
But its an interesting form of obfuscation that explores JavaScript type coercion. To learn more about it I recommend this presentation, slide 33:
http://www.slideshare.net/auditmark/owasp-eu-tour-2013-lisbon-pedro-fortuna-protecting-java-script-source-code-using-obfuscation
That's called Non-alphanumeric JavaScript and it's possible because of JavaScript type coercion capabilities. There are actually some ways to call eval/Function without having to use alpha characters:
[]["filter"]["constructor"]('[+!+[]+!+[]+!+[]+!+[]+!+[]]')()
After you replace strings "filter" and "constructor" by non-alphanumeric representations you have your full non-alphanumeric JavaScript.
If you want to play around with this, there is a site where you can do it: http://www.jsfuck.com/.
Check this https://github.com/aemkei/jsfuck/blob/master/jsfuck.js for more examples like the following:
'a': '(false+"")[1]',
'b': '(Function("return{}")()+"")[2]',
'c': '([]["filter"]+"")[3]',
...
To get the value as 5, the expression should have been like this
+[!+[] + !+[] + !+[] + !+[] + !+[]]
Let us analyze the common elements first. !+[].
An empty array literal in JavaScript is considered as Falsy.
+ operator applied to an array literal, will try convert it to a number and since it is empty, JavaScript will evaluate it to 0.
! operator converts 0 to true.
So,
console.log(!+[]);
would print true. Now, the expression can be reduced like this
+[true + true + true + true + true]
Since true is treated as 1 in arithmetic expressions, the actual expression becomes
+[ 5 ]
The + operator tries to convert the array ([ 5 ]) to a number and which results in 5. That is why you are getting 5.
I don't know of any term used to describe this type of code, aside from "abusing eval()".
I find this very interesting and would like to know where/how one
learns how to print different things instead of just the integer 5.
Letters, symbols and etc. Since I can't pin point a pattern in that
code I've had 0 success taking from and adding to it to make different
results.
This part I can at least partially answer. The eval() you pasted relies heavily on Javascript's strange type coercion rules. There are a lot of pages on the web describing various strange consequences of the coercion rules, which you can easily search for. But I can't find any reference on type coercion with the specific goal of getting "surprising" output from things like eval(), unless you count this video by Destroy All Software (the Javascript part starts at 1:20). Most of them understandably focus on how to avoid strange bugs in your code. For your purpose, I believe the most useful things I know of are:
The ! operator will convert anything to a boolean.
The unary + operator will convert anything to a number.
The binary + operator will coerce its arguments to either numbers or strings before adding or concatenating them. Normally it will only go with strings if one or the other argument is already a string, but there are exceptions.
The bitwise operators output integers, regardless of input.
Converting to numbers is complicated. Booleans will go to 0 or 1. Strings will attempt to use parseInt(), or produce NaN if that fails. I forget the rest.
Converting an object to a string or number will invoke its "toString" or "toValue" method respectively, if one exists.
You get the idea. thefourtheye's answer walks through exactly how these rules apply to the example you gave. I'm not even going to try summarizing what happens when Dates, Functions, and Regexps are involved.
Is this some type of obfuscation?
Normally you'd simply get obfuscation for free as part of minification, so I have no idea why someone would write that in real production code (assuming that's where you found it).

Categories

Resources