Parsing CalculiX

par Stéphane Graham-Lengrand

The goal of this TD is to write a parser for CalculiX, the language for which we implemented an interpreter in TD12. In that TD, we gave you a parser so that you could test your code. Today, it is your turn to write the parser, reading a sequence of tokens recognized by a lexical analyzer (lexer) and constructing expressions according to a grammar.

Preparation

The setup for this TD is identical to TD12. Create a new project TD14, download td_14.zip and unzip it in the project directory. Add the two libraries in the directory lib to the project ("Java Build Path -> Add JARs").

The package CalculiX contains a number of classes, of which the following are mainly of interest in this TD:

The class ExpressionFactory that you have implemented in the previous TD enables the parser to construct expressions.
The class Token represents lexical tokens recognized by the lexer. Each token has a symbol, a location, and an optional associated value (either a String or an Integer). We will use tokens through four methods: symbol() returns the symbol, asInteger() and asString() return the value (if any) converted to an int or a String, and syntaxError(s) constructs and returns an exception with an error message that indicates the location of the token together with the string s.
```
public class Token {
  int symbol(){...}
  int asInteger(){...} // returns content as an int
  String asString(){...} // returns content as a String
  SyntaxError syntaxError(String s){...} // constructs a SyntaxError
}
```
Note that what we call 'symbol' is in fact implemented as an int, following the correspondence defined in the file Symbols.java.
The class MyLexer recognizes and returns a sequence of tokens from a CalculiX program; we will use a method peek() to read the next token (without consuming it), a method consumeToken() that both reads and consumes a token, and a method consumeToken(int symbol) that consumes a token and verifies that its symbol matches the expected value (otherwise, it throws an exception of class SyntaxError).
```
public class MyLexer {
  Token peek() {...}
  Token consumeToken() {...}
  Token consumeToken(int expected) throws SyntaxError {...}
}
```
The incomplete class MyParser where you will implement the parser.
The class Main for testing our parser.

In TD12, we described the CalculiX language (up to Boolean expressions) by using the following grammar:

Integers	`i` `::= ... \| -1 \| 0 \| 1 \| ...`
Integer Operators	`op` `::= + \| - \| * \| /`
Integer Comparators	`co` `::= < \| <= \| ==`
Booleans	`b` `::= true \| false`
Boolean Connectives	`bop::= && \| \|\|`
Constants	`c::= i \| b`
Expressions	`e::= c \| e op e \| e co e \| e bop e \| if e then e else e`

In the lecture last week, you saw a generic method to turn a grammar such as the above, into a non-directional and non-deterministic parser. This afternoon, we implement a (more efficient) directional and deterministic parser for the CalculiX language, by adapting the above grammar to our needs. We do this step-by-step.

Implementation

Integers

The first grammar that we shall parse only accepts integer constants:

Integers	`i` `::= ... \| -1 \| 0 \| 1 \| ...`
Constants	`c` `::= i`
Expressions	`e::= c`

The class MyParser contains a field lexer, an expression factory factory, a constructor for initializing these fields and the following parse() function:

  Expression parse () throws SyntaxError
  {
    Expression e = consumeExpression();
    Token t = lexer.peek();
    switch (t.symbol()) {
    case EOF:
      return e;
    default:
      throw t.syntaxError ("expected EOF");
    }
  }

It parses an expression e, and then uses the peek function to look at the next available token. If the next token corresponds to the symbol EOF, it returns the parsed expression e, otherwise it throws a syntax error at the token. Given the current implementation of consumeExpression (always returning null), the parse function only accepts empty expressions (at present).

In the class MyParser, write a new method Expression consumeConstant() that

consumes a token t,
verifies that t contains an integer (that is, that t.symbol() returns INTEGER),
extracts its contents by calling t.asInteger(),
constructs an integer expression by calling factory.buildConstant, and
returns the resulting expression.

If the token is not an integer, throw a syntaxError (like in parse above).

Now, modify the function consumeExpression to call consumeConstant as follows:

  Expression consumeExpression () throws SyntaxError
  {
    return consumeConstant();
  }

Our parser now implements a grammar that accepts integer constants but nothing else.

Test your code by running the Main class which contains a sequence of tests. The line

int j = 0;

instructs the main method to perform the first j tests on integer expressions (initially, 0). If your code works correctly, the first 2 integer tests should produce

Parsed "0" as 0
Evaluated as 0, expecting 0
Parsed "7" as 7
Evaluated as 7, expecting 7

Submit MyParser.java

Additive Expressions

We now extend the grammar to accept additions and subtractions. The grammar from TD12 suggests to implement in our parser the following grammar

Additive Expressions ae::= c | ae + ae | ae - ae

Expressions e::= ae

Unfortunately, this grammar is ambiguous (the word c + c - c can be accepted with two different production trees).

In order to write our deterministic parser, we use a different grammar to define the same language:

Additive Expression Tail aet::= + c aet | - c aet | ε

Additive Expressions ae::= c aet

Expressions e::= ae

An expression is now defined as an additive expression, which is in turn defined as a constant followed by an additive expression tail. The tail may be empty (denoted ε), or may consist of a sequence of expressions of the form +/- c +/- c +/- c ...

To parse this grammar, we define an oracle that tells us which decision to make when we are facing a choice (between several productions). This oracle can be written as the following two-dimensional table, where each column corresponds to a non-terminal symbol that we are trying to parse, and each line corresponds to the first token to be consumed (a terminal symbol):

	ae	aet	c
i	c aet		i
+		+ c aet
-		- c aet
#		ε

In words:

If we encounter an INTEGER symbol (denoted i):
- if we are trying to parse an additive expression, we start by parsing a constant (c) and then we parse an additive expression tail (aet);
- if we are parsing a constant, we consume the INTEGER symbol as in the previous exercise.
If we encounter the operator symbols PLUS or MINUS and we are trying to parse an expression tail, we consume the terminal symbol PLUS or MINUS, then we try to parse a constant, and finally another tail expression.
If we encounter the EOF symbol (#) and we are trying to parse an expression tail, then that expression tail is just ε and we have finished;.
In all other cases, where the table cell is blank, we are in a situation of syntax error.

Implement columns (ae) and (aet) of this oracle as two functions Expression consumeAdditiveExpression() and Expression consumeAdditiveExpressionTail(Expression head). Note that consumeAdditiveExpressionTail takes as its argument the expression that was parsed before entering the tail. Then, modify consumeExpression to call consumeAdditiveExpression instead of consumeConstant.

WARNINGS!

Remember to first look at the token using lexer.peek() and only consume it (using lexer.consumeToken()) if it is part of the current non-terminal. For example, in an additive expression, if you see an INTEGER as the first available token, you should not consume it right away but leave that job to consumeConstant.
Your functions directly produce expressions (i.e. Abstract Syntax Trees) without explicitly building production trees. No matter the shape of the production trees that the grammar specifies, we impose that the additive expressions that you produce are left-associative; in other words make sure that parsing 3 - 2 - 1 produces the expression corresponding to (3 - 2) - 1.

For those who have time, have a look at this comment.

Test your code by running Main; check which tests now succeed and which fail.

Submit MyParser.java

Parenthetical Expressions

We extend the grammar to include parenthetical expressions ( e ) which allow us to write more precise expressions. For example, 3 - 2 - 1 is different from 3 - (2 - 1).

We introduce a new non-terminal called atomic expressions that include both constants and parenthetical expressions. Additive expressions are now written in terms of atomic expressions, not constants.

Atomic Expressions	`a::= c \| ( e )`
Additive Expression Tail	`aet::= + a aet \| - a aet \| ε`
Additive Expressions	`ae::= a aet`
Expressions	`e::= ae`

To parse this grammar, we extend the oracle as follows:

	ae	aet	a	c
i	a aet		c	i
(	a aet		(e)
)		ε
+		+ a aet
-		- a aet
#		ε

In particular, when we encounter a LPAREN symbol, we are parsing an atomic expression, and when we encounter a RPAREN symbol we finish parsing an additive expression tail.

Implement Expression consumeAtomicExpression(), and modify Expression consumeAdditiveExpression() and Expression consumeAdditiveExpressionTail(Expression head) to implement the above oracle.

Test your code with Main and submit MyParser.java

Multiplicative Expressions

We extend the grammar to include operators for multiplication and division which have higher precedence than addition and subtraction. To represent this precedence, we create a new non-terminal symbol for multiplicative expressions as follows:

Multiplicative Expression Tail	`met::= * a met \| / a met \| ε`
Multiplicative Expressions	`me::= a met`
Additive Expression Tail	`aet::= + me aet \| - me aet \| ε`
Additive Expressions	`ae::= me aet`

Design an oracle for this grammar and implement it as two new functions Expression consumeMultiplicativeExpression() and Expression consumeMultiplicativeExpressionTail(Expression head). Modify the other functions accordingly.

Again, make sure that the multiplicative expressions that you produce are left-associative; i.e. make sure that parsing 12 / 4 / 2 produces the expression corresponding to (12 / 4) / 2;

Test your code with Main and submit MyParser.java

Comparisons and Boolean Expressions

We extend the syntax with the boolean constants true and false, comparison operators (<, <=, ==), and boolean connectives (&&, ||). All comparison operators have the same precedence and they have lower precedence than addition and subtraction. Disjunction (||) has lower precedence than conjunction (&&) which has lower precedence than comparison.

Extend the grammar to accept this extended syntax; design an oracle for it and implement it by adding the corresponding parsing functions to MyParser; adapt the consumeExpression function to start parsing the constructs with lowest precedence.

Again, test your code with Main: you can test some boolean expressions by changing the value of variable k in the main method.

Submit MyParser.java

If Then Else

Finally, we extend the syntax with the if then else construct, which has the lowest precedence of all.

Extend the grammar to accept this construct; design an oracle for it and implement it by adding the corresponding parsing function to MyParser; adapt consumeExpression accordingly.

Test your code with Main and submit MyParser.java

Additive Expressions	`ae::= c \| ae + ae \| ae - ae`
Expressions	`e::= ae`

Additive Expression Tail	`aet::= + c aet \| - c aet \| ε`
Additive Expressions	`ae::= c aet`
Expressions	`e::= ae`