Lecture 3: Syntactic analysis ("parsing")

The course Compilers and interpreters | Lectures: 1 2 3 4 5 6 7 8 9 10 11 12

These lecture notes are my own notes that I made in order to use during the lecture, and it is approximately what I will be saying in the lecture. These notes may be brief, incomplete and hard to understand, not to mention in the wrong language, and they do not replace the lecture or the book, but there is no reason to keep them secret if someone wants to look at them.

Today: How to write a parser by hand

ALSU-07 section 2.4 and 2.5
The two simple "compilers" from ASU-86: Program 2.5, and Program 2.9
Grammatiktransformationer (also available in English: Grammar Transformations)
(The old edition of the book: ASU-86 2.4, 2.5, 2.9)

2.4 Parsing

So, how does the parser build the parse tree?
Or rather: how does the parser "navigate through the grammar", guided by the source tokens it sees, in such a way that it could build a parse tree?

Top-down: simpler to build (by hand)
Bottom-up: harder to build (so we use tools such as Yacc), can handle a larger class of grammars

Top-down parsing (2.4.1)

(We use the example from the old edition of the book, since it is better.)

Example: Types in Pascal.

simple -> integer
simple -> char
simple -> num dotdot num
type -> simple
type -> ^ id
type -> array [ simple ] of type

Examples:

integer
1..100
^kundpost
array [ integer ] of char
array [ 20..40 ] of 0..3
array [ 1..5 ] of array [ 1..3 ] of 2..5

Example source program:
array [ 1 .. 10 ] of integer

Tokens:
array [ num dotdot num ] of integer

Top-down parsing, starting with a node for the non-terminal type.
Which of the productions (=rules) should we use? (Answer: type -> array ....)
Why? (Answer: The only production that starts with array.)

"Lookahead symbol" = "current token" = Sw: "aktuell token"

ASU-86 fig 2.15:

Steps in top-down construction of a parse tree

ASU-86 fig 2.16:

Top-down parsing while scanning the input from left to right

Predictive parsing (2.4.2)

Note about backtracking: If it turns out that we have chosen the wrong production, we have to backtrack. But in this case, we know we have the right production, since it is the only one that starts with array.

If always just one production possible to chose: predictive parsing, no backtracking

Recursive-descent parsing

Predictive parsing = predictive recursive-descent parsing

Recursive-descent parsing = the parser is a program with a procedure (in C: "function") for each non-terminal.

void simple() {
  if (lookahead == integer)
    match(integer);
  else if (lookahead == char)
    match(char);
  else if (lookahead == num) {
    match(num); match(dotdot); match(num);
  }  
  else
    error();
}

match() checks that the lookahead symbol (=current token) is the expected one, and gets the next token (by calling the lexical analyzer).

When we write the function type(), it is not enough to see which tokens occur as the first token in the productions for type (^ and array). Since simple can occur as the first thing in a type, we must also see which tokens occur as the first token in productions for simple.

void type() {
  if (lookahead == integer or lookahead == char or lookahead == num)
    simple();
  else if (lookahead == ^) {
    match(^); match(id);
  }
  else if (lookahead == array) {
    match(array); match([); simple(); match(]); match(of); type();
  }
  else 
    error();
}

FIRST(x) = all possible first tokens in strings that match x.

FIRST(simple) = the set { integer, id, char }
FIRST(^ id) = the set { ^ }
FIRST(array [ simple ] of type) = the set { array }
FIRST(type) = the set { ^, array } + the set FIRST(simple) = the set { ^, array, integer, id, char }

Recursive-descent without backtracking: if the grammar contains two rules...

NT -> something
NT -> somethingelse

...then FIRST(something) and FIRST(somethingelse) must be disjoint (=no common elements).
You may have to rewrite your grammar to achieve that! (We will discuss left factoring, in Swedish vänsterfaktorisering, later.)

When to use empty-productions (2.4.3)

empty-productions (a non-terminal that matches an empty string) are sometimes needed:

stmt -> begin opt_stmts end
opt_stmts -> stmt_list | empty

Use opt_stmts -> empty if nothing else matches opt_stmts, which will be when lookahead is end.

Designing a predictive parser (2.4.4)

Syntax-directed translation scheme -> predictive parser:

Skriv grammatiken (entydig!)
Skriv om grammatiken för att få bort FIRST()-konflikter ocg vänsterrekursion
Construct a predictive parser, ignoring the semantic rules/actions.
Insert the semantic actions.

Left-recursion (2.4.5)

Left-recursive grammar:

expr -> expr + term

Implementation of a predictive, recursive-descent parser:

void expr() {
  if (lookahead == num or lookahead == id)
    expr();
  else
...

Eliminate left recursion! Rewrite this left-recursive production:

A -> A x | y

(A, x and y are just placeholders. A stands for some non-terminal, while x and y stand for sequences of both terminals and non-terminals. The book uses Greek letters: A -> A α | β)

into these, which are not left-recursive:

A -> y R
R -> x R | empty

(We can show that both these grammars describe the same language, namelt one y followed by zero, one or more x, by systematically expanding the starts ymbolen A in both grammars.)

Example: rewrite this left-recursive production:

expr -> expr + term | term

into these, which are not left-recursive:

expr -> term rest
rest -> + term rest | empty

A = expr
x = + term
y = term
R = rest

First version: An expression is either a term, or you already have an expression (which may contain one or more terms) and we cteaye a new expression by adding a plus sign and a term.

Second version: An expression consists of a term, followed by a rest. This rest can either be empty (so the expression only consisted of that single term), or it consists of a plus sign and a term, followed by yet a rest (which may be empty or can contain one or more terms).

But note this!
The new, rewritten grammar contains different productions than the original did. Because of that, we'll obviously get a different parse tree for a particular expression! Among other things, the + operator has now become right associative instead of left associative! (Draw the different parse trees for term + term + term, if we prertend that term is a terminal.)

But, as we will see in the next section: If you have a grammar with embedded semantic actions (that is, a so-called syntax-controlled translation scheme), if you include the semantic actions in the grammar transformations, they will be performed in the same order.)

2.5 A translator for simple expressions

(We use the C program frin the old book, ASU-86, instead of the Java program from the new one, ALSU-07. The source code is here.)

Ex: 2, 2+3, 2-3, 2-3+4+5-6

ASU-86 Fig 2.13/2.19 translation scheme for the program in 2.5:

expr -> expr + term { print('+') }
expr -> expr - term { print('-') }
expr -> term
term -> 0 { print('0') }
term -> 1 { print('1') }
term -> 2 { print('2') }
...
term -> 9 { print('9') }

Remember that term -> term + term was ambiguous (Sw: tvetydig), so we transformed the grammar.
We also saw how to handle operator associativity and precedence.

Rewrite the grammar to eliminate left recursion.
Important: Handle the semantic actions as part of the grammar, when you rewrite it using the left-recursion elimination technique from above, to get the prints done at the right places!

expr -> term rest
rest -> + term { print('+') } rest
rest -> - term { print('-') } rest
rest -> empty
term -> 0 { print('0') }
term -> 1 { print('1') }
term -> 2 { print('2') }
...
term -> 9 { print('9') }

The interesting parts of the program:

void term () {
  if (isdigit(lookahead)) {
    putchar(lookahead);
    match(lookahead);
  }
  else
    error();
}

void rest() {
  if (lookahead == '+') {
    match('+'); term(); putchar('+'); rest();
  }
  else if (lookahead == '-') {
    match('-'); term(); putchar('-'); rest();
  }
  else
  ;
}

void expr() {
  term(); rest();
}

Full source code: 2.5

2.5.4 Simplifying the Translator

Skip this section. Tail recursion will be eliminated automatically by a good, modern C compiler.

The rest is about Program 2.9.

2.6 Lexical analysis

Skip. Just know that lexan() (called scan in the new Java program) gets the next token. It returns a token type (NUM, DIV etc), and places a lexical value in the variable tokenval.

For example, when the source being read contains fnord, a call to lexan() will return 259 (which is the value ID, meaning that we have found an identifier), and put (for example) 16 in the variable tokenval, meaning that the identifier that was found is identifier number 16 in the symbol table.

Source code: lexer.c

2.7 Incorporating a symbol table

Skip. Just know two functions: insert() and lookup() (called get and put in the new Java program)

Source code: symbol.c

2.8 Intermediate Code Generation

Skip for now.

The "2.9" program

Source code: 2.9

Remember how stack macines work:

If the next token is a number: push the number on the stack
If the next token is an operator: pop, pop, calculate, push the result on the stack

The course Compilers and interpreters | Lectures: 1 2 3 4 5 6 7 8 9 10 11 12

Thomas Padron-McCarthy (thomas.padron-mccarthy@oru.se), September 13, 2022