Kompilatorer och interpretatorer: Lecture 2

Note: This is an outline of what I intend to say on the lecture. It is not a definition of the course content, and it does not replace the textbook.

Today: More about syntax analysis ("parsing"),
Aho et al, sections 2.1 - 2.4

But first: Rest from lecture 1:

1.5 The Grouping of Phases

Front end (almost = analysis). Independent of the target machine. Connected to the source language.
Back end (almost = synthesis). Dependent on the target machine. (Sort of) independent of the source language.
Pass. One reading (and writing) of the source program (in some form). Usually contains several phases (scanning, parsing...).

ASU p 21, Reducing the number of passes:

memory usage
how to connect the parser and the scanner
backpatching

More rest from lecture 1:

1.6 Compiler-Construction Tools

Parser generators (ex: Yacc, Bison)
Scanner generators (ex: Lex, Flex)
Data flow analysis (Swedish: dataflödesanalys) - what is that? For optimization. Ex: Slicing.
Kodgeneratorgeneratorer

ASU p 22: "Compiler-compiler" = a complete system for compiler building. But! "Yacc" = "Yet Another Compiler-Compiler" is a parser generator.

2.1 Overview

A compiler that translates infix to postfix:

Tree	Infix notation	Postfix notation
	2 + 3	2 3 +
	2 + 3 * 4	2 3 4 * +
	2 * 3 + 4	2 3 * 4 +
	2 * (3 + 4)	2 3 4 + *

Source and target as text.

Postfix: Stack machine. Easy to write an interpreter.

Push numbers onto the top of the stack.
+: Pop the two top numbers, add, and push the sum.

The "2.5" program: simple grammar (Sw: "grammatik") (only + and -), simple parser, very simple scanner (one character = one token).
The "2.9" program: more advanced grammar (identifiers, *, /, mod, div), therefore a more complex parser, a "real" scanner.

2.2 Syntax definition

Example: the if statement in C. An instance:

if (a == b)
  printf("Same!\n");
else
  printf("Not same!\n");

This, as you know, is the syntax for the if statement:

if ( some expression ) some statement else some other statement

A rule that could be part of a context-free grammar (Sw: kontextfri grammatik) for C:

statement -> if ( expression ) statement else statement
statement -> if ( expression ) statement
statement -> { statement-list } (forgot what?)
...

"Context-free": a production "X -> ..." can always be used to replace X with "...", no matter what the rest of the program (that is, the context, Sw: kontext, omgivning) looks like.

A set of terminals (Sw: terminaler) = terminal symbols = tokens
A set of non-terminals (Sw: icke-terminaler) = non-terminal symbols (compound grammatical constructs)
A set of productions (Sw: produktioner) = rules: non-terminal -> tokens/non-terminals. A production is for the non-terminal to the left.
What is the start symbol (Sw: startsymbolen)

Other concepts:

String (Sw: sträng) = a sequence of tokens
{E-symbol} = The empty string (Sw: tomma strängen)
Language (Sw: språk) = the set of all strings that can be derived from the start symbol (using the productions in the grammar), Sw: mängden av alla strängar som kan härledas från startsymbolen (med hjälp av produktionerna i grammatiken).

Example 2.1 (p. 27)

7+3, 7+3-4+6, 3 (but not 17, -3 or 2*2)

digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
list -> digit
list -> list + digit
list -> list - digit

digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
list -> digit | list + digit | list - digit

Try 9-5+2.
9 -> digit -> list.
5 -> digit.
9-5 -> list - digit -> list
2 -> digit.
9-5+2 -> list + digit -> list

ASU fig 2.2, the parse tree (= concrete syntax tree) and the syntax tree (= abstract syntax tree):

Parse tree for 9-5+2

The start symbol in the root (Sw: rot).
A token (or the empty string) as each leaf (Sw: löv).
Non-terminals in the inner nodes (Sw: de inre noderna).
The children of each inner node is the right-hand side of a production!

Why list + digit etc? Asymmetrical and ugly? Why not just list + list, like this:

digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
string -> string | string + string | string - string

ASU fig 2.3 (slide!):

Two parse trees for 9-5+2

Operator associativity (Sw: Operatorassociativitet)

ASU fig 2.4:

Parse trees for left- and right-associative operators

Use a grammar like above, that is,
list -> list + digit
for left-associative (Sw: vänsterassociativa) operators. Use a grammar like
list -> digit + list
for right-associative (Sw: högerassociativa) operators, for example:

right -> letter = right
letter -> a | b | c | ... | z

Operator precedence (Sw: Operatorarprioritet, operatorarprecedens)

9 + 5 * 2 = 9 + (5 * 2), not (9 + 5) * 2.
"*" has higher precedence than "+".

Express this in the grammar:

factor -> digit | ( expr )
term -> term * factor | term / factor | factor
expr -> expr + term | expr - term | term

2.3 Syntax-directed translation

Not just the parse tree for a certain construct, but also keep track of attributes of each subtree. An attribute can be any data that we want to keep track of in the tree, such as the data type of an expression, or even generated code.

Two different types:

Syntax-directed definition = just rules
(Syntax-directed) translation scheme = more procedural

Syntax-directed definitions

Grammar + what to do for each production

Syntax-directed definition = context-free grammar, plus a semantic rule (Sw: semantisk regel) for each production, that specifies how to calculate values of attributes. Example:

Production	Semantic rule
term -> 0	term.output -> " 1"
term -> 1	term.output -> " 1"
term -> 2	term.output -> " 2"
...	...
expr -> expr₁ + term₁	expr.output -> expr₁.output + term₁.output + " +"
...	...

ASU fig 2.6:

Attribute values at nodes in a parse tree

But a syntax-directed definition says nothing about how the parser should build the parse tree! Just the grammar, and what to do when we have found which production to use.

(Syntax-directed) translations schemes

Grammar + what to do for each production embedded in the grammar: in the right-hand side of productions

Syntax-directed definition = context-free grammar, plus semantic actions (Sw: semantiska aktioner, semantiska åtgärder) for each production, that specifies what to do. Example:

expr -> expr₁ + term₁ { print("+"); }

Generates postfix!

Or, with the action somewhere in the middle:

rest -> + term₁ { print("+"); } rest₁

The semantic actions are put in the parse tree, just like the "real" parts. ASU fig 2.12:

An extra leaf is constructed for a semantic action

ASU fig 2.14:

Actions translating 9-5+2 into 95-2+

2.4 Parsing

So, how does the parser build the parse tree?
Or rather: how does the parser "navigate through the grammar", guided by the source tokens it sees, in such a way that it could build a parse tree?

Top-down: simpler to build (by hand)
Bottom-up: harder to build (so we use tools such as Yacc), can handle a larger class of grammars

...............

Recursive-descent parsing = the parser is a program with a procedure (in C: "function") for each non-terminal

current token, lookahead symbol

backtracking

Predictive parsing

Predictive parsing = a form of recursive-descent, with no backtracking

FIRST(some-nonterminal)

Designing a predictive parser

Left-recursion

ASU fig 2.15:

Steps in top-down construction of a parse tree

ASU fig 2.16:

Top-down parsing while scanning the input from left to right

Thomas Padron-McCarthy (Thomas.Padron-McCarthy@tech.oru.se) January 22, 2003