Lecture 2: Syntax and grammars

The course Compilers and interpreters | Lectures: 1 2 3 4 5 6 7 8 9 10 11 12

These lecture notes are my own notes that I made in order to use during the lecture, and it is approximately what I will be saying in the lecture. These notes may be brief, incomplete and hard to understand, not to mention in the wrong language, and they do not replace the lecture or the book, but there is no reason to keep them secret if someone wants to look at them.

Today:
More about syntactical analysis, "parsiing" (Swedish: syntaktisk analys, parsning)

Grammatiker för datorspråk (also available in English: Grammars for computer languages)
ALSU-07 avsnitt 2.1-2.3
(Old book: ASU-86 2.1-2.3)

Today nothing about how parsing actually is done. That will be next time.

2.1 Overview

We will be using C when doing the lab exercises, so we use the C examples from the old course book ASU-86, instead of the Java examples from the new course book ALSU-07. Besides, the old examples are pedagogically much better.

We will build a simple compiler that translates from infix notation of expressions to postfix. Examples of expressions that we want to translate:

Infix notation	Value	Postfix notation	Prefix notation	Function notation	LISP
2 + 3	5	2 3 +	+ 2 3	plus(2, 3)	(plus 2 3)
2 + 3 * 4	14	2 3 4 * +	+ 2 * 3 4	plus(2, times(3, 4))	(plus 2 (times 3 4))
2 * 3 + 4	10	2 3 * 4 +	+ * 2 3 4	plus(times(2, 3), 4)	(plus (times 2 3) 4)
2 * (3 + 4)	14	2 3 4 + *	* 2 + 3 4	times(2, plus(3, 4))	(times 2 (plus 3 4))

Source and target as text.

Postfix: Stack machine. Easy to write an interpreter.

Push numbers onto the top of the stack.
+: Pop the two top numbers, add, and push the sum.

The "2.5" program from ASU-86: simple grammar (Sw: "grammatik") (only + and -), simple parser, very simple scanner (one character = one token).
The "2.9" program from ASU-86: more advanced grammar (identifiers, *, /, mod, div), therefore a more complex parser, a "real" scanner.

2.2 Syntax definition

Example: the if statement in C. An instance:

if (a == b)
  printf("Same!\n");
else
  printf("Not same!\n");

This, as you know, is the syntax for the if statement:

if ( some expression ) some statement else some other statement

A rule that could be part of a context-free grammar (Sw: kontextfri grammatik) for C:

statement -> if ( expression ) statement else statement
statement -> if ( expression ) statement
statement -> { statement-list } (forgot what?)
...

"Context-free": a production "X -> ..." can always be used to replace X with "...", no matter what the rest of the program (that is, the context, Sw: kontext, omgivning) looks like.

A set of terminals (Sw: terminaler) = terminal symbols = tokens
A set of non-terminals (Sw: icke-terminaler) = non-terminal symbols (compound grammatical constructs)
A set of productions (Sw: produktioner) = rules: non-terminal -> tokens/non-terminals. A production is for the non-terminal to the left.
What is the start symbol (Sw: startsymbolen)

Other concepts:

String (Sw: sträng) = a sequence of tokens
{E-symbol} = The empty string (Sw: tomma strängen)
Language (Sw: språk) = the set of all strings that can be derived from the start symbol (using the productions in the grammar), Sw: mängden av alla strängar som kan härledas från startsymbolen (med hjälp av produktionerna i grammatiken).

Example 2.1 (p. 43)

Ensiffriga tal, plus, minus:
7+3, 7+3-4+6, 3 (but not 17, -3 or 2*2)

digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
list -> digit
list -> list + digit
list -> list - digit

digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
list -> digit | list + digit | list - digit

Try 9-5+2.
9 -> digit -> list.
5 -> digit.
9-5 -> list - digit -> list
2 -> digit.
9-5+2 -> list + digit -> list

ALSU-87 fig 2.5 = ASU-86 fig 2.2, the parse tree (= concrete syntax tree) and the syntax tree (= abstract syntax tree):

Parse tree for 9-5+2

The start symbol in the root (Sw: rot).
A token (or the empty string) as each leaf (Sw: löv).
Non-terminals in the inner (or "interior") nodes (Sw: de inre noderna).
The children of each inner node is the right-hand side of a production!

Why list + digit etc? Asymmetrical and ugly? Why not just list + list, like this:

digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
string -> digit | string + string | string - string

ALSU-07 fig 2.6 = ASU-86 fig 2.3 (slide!):

Two parse trees for 9-5+2

Operator associativity (Sw: Operatorassociativitet)

ALSU-07 fig 2.7 = ASU-86 fig 2.4:

Parse trees for left- and right-associative operators

Use a grammar like above, that is,
list -> list + digit
for left-associative (Sw: vänsterassociativa) operators. Use a grammar like
list -> digit + list
for right-associative (Sw: högerassociativa) operators, for example:

right -> letter = right
letter -> a | b | c | ... | z

Operator precedence (Sw: Operatorarprioritet, operatorarprecedens)

9 + 5 * 2 = 9 + (5 * 2), not (9 + 5) * 2.
"*" has higher precedence than "+".

Express this in the grammar:

factor -> digit | ( expr )
term -> term * factor | term / factor | factor
expr -> expr + term | expr - term | term

2.3 Syntax-directed translation

Not just the parse tree for a certain construct, but also keep track of attributes of each subtree. An attribute can be any data that we want to keep track of in the tree, such as the data type of an expression, or even generated code.

Two different types:

Syntax-directed definition (syntax-styrd definition) = just rules
(Syntax-directed) translation scheme (syntax-styrt översättningsschema) = more procedural

Syntax-directed definitions

Syntax-directed definition = syntax-styrd definition

Grammar + what to do for each production

Syntax-directed definition = context-free grammar, plus a semantic rule (Sw: semantisk regel) for each production, that specifies how to calculate values of attributes. Example:

Production	Semantic rule
term -> 0	term.postfixcode -> "0"
term -> 1	term.postfixcode -> "1"
term -> 2	term.postfixcode -> "2"
...	...
expr -> expr₁ + term	expr.postfixcode -> expr₁.postfixcode + term.postfixcode + " +"
...	...

ALSU-07 fig 2.9 = ASU-86 fig 2.6, an annotated (= with attribute values) parse tree:

A syntax-directed definition says nothing about how the parser should build the parse tree! Just the grammar, and what to do when the tree is finished (or at least, when we have found which production to use).

The syntax-directed definition doesn't specifiy in which order the semantic rules should be performed, except of course that he values used in the rules must be available. (For example, expr₁.postfixcode and term.postfixcode must be calculated before we can concatenate them into expr.postfixcode.)

(Syntax-directed) translations schemes

Syntax-directed translation scheme = syntax-styrt översättningsschema

Grammar + what to do for each production embedded in the grammar: in the right-hand side of productions

Syntax-directed definition = context-free grammar, plus semantic actions (Sw: semantiska aktioner, semantiska åtgärder) for each production, that specifies what to do. Example:

expr -> expr₁ + term { print("+"); }

Generates postfix!

Or, with the action somewhere in the middle:

rest -> + term { print("+"); } rest₁

The semantic actions are put in the parse tree, just like the "real" parts.
ALSU-07 fig 2.13 = ASU-86 fig 2.12:

An extra leaf is constructed for a semantic action

ALSU-07 fig 2.14 = ASU-86 fig 2.14:

Actions translating 9-5+2 into 95-2+

As with a syntax-directed definition, a syntax-directed scheme says nothing about how the parser should build the parse tree! Just the grammar, and what to do when the tree is finished.

But we don't actually have to build the tree. Just perform the semantic actions as the tree is (not) built!

In contrast to a syntax-directed definition (which was a table with rules) the semantic actions in a syntax-directed translations scheme must be performed in the right order. They are program segments, such as print statements and variable assignments, and that will of course fail if they are performed in an incorrect order!

2.4 Parsing

See the next lecture.

The course Compilers and interpreters | Lectures: 1 2 3 4 5 6 7 8 9 10 11 12

Thomas Padron-McCarthy (thomas.padron-mccarthy@oru.se), September 6, 2022