Föreläsning 2: Syntax och grammatiker

Kursen Kompilatorer och interpretatorer | Föreläsningar: 1 2 3 4 5 6 7 8 9 10 11 12

Det här är ungefär vad jag tänker säga på föreläsningen. Använd det för förberedelser, repetition och ledning. Det är inte en definition av kursinnehållet, och det ersätter inte kursboken.

Idag:
Mer om syntaktisk analys, "parsning" (engelska: syntactical analysis, parsing)

Grammatiker för datorspråk (also available in English, courtesy of Google Translate)
ALSU-07 avsnitt 2.1-2.3
(Gamla boken: ASU-86 2.1-2.3)
(Början på KP kapitel 3)

Än så länge inget om hur parsningen faktiskt går till. Det blir nästa gång.

2.1 Overview

Vi kommer att använda C på labbarna, och därför använder vi C-exemplen från den gamla kursboken ASU-86, i stället för Java-exemplen från den nya kursboken ALSU-07. Dessutom är de gamla exemplen pedagogiskt bättre.

Vi ska bygga en enkel kompilator som översätter från infix till postfix. Exempel på uttryck som ska kunna översättas:

Infix-notation	Värde	Postfix-notation	Prefix-notation	Funktionsnotation	LISP
2 + 3	5	2 3 +	+ 2 3	plus(2, 3)	(plus 2 3)
2 + 3 * 4	14	2 3 4 * +	+ 2 * 3 4	plus(2, times(3, 4))	(plus 2 (times 3 4))
2 * 3 + 4	10	2 3 * 4 +	+ * 2 3 4	plus(times(2, 3), 4)	(plus (times 2 3) 4)
2 * (3 + 4)	14	2 3 4 + *	* 2 + 3 4	times(2, plus(3, 4))	(times 2 (plus 3 4))

Source and target as text.

Postfix: Stack machine. Easy to write an interpreter.

Push numbers onto the top of the stack.
+: Pop the two top numbers, add, and push the sum.

The "2.5" program from ASU-86: simple grammar (Sw: "grammatik") (only + and -), simple parser, very simple scanner (one character = one token).
The "2.9" program from ASU-86: more advanced grammar (identifiers, *, /, mod, div), therefore a more complex parser, a "real" scanner.

2.2 Syntax definition

Example: the if statement in C. An instance:

if (a == b)
  printf("Same!\n");
else
  printf("Not same!\n");

This, as you know, is the syntax for the if statement:

if ( some expression ) some statement else some other statement

A rule that could be part of a context-free grammar (Sw: kontextfri grammatik) for C:

statement -> if ( expression ) statement else statement
statement -> if ( expression ) statement
statement -> { statement-list } (forgot what?)
...

"Context-free": a production "X -> ..." can always be used to replace X with "...", no matter what the rest of the program (that is, the context, Sw: kontext, omgivning) looks like.

A set of terminals (Sw: terminaler) = terminal symbols = tokens
A set of non-terminals (Sw: icke-terminaler) = non-terminal symbols (compound grammatical constructs)
A set of productions (Sw: produktioner) = rules: non-terminal -> tokens/non-terminals. A production is for the non-terminal to the left.
What is the start symbol (Sw: startsymbolen)

Other concepts:

String (Sw: sträng) = a sequence of tokens
{E-symbol} = The empty string (Sw: tomma strängen)
Language (Sw: språk) = the set of all strings that can be derived from the start symbol (using the productions in the grammar), Sw: mängden av alla strängar som kan härledas från startsymbolen (med hjälp av produktionerna i grammatiken).

Example 2.1 (p. 43)

Ensiffriga tal, plus, minus:
7+3, 7+3-4+6, 3 (but not 17, -3 or 2*2)

digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
list -> digit
list -> list + digit
list -> list - digit

digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
list -> digit | list + digit | list - digit

Try 9-5+2.
9 -> digit -> list.
5 -> digit.
9-5 -> list - digit -> list
2 -> digit.
9-5+2 -> list + digit -> list

ALSU-87 fig 2.5 = ASU-86 fig 2.2, the parse tree (= concrete syntax tree) and the syntax tree (= abstract syntax tree):

Parse tree for 9-5+2

The start symbol in the root (Sw: rot).
A token (or the empty string) as each leaf (Sw: löv).
Non-terminals in the inner (or "interior") nodes (Sw: de inre noderna).
The children of each inner node is the right-hand side of a production!

Why list + digit etc? Asymmetrical and ugly? Why not just list + list, like this:

digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
string -> digit | string + string | string - string

ALSU-07 fig 2.6 = ASU-86 fig 2.3 (slide!):

Two parse trees for 9-5+2

Operator associativity (Sw: Operatorassociativitet)

ALSU-07 fig 2.7 = ASU-86 fig 2.4:

Parse trees for left- and right-associative operators

Use a grammar like above, that is,
list -> list + digit
for left-associative (Sw: vänsterassociativa) operators. Use a grammar like
list -> digit + list
for right-associative (Sw: högerassociativa) operators, for example:

right -> letter = right
letter -> a | b | c | ... | z

Operator precedence (Sw: Operatorarprioritet, operatorarprecedens)

9 + 5 * 2 = 9 + (5 * 2), not (9 + 5) * 2.
"*" has higher precedence than "+".

Express this in the grammar:

factor -> digit | ( expr )
term -> term * factor | term / factor | factor
expr -> expr + term | expr - term | term

2.3 Syntax-directed translation

Not just the parse tree for a certain construct, but also keep track of attributes of each subtree. An attribute can be any data that we want to keep track of in the tree, such as the data type of an expression, or even generated code.

Two different types:

Syntax-directed definition (syntax-styrd definition) = just rules
(Syntax-directed) translation scheme (syntax-styrt översättningsschema) = more procedural

Syntax-directed definitions

Syntax-directed definition = syntax-styrd definition

Grammar + what to do for each production

Syntax-directed definition = context-free grammar, plus a semantic rule (Sw: semantisk regel) for each production, that specifies how to calculate values of attributes. Example:

Production	Semantic rule
term -> 0	term.postfixcode -> "0"
term -> 1	term.postfixcode -> "1"
term -> 2	term.postfixcode -> "2"
...	...
expr -> expr₁ + term	expr.postfixcode -> expr₁.postfixcode + term.postfixcode + " +"
...	...

ALSU-07 fig 2.9 = ASU-86 fig 2.6, an annotated (= with attribute values) parse tree:

A syntax-directed definition says nothing about how the parser should build the parse tree! Just the grammar, and what to do when the tree is finished (or at least, when we have found which production to use).

Den syntax-styrda definitionen säger inte heller något om i vilken ordning de semantisk reglerna ska "köras", förutom förstås att de värden som används i reglerna måste finnas tillgängliga. (Så man måste till exempel räkna ut expr₁.postfixcode och term.postfixcode innan man slår ihop dem till expr.postfixcode.)

(Syntax-directed) translations schemes

Syntax-directed translation scheme = syntax-styrt översättningsschema

Grammar + what to do for each production embedded in the grammar: in the right-hand side of productions

Syntax-directed definition = context-free grammar, plus semantic actions (Sw: semantiska aktioner, semantiska åtgärder) for each production, that specifies what to do. Example:

expr -> expr₁ + term { print("+"); }

Generates postfix!

Or, with the action somewhere in the middle:

rest -> + term { print("+"); } rest₁

The semantic actions are put in the parse tree, just like the "real" parts.
ALSU-07 fig 2.13 = ASU-86 fig 2.12:

An extra leaf is constructed for a semantic action

ALSU-07 fig 2.14 = ASU-86 fig 2.14:

Actions translating 9-5+2 into 95-2+

As with a syntax-directed definition, a syntax-directed scheme says nothing about how the parser should build the parse tree! Just the grammar, and what to do when the tree is finished.

But we don't actually have to build the tree. Just perform the semantic actions as the tree is (not) built!

Till skillnad från med en syntax-styrd definition (som var en tabell med regler) så måste de semantiska aktionerna i ett syntax-styrt översättningsschema utföras i rätt ordning. De är programsnuttar (t ex print-satser och variabeltilldelningar), och det blir förstås fel om de görs i fel ordning!

2.4 Parsing

Se nästa föreläsning.

Kursen Kompilatorer och interpretatorer | Föreläsningar: 1 2 3 4 5 6 7 8 9 10 11 12

Thomas Padron-McCarthy (Thomas.Padron-McCarthy@oru.se) 8 september 2018