Grammars for computer languages

By Thomas Padron-McCarthy (email: Thomas.Padron-McCarthy@oru.se ). Last change: September 4, 2017. This translation: September 10, 2018.

A computer language is a language that is designed with the intension that it should be read and understood by computers. It may be a programming language like C ++, Java or Visual Basic, but it can also be a much easier language, such as describes what to write in a configuration file, or what can be entered in an entry box.

A grammar for a language describes the language, and describes how to put together sentences and other constructions in that language.

Computer languages and natural languages

A computer language, such as C++ or HTML, is intended to be read and understood by computers.

There are also natural languages like Swedish and English. The natural languages tend to be much more free than the computer language, so that you for example can change the word order. The boundary between right and wrong is more fluid in natural languages.

Because computers are much stupider than people, one must be much more careful when writing something that a computer should read and understand. A person can often understand an opinion even though there is a comma in the wrong place. But if you type a comma in the wrong place in a program, then your computer will mostly do either nothing at all, or something completely different from what you thought.

Grammar

A grammar for a language describes the language, and specifies how to combine words and other symbols in the language into sentences and other constructions.

The grammar for the language English says, for example, that one can make a sentence for example by first writing a noun and then a verb, and finish with a period. That rule could be written as follows:


sentence -> noun verb .

You can thus take any noun, followed by a any verb, and a period, and this forms a sentence.

Because natural languages like English are quite free, it is difficult or impossible to write a complete and correct grammar that really does describe everything you can say in English.

It is much easier to write a grammar for a computer language. Computer languages are much simpler than natural languages, and moreover, they are much more precisely specified. There are no (well, almost none) doubtful cases, but if you write something in a computer language like Java, it's either right or wrong. Not as in English, where you can talk quite a long time about whether an expression is right or wrong, and in the end it may prove to be a matter of taste.

A simple language: greetings

We think we should write a grammar for languages where you can express the following four different greetings:

hello
good morning
good afternoon
good evening

Here's what the grammar might look like:


 
greeting -> hi | good morning | good afternoon | good evening

A greeting can be either hello, good morning, good afternoon or good evening. The symbol "|" means "or". The symbol "->" can be read as "can be replaced with" or "can be expanded to".

Here is another grammar that describes the exact same language:



greeting -> hi | good time_specification
time_specification -> morning | afternoon | evening

A greeting is either hello or the word good, followed by a time_specification, and the time_specification can be either morning, afternoon or evening.

Instead of using the "or" sign "|" to show that a time_specification can be one of several different things, you can write several different rules:


 
greeting -> Hi
greeting -> good time_specification
time_specification -> morning
time_specification -> afternoon
time_specification -> evening

"Terminals", "Non-Terminals" and "Productions"

A terminal is something that you write in the language itself. In the greeting language above we have five different terminals, namely the five words hello, good morning, afternoon and evening. Here we write terminals in bold.

In grammar, we also use the names greetings and time_specification. They do not appear in the greeting language, but only in the grammar. They are called non-terminals, and we write them in italics.

A "terminal" is something that is at the end of something. At the end of a bus line there is a bus terminal. There the bus stops, and the passengers leave. If we have a non-terminal, such as greeting, we can continue and replace it, according to the rules of grammar, but when we arrive at a terminal, such as tomorrow, we can not replace it with anything else, and we have to stop there.

A production is a rule in the grammar, which states that a certain non-terminal can be replaced with something else. For example, this is a production that says that non-terminal greetings can be replaced by the terminal good followed by the non-terminal time_specification:



greeting -> good time_specification

Also note that this is not one production, but three:

tidsspecifikation -> morgon | afternoon | evening

The "or" sign "|" is just a shorter way to write several different productions. Instead of the above, you can print the three productions like this:

time_specification -> morning
time_specification -> afternoon
time_specification -> evening

The start symbol

We assumed that you would write whole greetings, so that you can say good evening but not just evening. But in the grammar we had two non-terminals: greeting and time_specification. How do you know which one describes the whole expression?

When writing a grammar, it is not enough to specify which terminals, non-terminals and productions there are, but you must also indicate what kind of things you want to say in the language, for example, if you are saying complete greetings or just time specifications. To do this, we specify a start symbol.

For our greeting language, it is greeting that is the start symbol, not time_specification.

Grammar for the greeting language

As we saw, we were able to describe the greeting language with several different grammars, all of which describe the same language. We choose one of them, and to be complete we will calculate which terminals, non-terminals and productions are in the language and its grammar, and what is the start symbol. Like this:

Terminals:

hello
good
morning
afternoon
evening

Non-terminals:

greeting
time_specification

Productions:

greeting -> hi | good time_specification
time_specification -> morning | afternoon | evening

Start symbol:

greeting

What is a language?

The greeting language that we have described consists of the following four phrases, or sequences of terminals that are allowed in that language:

hello
good morning
good afternoon
good evening

The language thus contains four phrases. Other phrases, such as good night, are not allowed, and are not included in the language.

When we talk about a language we mean the set of all allowed sequences of terminals. This simple language contains four such sequences, but a language like C ++ contains of course many more - maybe even infinitely many.

How do you find out which language a grammar describes?

You can generate all allowed sequences of terminals in one language by starting from the start symbol, then gradually expanding all non-terminals. For example, take the grammar from the previous example, with the start symbol greeting and these productions:

greeting -> hi | good time_specification
time_specification -> morning | afternoon | evening

Start with the start symbol, greeting:

The figures have noot been translated yet, and still use the Swedish terminals and non-terminals!

Home icon

There are two ways to expand it:

Two possible expansions of the start symbol greeting

hello can not be expanded, so we will not go on, but the non-terminal time_specification can be expanded in three different ways:

Three possible expansions of non-terminal time_specification

Now we can go no further.

We have formed a tree, and the language, ie, which sequences of terminals are allowed, we find by looking at the leaves in the tree.

A grammar for an infinitely large language

A language may be infinitely large, in that it contains infinitely many allowed sequences:

sum -> 3 | sum + 3

If the sum is the start symbol, we can expand it as follows:

Expansion of an infinite language

Although the grammar of this language contains only two terminals, one non-terminal and two productions, the language contains infinitely many allowed sequences:

3
3 + 3
3 + 3 + 3
3 + 3 + 3 + 3
...

A grammar for sums

In the example above we were able to express sums, but unfortunately only sums of the number 3. We will try to write a grammar that can handle other sums, with numbers and plus signs, for example:

4 + 5
4 + 5 + 6
7
2 + 3 + 4 + 5 + 6

It might be surprising that 7 is counted as a sum, but we choose to include the case with a single number. When we then proceed with real math expressions, we should (of course!) allow you to write expressions that consist only of a number.

There are several ways to write such a grammar, but we choose to think like this: A sum is either a lone number, or you have a sum (which may be a lone number) where you add a new number at the end, so that there will be a new sum.

A first try:

sum -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | sum + 0 | sum + 1 | sum + 2 | sum + 3 | sum + 4 | sum + 5 | sum + 6 | sum + 7 | sum + 8 | sum + 9

The grammar works, but it is very long. We can make it shorter by introducing the non-terminal number:

number -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
sum -> digit | sum + digit

Alternatively, we can make a number into a terminal. Then it becomes the responsibility of the lexical analyzer (the "scanner") to recognize "0", "1", "2" and so forth, and to pass on to the syntactic analyzer (the "parser") that it found a number in the input.

Then the grammar becomes even shorter:

sum -> digit | sum + digit

Now it may be a bit clearer how we thought: A sum is either a lone number, or you have a sum (which may be a lone number), attaching a plus sign and a new number at the end, making it a new sum.

Lexical analysis and syntactic analysis

The first two phases of a compiler are called lexical analyzer (or "scanner") and syntactic analyzer (or "parser").

The source program to be compiled usually consists of text, ie a sequence of characters:

x = lower_coordinate + 14;

The scanner reads the source program text, and instead creates a sequence of terminals, or "tokens", which are passed on to the parser:

identifier, equals sign, identifier, plus sign, integer, semicolon

Note that the spaces disappeared because they have no meaning, they are just used to separate the terminals, and also note several characters (such as 1 and 4 ) can form a single terminal.

Normally, the scanner also sends information about which identifiers and integers it found, like this:

identifier (x), equals sign, identifier (lower_coordinate), plus sign, integer (14), semicolon

The parser reads this sequence of tokens, compares it to the grammar, and determines if there is an allowed sequence of tokens according to the grammar, and which of the productions in grammar are needed to generate this sequence.

When designing a compiler, one can sometimes choose whether to place a certain amount of work in the scanner or in the parser. For example, we saw above that you could have the parser rule a rule that says a "number" is 0 to 9, or you could let the scanner recognize that 0 to 9 are numbers.

Parsing and compilers

There are different methods of writing a parser, for example for being included in a compiler. You can write it by hand in the form of common program code in C, Java or a similar language, or you can use a so-called parser generator, such as Yacc. In this case, you don't need to write a program, but only a grammar for the language. The parser generator reads the file with the grammar, and then generates the program code (for example, in C or Java) that constitutes the parser.

Parse trees

When the parser reads the sequence of tokens that come from the scanner, and compares that sequence to the productions in grammar, it builds a so-called parse tree, sometimes called concrete syntax tree. The leaf nodes in the tree are tokens, and the inner nodes are non-terminals. The root node is the grammar's start symbol.

Parse tree

(A parenthesis: It's not really true that the parser necessarily builds a parse tree. Sometimes it doesn't build a tree, at least not as a data structure in memory. Instead, the parser goes through the productions in grammar in an order so that it could build a parse tree.)

The parse tree shows what the things (such as sums) consist of. For example, a sum can consist of two other sums. Therefore, instead of a tree you can draw it as boxes, with boxes inside boxes:

Parse tree as drawers inside drawers

One can simplify the parse tree by (1) removing the internal nodes that only have a single sub-node, and (2) moving operators to the node above where they replace the non-terminal:

To convert a parse tree to a syntax tree

Then a so-called syntax tree is formed, sometimes called an abstract syntax tree:

Syntax Trees

The simpler syntax contains all the information needed to calculate the expression (or run the program), but does not contain the information about the grammar structure found in the parse tree.

The same language, but with a worse grammar: ambiguity!

When we created the grammar for the language of sums, our thought process was this: A sum is either a lone number, or you have a sum (which may be a lone number), adding a new number at the end, making it a new sum.

Another way of thinking is this: A sum is either a lone number, or two sums that you add:

sum -> digit | sum + sum

This grammar describes the same language as the previous one, and thus generates the same set of allowed token sequences, but it has the disadvantage that it is ambiguous.

That a grammar is ambiguous means that the same sequence of tokens can be described by (at least) two different parse trees, according to the grammar. Note how 3 + 4 + 5 can be described in two ways:

Two different parse trees for 3 + 4 + 5

If you group the additions with parentheses, this corresponds to the two expressions (3+4)+5 and 3+(4+5) respectively.

For addition the result will be the same. Both sums yield the result 12, as an addition always gives the same result regardless of the order in which the terms are added. But if we had subtraction instead of addition, then the two expressions would have been (3-4)-5 and 3-(4-5). They give completely different results.

The opposite of ambiguous grammar is a unique grammar. A grammar for a computer language should be unambiguous, because an ambiguous grammar makes it possible for a source program to be interpreted in different ways, and it is difficult for the computer to know, with some sort of "common sense", which of the different interpretations was intended by the programmer.

A grammar for real expressions

We want to create a grammar for simple math expressions, starting with one that only accepts addition (plus) and multiplication (times). It should handle expressions such as:

1 + 2
2 * 3
1 + 2 * 3

How will 1 + 2 * 3 be interpreted? Well, it should be interpreted as 1 + (2 * 3), which becomes 7, and not like (1 + 2) * 3, which becomes 9. This is because multiplication should be performed before addition. We say that the multiplication operator has higher priority, sometimes called precedence, than the addition operator.

An initial attempt to write the grammar:

E -> number
E -> E + E
E -> E * E

"E" stands for "expression".

The same grammar can also be written as follows, if you want to write it more compactly:

E -> number | E + E | E * u

This grammar says that an expression can consist of just one number, or two other expressions that you add, or two other expressions that you multiply.

That's an ambiguous grammar (try it yourself!). Next attempt:

F -> number
T -> F * F
E -> T + T

"F" stands for "factor" and "T" stands for "term". "E" still stands for "expression". This grammar says that an expression consists of an addition of two terms. A term, in turn, consists of a multiplication of two factors. A factor, in the end, consists of a number.

It's better, but it only fits expressions on the form numbers * numbers + numbers * numbers.

We want to multiply arbitrarily many numbers into one term, and add many terms to an expression arbitrarily. Then both the term rule and the expression rule must handle it. We use what we learned earlier about how to combine sums by constantly adding + numbers to a sum.

F -> number
T -> F | T * F
E -> T | E + T

Now the grammar is correct, and it says the following:

A factor (F) consists of a single number.
A term (T) can either consist of a single factor, or you have a term (which may in turn be a single factor or several multiplied together), multiplying with a new factor at the end, so that it becomes a new term.
An expression (E) can either consist of a lone term, or you have an expression (which may in turn be a lone term or several added together) where you add a new term at the end so that it becomes a new expression.

Note:

Associativity: Operators + and * are both left associative. New terms and factors are added to the right.
Priority: The operator * has a higher priority than + because the right-hand side of the + production contains the non-terminal F, and the F-productions's right-hand side contains the operator *.

Yacc and Bison

In Yacc and Bison (which works almost exactly the same), you can write rules that look like above, and Yacc (or Bison) makes a parser for that grammar. Below we see a complete input file for Yacc and Bison that sets out the grammar of expressions we saw above. To show that non-terminals can be more than a single letter, F is called a factor, and so on.


%token number

%start expression

%%

factor : number ;
term : factor | term '*' factor ;
expression : term | expression '+' term ;

%%

You can also type the parser by hand, using maybe fifty lines of program code in a language like C or Java. (If you know how to do it. If you don't know how to do it, you might need a thousand lines, resulting in a program that only almost works.)