KOI: Exercise 4: Types of parsers, grammar transformations, bottom-up parsers, Bison

You can do this exercise in any size group you wish. Don't worry if you find some of the questions difficult to answer, since they may be intended to initiate a discussion rather than have a single correct answer.

Preparations

Before this exercise you should have:

Question 1

Here is a grammar. There are three terminals: a, b and c. There are three non-terminals: S, T and U. The start symbol is S.

S -> a b T | b a T
T -> a b U | a a U
U -> a b c

Below is the program badparser-1. It is an attempt to write a parser that implements the grammar above.

a) What is the language that this grammar defines?

b) Unfortunately, there is a problem with the parser below. Explain what the problem is.

c) We could use various programming tricks to fix the parser, but instead we will use a more general method, using a grammar transformation. What is the transformation we should use here?

d) Perform the grammar transformation, and show the new, transformed grammar.

e) Modify the program so the parser implements the new grammar. Verify that it can now correctly parse the language.

(Some helpful hints: The program only understands the tokens, or terminals, a, b and c. Finish the input with EOF, which in Linux is entered using CTRL-D on a separate line.a All other input, such as d, is ignored. You can compile and run the program if you want, but you don't have to.)

Download

// A simple (and wrong) parser: badparser-1.c

#include <stdio.h>
#include <ctype.h>
#include <stdlib.h>

int lookahead;

void error(void) {
    printf("Syntax error: unexpected token '%c'\n", lookahead);
    exit(EXIT_FAILURE);
}

// A simple scanner.
// Returns one of the tokens 'a', 'b' or 'c', or EOF. Other input is ignored.
int scan(void) {
    int input;
    while ((input = getchar()) != 'a' && input != 'b' && input != 'c' && input != EOF)
        ;
    return input;
}

void match(int expected) {
    if (lookahead == expected)
        lookahead = scan();
    else
        error();
}

// These are the non-terminals, calling each other recursively
void S(void), T(void), U(void);

void S(void) {
    if (lookahead == 'a') {
        match('a'); match('b'); T();
    }
    else if (lookahead == 'b') {
        match('b'); match('a'); T();
    }
    else {
        error();
    }
}

void T(void) {
    if (lookahead == 'a') {
        match('a'); match('b'); U();
    }
    else if (lookahead == 'a') {
        match('a'); match('a'); U();
    }
    else {
        error();
    }
}

void U(void) {
    match('a'); match('b'); match('c');
}

int main(void) {
    printf("Badparser 1. Enter input. End with EOF (CTRL-D on Linux).\n");
    lookahead = scan(); // To get started, so we have somthing to compare to
    S();
    if (lookahead != EOF)
        error();
    printf("Done!\n");
    return EXIT_SUCCESS;
}

Question 2

Here is a grammar. There are three terminals: a, b and c. There are two non-terminals: S and T. The start symbol is S.

S -> a b T c
T -> T a | b

Below is the program badparser-2. It is an attempt to write a parser that implements the grammar above.

a) What is the language that this grammar defines?

b) Unfortunately, there is a problem with the parser below. Explain what the problem is.

c) We could use various programming tricks to fix the parser, but instead we will use a more general method, using a grammar transformation. What is the transformation we should use here?

d) Perform the grammar transformation, and show the new, transformed grammar.

e) Modify the program so the parser implements the new grammar. Verify that it can now correctly parse the language.

Download

// A simple (and wrong) parser: badparser-2.c

#include <stdio.h>
#include <ctype.h>
#include <stdlib.h>

int lookahead;

void error(void) {
    printf("Syntax error: unexpected token '%c'\n", lookahead);
    exit(EXIT_FAILURE);
}

// A simple scanner.
// Returns one of the tokens 'a', 'b' or 'c', or EOF. Other input is ignored.
int scan(void) {
    int input;
    while ((input = getchar()) != 'a' && input != 'b' && input != 'c' && input != EOF)
        ;
    return input;
}

void match(int expected) {
    if (lookahead == expected)
        lookahead = scan();
    else
        error();
}

// These are the non-terminals, calling each other recursively
void S(void), T(void);

void S(void) {
    match('a'); match('b'); T(); match('c');
}

void T(void) {
    if (lookahead == 'b') {
        T(); match('a');
    }
    else if (lookahead == 'b') {
        match('b');
    }
    else {
        error();
    }
}

int main(void) {
    printf("Badparser 2. Enter input. End with EOF (CTRL-D on Linux).\n");
    lookahead = scan(); // To get started, so we have somthing to compare to
    S();
    if (lookahead != EOF)
        error();
    printf("Done!\n");
    return EXIT_SUCCESS;
}

Question 3

Here is a grammar. There are three terminals: a, b and c. There are two non-terminals: S and T. The start symbol is S.

S -> a b T c
T -> a b c | U
U -> a b c

Below is the program badparser-3. It is an attempt to write a parser that implements the grammar above.

a) What is the language that this grammar defines?

b) There is a problem with this grammar. Explain what the problem is.

c) Does the parser below correctly parse the language?

d) Try to fix the problem in the grammar, and show the new grammar.

e) Modify the program so the parser implements the new grammar. Verify that it can correctly parse the language.

Download

// A simple (but is it wrong?) parser: badparser-3.c

#include <stdio.h>
#include <ctype.h>
#include <stdlib.h>

int lookahead;

void error(void) {
    printf("Syntax error: unexpected token '%c'\n", lookahead);
    exit(EXIT_FAILURE);
}

// A simple scanner.
// Returns one of the tokens 'a', 'b' or 'c', or EOF. Other input is ignored.
int scan(void) {
    int input;
    while ((input = getchar()) != 'a' && input != 'b' && input != 'c' && input != EOF)
        ;
    return input;
}

void match(int expected) {
    if (lookahead == expected)
        lookahead = scan();
    else
        error();
}

// These are the non-terminals, calling each other recursively
void S(void), T(void), U(void);

void S(void) {
    match('a'); match('b'); T(); match('c');
}

void T(void) {
    if (lookahead == 'a') {
        match('a'); match('b'); match('c');
    }
    else if (lookahead == 'a') {
        U();
    }
    else {
        error();
    }
}

void U(void) {
    match('a'); match('b'); match('c');
}

int main(void) {
    printf("Badparser 3. Enter input. End with EOF (CTRL-D on Linux).\n");
    lookahead = scan(); // To get started, so we have somthing to compare to
    S();
    if (lookahead != EOF)
        error();
    printf("Done!\n");
    return EXIT_SUCCESS;
}

Question 4

Here is a grammar. There are three terminals: a, b and c. There are three non-terminals: S, T and U. The start symbol is S.

S -> a T | a b T
T -> T a | c | U
U -> c

Below is the file bisongrammar.y, which is an input file for Bison (or Yacc). It is an attempt to use Bison to specify a parser that implements the grammar above.

You can run it through bison with the command bison bisongrammar.y, which creates the C file bisongrammar.tab.c, which can then be compiled as a normal C program.

a) What is the language that this grammar defines?

b) This grammar has each of the three problems that we saw in questions 1-3 above. Identify those problems in the grammar!

c) Instead of a hand-coded top-down parser as in question 1-3, we are now using Bison to create a bottom-up parser. Which of those problems are still problems, with the Bison-generated parser? Which problems are no longer an issue?

Download

// bisongrammar.y

%{
  #include <stdio.h>
  #include <ctype.h>
  #include <stdlib.h>
  extern int yylex(void);
  extern void yyerror(const char *message);
%}

%define parse.error verbose

%%

S : 'a' T  | 'a' 'b' T ;
T : T 'a' | 'c' | U ;
U : 'c' ;

%%

void yyerror(const char *message) {
    printf("The Bison-generated parser found an error: %s\n", message);
    exit(EXIT_FAILURE);
}

// A simple scanner.
// Returns one of the tokens 'a', 'b' or 'c', or EOF. Other input is ignored.
int yylex(void) {
    int input;
    while ((input = getchar()) != 'a' && input != 'b' && input != 'c' && input != EOF)
        ;
    return input;
}

int main(void) {
    printf("Bisongrammar. Enter input. End with EOF (CTRL-D on Linux).\n");
    yyparse();
    printf("Done!\n");
    return EXIT_SUCCESS;
}

Question 5

a) What do we mean by a top-down parser?

b) What do we mean by a bottom-up parser?

c) What do we mean by a recursive-descent parser?

d) What do we mean by a predictive parser?

Question 6

Here are some terms often used in more theoretical works about parsing. What do they mean?

a) LL
b) LL(k)
c) LL(1)
d) LR
e) LR(k)
f) LR(1)


Thomas Padron-McCarthy (thomas.padron-mccarthy@oru.se) September 22, 2021