KOI: Exercise 6: Lexical analysis, state machines

You can do this exercise in any size group you wish. Don't worry if you find some of the questions difficult to answer, since they may be intended to initiate a discussion rather than have a single correct answer.

Preparations

Before this exercise you should have:

Question 1

a) What is the difference between a state machine or automaton (in Swedish: "tillståndsmaskin", "automat") and a finite state machine (FSM) or finite automaton or finite state automaton (FSA) (in Swedish: "ändlig tillståndsmaskin", "finit automat")?

b) What is the difference between a deterministic finite state machine or deterministic finite automaton (DFA) (in Swedish: "deterministisk ändlig tillståndsmaskin", "deterministisk finit automat") and a non-deterministic finite state machine or non-deterministic finite automaton (in Swedish: "icke-deterministisk ändlig tillståndsmaskin", "icke-deterministisk finit automat")?

Question 2

The following program can be described using a simple deterministic finite state machine. Draw the machine!

Download

// Counts the number of words in the input
// A word is a sequence of letters

#include <stdio.h>
#include <ctype.h>

int main(void) {
    int c;
    int number_of_words = 0;
    int inside_a_word = 0;

    while ((c = getchar()) != EOF) {
        if (!isalpha(c)) {
            // This was not a letter
            inside_a_word = 0;
        }
        else if (!inside_a_word) {
            // This was a letter, AND we were not (yet) inside a word
            ++number_of_words;
            inside_a_word = 1;
        }
    }
    
    printf("Number of words: %d\n", number_of_words);
    return 0;
} // main

Question 3

The program above counts words, where a word is defined as a sequence of letters, separated by non-letters (or the beginning or the end of the input). If we want to allow a "word" to contain non-letters, such as the name Anna-Belle or the term mother in law, we can use quotes to quote a string of any characters except quote, so the following input would consist of ten words:

My "mother in law", who is called "Anna-Belle", is a "50-year-old" woman.

Your state machine from question 2 above probably has one state for being inside a word, and one state for being outside. Draw a new state machine that also has a state for being inside a word that is a quoted string.

Question 4

Here is a scenario:

Pumpa

Soon (when I write this) it will be Halloween. Children celebrate the day by going "trick-or-treating". They go from door to door in the neighborhood and ask "trick or treat". To document this "candy round", they need a special Halloween language. A complete input might look like this:

began candy round;
treat: three candies;
trick: made some spooky sounds;
treat: one kilo of chocolate;
treat: a carrot;
trick: set fire to the neighbors car;
went home;

The input describes a candy round. It starts with began candy round; and ends with went home;. In between, there are zero or more notes about trick or treat. Such a note begins with either trick or treat, a colon (:), one or more words that indicate what happened, and a semicolon (;).

It should be possible to write the input in free format, such as in most common programming languages, for example like this:

began candy

round ; treat : three candies ; trick
: made some          spooky
sounds ; treat : one kilo of chocolate
; treat
: a carrot ; trick : set fire
to the neighbors car ; went
home    ;
Here is a Bison parser input file for the Halloween language.

Download

%{
    #include <stdio.h>
    extern void yyerror(const char *);
    extern int yylex(void);

%}

%token BEGAN CANDY ROUND SEMICOLON TRICK TREAT COLON WORD WENT HOME
%define parse.error verbose
%start candy_round

%%

candy_round : BEGAN CANDY ROUND SEMICOLON trick_or_treat_list WENT HOME SEMICOLON
    ;

trick_or_treat_list: trick_or_treat trick_or_treat_list
          | /* empty */
          ;

trick_or_treat : TRICK COLON word_list SEMICOLON
	| TREAT COLON word_list SEMICOLON
        ;

word_list : WORD
         | WORD word_list
         ;

%%

int main() {
    printf("Enter a complete candy round. Finsih with EOF.\n");
    yyparse();
    printf("Done!\n");
    return 0;
}

void yyerror(const char *s) {
    printf("Invalid candy round! Error message: %s\n", s);
}

int yywrap(void) {
    return 1;
}
a) One of the terminals, that is, types of tokens, that are used in the Halloween language is word. That is simply one or more lower-case letters, a to z. (We only use the English alphabet.) Write a regular expression that describes what such a word can look like.

b) Which other terminals, except word, are used in the language?

c) Write a Flex specification (the ".l" input file) for a scanner for the language.

d) Build a complete working program and check that it understands the Halloween language.

If the Flex input file is called halloween.l and the Bison input file is called halloween.y, the complete program can be built using these commands:

flex halloween.l    
bison -d halloween.y
gcc -o halloween -Wall -Wextra halloween.tab.c lex.yy.c

Question 5

Write a hand-coded "ad hoc" scanner for the Halloween language. It should work the same as the Flex-generated scanner.

There are some solutions to some of the questions, but try to solve them yourself first.


Thomas Padron-McCarthy (thomas.padron-mccarthy@oru.se) November 10, 2023