Corrected version. 2002-03-22.

Mitthögskolan
ITE
Thomas Padron-McCarthy (tpm@nekotronic.se)






Exam in

Datateknik C, Databaser 5 poäng

Wednesday March 20, 2002

(This exam is available in a Swedish and an English version.)







Utilities: None.
Points: Maximum 43 points.
21 points required to pass.
Results: Will be posted on the course home page, no later than Wednesday April 3, 2002.
Walk-thru: Time for walk-thru will be posted along with the result.
Examiner and teacher on duty: Thomas Padron-McCarthy, telephone 070-7347013


In addition to the instructions on the cover of the test, please also:


GOOD LUCK!


Scenario for the problems:

The Swedish State Railroads, Statens Järnvägar, abbreviated SJ, has a database for ticket bookings. As a part of that database, there is a time table for all train routes. It contains train routes, train stations, how the routes stop at the stations, and trains.

A "route" is, for example, the train from Östersund to Sundsvall every weekday at 09:12. A "train" is an actual individual train, such as that train over there that just now leaves from Östersund to Sundsvall.

Here is an Entity-Relationship diagram:

And here are the tables:

Station(Snr, Name)
Stops(Route, Station, Number, Time);
Route(Rnr, Days)
Train(Tnr, Route, Date)

Example data:

Stops
Route Station Number Time
3 3 1 09:12
3 2 2 10:41
3 1 3 10:44
3 4 4 16:31
... ... ... ...
             Station
Snr Name
1 Sundsvalls Central
2 Sundsvalls Västra
3 Östersunds Central
4 Stockholms Central
... ...
 
Route
Rnr Days
1 Sun
2 Mon, Tue, Wed, Thu, Fri, Sat, Sun
3 Mon, Tue, Wed, Thu, Fri
4 Sun
... ...
             Train
Tnr Route Date
1 3 2002-03-20
2 3 2002-03-21
3 3 2002-03-22
... ... ...

Number of rows in the tables:






Problem 1 (2p)

There is a normalization problem with the table Route. Explain what the problem is, and why it is a problem.

Problem 2 (7p)

We want to answer to the following question:

Which trains stop at the station Sundsvalls Västra on the date 2002-03-20? The result should consist of the train number (Tnr).

We use this SQL query:

select Train.Tnr
from Station, Stops, Route, Train
where Station.Name = 'Sundsvalls Västra'
      and Station.Snr = Stops.Station
      and Stops.Route = Route.Rnr
      and Route.Rnr = Train.Route
      and Train.Date = '2002-03-20'

a) (1p) Do the systematic translation from the SQL query to the initial query tree, and draw the tree.

b) (1p) Write that initial query tree as a relational algebra expression (that is, not drawn as a tree).

c) (2p) A heuristic query optimizer transforms the query tree to a more efficient form. Draw the resulting tree!

d) (2p) Explain why this new tree is more efficient than the initial tree. Which improvements have been made? Why is the execution of the query (probably) faster now?

e) (1p) Which shortcomings does a heuristic query optimizer of this type have? By using a cost-based query optimizer, some of those shortcomings can be addressed. How?

Problem 3 (2p)

We want an answer to this question:

I want to go from Sundsvalls Västra to Stockholms Central on the date 2002-03-20. At which times can I leave?

We use this SQL query:

select firststop.Time
from Station as firststn, Station as laststn,
     Stops as firststop, Stops as laststop, Route, Train
where firststn.Name = 'Sundsvalls Västra'
      and firststn.Snr = firststop.Station
      and firststop.Route = Route.Rnr
      and laststn.Name = 'Stockholms Central'
      and laststn.Snr = laststop.Station
      and laststop.Route = Route.Rnr
      and Route.Rnr = Train.Route
      and Train.Date = '2002-03-20'
      and firststop.Number < laststop.Number
Formulate the SQL query as a relational algebra expression, in any way that gives a correct result. (If you want to, you can write it as a query tree, but that is not required.)

Problem 4 (5p)

Queries of the type shown in the two problems above are common in this database. (The station name and date of course varies.) We use a DBMS that stores its indexes as B+ trees.

a) (4p) Which indexes should we create?

b) (1p) Can the DBA do something else, besides crating indexes, to speed up the queries?

Problem 5 (5p)

We want to know the date of train number 4711, and run the following SQL query:
select Date
from Train
where Tnr = 4711
State reasonable assumptions concerning disk block size etc, and calculate how long the execution of the query will take, on an average.

a) with the table Train stored as an unsorted heap, and no indexes

b) with a B+-tree primary index on the attribute Tnr, and no other interesting storage structures

c) with a secondary index in the form of a hash table, on the attribute Tnr, and no other interesting storage structures

Problem 6 (2p)

SJ is expanding, and is taking over all train traffic in the world. This makes the database grow, and all four tables becomes one thousand times bigger: What happens now with the response times in a-c in the problem above?

Problem 7 (3p)

SJ wants to store maps of train routes and stations in their database.

a) (1p) Why do the usual index structures, such as B+ trees and hash tables, not work well with two-dimensional data, such as maps?

b) (2p) Give an example of a data structure that is suitable for two-dimensional data, and describe in short how it works.

Problem 8 (3p)

We want to know how many train routes that stop at each station, and we start by trying this SQL query:
select Station.Name, count(*)
from Station, Stops
where Station.Snr = Stops.Station
group by Station.Name
a) (1p) Translate this SQL query to a relational algebra expression.

b) (1p) There are some newly-built stations where no trains stop yet. Those stations are not present at all in the result of the query above. Write a new SQL query, where the new stations too are present in the result, with zero in the column for the number of train routes that stop.

c) (1p) Translate the new SQL query to a relational algebra expression.

Problem 9 (2p)

Many DBMSs provide a library of functions, which can be used to access a database from inside a program. Another alternative is ODBC. Which advantages and disadvantages do you get when using ODBC, instead of the DBMS's own function library?

Problem 10 (5p)

As mentioned above, SJ is taking over all train traffic in the world. Because of this, we add a Country column to the table Station:

Station
Snr Name Country
1 Sundsvalls Central Sweden
2 Sundsvalls Västra Sweden
2877 Trondheim Norway
4876 Notting Hill Gate UK
6777 New Delhi East India
... ... ...

We also want to distribute the database. Each country gets its own fragment of the Station table, using primary horizontal fragmentation based on the attribute Country:

We also want to fragment the table Stops, and store the rows for stops in each country. We use secondary horizontal fragmentation, based on the fragmentation of the table Station.

a) (1p) Give the fragmentation schema for the relation Stops

b) (0.5p) Give the reconstruction program for the relation Station

c) (0.5p) Give the reconstruction program for the relation Stops

d) (1p) We want to know the earliest time of day ever that a train stops at Sundsvalls Central in Sweden. We use this SQL query:

select min(Stops.Time)
from Station, Stops
where Station.Snr = Stops.Station
and Station.Name = "Sundsvalls Central"
and Station.Country = "Sweden"
Translate this to an efficient relational algebra expression, using the global relations.

e) (1p) Localize the above relational algebra expression by replacing the global relations with their reconstruction programs.

f) (1p) Simplify that relational algebra expression by removing parts that give empty results.

Problem 11 (4p)

Enforcing isolation in ACID transactions is most often done with locks.

a) (1p) Sometimes we use read- and write-locks, instead of plain binary locks. How do read- and write-locks work?

b) (1p) Why do we use read- and write-locks, instead of plain binary locks?

c) (2p) Why is locking more difficult in a distributed database than in a centralized database? How do you do it? Are there any special difficulties if you want read- and write-locks in a distributed database?

Problem 12 (3p)

In some systems, such as Microsoft ASP and Roxen WebServer, it is possible to embed SQL queries in a HTML web page. This is an example from Roxen WebServer, which shows such an embedded SQL query:
<table border="1">
<tr><th>Number</th> <th>Name</th></tr>
<emit source="sql" host="foodbase" query="select id, name
                                          from person order by id">
<tr><td>&_.id;</td> <td>&_.name;</td></tr>
</emit>
</table>
a) (1p) Explain, in short, what happens "behind the scenes" in the web server when it encounters an SQL query like this one.

b) (2p) It is convenient to be able to put SQL queries in the HTML code like this, but are there other advantages too, compared to the alternatives? And which are the alternatives?