Databasteknik C: (Preliminary!) Solutions to the exam 2003-01-15

Thomas Padron-McCarthy (e-mail: tpm@nekotronic.se), January 6, 2003

These are suggested solutions. There may be other correct solutions. Some of these answers may be more exhaustive than required for full points on the exam.

Problem 1

I can't write the proper symbols here, so instead I'll write PROJECT etc. I am also using the somewhat non-traditional SQL-like dot notation to qualify attributes in relational algebra.

select Record.Name
from Record, Artist
where Artist.Name = 'Carola'
and Artist = Anr;
PROJECT_Record.Name [ SELECT_{Name = "Carola"}(Artist) JOIN_{Artist = Anr} Record ]
or, as another example
PROJECT_Record.Name SELECT_{Name = "Carola" and Artist = Anr} [ Record X Artist ]

select sum(Length)
from Artist, Record, Song
where Anr = Artist
and Rnr = Record
and Artist.Name = 'Carola'
and Record.Name = 'Stranger';
F_sum(Längd) ( [ SELECT_{Name = "Carola"}(Artist) ] JOIN_{Artist = Anr} [ SELECT_{Name = "Främling"}(Record) ] JOIN_{Rnr = Record} Song )

select Artist.Name
from Artist left outer join Record on Anr = Artist
where Rnr is null;
or
select Name
from Artist
where not exists
(select * from Record
where Artist = Anr);
PROJECT_{Anr, Artist.Namn} ( SELECT_{Rnr = null} [ Artist LEFT OUTER JOIN_{Anr = Artist} Record ] )

Problem 2

Lotta should define triggers, saying that whenever the contents of the table Song is changed in any way, that change should be propagated to the column Length in the table Record. Some possible SQL statements:

create trigger klu1 for Song after update as
if old.Length != new.Length
alter table Record
set Record.Length = Record.Length - old.Length + new.Length
where Rnr = new.Record;

create trigger klu2 for Song after insert as
alter table Record
set Record.Length = Record.Length + new.Length
where Rnr = new.Record;

create trigger klu3 for Song after delete as
alter table Record
set Record.Length = Record.Length - old.Length
where Rnr = old.Record;

The DBMS keeps track of all actions in the database, and whenever someone performs an action that has a trigger on it, for example when someone adds a row to the table Song, the appropriate rule (in that case the rule klu2) will be executed by the DBMS, thus updating the sum of lengths in the table Record.

(If you wonder what klu means, it stands for keep length updated.)

Problem 3

Initial query tree

Optimized query tree

Smaller intermediate results, with fewer tuples and fewer attributes, and therefore less work, since the optimized query tree

performs select operations as early as possible
uses joins, instead of Cartesian products followed by select operations
has re-orderered the relations to make joins possible
performs the most selective select operations first
performs project operations as early as possible

With materialization, the entire result from each operation in the tree is created ("materialized") and stored somewhere (perhaps on disk). With pipe-lining, each row in the result is passed on to the next operation, as soon as that row is available. With pipe-lining, the query tree can be executed as a nested loop.

A heuristic query optimizer (of the type covered in this course) creates an execution plan by applying some general rules, such as "select operations should be performed as early as possible". It doesn't take into account different costs of using different algorithms for performing operations such as join and select. It doesn't take into account the size of the data, or the use of different physical storage structures (indexes).

A cost-based query optimizer, on the other hand, has a cost function, which is used to estimate the cost (typically measured in number of disk accesses) of an execution plan. To estimate the cost, the cost function must (and therefore does) take into account the actual or estimated size of the data, which storage structures exist, and also which algorithms will be used.

Problem 4

Choose two of nested-loop join, single-loop join, sort-merge join, and hash join (and perhaps partition hash join and hybrid hash join). See Elmasri/Navathe 3d Ed Section 18.2.3 for descriptions.

Example: In most cases, a single-loop join is much better than a nested-loop join. But this depends on the data and the indexes. There exist cases where a nested-loop join may actually be better.

For disk-based databases, the best join algorithm for a particular case is the one that requires the fewest disk block accesses.

Problem 5

Any reasonable optimizer will first SELECT_Rnr=17(Record), and then use the single row in that result to perform a JOIN with Artist. This join is in effect a SELECT_Anr=...(Artist). So we calculate the times to perform two select operations, on key columns, on the two tables.

There are 1000 rows in the table Record. Assume a block size of for example 8 kibibyte, i. e. 8192 bytes, a random block access time of 10 ms, and a sequential read time of 1 ms. Also assume that Rnr is 4 bytes, Name is 42 bytes, and Artist is 4 bytes. This gives a record size of 50 bytes, a blocking factor of 163, and 7 blocks for the entire file. On an average, we will find the correct record after reading 4 blocks. If the file is stored sequentially, it will take 13 ms (10 + (4-1)*1) to read it.
There are 100 rows in the table Artist. Assume that Anr is 4 bytes, Name is 42 bytes, and Country is 4 bytes. This gives (again) a record size of 50 bytes, and (again) a blocking factor of 163, so the entire file fits in a singe block. It takes 10 ms to read.
Answer: 0.23 ms.

Now there are 10⁶ rows in the Record table, and 7519 blocks. On an average we need to read 3760 blocks, which takes 3.760 seconds. (We ignore that the first read takes 10 ms instead of 1 ms.)
There are 10⁵ rows in the Artist table, and 614 blocks. On an average we need to read 307 blocks, which takes 0.307 seconds.
Answer: 4.0 seconds.

Use indexes on:

Record.Rnr (used in a select operation)
Record.Artist (used in a join operation)
Artist.Anr (used in a join operation)

An answer without Record.Artist, justified by explaining how the query will actually be executed by the DBMS, is also acceptable.

We assume primary indexes on Record(Rnr) and on Artist(Anr).
First look at the B⁺-tree index on table Record. Assume that a disk block pointer requires 4 bytes. A B⁺-tree node will have place for 1024 pointers to sub-trees. Since only two thirds of each node is used, on an average, the average "effective order" of the tree is 682. There are 7519 data blocks (or more, if they are constructed dynamically using the index). On the first index level, we need 7519 / 1024 + 1 = 8 index blocks. A second level is enough to connect them, so there are two levels in the index. A lookup requires two index block reads, and a data block read, giving a total of 30 ms.
The index for the Artis table turns out to be similar, but there are only 614 data blocks. A single index level is enough to index them. A lookup requires one index block read, and a data block read, giving a total of 20 ms.
Answer: 50 ms, or 0.05 seconds.

Problem 6

A CGI script is a separate program, which is started by the web server. The complete output from the CGI script becomes the HTML source code that is sent to the client, and this HTML source is then rendederd by the browser and shown to the user. Since a CGI scripts is a complete, stand-alone program, it can connect to a database and retrievde data in the same way as any other program.

One alternative is to use a web server that understands SQL statements embedded in web pages, sends them to the database server, and replaces each SQL statement with the result of that qyery, before the HTML source is sent to the server. (Other alternatives, such as Java applets and other types of client-side programs, can also be considered.) CGI scripts are inefficient. Starting a new, separate program takes a long time on a computer. Also, if the CGI-generated web page contains output from a database, the CGI-script must open a new connection and log in on the database server, and this has to be done every time the web page is requested. On the other hand, CGI scripts are portable, and easier to test.

Problem 7

ODBC ... ODBC calls will look (almost) the same no matter which DBMS you communicate with. You can connect the same ODBC calls to another DBMS, without re-compiling or even re-linking the program, and (given some conditions) even dynamically when the program is running. ODBC is standardized, and most widely used of these alternatives, so it will be easiest to find programmers to write or maintain an ODBC program. ODBC code is complicated.

ESQL ... ESQL sometimes requires using C and not C++. ESQL is standardized, but less portable than ODBC, and requires re-compilation and sometimes some modification, if you wish to switch to another DBMS. ESQL probably gives the simplest and (if you know ESQL) easiset to read source code. (The truth of that statement can be discussed at length.)

The DBMS's own function library ... The DBMS's own function library may be faster than ODBC, and may allow use of some special features that are not available through ODBC.

Problem 8

A, atomicity. Transactions are atomic, so either all parts of it are performed, or none, even in the face of aborts and system crashes. Without this property, half-finished transactions could remain in the database, for example withdrawing some money from a bank account but never depositing it in another.

C, consistency preserving. If the database is in a consistent state before the transaction, fulfilling all integrity constraints, it will be in a consistent state after the transaction, Without this property, transactions could cause the database to become inconsistent, and perhaps useless.

I, isolation. Concurrently executing transactions are isolated from each other, and intermediate or partial results from one transaction are not allowed to (unduly) effect another transaction. Without this property, transaction could interfere with each other, for example two transaction writing to the same table and ending up with a garbled mix of two sets of data.

D, durability. When a transaction has been committed, the changes it has made in the database should never disappear, even in the face of aborts and system crashes. Without this property, important changes could disappear from the database, for example with money vanishing from your bank account.

Step	Transaction T1	Transaction T2
1	Lock(A)
2	Read(A)
3	Lock(B)
4	Read(B)
5	A = A + B
6	B = 0
7		Lock(C)
8		Read(C)
9		B = C
10		C = 0
11		Wating to Lock(B)...
12	Write(B)
13	Unlock(B)
14		...Lock(B)
15		Write(B)
16	Write(A)
17	Unlock(A)
18		Write(C)
19		Unlock(B)
20		Unlock(C)

The operations after T1:s Unlock(B) may be performed in various different orders.

TS(T1) = 1. TS(T2) = 2.

Step	Transaction T1	Transaction T2	read_TS(A)	write_TS(A)	read_TS(B)	write_TS(B)	read_TS(C)	write_TS(C)
			0	0	0	0	0	0
1	Read(A)		1	0	0	0	0	0
2	Read(B)		1	0	1	0	0	0
3	A = A + B		1	0	1	0	0	0
4	B = 0		1	0	1	0	0	0
5		Read(C)	1	0	1	0	2	0
6		B = C	1	0	1	0	2	0
7		C = 0	1	0	1	0	2	0
8		Write(B)	1	0	1	2	2	0
9	Write(B)		1	0	1	2	2	0
10	Write(A)		1	1	1	2	2	0
11		Write(C)	1	1	1	2	2	2

Both transactions commit after having performed the operations in the given order. No rollback is required. (Operation number 9, T1:s Write(B), is skipped according to Thomas's Write Rule.)

Problem 9

Quad tree:

A quad tree

K-d tree, and how it divides the area:

A k-d tree

Problem 10

An object-relational database is a database that combines a relational database, with its database functionality such as a declarative query language, data independence (and therefor ease of schema evolution) and transaction management, and an object-oriented data model.
A reconstruction program is a relational algebra expression that materializes a global relation from its fragments.
A data warehouse is a very large collection of data, possible collected from several databases, optimized for analyzing the data to extract new knowledge (data mining), i. e. for read-only analytical processing.
A cascade rollback occurs when a transaction is aborted and rolled back, and one or more other transactions also have to be aborted and rolled back, since they had read data that had been written by the aborted transaction.