Assignment 2: POSIX Threads, Linux kernel modules

Files

Lab materials

A Linux PC (or virtual machine). You can use a VirtualBox guest machine that can be downloaded from the course book's website, or your own Linux machine, or one of the Linux machines in T002 (the same kernel as in the 10th Edition of the book), or (for some of the tasks) the ThinLinc server.

Some of these programming assignments are described in the course book.

In the 9th Edition (the paper version), kernel modules are described in "Programming Projects" in chapter 2 (pages 94-98) and in project 2 in "Programming Projects" in chapter 3 (pages 156-158).

In the 10th Edition (the electronic version), kernel modules are described in "Programming Projects" in chapter 2 (there are no page numbers) and in project 2 and 3 in "Programming Projects" in chapter 3.

The 10th Edition of the book, and the accompanying virtual machine, uses Linux kernel version 4.4. (The machines in T002 have this kernel.) The 9th Edition of the book, and its virtual machine, uses an older kernel version, 3.16.

Task 1

The thread performance test program "thread-performance-test-2020.c" starts one or more threads that run in parallel, and measures how many flops (floating point operations per second) it performs, both in total and per thread.

Here is a graph showing its performance on a computer with an Intel Core i7-6850K CPU from 2016. It has 6 cores, with hyperthreading. As you can see, performance increases linearly with the number of threads up to 12 threads, but with 13 or more threads the increase stops. Question: Why is that?

(Click the graph for a larger version.)

Here is a graph showing the same program's performance on a Mac Mini from 2020, which has Apple's own M1 processor. It has 8 cores and no hyperthreading. Question: Can you explain these results, especially the "knee" on the curve at 4 threads? (Wikipedia has an article about the M1 processor: https://en.wikipedia.org/wiki/Apple_M1)

(Click the graph for a larger version.)

Question: The first computer above runs Linux, and the second (the Mac Mini) runs Mac OS X. Can you guess something, just from these two graphs, about how the schedulers in Linux and Mac OS X compare when it comes to handling large numbers of threads?

An added graph from June 28, 2021:

However, when I ran the same program on the same Mac Mini another time, I got the following graph. Questions: What do you think happened? What implications does it have for how you should run performance tests?

(Click the graph for a larger version.)

Here is a graph that shows the program's performance on the university's ThinLinc server (thinlinc.oru.se), which runs on an Intel Xeon CPU E5-2660 v3 from 2014. According to its specifications, this processor type has 10 cores, with hyperthreading. Question: Can you explain these results?

(Click it to big it!)

Run the same thread performance test program on your computer. Question: Which results do you get? Explain your results!

Task 2

Modify the single-threaded monster world program ("single-threaded-monster-world.c") so it starts one thread for each monster, and all the fighting (the function "one_monster_fights") is done in those threads.

Try to utilize the hardware! For example, don't use a single lock, which locks the entire world. That would mean that no matter how many CPU cores you have, only one monster fight can happen at the same time.

Start without locking, and see if you get incorrect results.

See also the "MONSTER-WORLD-README" file.

Possibly useful command:

gcc -Wall -Wextra -std=c11 -O3 multi-threaded-monster-world.c -lpthread -o multi-threaded-monster-world

With the following Makefile you can just type "make multi-threaded-monster-world":

CC = gcc
CFLAGS += -Wall -Wextra -std=c11 -O3
LDLIBS += -lpthread

Questions:

Which results do you get?
What are the times for the single-threaded version, the multi-threaded version without locking, and the multi-threaded version with locking?
How do these times compare?
Can you explain what happens?
Compare to the "thread-performance-test-2020.c" program! Do you see the same behaviour when you increase the number of threads? Can you explain what happens?

Task 3

Create a kernel module that just prints some distinctive messages to the system log when it is loaded and when it is unloaded. Verify with "dmesg" that it works.

If you are running this in a modern Linux (kernel version 4.4.0-20 or later) on a computer where Secure Boot is enabled, it will fail with the error message "Operation not permitted", and in the kernel messages (dmesg) it will print "insmod: Loading of unsigned module is restricted; see man kernel_lockdown.7". This is beacause kernel modules have to be signed. Either turn off Secure Boot (not recommended), run in a virtual machine without Secure Boot, or sign your module (after each recompilation). Here is a Stack Overflow answer that explains how to sign your kernel modues:

https://askubuntu.com/questions/760671/could-not-load-vboxdrv-after-upgrade-to-ubuntu-16-04-and-i-want-to-keep-secur/768310#768310

(They are signing a module called vboxdrv, so you need to replace that with the name of your module.)

Task 4

Create a kernel module that, when it is loaded, prints a list of all processes to the system log. Verify with "dmesg" that it works.

There are macros to loop through processes, and they might be different between kernel 3 and 4.

An update: Note that in later kernels, the macro for_each_process has been moved to the #include file <linux/sched/signal.h>

Task 5 (Optional, counts toward a higher grade than 3)

Modify the kernel module from task 3 so that, when it is loaded, it prints a list of all processes, and for each process lists all threads in that process.

Run the threaded program from task 1 or task 2 to verify that your module works.

Task 6 (Optional, counts toward a higher grade than 3)

Let's say you run the multi-threaded monster world program you developed in task 2 with ten monsters, and therefore ten threads. Let's also say that you run it on a computer where the hardware actually can execute ten threads in parallel. (That is, the computer must have at least ten CPU cores, or five cores with hyperthreading.) Optimistically, one could hope that the multi-threaded program would be ten times faster than the single-threaded one, since there are ten cores working in parallel instead of just one.

Change the multi-threaded program so it actually is as fast as we hope! With five monsters, it should be (at least appoximately) five times faster, with ten monsters, ten times faster, and with twenty monsters, twenty times faster, assuming we run it on a computer that can handle that many threads in parallel.

This is not an easy task. You might need to significantly re-structure how the program works.

Report

Write down and hand in a short summary (2-4 pages) of what you have done. Don't forget to answer the questions above. Mail it (as a PDF or text file) to the teacher. All source code for your programs must be attached. It is good if you can also demonstrate your results at one of the scheduled lab times.

Thomas Padron-McCarthy (thomas.padron-mccarthy@oru.se), 13 januari 2022