The Very Basics of Garbage Collection

By Thomas Padron-McCarthy (e-mail: Thomas.Padron-McCarthy@oru.se). Latest change: October 15, 2021.

This is a short introduction to automatic garbage collection. Garbage collection is the process of automatically determining what memory a program is no longer using, and making it available for other use. It is performed by a subsystem called a garbage collector, which can be part of the runtime system of a particular programming language, or an add-on library. Garbage collection can, but does not have to, be assisted by the compiler, the hardware, or the OS.

Dynamically allocated memory

Let's start with a small example program in C++:

#include <iostream>

struct Link {
  int number;
  Link* next;
};

Link* first = NULL;
Link* last = NULL;

void append(int number) {
  Link* p = new Link;
  p->number = number;
  p->next = NULL;
  if (first == NULL)
    first = last = p;
  else {
    last->next = p;
    last = p;
  }
}

int main() {
  append(17);
  append(4711);
  append(42);
}

This program builds a linked list. It allocates three chunks of memory, for three records in the list. When the program is run, at the end of the program its data structures will look like this:

Data at the end of the program

The variables first and last are static storage, also called statically allocated storage. (But many people use the word "memory" instead of "storage".) These variable exist during the entire life of the program, and can't be allocated or freed by the programmer.

The procedure append contains two local variables: the regular local variable p, and the parameter variable number. These variables are allocated on the stack each time the program enters the procedure. When program execution leaves the procedure, the procedure's local variables are popped from the stack, and disappear. (This isn't always true. Many modern compilers put small variables like p and number in fast registers in the processor itself, and not in regular memory.)

The three records in the linked list, containing the numbers 17, 4711 and 42, are dynamic storage, also called dynamically allocated storage. They are allocated using C++'s operator new. When we don't need them any more, we can use C++'s operator delete to "free" or "deallocate" them, allowing the memory they occupied to be used for other purposes. (A C programmer would use malloc and free instead.)

Note that dynamic memory allocated with new (or malloc) can be placed anywhere in the computer's memory. In the figure above it looks like the memory chunks for the records are placed immediately after each other, but there is no such guarantee. It could just as well look like this instead:

Another possibility

Manual memory management

When the C++ program's execution is finished, all memory used by the program is returned to the operating system. (Well, this depends on your operating system, and on other things, but it is true for a normal program running on a Unix or Windows system.) You don't need to write code in your program to free memory.

But in a program that runs for a long time, you might have to free memory that is no longer being used. If you don't, the computer's memory will fill up with unused data.

If you are a C or C++ programmer, you are probably familiar with how to free memory. If we no longer need the linked list in the example program, we can free it using C++'s operator delete:

  Link* p = first;
  while (p != NULL) {
    Link* tbd = p;
    p = p->next;
    delete tbd;
  }

This is called freeing or de-allocating memory, and after this manual cleanup the three memory chunks that we allocated have been freed and returned to the program's pool of unused memory. If we call new at a later time, some of the freed memory may be reused.

Small warning: We never set the variables first and last to NULL, or some other valid pointer, after freeing the records in the linked list. So now they point to memory that is no longer in use. They have become dangling pointers.

Dangling pointers

That the memory is no longer in use means that we don't know anything about what is at the end of those pointers. The old data may still be there, at least for a while. The memory may have been re-used (by another call to new), so now the pointers point at some completely different data structures. Or the memory may have been returned to the operating system, and (as far as this program knows) it no longer exists at all.

Garbage

Suppose that, at the end of the program, instead of freeing the records in the linked list with delete, we perform this assignment:
  first = last;

Now our data structures look like this:

Data after adding the statement first = last;

The two records with content 17 and 4711 have become unreachable. There is no way that the program can find them again. (Except by scanning through the program's entire data address space.) If we are using explicit memory management with delete, they can never be freed. They have turned into garbage.

From real life we know that unless garbage is collected, it just lies there rotting in the street for ever. It doesn't work any better inside a computer. If we want to use the memory that the unreachable records occupy for something else, the garbage has to be collected.

We could have freed the records first, before changing the first pointer. In this case it would have been easy. We could use a slightly modified version of the loop with delete that we saw above. But manual memory management, with explicit freeing of memory, can be very difficult.

Why is manual memory management hard?

It might seem simple to just free the allocated data objects as soon as they are no longer needed. But how do we know when that is?

In a large program, the same data object might be used in many places. For example, a subsystem (such as a class, a procedure or a library) in a program allocates a data object, and stores a pointer to the newly allocated object. This could be the user interface, which lets the user enter data, and then creates a data object representing those data. Perhaps another subsystem needs to do something with the data, so the first subsystem passes a pointer to the object to the other subsystem, which stores the pointer in a variable. The second subsystem might then pass the pointer on to a third subsystem.

Subsystems and data

So how do we know when the object is no longer needed? No single subsystem can decide to free the object, because other subsystems might still need it. It is possible to solve this problem, but it can be difficult. It is very easy to make mistakes, and either free the memory too soon (and cause all sorts of errors when a subsystem tries to use the deleted object), or free the memory too late or not at all (causing a memory leak).

And what about the second data object in the figure? Even if the only subsystem that contains a pointer to that object is the third subsystem, this subsystem still can't free the object when it doesn't need it any more. Another subsystem might still need it, and reach it through the pointer inside the first object.

Automatic garbage collection

The garbage collector should garbage collect any data that the program won't be using any more. Therefore the garbage collector should determine which data the program will make use of in the future, the so-called live data. How can the garbage collector do this?

The simple answer is that it can't, at least not in the general case. ("In the general case" means "always".) The garbage collector would have to make a complete analysis of the program, including all future input to the program. This is of course impossible. Instead, we use the concept of reachable data, which is data that the program can find. Reachable data consists of two parts:

In our C++ example above, the root set consists of the global variables first and last, and also, when we are executing the procedure append, the local variable p and the parameter variable number.

Any data item that can't be reached be following pointers from the root set, and therefore isn't part of the reachable data, is garbage, and can be recycled at any time.

A simple garbage collection algorithm: Mark and sweep

"Mark and sweep" is the basic, and simplest, garbage collection algorithm. To use this algorithm, you need three things:
  1. You need an extra bit in each data object, or perhaps a place somewhere else in memory where you allocate one bit for each data object that exists in the program's memory. The bits are used to mark if a data object is used (that is, reachable) or not used (that is, garbage).
  2. You need a way to go through all existing data objects, no matter if they can be reached from the program or not. (This may seem like a contradiction, but remember that the garbage collector is, at least in some sense, part of the runtime system, and it has access to system resources that the program doesn't.)
  3. You need to know the root set.

The algorithm consists of three steps:

  1. Start by clearing all the marker bits, marking every data item as not used.
  2. The marking phase. Go through the root set, marking all data items there as used. For each pointer found in the root set, follow that pointer to whatever data object it points to, and mark that object as used. Continue recursively, following all pointer from each found data object, and mark all new data objects that you find.
  3. The sweep phase. Now, all reachable data objects are marked as used. Go through all data items that exist in the program's memory. Those that have not been marked are garbage, and can be recycled. Exactly how the recycling is done varies between different system, but a common way is to use a pointer in each free data object and link them together in a free list.

Using the C++ program and its linked list as an example, we can illustrate the mark-sweep algorithm by first adding a small flag to each data object. We start by clearing all the flags.

A flag on each data object

(Note: In this example, I've assumed that even the pointer variables first and last are data objects that can be garbage collected, and that they have a bit for the used flag. This need not be the case.)

Then we mark as used all data objects in the root set, which, as we have mentioned earlier, consists of the variables first and last:

After marking the root set

We go through all data that can be reached from any item in the root set, in this case the record with the content 42. (If there had been any pointers from that record on to other objects, we would of course have marked those objects too, and then followed any pointers from them on to other objects, and so on.)

After marking all reachable objects

Now all reachable data has been marked. We go through all data objects, and those that have not been marked can safely be deleted, and their memory recycled.

After freeing unmarked objects

Another example, with slightly more complicated data. This picture shows the data after the marking phase has completed:

Another example of mark and sweep: after marking

Some comments about this example:

Almost garbage collection: Reference counting

Another simple algorithm for reclaiming unused memory is reference counting.

You need an extra field in each of the data objects that should be managed by the algorithm, and in that field you store the number of pointers that currently point to the object. When the counter reaches 0, no pointers point to the object, which means that it is garbage and the memory it uses can be reclaimed.

Reference counting

(It is also possible to allocate the fields somewhere else, and not in the objects themselves.)

We now perform this statement:

  first = last;

There is now one less pointer pointing to the record with content 17, and one more pointer pointing to the record with content 42. A garbage collector like mark-and-sweep can be run at any time, and will free all objects that can't be reached from the root set. With reference counting, the counter field must be updated every time a pointer is changed. So, after the assignment, the data will look like this, with the 17-record's field decremented and the 42-record's field incremented:

After the statement first = last;

The reference counting system sees that the record with content 17 has a reference count of 0, so it can be freed:

After freeing the first record

The freed record contained a pointer to the next record. When we removed that pointer, we had to change the reference counter in the pointed-to record. It has now reached 0, so this record too can be freed:

After freeing the second record

In the basic reference counting algorithm, all this work is performed immediately after the assignment statement.

Reference counting is sometimes considered as a form of garbage collection, but sometimes the term "garbage collection" is reserved for algorithms like mark-and-sweep that can be run at any time.

The FAQ referenced below explains referenced counting like this:

For each "object" managed by the collector, maintain a count of pointers referencing that object. Whenever a pointer is assigned to, as in "P := Q", the referent of Q has its reference count incremented, and the former referent of P has its reference count decremented. If the reference count ever falls to zero, then it is known that there are no pointers to the object, and it may be reclaimed. The converse is not true -- some unreferenced objects may not have a reference count of zero, because of cycles. Furthermore, the fields in which reference counts are stored are typically small, so it is also possible that they will overflow, in which case they "stick" at their maximum value -- again, the object cannot be reclaimed. However, in practice this is rare (unless the reference count field is very small) and it is also possible to confine reference-counting to data structures that will not participate in cycles (for examples, strings, or other pointer-free data).

In a multi-threaded system, if threads are preemptively scheduled, reference counting requires additional care to ensure that the updates to reference counts are atomic. This is straightforward using hardware primitives such as fetch-and-add, compare-and-swap, or load-linked/store-conditional, but it is definitely necessary to pay attention to this detail.

Advanced garbage collection algorithms

We'll look at two useful variations of the basic mark-and-sweep algorithm, generational garbage collection and conservative garbage collection.

Generational collection

This part is copied from the FAQ referenced below.

Based on the observation that most objects have short lifetimes, it is useful to restrict garbage collection to the most recently allocated objects.

A generational collector maintains several "generations" of objects. Newly created objects are all put in the "youngest" generation, and when the space allocated for that generation is full, the collector will use the root set (and any pointers from older generations to the youngest generation -- see below) to reclaim dead objects from the youngest generation only, leaving the "older" generations untouched. Objects that survive after several (perhaps just one) collection of the youngest generation are "promoted" to the next older generation, and when that generation becomes full, it and all the generations younger than it are collected.

The difficulty with generational collectors is the identification of pointers from older generations into younger ones. An observation is that such references are uncommon in most programs: new objects typically point to older objects; this is clearly true in pure functional languages where assignment is prohibited, but is common for many programs and languages. Nonetheless, such references do exist, and a collector must handle them. When a pointer to a younger generation object is written into an old object, the pointer (or just the old object) is recorded so it can be scanned when the younger generation is collected.

Conservative collection

This part is copied from the FAQ referenced below.

Conservative garbage collection makes use of the observation that if you are not relocating (copying) objects, then you need not be quite so certain about exactly what is a pointer. It suffices to scan the root set (and objects) for any pointer-like bit patterns, and treat those pointer-like bit patterns as actual pointers while marking. The result is that the "true" reachable objects are all found, along with a few others. This works surprisingly well in practice (if a few additional tricks are added) and often permits the use of garbage collection with oblivious source code and compilers.

Conservative collection is very important because it permits the use of garbage collection with programs that were not written with garbage collection in mind. That is, it simplifies the use of garbage collection with existing (non-garbage-collected) libraries of code.

Garbage Collection and Threads

Modern programs often contain multiple threads, sharing the same memory. Some programming tasks are easier to solve using threads, and threads are required to utilize the full performance of modern multi-core processors. Garbage collection is more difficult with threads, since memory is shared between the threads and all access to data needs to be synchronized between the threads. Typically there is one thread dedicated to garbage collection, and to avoid "stop-the-world" collection, which would pause all the other threads, some form of incremental collection is adapted for use in the threaded environment.

Myths

Some of the most important concepts

Garbage collection. In Swedish: skräpsamling, but the English term is often used, and is sometimes abbreviated to garbning. (Note that garbage collection, in the real-world and non-computer meaning, is better translated with sophämtning in Swedish.)

Garbage collector. In Swedish: skräpsamlare, but the English term is often used, and is sometimes abbreviated to garb.

Live data. In Swedish: levande data. Data that will be used again by the running program.

Dead data. In Swedish: döda data. Data that will not be used again by the running program. Each data item is either alive or dead, but ut it can be hard, and sometimes impossible, to know which.

Reachable data. In Swedish: nåbara data. Data that is accessible by following pointers from the root set.

Unreachable data or garbage. In Swedish: onåbara data or skräp. Data that isn't accessible by following pointers from the root set. Unreachable data are guaraneed to also be dead, since if they can't be reached by the program, they can't be used. Each data item is either reachable or garbage, and it is the garbage collector's job to determine which. All garbage is dead data but all dead data is not necessarily garbage.

Mark and sweep or mark-sweep. The English terms are used in Swedish too.

Free list. In Swedish: fri lista or frilista.

Reference. In Swedish: referens. In this context: a pointer to a data object. Note that a "reference" in C++ (such as the variable r defined by int& r = i;) is something else.

Reference counting. In Swedish: referensräkning.

Generational garbage collection. In Swedish: generationsskräpsamling or generationsgarbning.

Conservative garbage collection. In Swedish: konservativ skräpsamling or konservativ garbning.

Data object or data item. In Swedish: dataobjekt. Used in this text to mean the contents of an independently allocatable and freeable area in memory. Doesn't necessarily have anything at all to do with object-oriented programming.

Storage. A synonym for memory. In Swedish: lagringsutrymme or minne.

Still interested?