This is a short introduction to automatic garbage collection. Garbage collection is the process of automatically determining what memory a program is no longer using, and making it available for other use. It is performed by a subsystem called a garbage collector, which can be part of the runtime system of a particular programming language, or an add-on library. Garbage collection can, but does not have to, be assisted by the compiler, the hardware, or the OS.
#include <iostream> struct Link { int number; Link* next; }; Link* first = NULL; Link* last = NULL; void append(int number) { Link* p = new Link; p->number = number; p->next = NULL; if (first == NULL) first = last = p; else { last->next = p; last = p; } } int main() { append(17); append(4711); append(42); }
This program builds a linked list. It allocates three chunks of memory, for three records in the list. When the program is run, at the end of the program its data structures will look like this:
The variables first and last are static storage, also called statically allocated storage. (But many people use the word "memory" instead of "storage".) These variable exist during the entire life of the program, and can't be allocated or freed by the programmer.
The procedure append contains two local variables: the regular local variable p, and the parameter variable number. These variables are allocated on the stack each time the program enters the procedure. When program execution leaves the procedure, the procedure's local variables are popped from the stack, and disappear. (This isn't always true. Many modern compilers put small variables like p and number in fast registers in the processor itself, and not in regular memory.)
The three records in the linked list, containing the numbers 17, 4711 and 42, are dynamic storage, also called dynamically allocated storage. They are allocated using C++'s operator new. When we don't need them any more, we can use C++'s operator delete to "free" or "deallocate" them, allowing the memory they occupied to be used for other purposes. (A C programmer would use malloc and free instead.)
Note that dynamic memory allocated with new (or malloc) can be placed anywhere in the computer's memory. In the figure above it looks like the memory chunks for the records are placed immediately after each other, but there is no such guarantee. It could just as well look like this instead:
But in a program that runs for a long time, you might have to free memory that is no longer being used. If you don't, the computer's memory will fill up with unused data.
If you are a C or C++ programmer, you are probably familiar with how to free memory. If we no longer need the linked list in the example program, we can free it using C++'s operator delete:
Link* p = first; while (p != NULL) { Link* tbd = p; p = p->next; delete tbd; }
This is called freeing or de-allocating memory, and after this manual cleanup the three memory chunks that we allocated have been freed and returned to the program's pool of unused memory. If we call new at a later time, some of the freed memory may be reused.
Small warning: We never set the variables first and last to NULL, or some other valid pointer, after freeing the records in the linked list. So now they point to memory that is no longer in use. They have become dangling pointers.
That the memory is no longer in use means that we don't know anything about what is at the end of those pointers. The old data may still be there, at least for a while. The memory may have been re-used (by another call to new), so now the pointers point at some completely different data structures. Or the memory may have been returned to the operating system, and (as far as this program knows) it no longer exists at all.
first = last;
Now our data structures look like this:
The two records with content 17 and 4711 have become unreachable. There is no way that the program can find them again. (Except by scanning through the program's entire data address space.) If we are using explicit memory management with delete, they can never be freed. They have turned into garbage.
From real life we know that unless garbage is collected, it just lies there rotting in the street for ever. It doesn't work any better inside a computer. If we want to use the memory that the unreachable records occupy for something else, the garbage has to be collected.
We could have freed the records first, before changing the first pointer. In this case it would have been easy. We could use a slightly modified version of the loop with delete that we saw above. But manual memory management, with explicit freeing of memory, can be very difficult.
In a large program, the same data object might be used in many places. For example, a subsystem (such as a class, a procedure or a library) in a program allocates a data object, and stores a pointer to the newly allocated object. This could be the user interface, which lets the user enter data, and then creates a data object representing those data. Perhaps another subsystem needs to do something with the data, so the first subsystem passes a pointer to the object to the other subsystem, which stores the pointer in a variable. The second subsystem might then pass the pointer on to a third subsystem.
So how do we know when the object is no longer needed? No single subsystem can decide to free the object, because other subsystems might still need it. It is possible to solve this problem, but it can be difficult. It is very easy to make mistakes, and either free the memory too soon (and cause all sorts of errors when a subsystem tries to use the deleted object), or free the memory too late or not at all (causing a memory leak).
And what about the second data object in the figure? Even if the only subsystem that contains a pointer to that object is the third subsystem, this subsystem still can't free the object when it doesn't need it any more. Another subsystem might still need it, and reach it through the pointer inside the first object.
The simple answer is that it can't, at least not in the general case. ("In the general case" means "always".) The garbage collector would have to make a complete analysis of the program, including all future input to the program. This is of course impossible. Instead, we use the concept of reachable data, which is data that the program can find. Reachable data consists of two parts:
In our C++ example above, the root set consists of the global variables first and last, and also, when we are executing the procedure append, the local variable p and the parameter variable number.
Any data item that can't be reached be following pointers from the root set, and therefore isn't part of the reachable data, is garbage, and can be recycled at any time.
The algorithm consists of three steps:
Using the C++ program and its linked list as an example, we can illustrate the mark-sweep algorithm by first adding a small flag to each data object. We start by clearing all the flags.
(Note: In this example, I've assumed that even the pointer variables first and last are data objects that can be garbage collected, and that they have a bit for the used flag. This need not be the case.)
Then we mark as used all data objects in the root set, which, as we have mentioned earlier, consists of the variables first and last:
We go through all data that can be reached from any item in the root set, in this case the record with the content 42. (If there had been any pointers from that record on to other objects, we would of course have marked those objects too, and then followed any pointers from them on to other objects, and so on.)
Now all reachable data has been marked. We go through all data objects, and those that have not been marked can safely be deleted, and their memory recycled.
Another example, with slightly more complicated data. This picture shows the data after the marking phase has completed:
Some comments about this example:
You need an extra field in each of the data objects that should be managed by the algorithm, and in that field you store the number of pointers that currently point to the object. When the counter reaches 0, no pointers point to the object, which means that it is garbage and the memory it uses can be reclaimed.
(It is also possible to allocate the fields somewhere else, and not in the objects themselves.)
We now perform this statement:
first = last;
There is now one less pointer pointing to the record with content 17, and one more pointer pointing to the record with content 42. A garbage collector like mark-and-sweep can be run at any time, and will free all objects that can't be reached from the root set. With reference counting, the counter field must be updated every time a pointer is changed. So, after the assignment, the data will look like this, with the 17-record's field decremented and the 42-record's field incremented:
The reference counting system sees that the record with content 17 has a reference count of 0, so it can be freed:
The freed record contained a pointer to the next record. When we removed that pointer, we had to change the reference counter in the pointed-to record. It has now reached 0, so this record too can be freed:
In the basic reference counting algorithm, all this work is performed immediately after the assignment statement.
Reference counting is sometimes considered as a form of garbage collection, but sometimes the term "garbage collection" is reserved for algorithms like mark-and-sweep that can be run at any time.
The FAQ referenced below explains referenced counting like this:
For each "object" managed by the collector, maintain a count of pointers referencing that object. Whenever a pointer is assigned to, as in "P := Q", the referent of Q has its reference count incremented, and the former referent of P has its reference count decremented. If the reference count ever falls to zero, then it is known that there are no pointers to the object, and it may be reclaimed. The converse is not true -- some unreferenced objects may not have a reference count of zero, because of cycles. Furthermore, the fields in which reference counts are stored are typically small, so it is also possible that they will overflow, in which case they "stick" at their maximum value -- again, the object cannot be reclaimed. However, in practice this is rare (unless the reference count field is very small) and it is also possible to confine reference-counting to data structures that will not participate in cycles (for examples, strings, or other pointer-free data).
In a multi-threaded system, if threads are preemptively scheduled, reference counting requires additional care to ensure that the updates to reference counts are atomic. This is straightforward using hardware primitives such as fetch-and-add, compare-and-swap, or load-linked/store-conditional, but it is definitely necessary to pay attention to this detail.
Based on the observation that most objects have short lifetimes, it is useful to restrict garbage collection to the most recently allocated objects.
A generational collector maintains several "generations" of objects. Newly created objects are all put in the "youngest" generation, and when the space allocated for that generation is full, the collector will use the root set (and any pointers from older generations to the youngest generation -- see below) to reclaim dead objects from the youngest generation only, leaving the "older" generations untouched. Objects that survive after several (perhaps just one) collection of the youngest generation are "promoted" to the next older generation, and when that generation becomes full, it and all the generations younger than it are collected.
The difficulty with generational collectors is the identification of pointers from older generations into younger ones. An observation is that such references are uncommon in most programs: new objects typically point to older objects; this is clearly true in pure functional languages where assignment is prohibited, but is common for many programs and languages. Nonetheless, such references do exist, and a collector must handle them. When a pointer to a younger generation object is written into an old object, the pointer (or just the old object) is recorded so it can be scanned when the younger generation is collected.
Conservative garbage collection makes use of the observation that if you are not relocating (copying) objects, then you need not be quite so certain about exactly what is a pointer. It suffices to scan the root set (and objects) for any pointer-like bit patterns, and treat those pointer-like bit patterns as actual pointers while marking. The result is that the "true" reachable objects are all found, along with a few others. This works surprisingly well in practice (if a few additional tricks are added) and often permits the use of garbage collection with oblivious source code and compilers.
Conservative collection is very important because it permits the use of garbage collection with programs that were not written with garbage collection in mind. That is, it simplifies the use of garbage collection with existing (non-garbage-collected) libraries of code.
Garbage collector. In Swedish: skräpsamlare, but the English term is often used, and is sometimes abbreviated to garb.
Reachable data. In Swedish: nåbara data. Data that is accessible by following pointers from the root set.
Garbage. In Swedish: skräp. Data that is not reachable.
Mark and sweep or mark-sweep. The English terms are used in Swedish too.
Free list. In Swedish: fri lista or frilista.
Reference. In Swedish: referens. In this context: a pointer to a data object. Note that a "reference" in C++ (such as the variable r defined by int& r = i;) is something else.
Reference counting. In Swedish: referensräkning.
Generational garbage collection. In Swedish: generationsgarbning.
Conservative garbage collection. In Swedish: konservativ garbning.
Data object or data item. In Swedish: dataobjekt. Used in this text to mean the contents of an independently allocatable and freeable area in memory. Doesn't necessarily have anything at all to do with object-oriented programming.
Storage. A synonym for memory. In Swedish: lagringsutrymme or minne.