Mark-compact garbage collection for oldest generation, and other improvements
Monday, November 16, 2009
Factor now uses a mark-sweep-compact garbage collector for the oldest generation (known as tenured space), in place of the copying collector that it used before. This reduces memory usage. The mark-sweep collector used for the code heap has also been improved. It now shares code with the data heap collector and uses the same compaction algorithm.
Mark-sweep-compact garbage collection for tenured space
Mark phase
During the mark phase, the garbage collector computes the set of
reachable objects by starting from the “roots”; global variables, and
objects referenced from runtime stacks. When an object is visited, the
mark phase checks if the object has already been marked. If it hasn’t
yet been marked, it is marked, and any object that it refers to are then
also visited in turn. If an object has already been marked, nothing is
done. As described above, the algorithm is recursive, which is
problematic. There are two approaches to turn it into an iterative
algorithm; either objects yet to be visited are pushed on a mark stack,
and a top-level loop drains this stack, or a more complicated scheme
known as “pointer inversion” is used. I decided against pointer
inversion, since the mark stack approach is simpler and I have yet to
observe the mark stack grow beyond 128Kb or so anyway. I use an
std::vector
for the mark stack and it works well enough. The mark
stack and the loop that drains it are implemented in
full_collector.cpp.
There are several approaches to representing the set of objects which
are currently marked, also. The two most common are storing a mark bit
for each object in the object’s header, and storing mark bits in a
bitmap off to the side. I chose the latter approach, since it speeds up
the sweep and compact phases of the garbage collector, and doesn’t
require changing object header layout. Each bit in the mark bitmap
corresponds to 16 bytes of heap space. For reasons that will become
clear, when an object is marked mark bitmap, every bit corresponding to
space taken up by the object is marked, not just the first bit. The mark
bitmap is implemented in
mark_bits.hpp.
Sweep phase
Once the mark phase is complete, the mark bitmap now has an up-to-date
picture of what regions in the heap correspond to reachable objects, and
which are free. The sweep phase begins by clearing the free list, and
then computes a new one by traversing the mark bitmap. For every
contiguous range of clear bits, a new entry is added to the free list.
This is why I use a mark bitmap instead of mark bits in object headers;
if I had used mark bits in headers, then the sweep phase would need to
traverse the entire heap, not just the mark bitmap, which has
significantly worse locality. The sweep phase is implemented by the
free_list_allocator::sweep()
method in
free_list_allocator.hpp.
The key to an efficient implementation of the sweep algorithm is the
log2()
function in
bitwise_hacks.hpp;
it is used to find the first set bit in a cell. I use the BSR
instruction on x86 and cntlzw
on PowerPC. The sweep phase also needs
to update the object start offset map used for the card marking write
barrier.
When collecting a young generation, the garbage collector scans the set
of marked cards. It needs to know where the first object in each card
is, so that it can properly identify pointers. This information is
maintained by the object_start_map
class defined in
object_start_map.cpp.
If an object that happens to be the first object in a cardwas
deallocated by the sweep phase, the object start map must be updated to
point at a subsequent object in that card. This is done by the
object_start_map::update_for_sweep()
method.
Compact phase
The compact phase is optional; it only runs if the garbage collector
determines that there is too much fragmentation, or if the user
explicitly requests it. The compact phase does not require that the
sweep phase has been run, only the mark phase. Like the sweep phase, the
compact phase relies on the mark bitmap having been computed. Whereas
the sweep phase identifies free blocks and adds them to the free list,
the compact phase slides allocated space up so that all free space ends
up at the end of the heap, in a single block. The compact phase has two
steps. The first step computes a forwarding map; a data structure that
can tell you the final destination of every heap block. It is easy to
see that the final destination of every block can be determined from the
number of set bits in the mark bitmap that precede it. Since the
forwarding map is consulted frequently – once for every pointer in
every object that is live during compaction – it is important that
lookups are as fast as possible. The forwarding map should also be
space-efficient. This completely rules out using a hashtable (with many
small objects in the heap, it would grow to be almost as big as the heap
itself) or simply scanning the bitmap and counting bits every time
(since now compaction will become an O(n^2)
algorithm). The correct
solution is very simple, and well-known in language implementation
circles, but I wasn’t aware of it until I studied the Clozure Common
Lisp garbage collector.
You count the bits set in every group of 32 (or 64) bits in the mark
bitmap, building an array of cumulative sums as you go. Then, to count
the number of bits that are set up to a given element, you look up the
pre-computed population count for the nearest 32 (or 64) bit boundary,
and manually compute the population count for the rest. This gives you a
forwarding map with O(1)
lookup time. This algorithm relies on a fast
population count algorithm; I used the standard CPU-independent
technique in the popcount()
function of
bitwise_hacks.hpp.
Nowadys, as of SSE 4.2, x86 CPUs even include a POPCNT
instruction,
but since compaction spends most of its time in memmove()
, I didn’t
investigate if this would offer a speedup. It would require a CPUID
check at runtime and the fallback would still need to be there for
pre-Intel i7 CPUs and PowerPC, so it didn’t seem worth the extra
complexity to me. Once the forwarding map has been computed, objects can
be moved up and pointers that they contain updated in one pass. A
mark-compact cycle takes roughly twice as long as a mark-sweep, which is
why I elected not to perform a mark-compact on every full collection.
The latter leads to a simpler implementation (no sweep phase, and no
free list; allocation in tenured space is done by bumping a pointer just
as with a copying collector) however the performance penalty didn’t seem
worth the minor code size reduction to me.
Code heap compaction
Code heap compaction has been in Factor for a while, in the form of the save-image-and-exit word. This used an inefficient multi-pass algorithm (known as the “LISP-2 compaction algorithm”) however since it only ran right before exiting Factor, it didn’t matter too much. Now, code heap compaction can happen at any time as a result of heap fragmentation, and uses the same efficient algorithm as tenured space compaction. Compaction moves code around, and doing this at a time other than right before the VM exiting creates a few complications:
- Return addresses in the callstack need to be updated.
- If an inline cache misses, a new inline cache is compiled, and the call site for the old cache is patched. Inline cache construction allocates memory and can trigger a GC, which may in turn trigger a code heap compaction; if this occurs, the return address passed into the inline cache miss stub may have moved, and the code to patch the call site needs to be aware of this.
- If a Factor callback is passed into C code, then moving code around in the code heap may invalidate the callback, and the garbage collector has no way to update any function pointers that C libraries might be referencing.
The solution to the first problem was straightforward and involved
adding some additional code to the code heap compaction pass. The second
problem is trickier. I added a new code_root
smart pointer class,
defined in
code_roots.hpp,
to complement the data_root
smart pointer (see my blog post about
moving the VM to
C++
for details about that). The inline cache compiler wraps the return
address in a code_root
to ensure that if the code block that contains
it is moved by any allocations, the return address can be updated
properly. I solved the last problem with a layer of indirection (that’s
how all problems are solved in CS, right?). Callbacks are still
allocated in the code heap, but the function pointer passed to C is
actually stored in a separate “callback heap” where every block consists
of a single jump instruction and nothing else. When a code heap
compaction occurs, code blocks in the code heap might be moved around,
and all jumps in the callback heap are updated. Blocks within the
callback heap are never moved around (or even deallocated, since that
isn’t safe either).
New object alignment and tagging scheme
The last major change I wanted to discuss is that objects are now aligned on 16-byte boundaries, rather than 8-byte boundaries. This wastes more memory, but I’ve already offset most of the increase with some space optimizations, with more to follow. There are several benefits to this new system. First of all, the binary payload of byte array objects now begins on a 16-byte boundary, which allows SIMD intrinsics to use aligned access instructions, which are much faster. Second, it simplifies the machine code for inline caches and megamorphic method dispatch. Since the low 4 bits of every pointer are now clear, this allows all built-in VM types to fit in the pointer tag itself, so the logic to get a class descriptor for an object in a dispatch stub is very simple now; here is pseudo-code for the assembly that gets generated:
cell tag = ptr & 15; if(tag == tuple_tag) tag = ptr[cell - tuple_tag]; ... dispatch on tag ...