Switching call stacks on different platforms

Sunday, April 4, 2010

User-space implementations of coroutines and green threads need to be able to switch the CPU’s call stack pointer between different memory regions. Since this is inherently CPU- and OS-specific, I’ll limit my discussion to CPUs and platforms that Factor supports.

System APIs for switching call stacks

Windows has an API for creating and switching between contexts, which it calls “fibers”. The main functions to look at are CreateFiber() and SwitchToFiber(). On Windows, by far the easiest way to switch call stacks is to just use fibers.
Most Unix systems have a set of functions operating on ucontexts, such as makecontext(), swapcontext() and so on. On some systems, these APIs are poorly implemented and documented.
The C standard library functions setjmp() and longjmp() can be (ab)used to switch contexts. The jmpbuf structure stored to by setjmp() and read from by longjmp() contains a snapshot of all registers, including the call stack pointer. Once you’ve allocated your own call stack, you can capture a jmp_buf with setjmp(), change the call stack pointer, and then resume execution with longjmp(). As far as I know, this is completely undocumented and unsupported with every major C compiler.
The most direct way is to write some assembly to switch the relevant registers directly. Paradoxically, this approach is the most portable out of all of these, because while it is CPU-specific, the details are pretty much the same across most OSes, Windows being the exception. Switching call stacks on Windows is a bit more involved than just changing the ESP register, and requires updating some Windows-specific in-memory structures. However, it is still easy enough to do directly, if you do not wish to use the fiber API. I will describe the details at the end of this post.

High-level libraries for switching call stacks

A couple of existing libraries implement high-level, cross-platform abstractions using some combination of the above mechanisms.

libcoroutine uses fibers on Windows, and ucontext and longjmp on Unix.
libcoro uses a handful of Unix-specific APIs if they’re available, or falls back onto some hand-coded assembly routines. The latter works on Windows.

NetBSD/x86 limitation

There is no realiable way to switch call stacks on NetBSD/x86, because of a limitation in the implementation of libpthread. Even if your program doesn’t use pthreads, it might link with a library that does, such as Glib. The pthread library replaces various C standard library functions with its own versions, and since this includes some extremely common calls such as malloc(), almost nothing will work as a result.

The reason is that the NetBSD pthread library uses a silly trick to implement thread-local storage, one which unfortunately assumes that every native thread has exactly one call stack. The trick is to always allocate thread call stacks on multiples of the maximum call stack size. Then, each thread’s unique ID can just be the result of masking the call stack pointer by the maximum call stack size.

Windows-specific concerns

So far, I’ve only worked out the details of switching call stacks on 32-bit Windows. I’ll talk about 64-bit Windows in another post.

Windows has a mechanism known as Structured Exception Handling, which is a sort of scoped exception handling mechanism with kernel support. Windows uses SEH to deliver processor traps to user-space applications (illegal memory access, division by zero, etc). Some C++ compilers also use SEH to implement C++ exceptions.

On Windows, simply changing the ESP register to point at another memory region is insufficient because structured exception handling must know the current call stack’s beginning and end, as well as a pointer to the innermost exception handler record.

These three pieces of information are stored in a per-thread information block, known as the TIB. The fiber switching APIs update the TIB for you, which is why Microsoft recommends their use.

The TIB

The TIB is defined by the NT_TIB struct in winnt.h:

typedef struct _NT_TIB {
    struct _EXCEPTION_REGISTRATION_RECORD *ExceptionList;
    PVOID StackBase;
    PVOID StackLimit;
    PVOID SubSystemTib;
    _ANONYMOUS_UNION union {
        PVOID FiberData;
        DWORD Version;
    } DUMMYUNIONNAME;
    PVOID ArbitraryUserPointer;
    struct _NT_TIB *Self;
} NT_TIB,*PNT_TIB;

The relevant fields that must be updated when switching call stacks are ExceptionList, StackBase and StackLimit.

Note that since the x86 call stack grows down, StackLimit is the start of the call stack’s memory region and StackBase is the end.

Structured exception handlers

Structured exception handlers are stored in a linked list on the call stack, with the head of the list pointed at by the ExceptionList field of the TIB:

struct _EXCEPTION_REGISTRATION_RECORD
{
     PEXCEPTION_REGISTRATION_RECORD Next;
     PEXCEPTION_DISPOSITION Handler;
} EXCEPTION_REGISTRATION_RECORD, *PEXCEPTION_REGISTRATION_RECORD;

The exception list should be saved and restored when switching between call stacks, and a new call stack should begin with an empty list (ExceptionList set to NULL).

If the ExceptionList field of the TIB or the Next field of an exception record point outside the call stack, then the handler in question will not be called at all.

Accessing the TIB

On x86-32, the current thread’s TIB is stored starting at address 0 in the segment pointed at by the FS segment register. The Self field always points at the struct itself. Since Windows uses flat addressing, the segment storing the TIB begins somewhere in the linear 32-bit address space, so an assembly routine to fetch the TIB and return a pointer to it in EAX looks like this:

fetch_tib:
    mov eax,[fs:24]
    ret

This assembly routine can then be called from C code, which can manipulate the TIB like any other structure.

Sample code for updating Windows-specific structures

The basis/cpu/x86/32/winnt/bootstrap.factor file defines assembly subroutines for updating the TIB when switching call stacks. These routines are invoked by the x86-32 non-optimizing compiler backend: