Replacing GNU assembler with Factor code

Monday, January 25, 2010

Lately a major goal of mine has been to get the Factor VM to be as close as possible to standard C++, and to compile it with at least one non-gcc compiler, namely Microsoft’s Windows SDK.

I’ve already eliminated usages of gcc-specific features from the C++ source (register variables mostly) and blogged about it.

The remaining hurdle was that several of the VM’s low-level subroutines were written directly in GNU assembler, and not C++. Microsoft’s toolchain uses a different assembler syntax, and I don’t fancy writing the same routines twice, especially since the code in question is already x86-specific. Instead, I’ve decided to eliminate assembly code from the VM altogether. Factor’s compiler infrastructure has perfectly good assembler DSLs for x86 and PowerPC written in Factor itself already.

Essentially, I rewrite the GNU assembler source files in Factor itself. The individual assembler routines have been replaced by new functionality added to both the non-optimizing and optimizing compiler backends. This avoids the issue of the GNU assembler -vs- Microsoft assembler syntax entirely. Now all assembly in the implementation is consistently written in the same postfix Factor assembler syntax.

The non-optimizing compiler already supports “sub-primitives”, words whose definition consists of assembler opcodes, inlined at each call site. I added new sub-primitives to these files to replace some of the VM’s assembly routines:

A few entry points are now generated by the optimizing compiler, too. The optimizing compiler has complicated machinery for generating arbitrary machine code. I extended this with a Factor language feature similar to C’s inline assembly, where Factor’s assembler DSL is used to generate arbitrary assembly within a word’s compiled definition. This is more flexible than the non-optimizing compiler’s sub-primitives, and has wide applications beyond the immediate task of replacing a few GNU assembler routines with Factor code.

Factor platform ABI

Before jumping into a discussion of the various assembly routines in the VM, it is important to understand the Factor calling convention first, and how it differs from C.

Factor’s machine language calling convention in a nutshell:

Two registers are reserved, for the data and retain stacks, respectively.
Both registers point into an array of tagged pointers. Quotations and words pass and receive parameters as tagged pointers on the data stack.
On PowerPC and x86-64, an additional register is reserved for the current VM instance.
The call stack register (ESP on x86, r1 on PowerPC) is used like in C, with call frames stored in a contiguous fashion.
Call frames must have a bit of metadata so that the garbage collector can mark code blocks that are referenced via return addresses. This ensures that currently-executing code is not deallocated, even if no other references to it remain.
This GC meta-data consists of three things: the stack frame’s size in bytes, a pointer to the start of a compiled code block, and a return address inside that code block. Since every frame records its size and the next frame immediately follows, the garbage collector can trace and update all return addresses accurately.
A call frame can have arbitrary size, but the garbage collector does not inspect the additional payload; its can be any blob of binary data at all. The optimizing compiler generates large call frames in a handful of rare situations when a scratch area is needed for raw data.
Quotations must be called with a pointer to the quotation object in a distinguished register (even on x86-32, where the C ABI does not use registers at all). Remaining registers do not have to be preserved, and can be used for any purpose in the compiled code.
Tail calls to compiled words must load the program counter into a special register (EBX on x86-32). This allows polymorphic inline caches at tail call sites to patch the call address if the cache misses. A non-tail call PIC can look at the return address on the call stack, but for a space-saving tail-call, this is not available, so to make inline caching work in all cases, tail calls have to pass this address directly. The only compiled blocks that read the value of this register on entry are tail call PIC miss stubs.
All other calls to compiled words are made without any registers having defined contents at all, so effectively all registers that are not reserved for a specific purpose are volatile.
The call stack pointer must be suitably aligned so that SIMD code can spill vector data to the call frame. This is already the case in the C ABI on all platforms except non-Mac 32-bit x86.

Replacing the `c_to_factor()` entry point

There are two situations where C code needs to jump into Factor; when Factor is starting up, and when C functions invoke function pointers generated by alien-callback.

There used to be a c_to_factor() function defined in a GNU assembly source file as part of the VM, that would take care of translating from the C ABI to the Factor ABI. C++ code can call assembly routines that obey the C calling convention directly.

Now that the special assembly entry point is gone from the VM, a valid question to ask is how is it even possible to switch ABIs and jump out into Factor-land, without stepping outside the bounds of C++, by writing some inline assembler at least. It seems like an impossible dilemma.

The Factor VM ensures that the transition stub that C code uses to call into Factor is generated seemingly out of thin air. It turns out the only unportable C++ feature you really need when bootstrapping a JIT is the ability to cast a data pointer into a function pointer, and call the result.

The new factor_vm::c_to_factor() method, called on VM startup, looks for a function pointer in a member variable named factor_vm::c_to_factor_func. Initially, the value is NULL, and if this is the case, it dynamically generates the entry point and then calls the brand-new function pointer:

void factor_vm::c_to_factor(cell quot)
{
    /* First time this is called, wrap the c-to-factor sub-primitive inside
    of a callback stub, which saves and restores non-volatile registers
    as per platform ABI conventions, so that the Factor compiler can treat
    all registers as volatile */
    if(!c_to_factor_func)
    {
        tagged<word> c_to_factor_word(special_objects[C_TO_FACTOR_WORD]);
        code_block *c_to_factor_block = callbacks->add(c_to_factor_word.value(),0);
        c_to_factor_func = (c_to_factor_func_type)c_to_factor_block->entry_point();
    }

    c_to_factor_func(quot);
}

All machine code generated by the Factor compiler is stored in the code heap, where blocks of code can move. But c_to_factor() needs a stable function pointer to make the initial jump out of C and into Factor. As I briefly mentioned in a blog post about mark sweep compact garbage collection, Factor has a separate callback heap for allocating unmovable function pointers intended to be passed to C.

This callback heap is used for the initial startup entry point too, as well as callbacks generated by alien-callback..

As mentioned in the comment, the callback stub now takes care of saving
and restoring non-volatile registers, as well as aligning the stack frame. You can see how callback stubs are defined with the Factor assembler by grepping for callback-stub in the non-optimizing compiler backend.

The new callback stub covers part of what the old assembly c_to_factor() entry point did. The remaining component is calling the quotation itself, and this is now done by a special word c-to-factor.

The c-to-factor word loads the data stack and retain stack pointers and jumps to the quotation’s compiled definition. Grep for c-to-factor-impl in the non-optimizing compiler backend.

In effect, by abusing the non-optimizing compiler’s support for “subprimitives”, machine code for the C-to-Factor entry point can be generated by the VM itself.

Callbacks generated by alien-callback in the optimizing compiler used to contain a call to the c_to_factor() assembly routine. The equivalent machine code is now generated directly by the optimizing compiler.

Replacing the `throw_impl()` entry point

When Factor code throws an error, a continuation is popped off the catch stack, and resumed. When the VM needs to throw an error, it has to go through the same motions, but perform a non-local return to unwind any C++ stack frames first, before it can jump back into Factor and resume another continuation.

There used to be an assembly routine named throw_impl() which would take a quotation and a new value for the stack pointer.

This is now handled in a similar manner to c_to_factor(). The unwind-native-frames word in kernel.private is another one of those very special sub-primitives that uses the C calling convention for receiving parameters. It reloads the data and retain stack registers, and changes the call stack pointer to the given parameter. The call is coming in from C++ code, and the contents of these registers are not guaranteed since they play no special role in the C ABI. Grep for unwind-native-frames in the non-optimizing compiler backend.

Replacing the `lazy_jit_compile_impl()` and `set_callstack()`entry points

As I discussed in my most recent blog post, on the implementation of closures in Factor, quotation compilation is deferred, and initially all quotations point to the same shared entry point. This shared entry point used to be an assembly routine in the VM. It is now the lazy-jit-compile sub-primitive.

The set-callstack primitive predates the existence of sub-primitives, so it was implemented as an assembly routine in the VM for historical reasons.

These two entry points are never called directly from C++ code in the VM, so unlike c_to_factor() and throw_impl(), there is no C++ code to fish out generated code from a special word and turn it into a function pointer.

Inline assembly with the `alien-assembly` combinator

I added a new word, alien-assembly. In the same way as alien-invoke, it generates code which marshals Factor data into C values, and passes them according to the C calling convention; but where alien-invoke would generate a
subroutine call, alien-assembly just calls the quotation at
compile time instead, no questions asked. The quotation can emit
any machine code it desires, but the result has to obey the C calling
convention.

Here is an example: a horribly unportable way to add two floating-point numbers that only works on x86.64:

: add ( a b -- c )
    double { double double } "cdecl"
    [ XMM0 XMM1 ADDPD ]
    alien-assembly ;

1.5 2.0 add .
3.5

The important thing is that unlike assembly code in the VM, using this feature the same assembly code will work regardless of whether the Factor VM was compiled with gcc or the Microsoft toolchain!

Replacing FPU state entry points

The remaining VM assembly routines were used to save and restore FPU state used for advanced floating point features. The math.floats.env vocabulary would call these assembly routines as if they were ordinary C functions, using alien-invoke. After the refactoring, the optimizing compiler now generates code for these routines using alien-assembly instead. A hook dispatching on CPU type is used to pick the implementation for the current CPU.

See these files for details:

On PowerPC, using non-gcc compilers is not a goal, so these routines remain in the VM, and the vm/cpu-ppc.S source file still exists. It contains these FPU routines along with one other entry point for flushing the instruction cache (this is a no-op on x86).

Compiling Factor with the Windows SDK

With this refactoring out of the way, the Factor VM can now be compiled using the Microsoft toolchain, in addition to Cygwin and Mingw. The primary benefit of using the Microsoft toolchain is that it has allowed me to revive the 64-bit Windows support.

Last time I managed to get gcc working successfully on Win64, the Factor VM was written in C. The switch to C++ killed the Win64 port, because the 64-bit Windows Mingw port is a huge pain to install correctly. I never got gcc to produce a working executable after the C++ rewrite of the VM, and as a result we haven’t had a binary package on this platform since April 2009.

Microsoft’s free (as in beer) Windows SDK includes command-line compilers for x86-32 and x86-64, together with various tools, such as a linker, and nmake, a program similar to Unix make. Factor now ships with an Nmakefile that uses the SDK tools to build the VM:

nmake /f nmakefile

After fixing some minor compile errors, the Windows SDK produced a working Win64 executable. After updating the FFI code a little, I quickly got it passing all of the compiler tests. As a result of this work, Factor binaries for 64-bit Windows will be available again soon.