Replacing GNU assembler with Factor code
Monday, January 25, 2010
Lately a major goal of mine has been to get the Factor VM to be as close as possible to standard C++, and to compile it with at least one non-gcc compiler, namely Microsoft’s Windows SDK.
I’ve already eliminated usages of gcc-specific features from the C++ source (register variables mostly) and blogged about it.
The remaining hurdle was that several of the VM’s low-level subroutines were written directly in GNU assembler, and not C++. Microsoft’s toolchain uses a different assembler syntax, and I don’t fancy writing the same routines twice, especially since the code in question is already x86-specific. Instead, I’ve decided to eliminate assembly code from the VM altogether. Factor’s compiler infrastructure has perfectly good assembler DSLs for x86 and PowerPC written in Factor itself already.
Essentially, I rewrite the GNU assembler source files in Factor itself. The individual assembler routines have been replaced by new functionality added to both the non-optimizing and optimizing compiler backends. This avoids the issue of the GNU assembler -vs- Microsoft assembler syntax entirely. Now all assembly in the implementation is consistently written in the same postfix Factor assembler syntax.
The non-optimizing compiler already supports “sub-primitives”, words whose definition consists of assembler opcodes, inlined at each call site. I added new sub-primitives to these files to replace some of the VM’s assembly routines:
- basis/cpu/x86/bootstrap.factor
- basis/cpu/x86/32/bootstrap.factor
- basis/cpu/x86/64/bootstrap.factor
- basis/cpu/ppc/bootstrap.factor
A few entry points are now generated by the optimizing compiler, too. The optimizing compiler has complicated machinery for generating arbitrary machine code. I extended this with a Factor language feature similar to C’s inline assembly, where Factor’s assembler DSL is used to generate arbitrary assembly within a word’s compiled definition. This is more flexible than the non-optimizing compiler’s sub-primitives, and has wide applications beyond the immediate task of replacing a few GNU assembler routines with Factor code.
Factor platform ABI
Before jumping into a discussion of the various assembly routines in the VM, it is important to understand the Factor calling convention first, and how it differs from C.
Factor’s machine language calling convention in a nutshell:
- Two registers are reserved, for the data and retain stacks, respectively.
- Both registers point into an array of tagged pointers. Quotations and words pass and receive parameters as tagged pointers on the data stack.
- On PowerPC and x86-64, an additional register is reserved for the current VM instance.
- The call stack register (ESP on x86, r1 on PowerPC) is used like in C, with call frames stored in a contiguous fashion.
- Call frames must have a bit of metadata so that the garbage collector can mark code blocks that are referenced via return addresses. This ensures that currently-executing code is not deallocated, even if no other references to it remain.
- This GC meta-data consists of three things: the stack frame’s size in bytes, a pointer to the start of a compiled code block, and a return address inside that code block. Since every frame records its size and the next frame immediately follows, the garbage collector can trace and update all return addresses accurately.
- A call frame can have arbitrary size, but the garbage collector does not inspect the additional payload; its can be any blob of binary data at all. The optimizing compiler generates large call frames in a handful of rare situations when a scratch area is needed for raw data.
- Quotations must be called with a pointer to the quotation object in a distinguished register (even on x86-32, where the C ABI does not use registers at all). Remaining registers do not have to be preserved, and can be used for any purpose in the compiled code.
- Tail calls to compiled words must load the program counter into a special register (EBX on x86-32). This allows polymorphic inline caches at tail call sites to patch the call address if the cache misses. A non-tail call PIC can look at the return address on the call stack, but for a space-saving tail-call, this is not available, so to make inline caching work in all cases, tail calls have to pass this address directly. The only compiled blocks that read the value of this register on entry are tail call PIC miss stubs.
- All other calls to compiled words are made without any registers having defined contents at all, so effectively all registers that are not reserved for a specific purpose are volatile.
- The call stack pointer must be suitably aligned so that SIMD code can spill vector data to the call frame. This is already the case in the C ABI on all platforms except non-Mac 32-bit x86.
Replacing the c_to_factor()
entry point
There are two situations where C code needs to jump into Factor; when
Factor is starting up, and when C functions invoke function pointers
generated by alien-callback
.
There used to be a c_to_factor()
function defined in a GNU assembly
source file as part of the VM, that would take care of translating from
the C ABI to the Factor ABI. C++ code can call assembly routines that
obey the C calling convention directly.
Now that the special assembly entry point is gone from the VM, a valid question to ask is how is it even possible to switch ABIs and jump out into Factor-land, without stepping outside the bounds of C++, by writing some inline assembler at least. It seems like an impossible dilemma.
The Factor VM ensures that the transition stub that C code uses to call into Factor is generated seemingly out of thin air. It turns out the only unportable C++ feature you really need when bootstrapping a JIT is the ability to cast a data pointer into a function pointer, and call the result.
The new factor_vm::c_to_factor()
method, called on VM startup, looks
for a function pointer in a member variable named
factor_vm::c_to_factor_func
. Initially, the value is NULL, and if this
is the case, it dynamically generates the entry point and then calls the
brand-new function pointer:
void factor_vm::c_to_factor(cell quot)
{
/* First time this is called, wrap the c-to-factor sub-primitive inside
of a callback stub, which saves and restores non-volatile registers
as per platform ABI conventions, so that the Factor compiler can treat
all registers as volatile */
if(!c_to_factor_func)
{
tagged<word> c_to_factor_word(special_objects[C_TO_FACTOR_WORD]);
code_block *c_to_factor_block = callbacks->add(c_to_factor_word.value(),0);
c_to_factor_func = (c_to_factor_func_type)c_to_factor_block->entry_point();
}
c_to_factor_func(quot);
}
All machine code generated by the Factor compiler is stored in the code
heap, where blocks of code can move. But c_to_factor()
needs a stable
function pointer to make the initial jump out of C and into Factor. As I
briefly mentioned in a blog post about mark sweep compact garbage
collection,
Factor has a separate callback heap for allocating unmovable function
pointers intended to be passed to C.
This callback heap is used for the initial startup entry point too, as
well as callbacks generated by alien-callback
..
As mentioned in the comment, the callback stub now takes care of
saving
and restoring non-volatile registers, as well as aligning the stack
frame. You can see how callback stubs are defined with the Factor
assembler by grepping for callback-stub
in the non-optimizing compiler
backend.
The new callback stub covers part of what the old assembly
c_to_factor()
entry point did. The remaining component is calling the
quotation itself, and this is now done by a special word
c-to-factor
.
The c-to-factor
word loads the data stack and retain stack pointers
and jumps to the quotation’s compiled definition. Grep for
c-to-factor-impl
in the non-optimizing compiler backend.
In effect, by abusing the non-optimizing compiler’s support for “subprimitives”, machine code for the C-to-Factor entry point can be generated by the VM itself.
Callbacks generated by alien-callback
in the optimizing compiler used
to contain a call to the c_to_factor()
assembly routine. The
equivalent machine code is now generated directly by the optimizing
compiler.
Replacing the throw_impl()
entry point
When Factor code throws an error, a continuation is popped off the catch stack, and resumed. When the VM needs to throw an error, it has to go through the same motions, but perform a non-local return to unwind any C++ stack frames first, before it can jump back into Factor and resume another continuation.
There used to be an assembly routine named throw_impl()
which would
take a quotation and a new value for the stack pointer.
This is now handled in a similar manner to c_to_factor()
. The
unwind-native-frames
word in kernel.private
is another one of those
very special sub-primitives that uses the C calling convention for
receiving parameters. It reloads the data and retain stack registers,
and changes the call stack pointer to the given parameter. The call is
coming in from C++ code, and the contents of these registers are not
guaranteed since they play no special role in the C ABI. Grep for
unwind-native-frames
in the non-optimizing compiler backend.
Replacing the lazy_jit_compile_impl()
and set_callstack()
entry points
As I discussed in my most recent blog post, on the implementation of
closures in
Factor,
quotation compilation is deferred, and initially all quotations point to
the same shared entry point. This shared entry point used to be an
assembly routine in the VM. It is now the lazy-jit-compile
sub-primitive.
The set-callstack
primitive predates the existence of sub-primitives,
so it was implemented as an assembly routine in the VM for historical
reasons.
These two entry points are never called directly from C++ code in the
VM, so unlike c_to_factor()
and throw_impl()
, there is no C++ code
to fish out generated code from a special word and turn it into a
function pointer.
Inline assembly with the alien-assembly
combinator
I added a new word, alien-assembly
. In the same way as alien-invoke
,
it generates code which marshals Factor data into C values, and passes
them according to the C calling convention; but where alien-invoke
would generate a
subroutine call, alien-assembly
just calls the quotation at
compile time instead, no questions asked. The quotation can emit
any machine code it desires, but the result has to obey the C calling
convention.
Here is an example: a horribly unportable way to add two floating-point numbers that only works on x86.64:
: add ( a b -- c )
double { double double } "cdecl"
[ XMM0 XMM1 ADDPD ]
alien-assembly ;
1.5 2.0 add .
3.5
The important thing is that unlike assembly code in the VM, using this feature the same assembly code will work regardless of whether the Factor VM was compiled with gcc or the Microsoft toolchain!
Replacing FPU state entry points
The remaining VM assembly routines were used to save and restore FPU
state used for advanced floating point
features.
The math.floats.env
vocabulary would call these assembly routines as
if they were ordinary C functions, using alien-invoke
. After the
refactoring, the optimizing compiler now generates code for these
routines using alien-assembly
instead. A hook dispatching on CPU type
is used to pick the implementation for the current CPU.
See these files for details:
On PowerPC, using non-gcc compilers is not a goal, so these routines remain in the VM, and the vm/cpu-ppc.S source file still exists. It contains these FPU routines along with one other entry point for flushing the instruction cache (this is a no-op on x86).
Compiling Factor with the Windows SDK
With this refactoring out of the way, the Factor VM can now be compiled using the Microsoft toolchain, in addition to Cygwin and Mingw. The primary benefit of using the Microsoft toolchain is that it has allowed me to revive the 64-bit Windows support.
Last time I managed to get gcc working successfully on Win64, the Factor VM was written in C. The switch to C++ killed the Win64 port, because the 64-bit Windows Mingw port is a huge pain to install correctly. I never got gcc to produce a working executable after the C++ rewrite of the VM, and as a result we haven’t had a binary package on this platform since April 2009.
Microsoft’s free (as in beer) Windows
SDK
includes command-line compilers for x86-32 and x86-64, together with
various tools, such as a linker, and nmake
, a program similar to Unix
make
. Factor now ships with an
Nmakefile
that uses the SDK tools to build the VM:
nmake /f nmakefile
After fixing some minor compile errors, the Windows SDK produced a working Win64 executable. After updating the FFI code a little, I quickly got it passing all of the compiler tests. As a result of this work, Factor binaries for 64-bit Windows will be available again soon.