A look at the new compiler design
Friday, April 28, 2006
The higher levels of the compiler did not change at all. The stack effect inference module spits out a dataflow intermediate representation (IR), and the dataflow optimizer rewrites it. The most important dataflow IR node types are:
#push
- push literals on the data stack#shuffle
- permute the elements of the data or call stack#call
- call a word#label
- an inlined recursive block (loop, etc)#if
- conditional with two child nodes#dispatch
- jump table with multiple nodes; jumps to the node indexed by a number on the data stack
What is different is that now, machine code is generated directly from
optimized dataflow IR, instead of being converted to linear IR and
having another optimization pass applied. Paradoxically, the new
approach generates better code, and the key is making use of CPU
registers in an intelligent way.
The simplest optimization performed by the generator is an old one; the
defunct linear IR optimizer did it too. The idea is that inside runs of
generated code between calls to runtime functions, we can merge multiple
stack height change instructions into one final one at the end. As long
as no runtime functions which inspect the stack are called, the
temporary inconsistency does not create any issues, and saves
instructions.
When the generator encouters a #push
node, it allocates enough
registers to hold the literals being pushed, and loads the literals into
the registers. Then it makes a note that the top n stack elements are
actually in registers now. This information is recorded on a phantom
stack. Note that the dataflow optimizer already kills dead literal
loads, so there is no loss of efficiency from immediately loading the
literals into registers without checking usage first.
#shuffle
nodes are similar. If the phantom stack contains sufficient
register assignments for the shuffle to take place, then the phantom
stack is simply rearranged. Otherwise, we add special placeholder
“location” objects to the phantom stack, and perform the shuffle. For
example, suppose the compiler comes across a swap
, but the phantom
stack is empty, meaning no stack items are in registers. Then the
phantom stack ends up looking like this:
{ T{ ds-loc f 0 } T{ ds-loc f 1 } }
Since locations are indexed with the top of the stack starting from
zero, this means that the “real” top of the stack is the “current” item
underneath the top, and vice versa. Note that #shuffle
nodes do not
allocate registers or load values; they perform a purely formal
operation. This is because in many cases, values which would be loaded
by a shuffle are not actually used. For example, consider a shuffle
operation which leaves certain locations fixed, such as
over ( x y -- x y x )
. The value y
is part of the shuffle and is
never loaded. It would be inefficient to allocate a register for y
and
load it.
Any node which ends up calling into the runtime, or calling another
compiled word, “commits” the phantom stacks. What this does is look at
the phantom stack, and write back any values stored in registers to the
stack. It also shuffles stack elements if any location objects are
present, taking care not to ignore locations which end up back at their
original position. So in effect, a #shuffle
node does not generate any
code until the next time the phantom stack is “committed”, and a #push
only allocates some registers without writing the stack.
As I have described it so far, the optimizer only takes care of merging
consecutive #push
and #shuffle
nodes. While this is a useful
optimization, the old compiler design made it this far without any
problems.
The real improvement is in the handling of intrinsics. In the old
compiler, a handful of words, such as set-slot
, fixnum+
and so on
were compiled inline, as hand-tuned assembly segments. This is still
being done, but the framework is far more intelligent, and the idea is
that if an intrinsic is compiled with all its inputs as registers on the
phantom stack, significant effort can be saved in loading/saving
elements from the stack. For example, consider the following snippet:
dup cons? [ ... ] [ ... ] if
This is a typical type check. The dataflow optimizer inlines cons?
:
dup tag 2 eq? [ ... ] [ ... ] if
Now during code generation, the dup
modifies the phantom stack to note
that the top two “real” stack items are actually both the “current” top
stack item. When the #call
node for tag
is encountered, the compiler
notes that the tag
word has an associated compiler intrinsic, which
requests one input and produces one output. Now the compiler first
checks the phantom stack to see if the input can be loaded without
generating any stack traffic. In this case, the phantom stack holds a
location object, so a load intruction does need to be generated. Then
the tag
intrinsic is generated – it is a single instruction which
masks off the tag bits in the address given. Next up is the literal 2. A
register is allocated, the integer 2 is loaded in, and a note of this is
made on the phantom stack. The eq?
followed by a conditional is
handled as one operation; and this time, both inputs are on the phantom
stack, as registers – the return value of tag
, and the literal 2
.
So the eq?
does not generate any stack traffic at all. Also note that
in this example, dup
and 2
together increase the stack height by 2,
while eq?
decreases it by 1 and the conditional pops another value
(the return value of eq?
). So no stack height change instruction is
generated, since the stack is restored to its original height before any
runtime calls are made. Indeed, here is the generated x86 assembly for
the above example:
MOV EAX,[ESI]
MOV ECX,16 -- the fixnum 2, with a tag, is encoded as the integer 16
CMP EAX,ECX
JNE false_branch
...
JMP end
: false-branch
...
: end
...
Now, I admit that the old linear IR optimizer would, after considerable
effort and some rather ugly hand-coded heuristics, squeeze the same
machine code out of the above type-check snippet. However, the code was
considerably more complicated, and it did not deal with registers and
stack locations in such full generality. Indeed, the old design would
not manage to deal with the following:
over tag over tag eq? [ ... ] [ ... ] if
Both stack shuffle words would generate code, and the calls to tag
would incur stack load/store overhead. The new design handles the above
just fine; it loads the top two stack items into registers, masks off
everything but the tag bits, compares them and branches.
Many other far more complicated examples generate good code with the new
design, especially on PowerPC where there is a large number of registers
available.