Faster generic arithmetic and integer overflow checking
Saturday, November 29, 2008
While the Factor compiler does a pretty good job of inferring types and eliminating dispatch where possible, runtime polymorphism still ends up being used a lot. Integer arithmetic presents an additional challenge. Factor supports arbitrary-precision integers, and it tries to make this support transparent. However, the flipside of this is that even if the compiler can figure out that all of the values in an expression are integers, it may not be able to eliminate the overflow check, depending on whether or not the ranges of the values are known. Because of this, making runtime polymorphism efficient is still important. In the last few days, I worked on making integer arithmetic more efficient.
Consider a word like +. If compile-time type information is not
available, this word has to look at the types of the top two stack
items, and take appropriate action depending on whether they are
fixnums, bignums, ratios, floats, or complex numbers. Until now, this
dispatch was implemented by two indirect jumps. The type of the first
object is used to key a dispatch table, where each entry then looks at
the type of the second object and keys into a second-level dispatch
table; the entries in the latter are the actual methods implementing the
arithmetic operation.
In Factor, the generic word system generates Factor code for performing
method dispatch. If you look at the definition of a word such as + in
an older build, you’ll see this two-level jump table pattern,
implemented with Factor’s dispatch combinator:
( scratchpad ) \ + def>> .
[
    over tag {
        [
            dup tag {
                [ fixnum=>+ ]
                [ [ >bignum ] dip bignum=>+ ]
                [ \ + no-math-method ]
                [ \ + no-math-method ]
                [ ratio=>+ ]
                [ [ >float ] dip float=>+ ]
                [ complex=>+ ]
                [ \ + no-math-method ]
            } dispatch
        ]
        [
            dup tag {
                [ >bignum bignum=>+ ]
                [ bignum=>+ ]
                [ \ + no-math-method ]
                [ \ + no-math-method ]
                [ ratio=>+ ]
                [ [ >float ] dip float=>+ ]
                [ complex=>+ ]
                [ \ + no-math-method ]
            } dispatch
        ]
        [ \ + no-math-method ]
        [ \ + no-math-method ]
        [
            dup tag {
                [ ratio=>+ ]
                [ ratio=>+ ]
                [ \ + no-math-method ]
                [ \ + no-math-method ]
                [ ratio=>+ ]
                [ [ >float ] dip float=>+ ]
                [ complex=>+ ]
                [ \ + no-math-method ]
            } dispatch
        ]
        [
            dup tag {
                [ >float float=>+ ]
                [ >float float=>+ ]
                [ \ + no-math-method ]
                [ \ + no-math-method ]
                [ >float float=>+ ]
                [ float=>+ ]
                [ complex=>+ ]
                [ \ + no-math-method ]
            } dispatch
        ]
        [
            dup tag {
                [ complex=>+ ]
                [ complex=>+ ]
                [ \ + no-math-method ]
                [ \ + no-math-method ]
                [ complex=>+ ]
                [ complex=>+ ]
                [ complex=>+ ]
                [ \ + no-math-method ]
            } dispatch
        ]
        [ \ + no-math-method ]
    } dispatch
]
The dispatch combinator compiles down to an indirect jump, and
indirect jumps are expensive on modern CPUs. Also, note that this
dispatch scheme did not favor any single numeric type over another.
However, in practice, operations on bignums and ratios will always be
slow, and dispatch overhead is neglegible there; whereas code that works
with floating point numbers and complex numbers can only be fast if the
compiler can infer types and unbox values statically, in which case
there is no runtime dispatch at all. So performance of runtime
arithmetic dispatch on types other than fixnums is largely irrelevant in
Factor. Indeed, the overwhelming majority of calls to generic arithmetic
operations involve fixnums, so some kind of trick to avoid two indirect
jumps in this common case is needed.
There are many solutions, but the one I settled on is pretty much the simplest one. In Factor, all values are tagged pointers (except for unboxed floats and such, which the compiler uses, but they cannot be “observed” in their unboxed state from the outside). A tagged pointer is a value where the low 3 bits encode a type, and the remaining bits encode a value. If the type tag is 0, the value is interpreted as an integer; otherwise, it is a pointer. All values in the heap are aligned on 8 byte boundaries, so if we mask off the 3 tag bits, we get a valid pointer.
So if we have two values, stored in registers x and y, then both
have a tag of 0 if their inclusive bitwise or has a tag of zero. That
is, we can check if both are fixnums with a single conditional.
Here is the new Factor code generated for the dispatch performed by
+:
( scratchpad ) \ + def>> .
[
    both-fixnums?
    [ { fixnum fixnum } declare fixnum=>+ ] [
        over tag {
            [
                dup tag {
                    [ { fixnum fixnum } declare fixnum=>+ ]
                    [
                        { fixnum bignum } declare
                        [ >bignum ] dip bignum=>+
                    ]
                    [
                        { fixnum tuple } declare
                        \ + no-math-method
                    ]
                    [ \ + no-math-method ]
                    [ { fixnum ratio } declare ratio=>+ ]
                    [
                        { fixnum float } declare [ >float ] dip
                        float=>+
                    ]
                    [ { fixnum complex } declare complex=>+ ]
                    [
                        { fixnum POSTPONE: f } declare
                        \ + no-math-method
                    ]
                } dispatch
            ]
            [
                dup tag {
                    [
                        { bignum fixnum } declare
                        >bignum bignum=>+
                    ]
                    [ { bignum bignum } declare bignum=>+ ]
                    [
                        { bignum tuple } declare
                        \ + no-math-method
                    ]
                    [ \ + no-math-method ]
                    [ { bignum ratio } declare ratio=>+ ]
                    [
                        { bignum float } declare [ >float ] dip
                        float=>+
                    ]
                    [ { bignum complex } declare complex=>+ ]
                    [
                        { bignum POSTPONE: f } declare
                        \ + no-math-method
                    ]
                } dispatch
            ]
            [ \ + no-math-method ]
            [ \ + no-math-method ]
            [
                dup tag {
                    [ { ratio fixnum } declare ratio=>+ ]
                    [ { ratio bignum } declare ratio=>+ ]
                    [
                        { ratio tuple } declare
                        \ + no-math-method
                    ]
                    [ \ + no-math-method ]
                    [ { ratio ratio } declare ratio=>+ ]
                    [
                        { ratio float } declare [ >float ] dip
                        float=>+
                    ]
                    [ { ratio complex } declare complex=>+ ]
                    [
                        { ratio POSTPONE: f } declare
                        \ + no-math-method
                    ]
                } dispatch
            ]
            [
                dup tag {
                    [
                        { float fixnum } declare
                        >float float=>+
                    ]
                    [
                        { float bignum } declare
                        >float float=>+
                    ]
                    [
                        { float tuple } declare
                        \ + no-math-method
                    ]
                    [ \ + no-math-method ]
                    [ { float ratio } declare >float float=>+ ]
                    [ { float float } declare float=>+ ]
                    [ { float complex } declare complex=>+ ]
                    [
                        { float POSTPONE: f } declare
                        \ + no-math-method
                    ]
                } dispatch
            ]
            [
                dup tag {
                    [ { complex fixnum } declare complex=>+ ]
                    [ { complex bignum } declare complex=>+ ]
                    [
                        { complex tuple } declare
                        \ + no-math-method
                    ]
                    [ \ + no-math-method ]
                    [ { complex ratio } declare complex=>+ ]
                    [ { complex float } declare complex=>+ ]
                    [ { complex complex } declare complex=>+ ]
                    [
                        { complex POSTPONE: f } declare
                        \ + no-math-method
                    ]
                } dispatch
            ]
            [ \ + no-math-method ]
        } dispatch
    ] if
]
Notice that it is essentially the same as the old version, except the
two-level jump table has been wrapped in a conditional. If
both-fixnums? outputs true, we go directly to the fixnum case;
otherwise we do the two-level dispatch.
Since both-fixnums? needs to bitwise OR pointers together, I had no
choice but to make it a VM primitive. The non-optimizing compiler
inlines a fixed code sequence for calls of both-fixnums?, and the
optimizing compiler expands it into a series of low-level IR
operations:
( scratchpad ) [ both-fixnums? [ "Fixnums!" ] [ "At least one is not a fixnum" ] if ] test-mr mr.
=== word: ( gensym ), label: ( gensym )
_label 0 
##prologue 
_label 1 
##peek V int-regs 692262 D 0 
##peek V int-regs 692263 D 1 
##copy V int-regs 692264 V int-regs 692262 
##or V int-regs 692264 V int-regs 692264 V int-regs 692263 
##copy V int-regs 692265 V int-regs 692264 
##and-imm V int-regs 692265 V int-regs 692265 7 
_compare-imm-branch 3 V int-regs 692265 0 cc/= 
_label 2 
##inc-d 1 
##load-indirect V int-regs 692269 "Fixnums!" 
##replace V int-regs 692269 D 0 
##epilogue 
##return 
_label 3 
##inc-d 1 
##load-indirect V int-regs 692270 "At least one is not a fixnum" 
##replace V int-regs 692270 D 0 
_label 4 
##epilogue 
##return 
So the machine code generated for a generic arithmetic word such as +
now begins as follows:
    mov    (%r14),%rax
    mov    -0x8(%r14),%rcx
    or     %rcx,%rax
    and    $0x7,%rax
    cmp    $0x0,%rax
    jne    slow_dispatch
    ... handle fixnum+fixnum here ...
    ret
slow_dispatch:
    ... do table dispatch here ...
We load two values from the data stack, or them together, mask off the low 3 bits, compare with zero, and jump. Indeed, those readers who know the ways of x86 assembly will recognize that there is a shorter code sequence which can accomplish this same task:
    mov    (%r14),%rax
    mov    -0x8(%r14),%rcx
    or     %rcx,%rax
    test   $0x7,%rax
    jnz    slow_dispatch
    ... handle fixnum+fixnum here ...
    ret
slow_dispatch:
    ... do table dispatch here ...
Once I figure out a nice way to incorporate the test instruction into
my code generator and value numbering passes, I will switch things up
and eliminate the unnecessary instruction there.
This takes care of making dispatch faster. The next challenge was
speeding up integer overflow handling. If the optimizing compiler’s
high-level optimizer can prove that a call to + has fixnum inputs, and
in addition, the result will not overflow – either because the ranges
of the input values are constrained, or the output is passed to bitand
– it will replace it with a call to the fixnum+fast primitive. The
low-level optimizer turns a call to fixnum+fast into a single
low-level IR instruction, and this becomes a single machine instruction.
Very efficient.
However, if the high-level optimizer cannot prove that the result will
not overflow, but it does know the inputs are fixnums, it replaces the
call to + with a call to fixnum+. This still avoids dispatch
overhead, however, until now, the low-level optimizer did not do
anything special for fixnum+; it would compile it as a subroutine call
of a VM function, which added two integers, checking for overflow, in C.
This was slow, because C does not have a way to test the overflow flag
of the CPU.
I made the low-level optimizer aware of the fixnum+ word; it converts
it into a low-level instruction which can then participate in value
numbering and register allocation. The CPU-specific backend is then
responsible for generating an optimal machine code sequence for
fixnum+. On x86 and PowerPC, this can use the overflow flag in the
CPU, which is a more efficient than trying to simulate the same in C. If
the overflow flag is set after the addition, fixnum+ performs a
subroutine call into the VM, however there is no call and the arithmetic
is done inline in the common case, where there is no overflow.
Here is an example of machine code emitted for fixnum+:
    sub    $0x8,%r14
    mov    (%r14),%rax
    mov    0x8(%r14),%rcx
    add    %rcx,%rax
    jno    no_overflow
    ... handle overflow ...
no_overflow:
    mov    %rax,(%r14)
    retq   
I omitted the lengthy setup for the subroutine call into the VM, and
register assignments will vary depending on the placement of fixnum+
in the code.
The results were very interesting. Some benchmarks, such as spectral-norm and mandelbrot, showed no improvement, because the compiler eliminates all the dispatch and overflow checks from the inner loops anyway. Other benchmarks showed an impressive gain. This is good because it means that I didn’t waste my time with implementing the above, but it is also bad because it shows that the high-level optimizer could probably do a better job of inferring types in some of these benchmarks!
Here are the results; I’m running each benchmark 5 times and measuring the runtime in milliseconds.
| Benchmark | Before | After | 
|---|---|---|
| fasta | 9492 8985 8975 9310 9219 | 7841 7818 8115 7807 7777 | 
| reverse-complement | 5815 5611 5628 5645 5645 | 5176 5152 5177 5173 5289 | 
| binary-trees | 2911 2867 2870 2890 2894 | 2163 2142 2186 2168 2136 | 
| fib2 | 113 112 114 112 112 | 82 83 84 85 85 | 
| fib3 | 301 302 301 300 301 | 260 261 259 260 260 | 
| project-euler.134 | 427 427 427 426 424 | 314 313 315 318 313 |