Exciting bug
Sunday, February 11, 2007
I just fixed a bug in Factor which had me stumped for several hours. The
problem was that on PowerPC, Factor would crash on startup if the data
heap was 2 gigabytes or larger. Preliminary investigation showed that
this was only a problem if the library was compiled; bootstrapping with
-no-compile
first yielded an image which worked fine with a 2gb data
heap. And indeed, compiling something as simple as the +
word would
cause a crash. Now + is called a lot, so if it is miscompiled Factor
doesn’t survive long enough to read another line of input in the
listener. So instead I used a trick to compile + but put the compiled
definition in another word. Testing didn’t reveal any problems,
though;
0 0 blah .
3
1 3 blah .
4
... etc, with various data types, everything worked
However as soon as I swapped in the definition of blah
for +
, Factor
would instantly crash. Some further investigation revealed this:
0 0 blah 0 =
t
0 0 blah 0 > .
t
0 0 blah class .
bignum
So this is why my earlier testing didn’t reveal the problem. The sum of
0 with 0 became a corrupted bignum, which is apparently larger than 0.
But indeed, 0 plus 0 should be a fixnum and not overflow to a bignum at
all, much less a corrupted one.
I looked at the disassembly for +
specialized to fixnum arguments; my
first suspicion was some kind of singed/unsigned integer issue in the
VM, but the assembly generated was identical with a 2gb heap or a
typical 64mb one:
0x0516d020: lwz r3,-4(r14)
0x0516d024: lwz r4,0(r14)
0x0516d028: mtxer r0
0x0516d02c: addo. r5,r4,r3
0x0516d030: bns- 0x516d094
... code to handle overflow, if there is one ...
0x0516d094: stw r5,-4(r14)
0x0516d098: addi r14,r14,-4
Single-stepping through this code in gdb revealed that with a 2gb heap, the overflow branch was always taken, even if there was no overflow. The exact same code behaved correctly with the default heap size.
Then it hit me like a ton of bricks. The intention of this instruction
is to store a zero in the XER register:
0x0516d028: mtxer r0
But it actually stores the value of the register r0! I’m sure I knew
this when I was writing the code, but for some reason I didn’t notice I
had made a mistake. Well, some two years (!) later, the bug manifests
itself.
Why didn’t this bug appear earlier? The reason is simple. The Factor
code generator doesn’t use r0 as a general purpose scratch register,
because its not really general purpose; some instructions assume an
operand of r0 means a literal zero (perhaps I thought mtxer behaves like
this). So the code generator only ever uses r0 to store return addresses
in the subroutine prologue/epilogue sequences. So when +
was called
with a pair of fixnums, some random return address was being stuffed
into XER; not zero as intended! But amazingly, everything worked, until
I thought to test Factor with a larger data heap. The reason it worked
is also simple; the code heap is mapped in directly after the data heap,
so if the code heap was mapped large enough, storing a return address
into XER had the effect of enabling the overflow bit, causing the
following BNS instruction to always take the branch.
What an embarrassing bug! I hope the compiler implementation Gods can forgive me for forgetting to initialize a register.