Exciting bug

Sunday, February 11, 2007

I just fixed a bug in Factor which had me stumped for several hours. The problem was that on PowerPC, Factor would crash on startup if the data heap was 2 gigabytes or larger. Preliminary investigation showed that this was only a problem if the library was compiled; bootstrapping with -no-compile first yielded an image which worked fine with a 2gb data heap. And indeed, compiling something as simple as the + word would cause a crash. Now + is called a lot, so if it is miscompiled Factor doesn’t survive long enough to read another line of input in the listener. So instead I used a trick to compile + but put the compiled definition in another word. Testing didn’t reveal any problems, though;
0 0 blah .
3
1 3 blah .
4
... etc, with various data types, everything worked

However as soon as I swapped in the definition of blah for +, Factor would instantly crash. Some further investigation revealed this:

0 0 blah 0 =
t
0 0 blah 0 > .
t
0 0 blah class .
bignum
So this is why my earlier testing didn’t reveal the problem. The sum of 0 with 0 became a corrupted bignum, which is apparently larger than 0. But indeed, 0 plus 0 should be a fixnum and not overflow to a bignum at all, much less a corrupted one.

I looked at the disassembly for + specialized to fixnum arguments; my first suspicion was some kind of singed/unsigned integer issue in the VM, but the assembly generated was identical with a 2gb heap or a typical 64mb one:

0x0516d020:     lwz     r3,-4(r14)
0x0516d024:     lwz     r4,0(r14)
0x0516d028:     mtxer   r0
0x0516d02c:     addo.   r5,r4,r3
0x0516d030:     bns-    0x516d094
... code to handle overflow, if there is one ...
0x0516d094:     stw     r5,-4(r14)
0x0516d098:     addi    r14,r14,-4

Single-stepping through this code in gdb revealed that with a 2gb heap, the overflow branch was always taken, even if there was no overflow. The exact same code behaved correctly with the default heap size.

Then it hit me like a ton of bricks. The intention of this instruction is to store a zero in the XER register:
0x0516d028: mtxer r0
But it actually stores the value of the register r0! I’m sure I knew this when I was writing the code, but for some reason I didn’t notice I had made a mistake. Well, some two years (!) later, the bug manifests itself.

Why didn’t this bug appear earlier? The reason is simple. The Factor code generator doesn’t use r0 as a general purpose scratch register, because its not really general purpose; some instructions assume an operand of r0 means a literal zero (perhaps I thought mtxer behaves like this). So the code generator only ever uses r0 to store return addresses in the subroutine prologue/epilogue sequences. So when + was called with a pair of fixnums, some random return address was being stuffed into XER; not zero as intended! But amazingly, everything worked, until I thought to test Factor with a larger data heap. The reason it worked is also simple; the code heap is mapped in directly after the data heap, so if the code heap was mapped large enough, storing a return address into XER had the effect of enabling the overflow bit, causing the following BNS instruction to always take the branch.

What an embarrassing bug! I hope the compiler implementation Gods can forgive me for forgetting to initialize a register.