Improved value numbering, branch splitting, and faster overflow checks
Friday, July 17, 2009
Hot on the heels on the improved register allocator, I’ve added some new optimizations to the low-level optimizer and made existing optimizations stronger.
Improved value numbering
Local value numbering now detects more congruences along values computed by low level IR instructions. All the changes were either in the rewrite or simplify steps.
- Some additional arithmetic identities. For instance, the result of adding or subtracting 0 to a value is congruent to the value itself. The high-level optimizer implements the same optimization on Factor’s generic arithmetic words, but sometimes low-level optimizations introduce additional redundancies.
- A
##add
instruction where one of the two operands is the result of a##load-immediate
is now converted into a##add-imm
, saving a register. A similar identity holds for other commutative operations, such as##mul
,##and
,##or
,##xor
, and##compare
with a condition code ofcc=
. - Constant folding. Arithmetic instructions where both operands are
constants are replaced by a
##load-immediate
containing the result. A special case is a subtraction where both operands are congruent; this is replaced by a load of 0. Again, the high-level optimizer performs constant folding also, but low-level optimizations sometimes introduce redundancies or expose opportunities for optimization which were not apparent in high-level IR. - Reassociation. If an
##add-imm
instruction’s first input is the result of another##add-imm
, then a new##add-imm
is made which computes the same result with a single addition. Algebraically, this corresponds to the identity(x + a) + b = x + (a + b)
, wherea
andb
are known at compile-time. Similar transformations are made for other associative operations, such as##sub
,##mul
,##and
,##or
and##xor
. - Constant branch folding. If both inputs to a
##compare
or##compare-branch
are constant, or if both are congruent to the same value, then the result of the comparison can be computed at compile-time, and one of the two successor blocks can be deleted.
While the high-level optimizer performs constant folding and unreachable
code elimination as part of the
SCCP
pass, low-level optimizations which are done before value numbering can
expose optimization opportunities which were invisible in high-level IR.
For example, consider the following code:
[ { vector } declare [ length ] [ length ] bi < ]
Using the optimized.
word, we can see what the high-level optimizer
transforms this into:
[ dup >R 3 slot R> 3 slot fixnum< ]
The high-level optimizer only attempts to reason about immutable slots,
but a vector’s length is mutable since a vector may have elements added
to it, and so the high-level optimizer cannot detect that the two inputs
to fixnum<
are equal. A conversion into low-level IR, the code becomes
the following sequence of SSA
instructions:
_label 0 f
##prologue f
_label 1 f
##peek V int-regs 1 D 0 f
##copy V int-regs 4 V int-regs 1 f
##slot-imm V int-regs 5 V int-regs 4 3 7 f
##copy V int-regs 8 V int-regs 1 f
##slot-imm V int-regs 9 V int-regs 8 3 7 f
##copy V int-regs 10 V int-regs 5 f
##copy V int-regs 11 V int-regs 9 f
##compare V int-regs 12 V int-regs 10 V int-regs 11 cc< V int-regs 13 f
##replace V int-regs 12 D 0 f
##epilogue f
##return f
_spill-counts f f
Now, the low-level alias analysis
optimization
is able to detect that the second ##slot-imm
is redundant, and it
transforms the code into the following:
_label 0 f
##prologue f
_label 1 f
##peek V int-regs 1 D 0 f
##copy V int-regs 4 V int-regs 1 f
##slot-imm V int-regs 5 V int-regs 4 3 7 f
##copy V int-regs 9 V int-regs 5 f
##copy V int-regs 10 V int-regs 5 f
##copy V int-regs 11 V int-regs 9 f
##compare V int-regs 12 V int-regs 10 V int-regs 11 cc< V int-regs 13 f
##replace V int-regs 12 D 0 f
##epilogue f
##return f
_spill-counts f f
Now, notice that both inputs to the ##compare
instruction (vreg #10
and vreg #11) are copies of the same value, vreg #5. Value numbering
detects this congruence since copy propagation is just a special case of
value numbering; it then simplifies the comparison down to a constant
load of f
, since for any integer x
, we have x < x => false
. After
dead code elimination, the result is the following:
_label 0 f
##prologue f
_label 1 f
##load-immediate V int-regs 12 5 f
##replace V int-regs 12 D 0 f
##epilogue f
##return f
_spill-counts f f
While no programmer would explicitly write such redundant code, it comes up after inlining. For example, the push word, which adds an element at the end of a sequence, is implemented as follows:
: push ( elt seq -- ) [ length ] [ set-nth ] bi ;
HINTS: push { vector } { sbuf } ;
The hints tell
the compiler to compile specialized versions of this word for vectors
and string buffers. While the generic versions work on any sequence, the
hinted versions are used when the input types match, and the result can
be better performance if generic words are inlined (in this case,
length
and
set-nth).
Now, after a bunch of code is inlined, a piece of code comes up which
compares the insertion index against the vector’s length; however, the
insertion index is the length in this case, and so value numbering is
now able to optimize out a conditional which could not be optimized out
before. Of course I could have just written a specialized version of
push
for vectors (or even built it into the VM) but I’d rather
implement general optimizations and write my code in a high-level
style.
I’d like to thank Doug Coleman for implementing some of the above improvements.
Branch splitting
Roughly speaking, branch splitting is the following transform:
[ A [ B ] [ C ] if D ] => [ A [ B D ] [ C D ] if ]
It is only run if D is a small amount of code. Branch splitting runs before value numbering, so value numbering’s constant branch folding can help simplify the resulting control flow. Branch splitting is implemented in the compiler.cfg.branch-splitting vocabulary.
For example, suppose we have the following code:
[ [ t ] [ f ] if [ 1 ] [ 2 ] if ]
The high-level optimizer is unable to simplify this further, since it
does not work on a control flow graph. The low level optimizer builds a
control flow graph with the following shape:
The basic block in the middle is a candidate for splitting since it has
two predecessors, and it is short. After splitting, the result looks
like so:
Now, at this point, the only thing that this has achieved is eliminating
one unconditional jump at the expense of code size. However, after stack
analysis and value numbering run, some redundant work is eliminated:
One example of real code that benefits from branch splitting includes
code that calls the =
word followed by a conditional. The =
word is
inlined, and it has a conditional inside of it; some branches of the
conditional push constants t
and f
, and others call non-inline
words. The result is that the branches which push a constant boolean can
directly jump to the destination block without the overhead of storing
and testing a boolean value.
Once again thanks to Doug Coleman for collaborating with me on the implementation of branch splitting.
Block joining
Block joining is a simple optimization which merges basic blocks connected by a single control flow edge. Various low-level optimizations leave the control flow graph with a large number of empty or very small blocks with no conditional control flow between them. Block joining helps improve effectiveness of local optimizations by creating larger basic blocks.
In the above example for branch splitting, block joining will eliminate the four empty basic blocks that remain after optimization.
Block joining is implemented in the compiler.cfg.block-joining vocabulary.
Faster overflowing integer arithmetic intrinsics
In the lucky cases where the compiler can eliminate overflow checks, the various math operations become single instructions in the low-level IR which eventually map directly to machine instructions. In the general case, however, an overflow check has to be generated. In the event of overflow, a bignum is allocated, which may in turn involve running the garbage collector, so it is quite an expensive operation compared to a single machine arithmetic operation.
Previously, the overflowing fixnum operations were represented in low-level IR as single, indivisible units, and just like subroutine calls and returns, they were “sync points” which meant that all values in registers had to be saved to the data and retain stacks before, and reloaded after. This is because of how these operations were implemented; they would perform the arithmetic, do an overflow check, and in the event of overflow, they would invoke a subroutine which handled the overflow. Keeping registers live across this operation was not sound in the event of overflow, since the subroutine call would clobber them.
No longer. Now, an overflowing fixnum operation breaks down into several nodes in the control flow graph, and the arithmetic part, the overflow check and the bignum allocation are represented separately. In particular, the code to save registers to the stack and reload them after is only generated in the overflow case, so in the event of no overflow, which happens much more frequently, execution can “fall through” and continue using the same registers as before.
The code that generates low-level instructions and control flow graph nodes for overflowing fixnum operations is defined in the compiler.cfg.intrinsics.fixnum vocabulary.
Faster integer shifts
This last optimization is a very minor one, but it made a difference on benchmarks. Previously shifts with a constant shift count and no overflow check would compile as a machine instruction, and all other forms of shifts would invoke subroutine calls. Now, machine instructions are generated in the case where the shift count is unknown, but the result is still known to be small enough not to require an overflow check.
Benchmark results
Here are the results of running Factor’s benchmark suite on a build from 31st May 2009, before I started working on the current set of low-level optimizer improvements (which includes the register allocator I blogged about previously), and from today.
Benchmark
Before
After
benchmark.backtrack
1.767561
1.330641
benchmark.base64
1.997951
1.738677
benchmark.beust1
2.765257
2.461088
benchmark.beust2
3.584958
1.694427
benchmark.binary-search
1.55002
1.574595
benchmark.binary-trees
1.845798
1.733355
benchmark.bootstrap1
10.860492
11.447687
benchmark.dawes
0.229999
0.161726
benchmark.dispatch1
2.015653
2.119268
benchmark.dispatch2
1.817941
1.216618
benchmark.dispatch3
2.568987
1.899128
benchmark.dispatch4
2.319587
2.032182
benchmark.dispatch5
2.346744
1.614045
benchmark.empty-loop-0
0.146716
0.12589
benchmark.empty-loop-1
0.430314
0.342426
benchmark.empty-loop-2
0.429012
0.342097
benchmark.euler150
16.901451
15.288867
benchmark.euler186
8.805434999999999
7.920478
benchmark.fannkuch
3.202698
2.964037
benchmark.fasta
5.52608
4.934112
benchmark.gc0
2.15066
1.993158
benchmark.gc1
4.984841
4.961272
benchmark.gc2
3.327706
3.265462
benchmark.iteration
3.736756
3.30438
benchmark.javascript
9.79904
9.164517
benchmark.knucleotide
0.282296
0.251879
benchmark.mandel
0.125304
0.123945
benchmark.md5
0.946516
0.85062
benchmark.nbody
3.982774
3.349595
benchmark.nested-empty-loop-1
0.116351
0.135936
benchmark.nested-empty-loop-2
0.692668
0.438932
benchmark.nsieve
0.714772
0.698262
benchmark.nsieve-bits
1.451828
0.907247
benchmark.nsieve-bytes
0.312481
0.300053
benchmark.partial-sums
1.205072
1.221245
benchmark.pidigits
16.088346
16.159176
benchmark.random
2.574773
2.706893
benchmark.raytracer
3.481714
2.914116
benchmark.recursive
5.964279
3.215277
benchmark.regex-dna
0.132406
0.093095
benchmark.reverse-complement
3.811822
3.257535
benchmark.ring
1.756481
1.79823
benchmark.sha1
2.267648
1.473887
benchmark.sockets
8.794280000000001
8.783398
benchmark.sort
0.421628
0.363383
benchmark.spectral-norm
3.830249
4.036353
benchmark.stack
2.086594
1.014408
benchmark.sum-file
0.528061
0.422194
benchmark.tuple-arrays
0.127335
0.103421
benchmark.typecheck1
0.876559
0.6723440000000001
benchmark.typecheck2
0.878561
0.671624
benchmark.typecheck3
0.86596
0.670099
benchmark.ui-panes
0.426701
0.372301
benchmark.xml
2.351934
2.187999
There are a couple of regressions I need to look into but for the most part the results look pretty good. I also expect further gains to come with additional optimizations that I plan on implementing.
Next steps
I’m going to keep working on the code generator for another little while. First of all, compile time has increased so that needs to improve. Next, I’m going to implement better global optimization. At this point, values are stored in register between basic blocks, unlike with the old code generator. However, loop variables are still stored on the stack between iterations because the analysis does not handle back edges yet. Fixing this, and making float unboxing work globally, is my next step. After that, I plan on adding support for unboxed word-size integers (32 or 64-bits, depending on platform) and some intrinsics for SSE2 vector operations on Intel CPUs. Next, the register allocator needs improved coalescing logic, and it also needs to support fixed register assignments as found on some x86 instructions which take implicit operands. Finally, I have a few optimizations I want to add to the high level optimizer.