Struct arrays benchmark revisited: trig function calls are slow in Java, but without them Factor is still 3x faster

Saturday, August 29, 2009

My struct arrays benchmark generated a fair amount of discussion on reddit, with some people disputing the benchmark’s validity. Certainly I’m not claiming that Factor is faster than Java HotSpot in general (on most tasks it is slower, sometimes much more so), however I think the benchmark legitimately demonstrates the performance advantage of value semantics over reference semantics.

A few people pointed out that the Java version of the benchmark was spending a lot of time in trigonometric functions, and that Java computes sin and cosine “by hand” rather than using x87 FPU instructions. However, the same is also true of the Factor implementation, its just that none of the Java advocates bothered to check. I don’t use x87 instructions either, and call into libc for sin and cos, just like Java. Indeed, Factor’s sin and cos are even more heavyweight than Java’s, because they also support complex numbers; Factor’s compiler converts them to their real-valued equivalents if it can prove the inputs are real.

Despite this, I modified the benchmark to not call trig functions, and instead just initialize each point as (n+1,n+2,(n+3)*(n+3)/2). Factor is still faster by roughly the same margin as before:

Java	829ms (best run out of 8)
Factor	284ms

A few people also mentioned that by only running the test 8 times I wasn’t giving HotSpot enough time to “warm up”. However, the benchmark does not make any polymorphic calls in the inner loops, so there’s really no “warm up” needed at all; and indeed I got the best time on the third iteration.

I improved Factor’s code generation for trigonometric function calls. Trig function calls are treated like instructions by the low-level optimizer now, which means they participate in value numbering and do not split basic blocks. Instead, the register allocator spills all live registers at the call site at the very end of code generation. This strategy does not work in general for all FFI calls, because the case where the FFI calls invokes a Factor callback needs additional runtime support. However, neither sin nor cos do this. After implementing this, I noticed that my struct-arrays benchmark was calling sin() twice on the same input, and value numbering was folding the second call away. Neat optimization to get “for free”.

This code generation change improves performance of both the struct-arrays and partial-sums benchmarks:

Benchmark	Before	After
struct-arrays	1483ms	749ms
partial-sums	1233ms	938ms

Here is the disassembly for benchmark.struct-arrays with this optimization performed, and this is the patch.

For what its worth, Java HotSpot runs the partial-sums benchmark in 1293ms. Hopefully HotSpot’s trig function call performance will receive some attention at some point.