Performance comparison between Factor and Java on a contrived benchmark

Friday, August 28, 2009

Recently, Joe Groff has been working on struct classes, with the aim of completely replacing Factor’s existing support for C structure types. The interesting thing about Joe’s struct classes is that unlike the clunky old FFI struct support which operated on byte arrays, struct classes really are Factor classes; instances look and feel like Factor objects, there’s literal syntax and prettyprinter support, and structure fields are accessed using accessor words that look exactly like Factor tuple slot accessors.

So whereas the old C structure support didn’t have much use outside of the FFI, the new struct classes become a useful and type-safe abstraction in themselves. Tuples are composed of dynamically-typed slots which reference Factor values, and structs are composed of scalar data stored in contiguous memory. This makes them useful in code that otherwise does not use the FFI.

What makes struct classes even more useful in performance-sensitive code is the related struct-arrays vocabulary. This presents a sequence of struct values, stored in contiguous memory, as a Factor sequence. This allows ordinary high-level sequence operations and combinators (map, filter, change-each, …) to be used on scalar data with a very efficient in-memory representation.

It is interesting that Factor’s core is rather high-level and dynamically typed, but various libraries built off to the side using meta-programming facilities implement features useful for systems programming, such as specialized array types, allowing binary pointerless data to be represented and manipulated efficiently. This is unprecedented in dynamic languages, where the usual approach is to farm out performance-intensive work to another language.

The performance difference between, say, an array of pointers to boxed floats, and a raw array of floats, cannot be understated. Further levels of structure make the difference even more dramatic: an array of pointers to objects where each one has three fields pointing to boxed floats (what a mouthful), is considerably less efficient to work with than an array of structures with three float fields each. In the former case, a great many objects are allocated and pointers traversed while working with the data; in the latter case, all data can be stored in one contiguous memory block, that appears as a simple byte array to the garbage collector.

Dynamic languages with advanced JITs, such as the recent JavaScript implementations, hit a performance barrier imposed by their in-memory object representations. Even Java, which has a reputation of being very fast as far as managed, GC’d languages go, suffers here. While Java can pack scalar data into a single instance (so if an object has three float fields, the float data is stored directly in the object); however it does not offer arrays of objects where the objects are stored directly in the array. This negatively impacts performance, as you will see.

There is a buzzword that describes this approach of dealing with data: value semantics. If you find some whiny C++ programmers, soon enough one of them will mumble something about value semantics, and with good reason: because C++ offers value semantics, it often has a performance edge over other programming languages. While production-ready Java and C++ implementations both implement a similar set of optimizations, language semantics contribute to C++ being faster for a lot of code. Of course, C++ is low-level and has unsafe pointers; however, as Factor demonstrates, you can have a high-level managed language that still provides support for value semantics in a safe way.

I decided to whip up a simple benchmark. Here is what it does:

It works on points, which are triplets of single-precision floats, (x,y,z)
First, the benchmark creates a list of 5000000 points for i=0..4999999, where the ith point is (sin(i),cos(i)*3,sin(i)*sin(i)/2).
Then, each point is normalized; the x, y, and z components are divided by sqrt(x*x+y*y+z*z).
Finally, the maximum x, y, and z components are found, for all points, and this is printed out.

Note that in-place operations, and re-using temporary objects, is allowed.

Here is the code:

Factor
Java

The Factor version is both shorter, and has more blank lines.

Note that the Factor version is intended to be run as a vocabulary from the Factor environment, using the time word, as follows:

[ "benchmark.struct-arrays" run ] time

Run it a few of times so that the data heap can grow to a stable size.

The Java code is self-contained; run it with

java -Xms512m -server silly

The Java version runs the benchmark 8 times, and prints the best time; this gives the HotSpot Server JIT a chance to ‘warm up’.

The JVM shipped with Mac OS X 10.5 (build 1.5.0_19-b02-304) runs the benchmark in 4.6 seconds, and the Factor version ran in 2.2 seconds using a Factor build from earlier this evening. I made a couple of further improvements to the Factor compiler, bringing the runtime of the Factor version of the benchmark down to 1.4 seconds in the latest source tree:

Added intrinsics for min and max so that when they are applied to values known to be floating point at compile-time, the SSE2 instructions MINSD and MAXSD are used.
Added some optimizations to unbox intermediate <displaced-alien> values constructed by the struct array code. A displaced alien is a value representing a pointer and an offset; it is a heap-allocated object with two slots. This is how struct classes internally implement the backing store for objects which are stored ‘in the middle’ of a byte array.

The Factor code is several layers of abstraction removed from the low-level machine code that is generated in the end. It takes the following optimizations to get good performance out of it:

All words declared inline are inlined, all higher-order functions are inlined.
Literal quotations (written in square brackets) are inlined at their call sites.
Macros, such as sum-outputs are expanded.
Sequence words, struct field accessors, and generic math operations are all replaced with lower-level type-specific equivalents using type inference; in many cases, these operations, such as adding floats, map directly to machine instructions
The incrementing counter for initializing points is converted into a fixed-precision integer because interval and def-use analysis determine it is safe to do so
Escape analysis determines that various temporary objects can be stack-allocated:
- closures created by various combinators, such as tri-curry, each, and so on
- the struct array object created with <struct-array>
- the struct class instances created inside the many calls to each
- the struct created at the end to store the maximum value
Of the remaining memory allocations, those that create small objects are completely inlined, and do not call into the VM at all
Stack analysis eliminates most of the runtime data stack manipulation in favor of keeping values in CPU registers
Representation analysis figures out that all of the intermediate float values can be unboxed and stored in floating point registers, except for this that live across the sin/cos calls
While the sin and cos functions result in calls into libc, sqrt is open-coded as an SSE2 instruction
Value numbering eliminates redundant pointer arithmetic and the boxing/unboxing of pointer values
GC checks are open-coded and inlined at every basic block which performs an allocation; the GC check compares the nursery pointer against the limit, and in the fast case, where the nursery is not full, no subroutine call is made, and no registers need to be saved and restored
The coalescing pass eliminates unnecessary register to register copy instructions that arise from SSA form, as well as ensuring that in most cases, the result of an arithmetic instruction is the same as the first operand; this helps x86 instruction selection
The register allocator assigns virtual registers to real CPU registers; in this case there is no spilling at all on x86-64

So what does this say about language expressiveness? Any dynamic language that a) offers the same set of metaprogramming tools as Factor b) has a reasonably advanced JIT could do the same. On the other hand, working value semantics into something like Java is pretty much impossible. I invite you implement this benchmark in your favorite language and share your timings.