Comparing Factor's performance against V8, LuaJIT, SBCL, and CPython

Saturday, May 29, 2010

Together with Daniel Ehrenberg and Joe Groff, I’m writing a paper about Factor for DLS2010. We would appreciate feedback about the draft version of the paper. As part of the paper we include a performance comparison between Factor, V8, LuaJIT, SBCL, and Python. The performance comparison consists of some benchmarks from the The Computer Language Benchmarks Game. I’m posting the results here first, in case there’s something really stupid here.

Language implementations

Factor and V8 were built from their respective repositories. SBCL is version 1.0.38. LuaJIT is version 2.0.0beta4. CPython is version 3.1.2. All language implementations were built as 64-bit binaries and run on an 2.4 GHz Intel Core 2 Duo.

Benchmark implementations

Factor implementations of the benchmarks can be found in our source repository:

Implementations for the other languages can be found at the language benchmark game CVS repository:

	LuaJIT	SBCL	V8	CPython
binary-trees	binarytrees.lua-2.lua	binarytrees.sbcl	binarytrees.javascript	binarytrees.python3-6.python3
fasta	fasta.lua	fasta.sbcl	fasta.javascript-2.javascript	fasta.python3-2.python3
knucleotide	knucleotide.lua-2.lua	knucleotide.sbcl-3.sbcl	knucleotide.javascript-3.javascript	knucleotide.python3-4.python3
nbody	nbody.lua-2.lua	nbody.sbcl	nbody.javascript	nbody.python3-4.python3
regex-dna		regexdna.sbcl-3.sbcl	regexdna.javascript	regexdna.python3
reverse-complement	revcomp.lua	revcomp.sbcl	revcomp.javascript-2.javascript	revcomp.python3-4.python3
spectral-norm	spectralnorm.lua	spectralnorm.sbcl-3.sbcl	spectralnorm.javascript	spectralnorm.python3-5.python3

In order to make the reverse complement benchmark work with SBCL on Mac OS X, I had to apply this patch; I don’t understand why:

--- bench/revcomp/revcomp.sbcl 9 Feb 2007 17:17:26 -0000 1.4
+++ bench/revcomp/revcomp.sbcl 29 May 2010 08:32:19 -0000
@@ -26,8 +26,7 @@
 
 (defun main ()
   (declare (optimize (speed 3) (safety 0)))
-  (with-open-file (in "/dev/stdin" :element-type +ub+)
-    (with-open-file (out "/dev/stdout" :element-type +ub+ :direction :output :if-exists :append)
+  (let ((in sb-sys:*stdin*) (out sb-sys:*stdout*))
       (let ((i-buf (make-array +buffer-size+ :element-type +ub+))
             (o-buf (make-array +buffer-size+ :element-type +ub+))
             (chunks nil))
@@ -72,4 +71,4 @@
                         (setf start 0)
                         (go read-chunk))))
            end-of-input
-             (flush-chunks)))))))
+             (flush-chunks))))))

Running the benchmarks

I used Factor’s deploy tool to generate minimal images for the Factor benchmarks, and then ran them from the command line:

./factor -e='USE: tools.deploy "benchmark.nbody-simd" deploy'
time benchmark.nbody-simd.app/Contents/MacOS/benchmark.nbody-simd

For the scripting language implementations (LuaJIT and V8) I ran the scripts from the command line:

time ./d8 ~/perf/shootout/bench/nbody/nbody.javascript -- 1000000
time ./src/luajit ~/perf/shootout/bench/nbody/nbody.lua-2.lua 1000000

For SBCL, I did what the shootout does, and compiled each file into a new core:

ln -s ~/perf/shootout/bench/nbody/nbody.sbcl .

cat > nbody.sbcl_compile <<EOF
(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0) (space 0)))
(handler-bind ((sb-ext:defconstant-uneql (lambda (c) (abort c))))
  (load (compile-file "nbody.sbcl" )))
(save-lisp-and-die "nbody.core" :purify t)
EOF

sbcl --userinit /dev/null --load nbody.sbcl_compile

cat > nbody.sbcl_run <<EOF
(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0) (space 0)))
(main) (quit)
EOF

time sbcl --dynamic-space-size 500 --noinform --core nbody.core --userinit /dev/null --load nbody.sbcl_run 1000000

For CPython, I precompiled each script into bytecode first:

python3.1 -OO -c "from py_compile import compile; compile('nbody.python3-4.py')"

Benchmark results

All running times are wall clock time from the Unix time command. I ran each benchmark 5 times and used the best result.

	Factor	LuaJIT	SBCL	V8	CPython
fasta	2.597s	1.689s	2.105s	3.948s	35.234s
reverse-complement	2.377s	1.764s	2.955s	3.884s	1.669s
nbody	0.393s	0.604s	0.402s	4.569s	37.086s
binary-trees	1.764s	6.295s	1.349s	2.119s	19.886s
spectral-norm	1.377s	1.358s	2.229s	12.227s	1m44.675s
regex-dna	0.990s	N/A	0.973s	0.166s	0.874s
knucleotide	1.820s	0.573s	0.766s	1.876s	1.805s

Benchmark analysis

Some notes on the results:

There is no Lua implementation of the regex-dna benchmark.
Some of the SBCL benchmark implementations can make use of multiple cores if SBCL is compiled with thread support. However, by default, thread support seems to be disabled on Mac OS X. None of the other language implementations being tested have native thread support, so this is a single-core performance test.
Factor’s string manipulation still needs work. The fasta, knucleotide and reverse-complement benchmarks are not as fast as they should be.
The binary-trees benchmark is a measure of how fast objects can be allocated, and how fast the garbage collector can reclaim dead objects. LuaJIT loses big here, perhaps because it lacks generational garbage collection, and because Lua’s tables are an inefficient object representation.
The regex-dna benchmark is a measure of how efficient the regular expression implementation is in the language. V8 wins here, because it uses Google’s heavily-optimized Irregexp library.
Factor beats the other implementations on the nbody benchmark because it is able to make use of SIMD.
For some reason SBCL is slower than the others on spectral-norm. It should be generating the same code.
The benchmarks exercise insufficiently-many language features. Any benchmark that uses native-sized integers (for example, an implementation of the SHA1 algorithm) would shine on SBCL and suffer on all the others. Similarly, any benchmark that requires packed binary data support would shine on Factor and suffer on all the others. However, the benchmarks in the shootout mostly consist of scalar floating point code, and text manipulation only.

Conclusions

Factor’s performance is coming along nicely. I’d like to submit Factor to the computer language shootout soon. Before doing that, we need a Debian package, and the deploy tool needs to be easier to use from the command line.