Out-Fibbing CPython with the Plush Interpreter

In the last post I talked about Plush, the toy programming language with actor-based parallelism I've been tinkering with. The implementation is still immature, but it's reached a point where I can write fun programs that produce 2D/3D graphics and parallelize things over multiple CPU cores. Something I'd like to try soon, for the fun of it, is to animate a spinning cube with software rendering (rasterization). Before I get to that though, I'd like to optimize my interpreter a little bit so that I'm not leaving too much performance on the table, which gets us to today's topic.

My PhD advisor seemed to have an obsession with the recursive Fibonacci microbenchmark, or just fib for short. It was almost like a ritual. Every time he would bring up some kind of compiler or language implementation, one of his first goals was to get to the point where he could run this tiny program, and if you were working on some kind of compiler, he would ask you how fast it could run fib. Sometimes he would also ask to compare the performance of your compiler against his Scheme compiler on this microbenchmark, and you couldn't tell if he was making some kind of a joke or if he was trying to compete with you, or both.

fun fib(n)
{
    if (n < 2)
        return n;

    return fib(n - 1) + fib(n - 2);
}

It's well-known that microbenchmarks are not representative of the overall performance of language implementations. There are many reasons for that, the most obvious being that they tend to overemphasize certain aspects of your implementation while completely disregarding others. In this case, the fib microbenchmark heavily emphasizes how fast you can execute function calls. This is not a bad thing to measure though, as function calls are central to modern programming languages, and they can also be a significant source of overhead.

Testing out the Plush interpreter, I found that I could run fib(38) in 9.10 seconds with a release build. That's over 126 million function calls, so about 14 million function calls per second. Just for fun I decided to compare it against Python 3.13.5 (released June 2025) and that ran in 5.70s, meaning Plush was about 60% slower. Ouch. CPython also uses a token-threaded interpreter, so in theory this is not an unfair comparison. This situation could not be allowed to stand, so I decided to look at whether it was possible, with a few simple tweaks, to get to the point where we could outperform CPython on fib.

I started by dumping the bytecode instruction sequence for fib, which is 18 instructions long:

# fib
get_arg { idx: 0 }
push { val: Int64(2) }
lt
if_false { target_ofs: 2 }
get_arg { idx: 0 }
ret
get_arg { idx: 0 }
push { val: Int64(1) }
sub
push { val: Fun(FunId(256)) }
call { argc: 1 }
get_arg { idx: 0 }
push { val: Int64(2) }
sub
push { val: Fun(FunId(256)) }
call { argc: 1 }
add
ret

Like CPython, the Plush interpreter is a stack-based bytecode VM. The get_arg instruction pushes the value of a function argument on the stack, push is used to place a constant on the stack, and lt performs a less-than comparison, etc. The first thing that stood out to me is that I am pushing a constant representing a function before calling it with the call instruction. In this case we know the function we're calling, and so it seems like it would be easy to combine the push and the call into some kind of call_direct instruction which embeds the function to call into the bytecode instruction. I made that change, which saves two bytecode instructions, and it got me from 9.10s to 8.44s.

Some of you may be wondering why this is faster. We have less bytecode instructions but we're really doing the same computation, the same work that we were doing before. The important thing to know is that much of the performance loss incurred by an interpreter comes from the overhead of dispatching instructions. In general, bigger instructions that do more work are better for interpreter performance because they allow us to reduce dispatch overhead. Bigger instructions also give our compiler a bigger chunk of code to optimize in one piece, which could lead to more effective optimizations.

For the next step I pulled out the Linux perf tool, a profiler, to get a better look at the main sources of overhead in my interpreter. Plush uses a system where it lazily compiles functions into bytecode. That is, when you run a script, everything gets parsed into an Abstract Syntax Tree (AST), but functions only get compiled into bytecode when you first call them. One of the main reasons to do that is that in practice, as much as 90% of the code in a program may never be run, and so saves time and memory to avoid compiling code that isn't executed. Another reason to do this is that the longer you wait to compile a function, the more information you have to potentially help you optimize it.

The call instruction, when invoked for a function, will look up this function into a hash table to see if it's already been compiled, and if so, at which offset or Program Counter (PC) it should jump. As I suspected, this hash lookup adds quite a bit of overhead. I didn't quite realize how much, however. I implemented a new instruction call_pc, which can call a function that has already been compiled into bytecode. I then made it so that the call_direct instruction will trigger the compilation of a function if necessary, and then overwrite itself with a call_pc, which doesn't need to do a hash lookup anymore. Patching instructions like this is a classic optimization technique that is sometimes referred to as self-modifying code. Amazingly, introducing call_pc took us all the way down to 5.13s. So removing that hash lookup with code patching had a huge amount of impact. This is already significantly faster than CPython's 5.70s, but can we do better? We certainly can.

Looking back at the bytecode sequence above, there's another obvious optimization that we can make. The program computes n - 1 and n - 2, and to do that, it uses 3 bytecode instructions. Something we can easily do is create a new instruction, add_i64, to add an integer constant to a value without needing to push the constant separately from the add or sub operation. In this case we'll be adding negative integers. Obviously, we could play all sorts of games with fusing different instructions together, and in the limit we could implement most of the benchmark into a single instruction, which would feel like cheating, but adding and subtracting integers is a fairly common operation, so this optimization should generalize pretty well to other code. This new optimization takes us from 5.13s to 4.67s.

I figured out that I could do a tiny little bit more. My add_i64 implementation would first pop the top of the stack, add the constant, and then push the new value on top of the stack. However, it's more efficient to modify the top of the stack in-place without pushing and popping. You would think that the Rust compiler would be smart enough to optimize the stack pop and push together, but it seems that it's not. Modifying the top of the stack in-place brings us from 4.67s down to 4.57s. Neat! With just a little bit of focused effort, we've made the Plush interpreter just shy of twice as fast on this microbenchmark.

Out of curiosity I decided to see how much these optimizations improved the performance of my parallel raytracer program and found... Absolutely zero impact. This is a bit surprising given that the raytracer executes many function calls for each pixel. However, I did start this post by stating that microbenchmarks can overemphasize certain aspects of an implementation while disregarding others. The issue here is that there are probably other (even worse) inefficiencies that affect the raytracer program but which are not represented in fib. Stay tuned and find out what can be so inefficient that it renders this optimization work invisible in the next episode!