In the last episode, I wrote about a few simple optimizations that I implemented in order to increase the performance of the recursive Fibonacci microbenchmark running in the Plush interpreter. A few relatively small changes made it possible to double the speed of that microbenchmark, and most importantly, outperform the CPython interpreter. Unfortunately, more benchmarking revealed that these changes had approximately zero impact on the speed of my raytracer program, which is a bummer, because I had a soft goal of making this simple raytracer fast enough that it could render a sphere at real-time frame rates. I find this goal amusing because the idea of a raytracer running in an interpreted language producing real-time frame rates would have seemed ludicrous when I was a teenager in the early 2000s, but on a modern laptop, it seems very much achievable!
It might feel surprising that the performance of the raytracer benchmark remains unchanged after we doubled the performance of the Fibonacci microbenchmark. What this suggests is that there are other inefficiencies hiding in the Plush VM that are significant enough that they simply dwarf the effect of the optimizations we already made. The good news is that major inefficiencies are often not so hard to address. A few big inefficiencies might just require us to find a different way of implementing things. It's much harder to improve performance if you're faced with a situation where there is a long tail of very many slightly inefficient things.
A good first step to identify inefficiencies is to pull out a profiler. In this case I used the perf
tool on my Linux desktop, which is pretty easy to use. Rust makes it possible to build a release binary with debug symbols which we can use for profiling purposes. The raytracer program allocates many intermediate Vec3
objects as part of sphere intersection calculations, and so I would have thought that one of the main sources of overhead was GC allocations. However, what the profile showed is that the running time is still dominated by Rust hash table accesses. The interpreter is still doing a lot of string hashing when doing method calls and class field accesses, and this is very much killing performance.
I started by benchmarking the raytracer in release mode using a single thread on my MacBook Air M1 so I could have a baseline against which to compare changes:
cargo build --release && time target/release/plush examples/raytracer.psh
Single thread render time: 700ms
I then went on to implement an inline cache on the get_field
instruction. I started with this instruction because most programs do more object field reads than writes. In Plush, objects have an associated class id, which is an integer representing the class they were instantiated from, and also a flat array of slots containing values associated with each class field (instance variable). The way the inline cache works is pretty simple: it stores a class id and a slot index directly on the instruction. When we execute the instruction, we check if the class id matches what we previously saw, and if so, we use the cached slot index to access the field on the object. This is our fast path. If the class id doesn't match, we do the expensive hash lookup to get the slot index for that class, and then update the cache. In practice, most field accesses are monomorphic, meaning they only ever see objects of the same class, in which case the hash lookup only needs to be done once for the entire execution of the program. This simple optimization brings the rendering time down from 700ms to 410ms, which is already a big difference!
After implementing the inline cache for get_field
, I did the same thing for set_field
. This brought the rendering time down from 410ms to 348ms. The difference is noticeable but not as marked, which confirms the empirical wisdom that like most programs, the raytracer does more instance variable reads than writes.
At this point I did more profiling and was a bit surprised to realize that I had forgotten another big source of inefficiency: when we construct a class, we have to call its constructor. We can know the id of the class we're instantiating when the AST gets compiled into bytecode instructions. However, the Plush interpreter still did a hash lookup to access said class and find out how many slots to allocate for each new object. It also does more hash lookups to locate the constructor function to call. This is again super inefficient. In theory, we could actually resolve the number of slots when generating the bytecode, and completely avoid an inline cache. However, we can't easily determine the address of the constructor function while compiling the bytecode. The simple reason is that the constructor doesn't get compiled into bytecode until it first gets called, and so we don't know which bytecode address (also known as a PC or Program Counter) to jump to until we compile that function, and we're already in the middle of compiling another function.
Since we can't resolve the PC of the constructor while compiling the new
instruction, I opted for a simple solution. Compiling to bytecode produces a new
instruction that only contains a class id. When this instruction is first executed, it triggers the compilation of the constructor for this class, and then it replaces itself with a new_known_ctor
instruction which stores both the number of slots for this class and the PC of the compiled constructor. The patched instruction executes much faster. This optimization brings down the execution time of the raytracer from 348ms to 296ms.
At this point I did another profiling pass and realized that there is yet more hashing overhead hiding in the program. In the last post I talked about optimizing the call
instruction, but this is only used for calling global functions. Method calls use a call_method
instruction which is not yet optimized. Here we can simply pull the inline caching trick again, and this finally brings the execution time to 196ms.
The four optimizations I talked about in this post are all variants on inline caching and code patching. Funnily enough, they all served to basically do the same thing, which is to remove hash lookup overhead. The final impact is much bigger than I had originally anticipated, however. I was very surprised by how much overhead it was possible to shave off. The optimized interpreter can now run the raytracer program more than 3.5 times faster than before when using a single thread. Using 16 threads on my Ryzen desktop, it can render a single frame in just 17 milliseconds, oh yeah!
A fun thing I wanted to mention is that since the last post, an open source contributor has helped me write a simple software rasterizer using Plush. I was then able to optimize this code with the help of Gemini Code CLI, and the final result is a 3D spinning cube that can render at over 650 frames per second at a 600x600 resolution on my MacBook Air M1. We're slightly "cheating" in the sense that it's only so fast because the polygons are flat shaded, texturing would make it much slower, but I was still impressed with the final performance. In another experiment I was able to animate a basic city skyline with 2200 polygons in real-time. This means that Plush would be fast enough to implement a full-on 3D game like a simple flight simulator, a racing game or first-person shooter (but only as long as it's flat-shaded).
I'd like to conclude by saying that I've very much reached my primary objective with Plush, which was to have a lot of fun coding. I'd also like to invite contributions from readers to write a more fun raytracer example. For my tests, I'm only raytracing a sphere, but it could be fun to render a scene with other shapes and reflections etc. You might even be able to vibe code this. Something else I'd love to see is a recreation of the famous Amiga Ball animation in Plush. If you'd like to contribute fun example programs, check out the GitHub repo. Bug reports are also welcome.