Zeta's JITterpreter

June 12th, 2017

About six weeks ago, I made ZetaVM open source and announced it on this blog. This is a compiler/VM project that I had been quietly working on for about 18 months. The project now has 273 stars on GitHub. This is both exciting and scary, because so much of what I want to do with this project is not yet built. I really want to show this project to people, but I also find myself scared that people may come, see how immature/incomplete the project is at this stage, and never come back. In any case the project is open source now, and I think it would be good for me to write about it and explain some of the design ideas behind it, if only to document it for current and potential future collaborators.

One of the main goals of ZetaVM is to be very approachable, and enable more people to create their own programming languages, or to experiment with language design. With that goal in mind, I've designed a textual, stack-based bytecode format that resembles JSON, but allows for cycles (objects can reference one-another). Functions, basic blocks, and instructions in this bytecode format are all described as a graph of JS-like objects. This is very user-friendly. Anyone can write a Python script that outputs code in this format and run the output in ZetaVM. It's also very powerful, in a LISP-y code-is-data kind of way: you can generate and introspect bytecode at run time, it's just made of plain objects and values that the VM can manipulate. ZetaVM has first-class bytecode.

The downside of all this, as you can imagine, is that it inherently has to be absolutely dog slow. Or does it? The first version of the Zeta interpreter traversed the graph of bytecode objects the naive way, and it was indeed dog slow. I've since written a new interpreter which removes the overhead of the object-based bytecode by dynamically translating it to an internal representation (think dynamic binary translation). The internal IR is compact and flat, executing it involves no pointer hopping.

The new interpreter generates code on the fly, which means it is, by definition, a Just-In-Time (JIT) compiler. It's architecture is based on Basic Block Versioning (BBV), a compilation technique I developed during my PhD (talks and papers on the topic are linked here if you're curious). BBV has the nice property that it generates code lazily, and the generated code naturally ends up compact and fairly linear. This is not done yet, but BBV also makes it possible to specialize code to eliminate dynamic type checks very effectively, and to perform various optimizations on the fly.

You might be wondering why I'm bothering with an interpreter, instead of just writing a JIT that generates machine code. One of the motivating factors is that Zeta is still at a very early stage, and I think that an interpreter is a better choice for prototyping things. Another factor is that it occurred to me that I could potentially make Zeta more portable by having the interpreter do much of the compilation and optimization work. The interpreter can do type-specialization, inlining and various code simplifications.

The interpreter will be designed in such a way that the internal IR it produces is optimized and in many ways very close to machine code. It should then be possible to add a thin JIT layer on top to generate actual machine code. The resulting JIT will hopefully be much simpler and easier to maintain than if one were compiling directly from the raw object-based bytecode format. Another benefit of this design is that all of the optimizations that the interpreter perform will not be tied in with the specifics of x86 or other architectures, they will remain portable.

At the moment, the new interpreter is at a point where it lazily compiles code into the flat internal format, but performs no other optimization. This was enough to get a 7x performance improvement over the naive interpreter, but the current system is still quite a bit below the performance level of the Python interpreter, and there is definite room for improvement. Some of the first optimizations I would like to introduce are the elimination of redundant branching instructions, and the use of inline caching to speed up function calls.

The elimination of redundant branches is fairly easy to do with BBV. Code is lazily compiled, and appended linearly into a buffer. When generating code for a branch, if the block we are jumping to is just about to be compiled, then the branch is redundant. BBV will naturally tend to generate code that flows linearly along hot code paths and branches out-of-line for infrequent code paths. That is, the default path often comes right next and requires no branching.

Inline caching is a classic technique that was pioneered by the Smalltalk and SELF VMs. It's used, among other things, to eliminate dynamic property lookups when polymorphic function calls are performed (see this excellent blog post for more information). Currently, ZetaVM, when performing a function call, needs to read multiple properties on function objects. For instance, it needs to find out how many arguments the function has, and what is its entry basic block. These property lookups are dynamic, and relatively slow. The end result is that the call instruction is very slow compared to other instructions.

Most call instructions will end up always calling the same instruction. Hence, dynamic overhead on function calls can largely be eliminated by caching the identity of the function being called by a given call instruction. That is, one can cache the number of arguments and entry basic block associated with a function object the first time a call instruction is run, and then reuse this information when calling the function again, provided we keep calling the same function. This information will be cached in line, in the instruction stream, right after the call instruction opcode, hence the name inline caching. I anticipate that with inline caching, function calls in Zeta can be made several times faster.