A Simple JIT Compiler

In the last few days I've been prototyping a basic Just-In-Time (JIT) compiler for Higgs. I decided to start with a very simple design which is sometimes referred to as a tracelet JIT. This is simply a compiler which compiles one basic block at a time, basic blocks being defined as single-entry, single-exit sequences of instructions. This compiler is lazy, meaning it doesn't compile unexecuted blocks. It also only compiles "hot" blocks that have been executed a few hundred times, the idea being to avoid wasting time compiling code that only runs once or very few times (e.g.: initialization code). Another advantage of delaying compilation somewhat is to allow time for profiling statistics to be gathered.

I have to admit that I was very excited to actually begin generating machine code (if you can believe that). It took me a few days to port over the x86 assembler I wrote in JavaScript for Tachyon to D, along with all associated unit tests. Once this was done, I was able to get a very basic tracelet JIT working in less than a day. This first version simply compiled sequences of calls to functions implementing interpreter instructions. This provided small speedups on the order of 5%. I improved on this by adding JIT code generation support for some basic arithmetic, bitwise and branching operations which provided speedups of up to 30%. The next step, in my mind, was to support all branching operations so that tracelets could jump directly into each other without ever needing to fall back to the interpreter. However, I ran into a problem.

The Higgs interpreter was designed with simplicity and maintainability in mind, a choice which has paid off thus far. Unfortunately, this means that some operations aren't built with pure speed in mind, and one problematic case is the call instruction. The call instruction builds the stack frame for callees. This means it needs to make several conditional checks and push the right number of arguments and local variable slots, which isn't the same for every callee. I started trying to implement the full dynamic extent of calls in the JIT compiler. I stopped once I realized how big and messy the generated code would be. Since every runtime primitive is self-hosted in Higgs, there are calls everywhere, which would assuredly mean the generated machine code would be huge (bad for the instruction cache).

I decided to try an idea just for the fun of it. Most of the calls in the small benchmarks I was testing were global function calls. For such calls, it's quite easy to guess correctly who the callee would be. I chose to implement a mechanism which tries to guess the callee and has an optimized call path specifically for the said function. If the callee isn't correctly guessed, the interpreter's call instruction logic is used instead. Amazingly, such a simple optimization (two hours of work) yielded a total performance improvement of about 500% over the interpreter on most of the benchmarks I tried.

Speedups over the interpreter
Speedups over the interpreter

The numbers above show the speedups I measured on some of the SunSpider benchmarks, going from very simple to medium-sized programs. If you're familiar with these benchmarks, you might feel that these times are ridiculously slow either way. You should keep in mind that this is still very preliminary work (I did say the JIT was a prototype) and that these values were measured using the linux time utility, and so include initialization time. Higgs is also self-hosted, meaning that it lazily compiles parts of its own standard library when you run a program. As such, Higgs will probably always be at a disadvantage when it comes to very short running programs.

At this point, the tracelet JIT compiler is about 800 lines of D code (excluding the x86-64 assembler). It's very much a prototype and a platform for me to experiment with, meaning I will probably refactor the design in many ways. I'm already thinking of several improvements I can make over the course of the next week which should bring more significant speedups. The main thing I will be doing is trying to extend tracelets in a trace-like manner, which should tremendously help with loops.

As I said earlier, I designed the Higgs interpreter to be simple and easy to maintain. This decision was partly motivated by my belief that interpreter speed should be largely irrelevant in a system where a JIT compiler is also available. I could have tried hard to make the Higgs interpreter faster, at the cost of tasteless kludges and largely increased code complexity, but I don't think it would have been worth it, especially now that I've shown that 4 days of work can yield a simple JIT compiler producing 500% speedups. Next week, I'd like to have at least 1000% speedups over the interpreter, which I believe is easily achievable.