Over a month ago, I received reviews for the basic block versioning paper I submitted to ECOOP 2015. The reviews were again mixed, ranging from great to terrible. After 3 previous rejections at other conferences, I’d lowered my expectations and thought we were most likely in for a fourth refusal. To be honest, I was somewhat angry. The worst review we had received was unfair and harsh. Whoever had written it seemed to have a fairly poor understanding of JIT compilation, but that didn’t stop them from aggressively criticizing our work.
Fortunately for me, ECOOP offers a chance for submitters to respond to peer reviews, and I was able to channel my anger into motivation to write a very strong and detailed rebuttal. I still didn’t think our paper would make it, but at the very least, I thought, I gave it my all. I was pleasantly surprised when I learned earlier this month that the paper was accepted! Two reviewers increased the score they gave us, tipping the balance in our favor.
This is an important turning point for me, because it means the key contribution of my research work (the idea of basic block versioning) is getting published. I’m finally getting some official recognition from my academic peers for the past four years of hard work I poured into my research. Getting at least one publication from my PhD work was a personal goal I’d set for myself.
Now that this goal is reached, this paper offers me something to build upon. Publishing about additional contributions from my PhD work should be a fair bit easier now that I have my foot in the academic door. I’m already working on a new paper about typed object shapes, and I’ll be able to cite the ECOOP paper. I’ll be able to present further contributions as improvements over the work I already published, instead of trying to explain them in isolation.
And hey, I get to go give a talk in Prague! Sometimes I think being an academic isn’t all that bad.
In December, I submitted a paper about my work on basic block versioning to ECOOP 2015. Basic block versioning (BBV for short) is a technique to specialize machine code in order to eliminate type checks. It works by generating type-specialized versions of basic blocks on the fly based on the types observed during the execution of a program. It’s a way to type-specialize code without doing type analysis per-se. The technique, as described in the ECOOP paper, works at the intraprocedural level (inside function bodies) and only deals with primitive types (i.e. integer, string, object, boolean). My goal with this paper was to explain basic block versioning in its simplest possible form and avoid complicating the explanation by having to delve into the details of various enhancements to the technique.
I still don’t know if the paper will be accepted or not, but in the meantime, my research has been advancing. I extended BBV to perform overflow check elimination on loop increments, which turned out to be a relatively trivial extension to make. I also implemented what I refer to as typed shapes, which is a mechanism to version code based on object shapes and property types that I feel meshes quite well with BBV. Most recently, I’ve been working on extending BBV to work interprocedurally. That is, allowing type information to cross function call boundaries. I see this last addition as the missing piece of the puzzle which completes the BBV approach and makes it into a viable complete alternative to traditional fixed point interprocedural type analyses.
My implementation of interprocedural BBV can be divided into at least four components:
- Generating type-specialized function entry points (passing types to callees)
- Type-specialized call continuations (passing return types back to callers)
- Shapes-changed “dirty” flag propagation to call continuations (shape mutation/aliasing info)
- Threading the global object shape through function calls
So far, I have the first two parts working on my development branch with all tests passing and all benchmarks working. The results are quite positive so far. I’ve been able to achieve a reduction in the number of type tag tests of 98% on average over 26 benchmarks. On many of the simpler benchmarks, there are essentially no type tests executed during the entire execution.
The results so far are impressive enough that I believe BBV may actually be eliminating more type tests than what is achievable with traditional type analyses. This is assuming that such analyses eliminate redundant type tests without versioning or duplicating code. I intend to explore this further as part of my PhD dissertation. I would like to be able to show that BBV is not only fast but actually more capable than iterative fixed-point type analyses.
A few months ago, I was pleased to receive an e-mail from professor Jan Vitek (chair of the ACM SIGPLAN) inquiring about the status my basic block versioning work. He had been on the committee of a conference to which my advisor and I had submitted a paper (which was unfortunately rejected). Following some discussion, Jan Vitek invited me to visit Northeastern University to meet his research group and give a talk about my PhD work.
This Monday, I flew to Boston to visit the Programming Research Laboratory. I had the opportunity to exchange ideas with several PhD students as well as established researchers in the PL and compiler research fields, including Jan Vitek himself, Matthias Felleisen and Olin Shivers. Topics discussed were quite broad, ranging from graduate studies to PL design to data-parallel programming to compiler verfication to actors and JIT compiler imlementation.
The talk itself went quite well and elicited much participation. I have to say that the act of creating and organizing the slides for the talk was pretty helpful for me. I’ve been able to come up with better examples to explain basic block versioning and the way incremental code generation is implemented in Higgs. It also gave me an opportunity to talk about my recent work on typed shapes. The general feeling I’ve been having lately is I explain my research much more effectively when giving talks than when describing it in papers.
I exchanged ideas with Jan Vitek regarding hybrid type analyses for optimizing dynamic languages (he is working on an optimizing VM for the R language). Jan Vitek also gave me precious feedback on how to improve my basic block versioning paper. I’m currently working on an updated submission for the upcoming ECOOP conference. Regardless of whether the new paper gets in or not, it will be much improved with additional data to support our claims.
My hosts were very well-organized which made the whole experience quite pleasant. A meeting schedule had been planned for me and a hotel room reserved when I arrived. I got a little scared Tuesday evening when my taxi was over 30 minutes late, putting me in serious danger of missing the last flight back to Montreal, but I was lucky enough to run into Carla E. Brodley, dean of CCIS, who offered me to share an uber ride with her. My only regret regarding this short trip is that I did not get to try Boston clam chowder. Maybe next time.
This Monday, I met with Marc Feeley, my PhD advisor. I showed him graphs comparing the performance of Higgs to that of the TraceMonkey compiler. Reviewers at conferences we’ve previously submitted papers to have been very skeptical, and pressed us multiple times to produce some kind of comparison of basic block versioning against tracing. We believe that faring well against TraceMonkey could increase our odds of getting our work published.
So far, we’re beating TraceMonkey on about half of our benchmark set. There are, however, cases where we’re not doing well at all. The above graph compares speedups produced by Higgs relative to V8 baseline and TraceMonkey. The scale is logarithmic and higher bars are better. If you look at the last bars at the far right, you’ll notice that they’re actually negative. That’s because our performance on that benchmark is about 100 times slower than that of TraceMonkey (ouch). The scale of this graph could be fixed so that the bars no longer show up as negative, but Marc rightly pointed out that reviewers may latch on instances like this. Does this benchmark exhibit some pathological case that basic block versioning can’t handle?
What Went Wrong?
The reason my for-in loop was slow is that I never bothered to optimize it. With such complex semantics, it seemed easier to take the “comfortable” approach. I implemented the loop with a closure that gets generated when iteration begins. This closure is a generator that returns a different property name on each call. It hides all of the iteration state and complex logic from the outside. What also added to the slowness is that this closure made multiple calls to the D part of my runtime to traverse object shapes. I wrote a microbenchmark to measure the speed of the for-in loop itself and sure enough, it was over 100 times slower than that of V8 and TraceMonkey.
Making it Fast
My advisor convinced me that I should address this performance problem, and to do that, I had to make for-in loops faster. It was obvious that I should probably get rid of the closure created on each traversal and eliminate the D runtime calls. I had to remove the slow things, but what would I replace them with? For the closure, it was fairly obvious: initialize the variables I need before the loop, and thread them through manually as appropriate. The bigger issue was how I would go about enumerating all the properties in an object without calling into my D runtime code, and without traversing the shape tree structure at all, if possible.
I knew that V8 and TraceMonkey had fast for-in loops, so surely, they must have found a way to make them fast. They must have some kind of trick that I hadn’t thought of. My eureka moment came when I realized that for a given object shape (aka hidden class), the properties enumerated are always the same. Aha! There’s an easy solution then. A list of the properties to be enumerated over can be generated per-shape. To avoid D runtime calls, this list can take the form of a JS array that can be directly accessed by my JS runtime. Furthermore, it can be lazily generated the first time a for-in loop touches a given shape, and cached for future reuse. Clever, and not too difficult to implement.
|Higgs (old for-in)||>500s|
|Higgs (new for-in)||2.91s|
|V8 3.29.66 (baseline)||4.98s|
|V8 3.29.66 (Crankshaft)||4.95s|
After a few iterations of tweaking, I was quite pleased with the performance results I got (see above). Higgs went from having a horribly slow for-in loop to having the fastest of all. Quite possibly, the precached table trick I came up with is something other engines don’t have in their implementation.
The Benchmark Game
The faster for-in loop satisfied my original goal. The fasta benchmark now runs several times faster on Higgs. The performance we get is no longer abhorrent. However, we still aren’t beating the other engines on that benchmark, we’re about 3 times slower now. There’s another detail that hinders performance in this specific benchmark. In the body of the for-in loops, properties are being accessed with dynamic keys that are only known at execution time. Higgs currently deals with that by doing a D runtime call. Clearly, again, the other engines have some better way of handling this particular case than I do.
I’ve been working hard in the last two weeks to try and fix every obvious performance issue I could find. Among other things, I implemented ropes to make string concatenation faster. Now, I made for-in loops faster. In the end though, each optimization I make is just a drop in the bucket. To compete effectively on every benchmark, there are many individual things to optimize. Everything must be made fast. The people at Google and Mozilla have had years to optimize their engines specifically so that they would perform well on the well-known sets of benchmarks. They’ve gone as far as implementing memoization of sine values based on the knowledge that this one benchmark calls the sine functions with the same angle values repeatedly.
Over a week ago, I received the unfortunate news that my basic block versioning paper was rejected for the third time. We had received five “weak accept” scores, but at the last minute, after the rebuttal period was over, one of the reviewers decided to change his vote and give us a “reject” rating. I won’t lie, even though I knew the paper could be rejected, I found that I had a hard time getting out of bed the next day. For a little while, my motivation was completely crushed. I decided to put the latest version of the paper on arXiv.org. I feel that it’s useful to timestamp my progress, and it seems that sharing the results of my research is more important than getting into academic conferences.
The graph above shows a comparison of the relative execution time of code compiled by Higgs vs TraceMonkey (lower is better). Compilation times are excluded from this comparison because of issues mentioned in a previous blog post. The 26 benchmarks are from the SunSpider and V8 suites, but I’ve excluded regular expression benchmarks since Higgs doesn’t have a regular expression compiler (only an interpreter), and my research is not specifically about regular expressions.
Higgs clearly doesn’t beat TraceMonkey across the board, but it does perform better on half of the above benchmarks, sometimes by a wide margin. What seems clear after some analysis is that TraceMonkey does very well in benchmarks with short, obvious loops, but poorly in benchmarks with more complex control-flow. This should at least highlight the fact that basic block versioning is not the same thing as tracing, and can, in fact, deal with situations to which tracing is less suitable.
Fortunately for me, TraceMonkey is not a moving target in terms of performance, and this week, I’ve been working on helping Higgs catch up. I’ve added string optimizations such as ropes and precaching of one-character strings. I’ve also been working on eliminating bottlenecks in the runtime and standard libraries. Tomorrow, I’ll be optimizing the inefficient way in which Higgs handles for-in loops (property enumerations), which are extensively used in the fasta benchmark.
It’s very motivating to have a goal. You can be sure that the next paper I submit will be showing Higgs beating TraceMonkey on most of these benchmarks. I’m going to keep improving the performance, keep improving our results, and keep amassing more and more evidence that basic block versioning is a worthwhile JIT compiler architecture. I’m going to push Higgs to the limit.
I recently wrote about performance issues I ran into with Higgs. Compilation times are currently unacceptably slow, and I tracked the cause down to performance problems in D’s garbage collector. Someone familiar with the GC internals explained that the GC scales poorly, with total collection time increasing quadratically in function of the amount of memory allocated. Despite this poor GC scaling being an obvious flaw, I received criticism from people telling me that my compiler was simply allocating “too many objects” and that I was doing something wrong. That I should be implementing my own custom memory allocation strategies because garbage collectors were always going to be slow.
Despite the nay sayers, Andrei Alexandrescu, co-creator of D, announced that he would be rewriting D’s GC to fix these performance problems, and his announcement was well-received in the D community. Unfortunately for me, this GC rewrite doesn’t seem to be a top priority, and I’m not sure how many months it will take before a new collector replaces the current one. It seems conceivable that it could take over a year, or that it might happen after I finish my PhD studies. In a sense, the nay sayers are right. With the current GC situation, I have no choice but to circumvent the GC to get acceptable performance.
I decided to do some research into alternative memory management strategies. I first started by looking at the way it’s done in Google’s V8 JIT. Their strategy is a system of memory “zones”. These appear to be large memory arrays in which a pointer is bumped to allocate memory, which is very fast. They generate a function’s Abstract Syntax Tree (AST) and it’s Intermediate Representation (IR), allocating all node objects into a zone, and then discard all of these objects at once when they are no longer needed. The cost to deallocate an entire zone at once is constant. This strategy is clever, but it isn’t sufficient for Higgs. I could see myself using something like this for my AST and IR (and maybe I will), but there are other objects in the system which need more fine-grained deallocation.
Quite possibly, I’ll end up implementing object pools with free lists. This wouldn’t be quite as blazingly fast as zones, but it would result in less memory waste, and it would be usable for all of the often-allocated kinds of objects in Higgs. The folks on the D forum were kind enough to point me to useful resources on how to manually allocate class instances in D.
Out of curiosity, before seriously considering implementing an object pool, I decided to try and benchmark the D garbage collector. I knew there was a performance issue, but I hadn’t precisely quantified it. I was curious to see how much time would be needed to allocate a few million objects, and so I wrote the following test program:
The results, timed on my Core i7 3.4GHz with 16GB of RAM, were as follows:
It’s a sad admission to make, but I’ve always found reading books tedious and uncomfortable. Paper books cause me eye strain and I can never seem to find a comfortable position to hold them in. Computer screens aren’t much better: I have a hard time reading multiple pages in one sitting, and reading a book one page at a time is rather ineffective. I’m not sure if I have some kind of low-grade attention deficit, or if it’s a side-effect of programming and using the internet for so many years, but I prefer for written information to come in bite-sized pieces.
I’ve always loved science fiction, but until recently, my only exposure to it came from movies and tv shows. This changed recently after I acquired a smartphone (my beloved Nexus 4). I found out that there were several free audiobook playback apps in the Google Play store and that many of the greatest science fiction classics were available in audiobook form. I decided to give this a try and listened to Neuromancer, the ultimate cyberpunk classic. I was instantly hooked.
Since discovering audiobooks just a few months ago, I’ve “read”:
- Burning Chrome
- Dune Messiah
- Children of Dune
- The Positronic Man
- The Last Theorem
- I, Robot
- Buddha’s Brain
- Robot Dreams
Listening to audiobooks is a relaxing experience, a way to take my mind off of things while having a story narrated to me, a way to take a break. What’s even better is that this entertainment doesn’t take time away from productive activities at all. I can listen to audiobooks while I’m walking somewhere, on the subway or on the bus, while I’m cooking alone, running errands, in a waiting room or even while brushing my teeth. I just have to whip out my phone and earbuds and press play.
The audiobook app I use (Smart Audiobook Player) is nice enough to automatically stop playback if the earbuds get unplugged, and to go back some number of seconds when playback is resumed. This means that if I run into someone I want to talk to, I can simply unplug the earbuds and stop playback without missing anything. The earbuds I bought ($10 Panasonics) provide good enough sound isolation to use on the subway or in other loud environments.
I implied earlier that I have a hard time staying focused on paper books. I’m one of those people who will be reading a book, and eventually realize that I’m rereading the same sentence for the 5th time. You might think that staying focused on audiobooks is just as hard, but I find it a little easier. What I’m also finding though, is that I’ve gotten better at staying focused on audiobooks.
In a way, listening to audiobooks can be seen as a form of mindfulness practice (a form of meditation). If I let my mind run away in thoughts, I lose track of the story and I might want to rewind a bit to try and pay more attention, but I find myself needing to do this less and less. What maybe makes this less tedious than with paper books is that there seems to be less effort involved in listening to something than in reading.
I’m currently listening to The Fall of Hyperion and enjoying it very much so far. I plan to listen to many more audiobooks. It’s come to the point where when I’m between books, I miss having something to listen to. Fortunately, there are many great science-fiction classics out there which have been highly acclaimed and are available in audiobook format.
If you’re a computer geek and you’re looking for something that doesn’t take itself too seriously, I would highly recommend collections of short stories by Isaac Asimov, such as I, Robot. I found them amusing and relaxing to listen to. For rich and complex fictional universes, you might want to check out Dune, Neuromancer or Hyperion. Scott Brick, narrator of the Dune audiobook I listened to, is an excellent voice actor who even makes convincing female voices. His wikipedia page lists several other books he has narrated.