Skip to content

Visit to Northeastern University

A few months ago, I was pleased to receive an e-mail from professor Jan Vitek (chair of the ACM SIGPLAN) inquiring about the status my basic block versioning work. He had been on the committee of a conference to which my advisor and I had submitted a paper (which was unfortunately rejected). Following some discussion, Jan Vitek invited me to visit Northeastern University to meet his research group and give a talk about my PhD work.

This Monday, I flew to Boston to visit the Programming Research Laboratory. I had the opportunity to exchange ideas with several PhD students as well as established researchers in the PL and compiler research fields, including Jan Vitek himself, Matthias Felleisen and Olin Shivers. Topics discussed were quite broad, ranging from graduate studies to PL design to data-parallel programming to compiler verfication to actors and JIT compiler imlementation.

The talk itself went quite well and elicited much participation. I have to say that the act of creating and organizing the slides for the talk was pretty helpful for me. I’ve been able to come up with better examples to explain basic block versioning and the way incremental code generation is implemented in Higgs. It also gave me an opportunity to talk about my recent work on typed shapes. The general feeling I’ve been having lately is I explain my research much more effectively when giving talks than when describing it in papers.

I exchanged ideas with Jan Vitek regarding hybrid type analyses for optimizing dynamic languages (he is working on an optimizing VM for the R language). Jan Vitek also gave me precious feedback on how to improve my basic block versioning paper. I’m currently working on an updated submission for the upcoming ECOOP conference. Regardless of whether the new paper gets in or not, it will be much improved with additional data to support our claims.

My hosts were very well-organized which made the whole experience quite pleasant. A meeting schedule had been planned for me and a hotel room reserved when I arrived. I got a little scared Tuesday evening when my taxi was over 30 minutes late, putting me in serious danger of missing the last flight back to Montreal, but I was lucky enough to run into Carla E. Brodley, dean of CCIS, who offered me to share an uber ride with her. My only regret regarding this short trip is that I did not get to try Boston clam chowder. Maybe next time.

About these ads

The Fastest For-In Loops in the Whole Wide Web

This Monday, I met with Marc Feeley, my PhD advisor. I showed him graphs comparing the performance of Higgs to that of the TraceMonkey compiler. Reviewers at conferences we’ve previously submitted papers to have been very skeptical, and pressed us multiple times to produce some kind of comparison of basic block versioning against tracing. We believe that faring well against TraceMonkey could increase our odds of getting our work published.

speedups-log

So far, we’re beating TraceMonkey on about half of our benchmark set. There are, however, cases where we’re not doing well at all. The above graph compares speedups produced by Higgs relative to V8 baseline and TraceMonkey. The scale is logarithmic and higher bars are better. If you look at the last bars at the far right, you’ll notice that they’re actually negative. That’s because our performance on that benchmark is about 100 times slower than that of TraceMonkey (ouch). The scale of this graph could be fixed so that the bars no longer show up as negative, but Marc rightly pointed out that reviewers may latch on instances like this. Does this benchmark exhibit some pathological case that basic block versioning can’t handle?

What Went Wrong?

The poorly-performing benchmark is string-fasta, from the SunSpider suite. A little bit of digging revealed what was going wrong. The benchmark uses the JavaScript for-in loop to iterate through object keys. The for-in loop uses reflection to iterate through the property names of an object and all of its parents. The semantics of this simple-looking construct are annoyingly complex. It deals with arrays and strings in addition to objects. If a property is deleted during iteration and hasn’t been listed yet, it must never be listed. Properties of child objects must “shadow” parent properties with the same name so that duplicate property names won’t show up twice.

The reason my for-in loop was slow is that I never bothered to optimize it. With such complex semantics, it seemed easier to take the “comfortable” approach. I implemented the loop with a closure that gets generated when iteration begins. This closure is a generator that returns a different property name on each call. It hides all of the iteration state and complex logic from the outside. What also added to the slowness is that this closure made multiple calls to the D part of my runtime to traverse object shapes. I wrote a microbenchmark to measure the speed of the for-in loop itself and sure enough, it was over 100 times slower than that of V8 and TraceMonkey.

Making it Fast

My advisor convinced me that I should address this performance problem, and to do that, I had to make for-in loops faster. It was obvious that I should probably get rid of the closure created on each traversal and eliminate the D runtime calls. I had to remove the slow things, but what would I replace them with? For the closure, it was fairly obvious: initialize the variables I need before the loop, and thread them through manually as appropriate. The bigger issue was how I would go about enumerating all the properties in an object without calling into my D runtime code, and without traversing the shape tree structure at all, if possible.

I knew that V8 and TraceMonkey had fast for-in loops, so surely, they must have found a way to make them fast. They must have some kind of trick that I hadn’t thought of. My eureka moment came when I realized that for a given object shape (aka hidden class), the properties enumerated are always the same. Aha! There’s an easy solution then. A list of the properties to be enumerated over can be generated per-shape. To avoid D runtime calls, this list can take the form of a JS array that can be directly accessed by my JS runtime. Furthermore, it can be lazily generated the first time a for-in loop touches a given shape, and cached for future reuse. Clever, and not too difficult to implement.

Higgs (old for-in) >500s
Higgs (new for-in) 2.91s
V8 3.29.66 (baseline) 4.98s
V8 3.29.66 (Crankshaft) 4.95s
TraceMonkey 4.12s

After a few iterations of tweaking, I was quite pleased with the performance results I got (see above). Higgs went from having a horribly slow for-in loop to having the fastest of all. Quite possibly, the precached table trick I came up with is something other engines don’t have in their implementation.

The Benchmark Game

The faster for-in loop satisfied my original goal. The fasta benchmark now runs several times faster on Higgs. The performance we get is no longer abhorrent. However, we still aren’t beating the other engines on that benchmark, we’re about 3 times slower now. There’s another detail that hinders performance in this specific benchmark. In the body of the for-in loops, properties are being accessed with dynamic keys that are only known at execution time. Higgs currently deals with that by doing a D runtime call. Clearly, again, the other engines have some better way of handling this particular case than I do.

I’ve been working hard in the last two weeks to try and fix every obvious performance issue I could find. Among other things, I implemented ropes to make string concatenation faster. Now, I made for-in loops faster. In the end though, each optimization I make is just a drop in the bucket. To compete effectively on every benchmark, there are many individual things to optimize. Everything must be made fast. The people at Google and Mozilla have had years to optimize their engines specifically so that they would perform well on the well-known sets of benchmarks. They’ve gone as far as implementing memoization of sine values based on the knowledge that this one benchmark calls the sine functions with the same angle values repeatedly.

What I wish conference reviewers would understand is just how high they’re raising the bar by asking little me to compare the performance of Higgs to that of commercial implementations. More than two years of work went into my JIT compiler, and I’m just barely starting to have performance numbers I can be proud of. You’ll notice though that I’m not comparing the performance of Higgs on benchmarks to that of IonMonkeyCrankshaft or TurboFan. I simply don’t think I can match the tens of man-years of (over)engineering and that went into it. It would take me several years to get anywhere close, and by then, V8 would be even faster than it is now. I’ve got the fastest for-in loop for the moment, but it takes a lot more than that to have a competitive JavaScript implementation.

Pushing Higgs to the Limit

Over a week ago, I received the unfortunate news that my basic block versioning paper was rejected for the third time. We had received five “weak accept” scores, but at the last minute, after the rebuttal period was over, one of the reviewers decided to change his vote and give us a “reject” rating. I won’t lie, even though I knew the paper could be rejected, I found that I had a hard time getting out of bed the next day. For a little while, my motivation was completely crushed. I decided to put the latest version of the paper on arXiv.org. I feel that it’s useful to timestamp my progress, and it seems that sharing the results of my research is more important than getting into academic conferences.

Fortunately, my motivation is beginning to return, in part because I’ve had some ideas on how to make the next version of the paper even better. Conference reviewers criticized us for not discussing compilation times, and raised the issue that perhaps basic block versioning could drastically increase compilation times. I think it will be easy enough to simply include a graph showing that this is not the case in practice, and avoid receiving this criticism ever again. Another thing that has been brought up multiple times is the request for a comparison of basic block versioning against trace compilation. I personally don’t want to dedicate the next year to building a JavaScript tracing JIT to compare Higgs against, but I figured that maybe I could make a convincing case some other way.

I decided to ask around on Twitter and found the last working version of the famous TraceMonkey JIT, which was once part of Mozilla’s JavaScript engine. Numerous papers have been published about this JIT compiler and so it’s relatively famous in the academic community. Hence, I figured that if Higgs could compare favorably to TraceMonkey in terms of performance, that may give us more credibility. Performing well against a well-known commercial JIT compiler will hopefully provide the psychological impact needed for conference reviewers to take us more seriously.

TM comparison

The graph above shows a comparison of the relative execution time of code compiled by Higgs vs TraceMonkey (lower is better). Compilation times are excluded from this comparison because of issues mentioned in a previous blog post. The 26 benchmarks are from the SunSpider and V8 suites, but I’ve excluded regular expression benchmarks since Higgs doesn’t have a regular expression compiler (only an interpreter), and my research is not specifically about regular expressions.

Higgs clearly doesn’t beat TraceMonkey across the board, but it does perform better on half of the above benchmarks, sometimes by a wide margin. What seems clear after some analysis is that TraceMonkey does very well in benchmarks with short, obvious loops, but poorly in benchmarks with more complex control-flow. This should at least highlight the fact that basic block versioning is not the same thing as tracing, and can, in fact, deal with situations to which tracing is less suitable.

Fortunately for me, TraceMonkey is not a moving target in terms of performance, and this week, I’ve been working on helping Higgs catch up. I’ve added string optimizations such as ropes and precaching of one-character strings. I’ve also been working on eliminating bottlenecks in the runtime and standard libraries. Tomorrow, I’ll be optimizing the inefficient way in which Higgs handles for-in loops (property enumerations), which are extensively used in the fasta benchmark.

It’s very motivating to have a goal. You can be sure that the next paper I submit will be showing Higgs beating TraceMonkey on most of these benchmarks. I’m going to keep improving the performance, keep improving our results, and keep amassing more and more evidence that basic block versioning is a worthwhile JIT compiler architecture. I’m going to push Higgs to the limit.

Circumventing the D Garbage Collector

I recently wrote about performance issues I ran into with Higgs. Compilation times are currently unacceptably slow, and I tracked the cause down to performance problems in D’s garbage collector. Someone familiar with the GC internals explained that the GC scales poorly, with total collection time increasing quadratically in function of the amount of memory allocated. Despite this poor GC scaling being an obvious flaw, I received criticism from people telling me that my compiler was simply allocating “too many objects” and that I was doing something wrong. That I should be implementing my own custom memory allocation strategies because garbage collectors were always going to be slow.

Despite the nay sayers, Andrei Alexandrescu, co-creator of D, announced that he would be rewriting D’s GC to fix these performance problems, and his announcement was well-received in the D community. Unfortunately for me, this GC rewrite doesn’t seem to be a top priority, and I’m not sure how many months it will take before a new collector replaces the current one. It seems conceivable that it could take over a year, or that it might happen after I finish my PhD studies. In a sense, the nay sayers are right. With the current GC situation, I have no choice but to circumvent the GC to get acceptable performance.

I decided to do some research into alternative memory management strategies. I first started by looking at the way it’s done in Google’s V8 JIT. Their strategy is a system of memory “zones”. These appear to be large memory arrays in which a pointer is bumped to allocate memory, which is very fast. They generate a function’s Abstract Syntax Tree (AST) and it’s Intermediate Representation (IR), allocating all node objects into a zone, and then discard all of these objects at once when they are no longer needed. The cost to deallocate an entire zone at once is constant. This strategy is clever, but it isn’t sufficient for Higgs. I could see myself using something like this for my AST and IR (and maybe I will), but there are other objects in the system which need more fine-grained deallocation.

Quite possibly, I’ll end up implementing object pools with free lists. This wouldn’t be quite as blazingly fast as zones, but it would result in less memory waste, and it would be usable for all of the often-allocated kinds of objects in Higgs. The folks on the D forum were kind enough to point me to useful resources on how to manually allocate class instances in D.

Out of curiosity, before seriously considering implementing an object pool, I decided to try and benchmark the D garbage collector. I knew there was a performance issue, but I hadn’t precisely quantified it. I was curious to see how much time would be needed to allocate a few million objects, and so I wrote the following test program:

membench

The results, timed on my Core i7 3.4GHz with 16GB of RAM, were as follows:

curve

The curve above is not perfectly quadratic, but if anyone doubted that the D GC has a performance problem, I think that this should be proof enough, with 8 million node allocations taking over a minute on a very recent CPU. In comparison, Higgs itself is able to execute a similar JavaScript benchmark with 10 million objects allocated and initialized in less than 10 seconds.

Audiobooks are Awesome

It’s a sad admission to make, but I’ve always found reading books tedious and uncomfortable. Paper books cause me eye strain and I can never seem to find a comfortable position to hold them in. Computer screens aren’t much better: I have a hard time reading multiple pages in one sitting, and reading a book one page at a time is rather ineffective. I’m not sure if I have some kind of low-grade attention deficit, or if it’s a side-effect of programming and using the internet for so many years, but I prefer for written information to come in bite-sized pieces.

I’ve always loved science fiction, but until recently, my only exposure to it came from movies and tv shows. This changed recently after I acquired a smartphone (my beloved Nexus 4). I found out that there were several free audiobook playback apps in the Google Play store and that many of the greatest science fiction classics were available in audiobook form. I decided to give this a try and listened to Neuromancer, the ultimate cyberpunk classic. I was instantly hooked.

Since discovering audiobooks just a few months ago, I’ve “read”:

  • Neuromancer
  • Burning Chrome
  • Dune
  • Dune Messiah
  • Children of Dune
  • Foundation
  • The Positronic Man
  • The Last Theorem
  • I, Robot
  • Buddha’s Brain
  • Robot Dreams
  • Hyperion

Listening to audiobooks is a relaxing experience, a way to take my mind off of things while having a story narrated to me, a way to take a break. What’s even better is that this entertainment doesn’t take time away from productive activities at all. I can listen to audiobooks while I’m walking somewhere, on the subway or on the bus, while I’m cooking alone, running errands, in a waiting room or even while brushing my teeth. I just have to whip out my phone and earbuds and press play.

The audiobook app I use (Smart Audiobook Player) is nice enough to automatically stop playback if the earbuds get unplugged, and to go back some number of seconds when playback is resumed. This means that if I run into someone I want to talk to, I can simply unplug the earbuds and stop playback without missing anything. The earbuds I bought ($10 Panasonics) provide good enough sound isolation to use on the subway or in other loud environments.

I implied earlier that I have a hard time staying focused on paper books. I’m one of those people who will be reading a book, and eventually realize that I’m rereading the same sentence for the 5th time. You might think that staying focused on audiobooks is just as hard, but I find it a little easier. What I’m also finding though, is that I’ve gotten better at staying focused on audiobooks.

In a way, listening to audiobooks can be seen as a form of mindfulness practice (a form of meditation). If I let my mind run away in thoughts, I lose track of the story and I might want to rewind a bit to try and pay more attention, but I find myself needing to do this less and less. What maybe makes this less tedious than with paper books is that there seems to be less effort involved in listening to something than in reading.

I’m currently listening to The Fall of Hyperion and enjoying it very much so far. I plan to listen to many more audiobooks. It’s come to the point where when I’m between books, I miss having something to listen to. Fortunately, there are many great science-fiction classics out there which have been highly acclaimed and are available in audiobook format.

If you’re a computer geek and you’re looking for something that doesn’t take itself too seriously, I would highly recommend collections of short stories by Isaac Asimov, such as I, Robot. I found them amusing and relaxing to listen to. For rich and complex fictional universes, you might want to check out Dune, Neuromancer or Hyperion. Scott Brick, narrator of the Dune audiobook I listened to, is an excellent voice actor who even makes convincing female voices. His wikipedia page lists several other books he has narrated.

Faster than V8 *

lelouch_r

There’s an asterisk next to the title because in most cases, Higgs doesn’t beat the performance of Google’s V8, quite the opposite. I’ve mentioned in my post about D’s garbage collector performance that Higgs suffers from very slow compilation times. Higgs also lacks the sophisticated inlining capabilities of V8, along with a plethora of micro-optimizations needed to perform well on widely-used benchmarks. Hence, I was both surprised and pleased when I found out that Higgs performs better than V8 on a microbenchmark.

The microbenchmark in question is a simple loop incrementing a global counter variable from 0 to one billion:

// Loop to 1B
// Note: i is a global variable
for (i = 0; i < 1000000000; ++i)
{
}

On my Core i7 laptop running 64-bit linux, I get the following performance results (also replicated on an i7 desktop):

time d8 loop-global-incr.js

real 0m2.686s
user 0m2.672s
sys 0m0.008s

time ./higgs loop-global-incr.js

real 0m2.415s
user 0m2.368s
sys 0m0.040s

This might seem quite pointless as it’s a trivial piece of code, not at all representative of the complexities of a real program. As I said earlier, on real benchmarks, V8 usually wins by a wide margin. Still, I do think that this is interesting. Why? Because in this simple instance, I’m able to generate better quality machine code. Despite its complex multi-tiered architecture, type-feedback and type analysis capabilities, V8 actually does a fairly poor job of optimizing this loop, whereas the machine code generated by Higgs is fairly tight.

Optimizing this simple loop is actually nontrivial. In JavaScript, global variables are properties of the global object. As such, these properties could turn out to be accessors (getters and setter methods). Global properties can also be deleted, in which case reading them should throw an exception. There is also the issue that the addition and less-than operators can operate on any type without throwing an error. Hence, to generate effective machine code, you have to prove that the global variable “i” is not an accessor, that it won’t be deleted and that it will remain an integer. If you can’t prove those facts, then the machine code you generate will contain redundant type checks.

The optimized machine code generated by V8 is quite bulky and cryptic:

— Optimized code —
optimization_id = 0
source_position = 0
kind = OPTIMIZED_FUNCTION
stack_slots = 2
Instructions (size = 346)
0xc579e565ae0 0 55 push rbp
0xc579e565ae1 1 4889e5 REX.W movq rbp,rsp
0xc579e565ae4 4 56 push rsi
0xc579e565ae5 5 57 push rdi
0xc579e565ae6 6 4883ec10 REX.W subq rsp,0x10
0xc579e565aea 10 488b45f8 REX.W movq rax,[rbp-0x8]
0xc579e565aee 14 488945e0 REX.W movq [rbp-0x20],rax
0xc579e565af2 18 488bf0 REX.W movq rsi,rax
0xc579e565af5 21 493ba540070000 REX.W cmpq rsp,[r13+0x740]
0xc579e565afc 28 7305 jnc 35 (0xc579e565b03)
0xc579e565afe 30 e8bdd5fcff call StackCheck (0xc579e5330c0)
0xc579e565b03 35 33c0 xorl rax,rax
0xc579e565b05 37 48bb681c51f000110000 REX.W movq rbx,0x1100f0511c68 ;; property cell
0xc579e565b0f 47 4d8b55b0 REX.W movq r10,[r13-0x50]
0xc579e565b13 51 4c3913 REX.W cmpq [rbx],r10
0xc579e565b16 54 0f84f6000000 jz 306 (0xc579e565c12)
0xc579e565b1c 60 488903 REX.W movq [rbx],rax
0xc579e565b1f 63 e911000000 jmp 85 (0xc579e565b35)
0xc579e565b24 68 4883ec08 REX.W subq rsp,0x8
0xc579e565b28 72 488b45f8 REX.W movq rax,[rbp-0x8]
0xc579e565b2c 76 488b5d10 REX.W movq rbx,[rbp+0x10]
0xc579e565b30 80 e908000000 jmp 93 (0xc579e565b3d)
0xc579e565b35 85 488b5d10 REX.W movq rbx,[rbp+0x10]
0xc579e565b39 89 488b45e0 REX.W movq rax,[rbp-0x20]
0xc579e565b3d 93 48ba681c51f000110000 REX.W movq rdx,0x1100f0511c68 ;; property cell ptr
0xc579e565b47 103 488b12 REX.W movq rdx,[rdx] ;; read counter from global
0xc579e565b4a 106 f6c201 testb rdx,0x1 ;; type tag test
0xc579e565b4d 109 0f8553000000 jnz 198 (0xc579e565ba6)
0xc579e565b53 115 48c1ea20 REX.W shrq rdx,32 ;; unboxing counter
0xc579e565b57 119 81fa00ca9a3b cmpl rdx,0x3b9aca00
0xc579e565b5d 125 0f8d32000000 jge 181 (0xc579e565b95)
0xc579e565b63 131 493ba540070000 REX.W cmpq rsp,[r13+0x740] ;; ???
0xc579e565b6a 138 0f8261000000 jc 241 (0xc579e565bd1); jump on carry
0xc579e565b70 144 83c201 addl rdx,0x1 ;; counter increment
0xc579e565b73 147 8bca movl rcx,rdx ;; redundant move
0xc579e565b75 149 48c1e120 REX.W shlq rcx,32 ;; boxing counter
0xc579e565b79 153 48ba681c51f000110000 REX.W movq rdx,0x1100f0511c68 ;; property cell ptr
0xc579e565b83 163 4d8b55b0 REX.W movq r10,[r13-0x50] ;; ???
0xc579e565b87 167 4c3912 REX.W cmpq [rdx],r10 ;; ???
0xc579e565b8a 170 0f8487000000 jz 311 (0xc579e565c17) ;; ???
0xc579e565b90 176 48890a REX.W movq [rdx],rcx ;; write counter to globa
0xc579e565b93 179 eba8 jmp 93 (0xc579e565b3d) ;; jump to loop header
0xc579e565b95 181 48b82141d0c48b060000 REX.W movq rax,0x68bc4d04121
0xc579e565b9f 191 488be5 REX.W movq rsp,rbp
0xc579e565ba2 194 5d pop rbp
0xc579e565ba3 195 c20800 ret 0x8
0xc579e565ba6 198 4d8b5500 REX.W movq r10,[r13+0x0]
0xc579e565baa 202 4c3952ff REX.W cmpq [rdx-0x1],r10
0xc579e565bae 206 751a jnz 234 (0xc579e565bca)
0xc579e565bb0 208 f20f104207 movsd xmm0,[rdx+0x7]
0xc579e565bb5 213 f20f2cd0 cvttsd2sil rdx,xmm0
0xc579e565bb9 217 0f57c9 xorps xmm1,xmm1
0xc579e565bbc 220 f20f2aca cvtsi2sd xmm1,rdx
0xc579e565bc0 224 660f2ec1 ucomisd xmm0,xmm1
0xc579e565bc4 228 7504 jnz 234 (0xc579e565bca)
0xc579e565bc6 230 7a02 jpe 234 (0xc579e565bca)
0xc579e565bc8 232 eb8d jmp 119 (0xc579e565b57)
0xc579e565bca 234 e86304caff call 0xc579e206032 ;; deoptimization bailout
0xc579e565bcf 239 eb86 jmp 119 (0xc579e565b57)
0xc579e565bd1 241 50 push rax
0xc579e565bd2 242 51 push rcx
0xc579e565bd3 243 52 push rdx
0xc579e565bd4 244 53 push rbx
0xc579e565bd5 245 56 push rsi
0xc579e565bd6 246 57 push rdi
0xc579e565bd7 247 4150 push r8
0xc579e565bd9 249 4151 push r9
0xc579e565bdb 251 4153 push r11
0xc579e565bdd 253 4156 push r14
0xc579e565bdf 255 4157 push r15
0xc579e565be1 257 488d6424d8 REX.W leaq rsp,[rsp-0x28]
0xc579e565be6 262 488b75f8 REX.W movq rsi,[rbp-0x8]
0xc579e565bea 266 33c0 xorl rax,rax
0xc579e565bec 268 498d9d451eb1fe REX.W leaq rbx,[r13-0x14ee1bb]
0xc579e565bf3 275 e82806faff call 0xc579e506220 ;; CEntryStub
0xc579e565bf8 280 488d642428 REX.W leaq rsp,[rsp+0x28]
0xc579e565bfd 285 415f pop r15
0xc579e565bff 287 415e pop r14
0xc579e565c01 289 415b pop r11
0xc579e565c03 291 4159 pop r9
0xc579e565c05 293 4158 pop r8
0xc579e565c07 295 5f pop rdi
0xc579e565c08 296 5e pop rsi
0xc579e565c09 297 5b pop rbx
0xc579e565c0a 298 5a pop rdx
0xc579e565c0b 299 59 pop rcx
0xc579e565c0c 300 58 pop rax
0xc579e565c0d 301 e95effffff jmp 144 (0xc579e565b70)
0xc579e565c12 306 e8f303caff call 0xc579e20600a ;; deoptimization bailout
0xc579e565c17 311 e80c04caff call 0xc579e206028 ;; deoptimization bailout

At the start, there’s a long sequence of instruction which seems to be a function entry prelude. This is not part of the loop body. There are many instructions in this machine code which are never executed. You can see, for example, that there is floating-point code. That code was generated in case an overflow might occur, but this can’t happen here. There are also deoptimization bailouts (calls into the JIT compiler) which are never needed. This code is unnecessary, but won’t affect performance. I’ve highlighted in bold the machine instructions inside the loop body which do affect performance but could potentially be eliminated. Essentially, in each loop iteration, the counter variable is unboxed (using a shift instruction), its type tag is tested and then it is boxed again.

In contrast, Higgs, using basic block versioning, is able to eliminate the type test from the loop body. Higgs also uses an unusual scheme where type tags are stored separately from value words. Because of this, Higgs does not need to box and unbox object property values. Since the type of the counter variable is known to always be an integer in this case, type tags don’t need to be touched at all inside the loop body. Furthermore, in Higgs, code is lazily generated, so there is no floating-point or bailout code. There are no long sequences of instructions that are never executed.

The machine code produced by Higgs for the loop body is as follows:

for_test(19337):
; $3 = shape_get_def $1, “i” => capture_cont(19386)
mov rdx, 140451224576896; rdx = shape pointer, unused

capture_cont(19386):
; $8 = shape_get_prop $1, $3
mov r10, [qword rsi + 2616];

; $0 = lt_i32 $6, 1000000000
cmp r10d, 1000000000;
jge branch_for_exit(1933A);
nop5;

for_body(19338):
; $4 = shape_get_def $1, “i” => capture_cont(193A5)
mov r8, 140451224576896; rdx = shape pointer, unused

; $11 = shape_get_prop $1, $4
mov rdi, [qword rsi + 2616];

; $17 = add_i32_ovf $9, 1 => call_merge(193FF), if_false(19410)
add edi, 1;
mov eax, 9795; eax = block idx, for eventual bailout
jo branch_if_false(19410); overflow test

; $14 = shape_get_def $1, “i” => capture_cont(19429)
mov r9, 140451224576896; rdx = shape pointer, unused

; shape_set_prop $1, “i”, $13
mov [qword rsi + 2616], rdi;

; jump => for_test(19337)
jmp for_test(19337);

The machine code above is far from optimal. It contains redundant moves and overflow checks which are not yet optimized away. I’m also fully aware that V8 is a moving target in terms of performance: the V8 team is already working on a better optimizing compiler, and they will no doubt improve upon the current situation. Still, it’s nice to see that all the work I’ve put into basic block versioning is paying off. In the general case, Higgs doesn’t beat V8, and things are likely to remain that way, but in this specific instance, Higgs is able to generate better quality machine code than a commercial compiler which has tens of man-years of engineering behind it. It’s a small victory, but I find it very encouraging nevertheless.

The tests I performed were done with V8 version 3.29.66 (current SVN trunk revision), and the current version of my development branch. If someone has a recent version of the SpiderMonkey shell built, I’d be curious to see the generated code and performance results for this microbenchmark. I wouldn’t be surprised if SpiderMonkey did better than V8 and Higgs, considering it sports a powerful hybrid type inference engine.

Making Waves: D’s GC Problem Followup

Front Page News

reddit_no1_crop

Two days ago, late in the evening, I wrote a blog post about performance issues I’d attributed to D’s garbage collector. I didn’t expect this blog post to draw much attention. I was hoping that some of the people working on the D runtime library might see it, but I didn’t publicize it any further than tweeting the link. My goal was only to draw some amount of attention to the problem; to provide some numbers so I could quantify the GC overhead a bit.

Someone decided the post was interesting enough to submit it to reddit’s r/programming as well as Hacker News. It’s also been discussed on the D forums. In total, about 18000 people ended up viewing it, and over 160 comments were posted. My main conclusion is that controversial topics are a good way to draw attention. I’ve succeeded in my goal of drawing attention to the problem on a much bigger scale than I’d wanted: now, everyone knows about it. I sincerely apologize for the bad publicity I may have caused D, but I’m glad that the problem has been recognized and that something will be done about it.

Andrei Alexandrescu, co-creator of D, posted on reddit to say that he wants to rewrite the GC based on his own allocator design. Dicebot of Sociomantic has also said that they are working on porting their concurrent garbage collector to the current version of D. I don’t how these two efforts will combine, but I’m sure that something good will come of it. A high performance language needs a high performance runtime, and with a brand new GC, the D programming language will be much more competitive.

Insightful Details

People with knowledge of the D GC internals ended up posting on the reddit thread and providing insight into why the D garbage collector is currently inefficient.

thedeemon wrote the following:

The main issue from the post is quite simple: if you allocate a lot of small objects, D runtime will place them inside chunks (called pools) of fixed size (like 4 MB) that it requested from the OS. Each time a pool is filled up it will do a GC cycle and only if there’s still no room for new object it will allocate a new pool. That means number of GC cycles in this situation is proportional to number of objects you allocate, while each GC cycle time is proportional to the managed heap size. It leads to quadratic complexity and awful slowness. If runtime allocated pools not of fixed size but adapted their size to the allocation rate, this scenario could be significantly faster.

deadalnix added this:

There are many other problems. Amongst others: – D’s GC rely on a global lock in which it goes through every time. – D’s GC scan many location even when they cannot contain pointers, which do not make any sense. – D’s GC do not take advantage of D’s type system, when other experiment like OCaml’s GC have shown that there is great benefice going down that road.

To summarize the issues outlined, the D GC:

  1. Has potentially quadratic complexity with respect to the number of allocated objects.
  2. Uses a global lock while traversing objects.
  3. Can only use one thread to perform collections.
  4. Does not take advantage of the D type system, unnecessarily scans non-pointer data.
  5. Does not use lazy sweeping, a common mark & sweep GC optimization.

Fortunately, the worst issue (quadratic growth) can be fixed without much effort.

For those who are curious, or may want to help, here is a link to the source code of the D GC. If you’re interested in contributing and helping to fix this problem, you should head to the D forums and get in touch with the community, or try contacting Andrei directly. Even if you don’t have time to contribute, your feedback may be useful in the design of a better GC.

Dismissive Criticism

Unfortunately, not everyone was welcoming of the discussion I brought forward. Criticism mostly seemed to come from developers who are fundamentally opposed to the use of garbage collection, and view the whole concept in a very negative light. These people put forth the same dismissive and uninformed line of reasoning, which can be summarized as follows:

  1. Garbage collectors are inherently slow.
  2. Since Higgs is suffers from GC overhead, Higgs must be doing “too many” allocations.
  3. Because of #1, it’s pointless to improve the D GC. It will always be slow.
  4. It’s the responsibility of the application writer to avoid using the GC.

I’ll say this: one of the main reasons I chose D over C++ was that D has a garbage collector. I had written a JIT compiler in C++ previously and I found it tedious to manage memory allocations manually. Compilers are inherently allocation-heavy, because they must represent and transform symbolic expression trees, graphs and a heavy amount of metadata. Having to take memory allocation and release into account every time you wish to transform such a tree or graph is highly error-prone.

To avoid using the GC in Higgs, I’d basically have to implement my own specialized memory managers (my own allocation and garbage collection) for each of the data structures used in the compiler. Saying that I shouldn’t use the GC is basically denying any responsibility the D GC might have and shifting the burden onto me, the application developer, to implement my own specialized memory management for the different use cases where I need dynamic memory allocation, and there are many such use cases.

In my previous blog post, I’ve demonstrated two things:

  1. In a normal execution of Higgs, the GC may take more than 75% of the total execution time.
  2. Simple tweaking of GC parameters can reduce the GC overhead tremendously, producing a 3x speedup.

What’s clear is that the D garbage collector scales very poorly. In fact, I’d say that the scaling is often pathological. Twice the allocations may mean the total collection time quadruples: a quadratic growth. This means that there’s a ceiling on the number of objects the GC can be expected to reasonably manage. There’s a ceiling on the scale of applications using the GC. If you go beyond this ceiling, your application will slow down to a crawl. You want to write a multithreaded web server in D? You’re probably going to have to manage memory manually. Java doesn’t suffer from this weakness, so why should D?

Higgs is a program written in what I believe to be a reasonably efficient style. There aren’t many layers of abstraction in the implementation. It doesn’t create what I would deem to be an unreasonable number of allocations for what it does. I’m sure that if I put forth the engineering effort, I could reduce my reliance on the GC and improve performance. In fact, I already have. However, improving the D GC is something that will clearly benefit all users of the language. A future compiler update might improve the performance of some D programs by an order of magnitude. Working towards fixing poor GC performance is much more helpful than blaming those who are affected by it.

Follow

Get every new post delivered to your Inbox.

Join 2,917 other followers