The Fastest For-In Loops in the Whole Wide Web

November 14th, 2014

This Monday, I met with Marc Feeley, my PhD advisor. I showed him graphs comparing the performance of Higgs to that of the TraceMonkey compiler. Reviewers at conferences we've previously submitted papers to have been very skeptical, and pressed us multiple times to produce some kind of comparison of basic block versioning against tracing. We believe that faring well against TraceMonkey could increase our odds of getting our work published.

Speedup vs V8 baseline and TraceMonkey on a logarithmic scale (higher is better)

So far, we're beating TraceMonkey on about half of our benchmark set. There are, however, cases where we're not doing well at all. The above graph compares speedups produced by Higgs relative to V8 baseline and TraceMonkey. The scale is logarithmic and higher bars are better. If you look at the last bars at the far right, you'll notice that they're actually negative. That's because our performance on that benchmark is about 100 times slower than that of TraceMonkey (ouch). The scale of this graph could be fixed so that the bars no longer show up as negative, but Marc rightly pointed out that reviewers may latch on instances like this. Does this benchmark exhibit some pathological case that basic block versioning can't handle?

What Went Wrong?

The poorly-performing benchmark is string-fasta, from the SunSpider suite. A little bit of digging revealed what was going wrong. The benchmark uses the JavaScript for-in loop to iterate through object keys. The for-in loop uses reflection to iterate through the property names of an object and all of its parents. The semantics of this simple-looking construct are annoyingly complex. It deals with arrays and strings in addition to objects. If a property is deleted during iteration and hasn't been listed yet, it must never be listed. Properties of child objects must "shadow" parent properties with the same name so that duplicate property names won't show up twice.

The reason my for-in loop was slow is that I never bothered to optimize it. With such complex semantics, it seemed easier to take the "comfortable" approach. I implemented the loop with a closure that gets generated when iteration begins. This closure is a generator that returns a different property name on each call. It hides all of the iteration state and complex logic from the outside. What also added to the slowness is that this closure made multiple calls to the D part of my runtime to traverse object shapes. I wrote a microbenchmark to measure the speed of the for-in loop itself and sure enough, it was over 100 times slower than that of V8 and TraceMonkey.

Making it Fast

My advisor convinced me that I should address this performance problem, and to do that, I had to make for-in loops faster. It was obvious that I should probably get rid of the closure created on each traversal and eliminate the D runtime calls. I had to remove the slow things, but what would I replace them with? For the closure, it was fairly obvious: initialize the variables I need before the loop, and thread them through manually as appropriate. The bigger issue was how I would go about enumerating all the properties in an object without calling into my D runtime code, and without traversing the shape tree structure at all, if possible.

I knew that V8 and TraceMonkey had fast for-in loops, so surely, they must have found a way to make them fast. They must have some kind of trick that I hadn't thought of. My eureka moment came when I realized that for a given object shape (aka hidden class), the properties enumerated are always the same. Aha! There's an easy solution then. A list of the properties to be enumerated over can be generated per-shape. To avoid D runtime calls, this list can take the form of a JS array that can be directly accessed by my JS runtime. Furthermore, it can be lazily generated the first time a for-in loop touches a given shape, and cached for future reuse. Clever, and not too difficult to implement.

Higgs (old for-in)	>500s
Higgs (new for-in)	2.91s
V8 3.29.66 (baseline)	4.98s
V8 3.29.66 (Crankshaft)	4.95s
TraceMonkey	4.12s

After a few iterations of tweaking, I was quite pleased with the performance results I got (see above). Higgs went from having a horribly slow for-in loop to having the fastest of all. Quite possibly, the precached table trick I came up with is something other engines don't have in their implementation.

The Benchmark Game

The faster for-in loop satisfied my original goal. The fasta benchmark now runs several times faster on Higgs. The performance we get is no longer abhorrent. However, we still aren't beating the other engines on that benchmark, we're about 3 times slower now. There's another detail that hinders performance in this specific benchmark. In the body of the for-in loops, properties are being accessed with dynamic keys that are only known at execution time. Higgs currently deals with that by doing a D runtime call. Clearly, again, the other engines have some better way of handling this particular case than I do.

I've been working hard in the last two weeks to try and fix every obvious performance issue I could find. Among other things, I implemented ropes to make string concatenation faster. Now, I made for-in loops faster. In the end though, each optimization I make is just a drop in the bucket. To compete effectively on every benchmark, there are many individual things to optimize. Everything must be made fast. The people at Google and Mozilla have had years to optimize their engines specifically so that they would perform well on the well-known sets of benchmarks. They've gone as far as implementing memoization of sine values based on the knowledge that this one benchmark calls the sine functions with the same angle values repeatedly.

What I wish conference reviewers would understand is just how high they're raising the bar by asking little me to compare the performance of Higgs to that of commercial implementations. More than two years of work went into my JIT compiler, and I'm just barely starting to have performance numbers I can be proud of. You'll notice though that I'm not comparing the performance of Higgs on benchmarks to that of IonMonkey, Crankshaft or TurboFan. I simply don't think I can match the tens of man-years of (over)engineering and that went into it. It would take me several years to get anywhere close, and by then, V8 would be even faster than it is now. I've got the fastest for-in loop for the moment, but it takes a lot more than that to have a competitive JavaScript implementation.

Enjoyed this post? Share it with your online community!
Share to Hacker News