The HyperText Markup Language (HTML) was invented by Physicist Tim Berners-Lee of CERN, and its first publicly available specification surfaced on the internet in 1991. It was meant intended as a simple format for people to share enriched static documents or pages that could refer to one another through hyperlinks. The original design was relatively simple, with just 18 types of elements. As of today, the HTML specification in PDF format is 1156 pages long.
As of today, Chromium is about 17 million lines of code, comments excluded. The sheer complexity of HTML+JS+CSS is enormous. This is a problem because it means there are many areas where implementations can differ. What I mean is that your webpage is increasingly likely to break or not behave the same on different web browsers. The fact that HTML is a fast-moving target and a “living standard” certainly won’t help. What recently dawned on me is that HTML too complex for any individual or corporation to ever implement a new web browser from scratch.
A few years ago, I started playing with new HTML APIs to produce audio output. I was one of the first play with Mozilla’s Audio Data API. This API allowed you to generate raw audio samples and send them to an audio output, among other things. Unfortunately, Chrome never implemented this API. They opted instead to create the competing Web Audio API. Since Web Audio has the backing of the W3C, it became obvious it would win out, and the Web Data API would die. It took about 3 years for Firefox to finally implement Web Audio. I’m still not sure if Safari and IE support it properly.
The victory of Web Audio over Audio Data frustrated me a bit, because I really thought the Mozilla’s Audio Data API had the right approach: it was simple and low-level. You generated samples and wrote them out for output. The Web Audio API, in comparison, is essentially a modular synthesizer, with many different types of audio processing nodes you can interconnect together. It’s way more complex than the Audio Data API. The Web Audio API is most likely implemented in C++ and interfaced with JS through several dozen types of HTML DOM nodes.
The Web Audio API is stupid for two simple reasons. The first is that you really could implement your own modular synthesizer in JS (please refrain from telling me that JS is too slow for that as I’ve already implemented one). Letting people implement their own modular synthesizer as part of a library is more flexible and more portable. The second reason Web Audio is stupid is that implementing many kinds of pre-made synthesizer nodes in C++ and interfacing them through the DOM almost guarantees that Web Audio will not sound the same on different web browsers.
In my opinion, the problem with HTML is the growing complexity and feature redundancy. The web is moving very fast, too fast maybe, probably in part because it’s a hugely competitive market. Everyone wants HTML to support the latest and coolest feature their need in the way they like best. Design by committee ensures that bloat accumulates. I think that one day, HTML is bound to crumble upon its own weight, and something more minimalistic will take its place. Something designed from the ground up for web applications instead of static webpages. The app stores used on cellphones may be a hint of things to come.
Intelligence is a hard thing to define. The running joke in AI circles that every time a computer manages to do something we previously thought required human-level intelligence, we raise the bar and define this something to be what intelligence is not. A good example of this is the game of chess. Not long ago, many thought that computers could never equal chess grandmasters. They believed being a chess grandmaster required a powerful intuition, a kind of magic that only humans could ever possess. In 1997, the efforts of the Deep Blue team at IBM demonstrated that this belief was quite mistaken.
Today, chess-playing computers are usually not thought of as intelligent machines. We tend to believe intelligence is something only humans are capable of. We see human intelligence as fundamentally different from what animals and computers do. Animals may not be able to build computers or write poems, but a quick examination of the world around us reveals that many animals have emotions, complex social dynamics, puzzle-solving abilities, the capability to fashion and use tools as well as some degree of self-awareness.
I don’t view intelligence as exclusively human. Fundamentally, I believe that intelligence is the capability of a system (biological or computerized) to dynamically adapt to its environment. One thing that greatly enhances the survivability of lifeforms is to be able to dynamically adapt their behavior in response to changing environmental conditions. Humans are an extreme example of this: our brains and nervous systems allow us see with our eyes and interpret images, to move very rapidly, to understand our environment and devise elaborate plans going far into the future.
I believe that intelligence exists on a spectrum, or in varying degrees. Cats have a fairly powerful understanding of their environment too. They understand their own movements very well. They can predict the behavior of physical objects and their own body as a part of the physical world. Bees can see flowers, avoid obstacles and navigate based on light polarization patterns in the sky. Plants change the direction of their growth so they can more efficiently capture sunlight. Jellyfish have no brains, but they have a network of nerves which controls their swimming according to factors such as the day/night cycle.
At an even lower level, paramecium are unicellular organisms which can move around water using ciliate. Being just one cell, they have no neurons. Nevertheless, they can sense chemicals in the water surrounding them, detect obstacles, eat smaller prey organisms, reproduce sexually and have even shown some ability to learn. What is obvious is that even though paramecium have no brains or neurons, they contain chemical and genetic machinery that implements some kind of algorithmic process which changes their behavior based on environmental conditions.
You might be surprised to find out that some viruses are able to navigate inside host cells and express different genes (change their behavior) based on where they are inside these host cells. Viruses are thought of as non-lifeforms by many scientists, but they are definitely able to adapt. In fact, without even getting into dynamic changes in the expression of viral DNA, one could say that the fast mutation of viruses constitutes a form of adaptation in itself. Viruses reproduce so fast and in such large numbers, that they are able to do a brute-force search for new attack vectors on host organisms.
At the root of it all, natural selection is an optimization process that helps lifeforms adapt to their environment. One could argue that natural selection, a force of nature, is a kind of intelligence in and of itself. There is no intelligent design, but nature finds a way to make things more efficient through trial and error. I would argue that in a way, natural selection is a process that has been optimizing itself to help life spread and adapt more effectively. In the beginning, there were only unicellular organisms, and they reproduced through mutation only. Mutations are a very slow and inefficient way to adapt. Then came sexual reproduction. This allowed genes from multiple viable individuals to be recombined, increasing rate at which viable genetic combinations can be explored, thus increasing adaptability.
Multicellular organisms arose. This gave rise to inter-cellular communication and nervous systems which allowed rapid movements. Nervous systems gave rise to centralized brains which could perform computations of ever-increasing complexity. Eventually, organisms came which had such complete models of their environment that they understood their own existence and wondered about the origin of life itself. Today, humans are intelligent enough that they are beginning to understand their own DNA. Soon, humans will be using their intelligence to make conscious choices about their own evolutionary destiny. This will allow us, as a species, to make evolutionary leaps that no other species ever has.
Over a month ago, I received reviews for the basic block versioning paper I submitted to ECOOP 2015. The reviews were again mixed, ranging from great to terrible. After 3 previous rejections at other conferences, I’d lowered my expectations and thought we were most likely in for a fourth refusal. To be honest, I was somewhat angry. The worst review we had received was unfair and harsh. Whoever had written it seemed to have a fairly poor understanding of JIT compilation, but that didn’t stop them from aggressively criticizing our work.
Fortunately for me, ECOOP offers a chance for submitters to respond to peer reviews, and I was able to channel my anger into motivation to write a very strong and detailed rebuttal. I still didn’t think our paper would make it, but at the very least, I thought, I gave it my all. I was pleasantly surprised when I learned earlier this month that the paper was accepted! Two reviewers increased the score they gave us, tipping the balance in our favor.
This is an important turning point for me, because it means the key contribution of my research work (the idea of basic block versioning) is getting published. I’m finally getting some official recognition from my academic peers for the past four years of hard work I poured into my research. Getting at least one publication from my PhD work was a personal goal I’d set for myself.
Now that this goal is reached, this paper offers me something to build upon. Publishing about additional contributions from my PhD work should be a fair bit easier now that I have my foot in the academic door. I’m already working on a new paper about typed object shapes, and I’ll be able to cite the ECOOP paper. I’ll be able to present further contributions as improvements over the work I already published, instead of trying to explain them in isolation.
And hey, I get to go give a talk in Prague! Sometimes I think being an academic isn’t all that bad.
In December, I submitted a paper about my work on basic block versioning to ECOOP 2015. Basic block versioning (BBV for short) is a technique to specialize machine code in order to eliminate type checks. It works by generating type-specialized versions of basic blocks on the fly based on the types observed during the execution of a program. It’s a way to type-specialize code without doing type analysis per-se. The technique, as described in the ECOOP paper, works at the intraprocedural level (inside function bodies) and only deals with primitive types (i.e. integer, string, object, boolean). My goal with this paper was to explain basic block versioning in its simplest possible form and avoid complicating the explanation by having to delve into the details of various enhancements to the technique.
I still don’t know if the paper will be accepted or not, but in the meantime, my research has been advancing. I extended BBV to perform overflow check elimination on loop increments, which turned out to be a relatively trivial extension to make. I also implemented what I refer to as typed shapes, which is a mechanism to version code based on object shapes and property types that I feel meshes quite well with BBV. Most recently, I’ve been working on extending BBV to work interprocedurally. That is, allowing type information to cross function call boundaries. I see this last addition as the missing piece of the puzzle which completes the BBV approach and makes it into a viable complete alternative to traditional fixed point interprocedural type analyses.
My implementation of interprocedural BBV can be divided into at least four components:
- Generating type-specialized function entry points (passing types to callees)
- Type-specialized call continuations (passing return types back to callers)
- Shapes-changed “dirty” flag propagation to call continuations (shape mutation/aliasing info)
- Threading the global object shape through function calls
So far, I have the first two parts working on my development branch with all tests passing and all benchmarks working. The results are quite positive so far. I’ve been able to achieve a reduction in the number of type tag tests of 98% on average over 26 benchmarks. On many of the simpler benchmarks, there are essentially no type tests executed during the entire execution.
The results so far are impressive enough that I believe BBV may actually be eliminating more type tests than what is achievable with traditional type analyses. This is assuming that such analyses eliminate redundant type tests without versioning or duplicating code. I intend to explore this further as part of my PhD dissertation. I would like to be able to show that BBV is not only fast but actually more capable than iterative fixed-point type analyses.
A few months ago, I was pleased to receive an e-mail from professor Jan Vitek (chair of the ACM SIGPLAN) inquiring about the status my basic block versioning work. He had been on the committee of a conference to which my advisor and I had submitted a paper (which was unfortunately rejected). Following some discussion, Jan Vitek invited me to visit Northeastern University to meet his research group and give a talk about my PhD work.
This Monday, I flew to Boston to visit the Programming Research Laboratory. I had the opportunity to exchange ideas with several PhD students as well as established researchers in the PL and compiler research fields, including Jan Vitek himself, Matthias Felleisen and Olin Shivers. Topics discussed were quite broad, ranging from graduate studies to PL design to data-parallel programming to compiler verfication to actors and JIT compiler imlementation.
The talk itself went quite well and elicited much participation. I have to say that the act of creating and organizing the slides for the talk was pretty helpful for me. I’ve been able to come up with better examples to explain basic block versioning and the way incremental code generation is implemented in Higgs. It also gave me an opportunity to talk about my recent work on typed shapes. The general feeling I’ve been having lately is I explain my research much more effectively when giving talks than when describing it in papers.
I exchanged ideas with Jan Vitek regarding hybrid type analyses for optimizing dynamic languages (he is working on an optimizing VM for the R language). Jan Vitek also gave me precious feedback on how to improve my basic block versioning paper. I’m currently working on an updated submission for the upcoming ECOOP conference. Regardless of whether the new paper gets in or not, it will be much improved with additional data to support our claims.
My hosts were very well-organized which made the whole experience quite pleasant. A meeting schedule had been planned for me and a hotel room reserved when I arrived. I got a little scared Tuesday evening when my taxi was over 30 minutes late, putting me in serious danger of missing the last flight back to Montreal, but I was lucky enough to run into Carla E. Brodley, dean of CCIS, who offered me to share an uber ride with her. My only regret regarding this short trip is that I did not get to try Boston clam chowder. Maybe next time.
This Monday, I met with Marc Feeley, my PhD advisor. I showed him graphs comparing the performance of Higgs to that of the TraceMonkey compiler. Reviewers at conferences we’ve previously submitted papers to have been very skeptical, and pressed us multiple times to produce some kind of comparison of basic block versioning against tracing. We believe that faring well against TraceMonkey could increase our odds of getting our work published.
So far, we’re beating TraceMonkey on about half of our benchmark set. There are, however, cases where we’re not doing well at all. The above graph compares speedups produced by Higgs relative to V8 baseline and TraceMonkey. The scale is logarithmic and higher bars are better. If you look at the last bars at the far right, you’ll notice that they’re actually negative. That’s because our performance on that benchmark is about 100 times slower than that of TraceMonkey (ouch). The scale of this graph could be fixed so that the bars no longer show up as negative, but Marc rightly pointed out that reviewers may latch on instances like this. Does this benchmark exhibit some pathological case that basic block versioning can’t handle?
What Went Wrong?
The reason my for-in loop was slow is that I never bothered to optimize it. With such complex semantics, it seemed easier to take the “comfortable” approach. I implemented the loop with a closure that gets generated when iteration begins. This closure is a generator that returns a different property name on each call. It hides all of the iteration state and complex logic from the outside. What also added to the slowness is that this closure made multiple calls to the D part of my runtime to traverse object shapes. I wrote a microbenchmark to measure the speed of the for-in loop itself and sure enough, it was over 100 times slower than that of V8 and TraceMonkey.
Making it Fast
My advisor convinced me that I should address this performance problem, and to do that, I had to make for-in loops faster. It was obvious that I should probably get rid of the closure created on each traversal and eliminate the D runtime calls. I had to remove the slow things, but what would I replace them with? For the closure, it was fairly obvious: initialize the variables I need before the loop, and thread them through manually as appropriate. The bigger issue was how I would go about enumerating all the properties in an object without calling into my D runtime code, and without traversing the shape tree structure at all, if possible.
I knew that V8 and TraceMonkey had fast for-in loops, so surely, they must have found a way to make them fast. They must have some kind of trick that I hadn’t thought of. My eureka moment came when I realized that for a given object shape (aka hidden class), the properties enumerated are always the same. Aha! There’s an easy solution then. A list of the properties to be enumerated over can be generated per-shape. To avoid D runtime calls, this list can take the form of a JS array that can be directly accessed by my JS runtime. Furthermore, it can be lazily generated the first time a for-in loop touches a given shape, and cached for future reuse. Clever, and not too difficult to implement.
|Higgs (old for-in)||>500s|
|Higgs (new for-in)||2.91s|
|V8 3.29.66 (baseline)||4.98s|
|V8 3.29.66 (Crankshaft)||4.95s|
After a few iterations of tweaking, I was quite pleased with the performance results I got (see above). Higgs went from having a horribly slow for-in loop to having the fastest of all. Quite possibly, the precached table trick I came up with is something other engines don’t have in their implementation.
The Benchmark Game
The faster for-in loop satisfied my original goal. The fasta benchmark now runs several times faster on Higgs. The performance we get is no longer abhorrent. However, we still aren’t beating the other engines on that benchmark, we’re about 3 times slower now. There’s another detail that hinders performance in this specific benchmark. In the body of the for-in loops, properties are being accessed with dynamic keys that are only known at execution time. Higgs currently deals with that by doing a D runtime call. Clearly, again, the other engines have some better way of handling this particular case than I do.
I’ve been working hard in the last two weeks to try and fix every obvious performance issue I could find. Among other things, I implemented ropes to make string concatenation faster. Now, I made for-in loops faster. In the end though, each optimization I make is just a drop in the bucket. To compete effectively on every benchmark, there are many individual things to optimize. Everything must be made fast. The people at Google and Mozilla have had years to optimize their engines specifically so that they would perform well on the well-known sets of benchmarks. They’ve gone as far as implementing memoization of sine values based on the knowledge that this one benchmark calls the sine functions with the same angle values repeatedly.
Over a week ago, I received the unfortunate news that my basic block versioning paper was rejected for the third time. We had received five “weak accept” scores, but at the last minute, after the rebuttal period was over, one of the reviewers decided to change his vote and give us a “reject” rating. I won’t lie, even though I knew the paper could be rejected, I found that I had a hard time getting out of bed the next day. For a little while, my motivation was completely crushed. I decided to put the latest version of the paper on arXiv.org. I feel that it’s useful to timestamp my progress, and it seems that sharing the results of my research is more important than getting into academic conferences.
The graph above shows a comparison of the relative execution time of code compiled by Higgs vs TraceMonkey (lower is better). Compilation times are excluded from this comparison because of issues mentioned in a previous blog post. The 26 benchmarks are from the SunSpider and V8 suites, but I’ve excluded regular expression benchmarks since Higgs doesn’t have a regular expression compiler (only an interpreter), and my research is not specifically about regular expressions.
Higgs clearly doesn’t beat TraceMonkey across the board, but it does perform better on half of the above benchmarks, sometimes by a wide margin. What seems clear after some analysis is that TraceMonkey does very well in benchmarks with short, obvious loops, but poorly in benchmarks with more complex control-flow. This should at least highlight the fact that basic block versioning is not the same thing as tracing, and can, in fact, deal with situations to which tracing is less suitable.
Fortunately for me, TraceMonkey is not a moving target in terms of performance, and this week, I’ve been working on helping Higgs catch up. I’ve added string optimizations such as ropes and precaching of one-character strings. I’ve also been working on eliminating bottlenecks in the runtime and standard libraries. Tomorrow, I’ll be optimizing the inefficient way in which Higgs handles for-in loops (property enumerations), which are extensively used in the fasta benchmark.
It’s very motivating to have a goal. You can be sure that the next paper I submit will be showing Higgs beating TraceMonkey on most of these benchmarks. I’m going to keep improving the performance, keep improving our results, and keep amassing more and more evidence that basic block versioning is a worthwhile JIT compiler architecture. I’m going to push Higgs to the limit.