Two days ago, late in the evening, I wrote a blog post about performance issues I'd attributed to D's garbage collector. I didn't expect this blog post to draw much attention. I was hoping that some of the people working on the D runtime library might see it, but I didn't publicize it any further than tweeting the link. My goal was only to draw some amount of attention to the problem; to provide some numbers so I could quantify the GC overhead a bit.
Someone decided the post was interesting enough to submit it to reddit's r/programming as well as Hacker News. It's also been discussed on the D forums. In total, about 18000 people ended up viewing it, and over 160 comments were posted. My main conclusion is that controversial topics are a good way to draw attention. I've succeeded in my goal of drawing attention to the problem on a much bigger scale than I'd wanted: now, everyone knows about it. I sincerely apologize for the bad publicity I may have caused D, but I'm glad that the problem has been recognized and that something will be done about it.
Andrei Alexandrescu, co-creator of D, posted on reddit to say that he wants to rewrite the GC based on his own allocator design. Dicebot of Sociomantic has also said that they are working on porting their concurrent garbage collector to the current version of D. I don't how these two efforts will combine, but I'm sure that something good will come of it. A high performance language needs a high performance runtime, and with a brand new GC, the D programming language will be much more competitive.
People with knowledge of the D GC internals ended up posting on the reddit thread and providing insight into why the D garbage collector is currently inefficient.
thedeemon wrote the following:
The main issue from the post is quite simple: if you allocate a lot of small objects, D runtime will place them inside chunks (called pools) of fixed size (like 4 MB) that it requested from the OS. Each time a pool is filled up it will do a GC cycle and only if there's still no room for new object it will allocate a new pool. That means number of GC cycles in this situation is proportional to number of objects you allocate, while each GC cycle time is proportional to the managed heap size. It leads to quadratic complexity and awful slowness. If runtime allocated pools not of fixed size but adapted their size to the allocation rate, this scenario could be significantly faster.
deadalnix added this:
There are many other problems. Among others:
D's GC rely on a global lock in which it goes through every time.
D's GC scan many location even when they cannot contain pointers, which do not make any sense.
D's GC does not take advantage of D's type system, when other experiment like OCaml's GC have shown that there is great benefit going down that road.
To summarize the issues outlined, the D GC:
Has potentially quadratic complexity with respect to the number of allocated objects.
Uses a global lock while traversing objects.
Can only use one thread to perform collections.
Does not take advantage of the D type system, unnecessarily scans non-pointer data.
Does not use lazy sweeping, a common mark & sweep GC optimization.
Fortunately, the worst issue (quadratic growth) can be fixed without much effort.
For those who are curious, or may want to help, here is a link to the source code of the D GC. If you're interested in contributing and helping to fix this problem, you should head to the D forums and get in touch with the community, or try contacting Andrei directly. Even if you don't have time to contribute, your feedback may be useful in the design of a better GC.
Unfortunately, not everyone was welcoming of the discussion I brought forward. Criticism mostly seemed to come from developers who are fundamentally opposed to the use of garbage collection, and view the whole concept in a very negative light. These people all put forth the same dismissive and uninformed line of reasoning, which can be summarized as follows:
Garbage collectors are inherently slow.
Since Higgs is suffers from GC overhead, Higgs must be doing "too many" allocations.
Because of #1, it's pointless to improve the D GC. It will always be slow.
It's the responsibility of the application writer to avoid using the GC.
I'll say this: one of the main reasons I chose D over C++ was that D has a garbage collector. I had written a JIT compiler in C++ previously and I found it tedious to manage memory allocations manually. Compilers are inherently allocation-heavy, because they must represent and transform symbolic expression trees, graphs and a heavy amount of metadata. Having to take memory allocation and release into account every time you wish to transform such a tree or graph is highly error-prone.
To avoid using the GC in Higgs, I'd basically have to implement my own specialized memory managers (my own allocation and garbage collection) for each of the data structures used in the compiler. Saying that I shouldn't use the GC is basically denying any responsibility the D GC might have and shifting the burden onto me, the application developer, to implement my own specialized memory management for the different use cases where I need dynamic memory allocation, and there are many such use cases.
In my previous blog post, I've demonstrated two things:
In a normal execution of Higgs, the GC may take more than 75% of the total execution time.
Simple tweaking of GC parameters can reduce the GC overhead tremendously, producing a 3x speedup.
What's clear is that the D garbage collector scales very poorly. In fact, I'd say that the scaling is often pathological. Twice the allocations may mean the total collection time quadruples: a quadratic growth. This means that there's a ceiling on the number of objects the GC can be expected to reasonably manage. There's a ceiling on the scale of applications using the GC. If you go beyond this ceiling, your application will slow down to a crawl. You want to write a multithreaded web server in D? You're probably going to have to manage memory manually. Java doesn't suffer from this weakness, so why should D?
Higgs is a program written in what I believe to be a reasonably efficient style. There aren't many layers of abstraction in the implementation. It doesn't create what I would deem to be an unreasonable number of allocations for what it does. I'm sure that if I put forth the engineering effort, I could reduce my reliance on the GC and improve performance. In fact, I already have. However, improving the D GC is something that will clearly benefit all users of the language. A future compiler update might improve the performance of some D programs by an order of magnitude. Working towards fixing poor GC performance is much more helpful than blaming those who are affected by it.