Skip to content

Faster than V8 *

September 17, 2014

There’s an asterisk next to the title because in most cases, Higgs doesn’t beat the performance of Google’s V8, quite the opposite. I’ve mentioned in my post about D’s garbage collector performance that Higgs suffers from very slow compilation times. Higgs also lacks the sophisticated inlining capabilities of V8, along with a plethora of micro-optimizations needed to perform well on widely-used benchmarks. Hence, I was both surprised and pleased when I found out that Higgs performs better than V8 on a microbenchmark.

The microbenchmark in question is a simple loop incrementing a global counter variable from 0 to one billion:

// Loop to 1B
// Note: i is a global variable
for (i = 0; i < 1000000000; ++i)
{
}

On my Core i7 laptop running 64-bit linux, I get the following performance results (also replicated on an i7 desktop):

time d8 loop-global-incr.js

real 0m2.686s
user 0m2.672s
sys 0m0.008s

time ./higgs loop-global-incr.js

real 0m2.415s
user 0m2.368s
sys 0m0.040s

This might seem quite pointless as it’s a trivial piece of code, not at all representative of the complexities of a real program. As I said earlier, on real benchmarks, V8 usually wins by a wide margin. Still, I do think that this is interesting. Why? Because in this simple instance, I’m able to generate better quality machine code. Despite its complex multi-tiered architecture, type-feedback and type analysis capabilities, V8 actually does a fairly poor job of optimizing this loop, whereas the machine code generated by Higgs is fairly tight.

Optimizing this simple loop is actually nontrivial. In JavaScript, global variables are properties of the global object. As such, these properties could turn out to be accessors (getters and setter methods). Global properties can also be deleted, in which case reading them should throw an exception. There is also the issue that the addition and less-than operators can operate on any type without throwing an error. Hence, to generate effective machine code, you have to prove that the global variable “i” is not an accessor, that it won’t be deleted and that it will remain an integer. If you can’t prove those facts, then the machine code you generate will contain redundant type checks.

The optimized machine code generated by V8 is quite bulky and cryptic:

— Optimized code —
optimization_id = 0
source_position = 0
kind = OPTIMIZED_FUNCTION
stack_slots = 2
Instructions (size = 346)
0xc579e565ae0 0 55 push rbp
0xc579e565ae1 1 4889e5 REX.W movq rbp,rsp
0xc579e565ae4 4 56 push rsi
0xc579e565ae5 5 57 push rdi
0xc579e565ae6 6 4883ec10 REX.W subq rsp,0x10
0xc579e565aea 10 488b45f8 REX.W movq rax,[rbp-0x8]
0xc579e565aee 14 488945e0 REX.W movq [rbp-0x20],rax
0xc579e565af2 18 488bf0 REX.W movq rsi,rax
0xc579e565af5 21 493ba540070000 REX.W cmpq rsp,[r13+0x740]
0xc579e565afc 28 7305 jnc 35 (0xc579e565b03)
0xc579e565afe 30 e8bdd5fcff call StackCheck (0xc579e5330c0)
0xc579e565b03 35 33c0 xorl rax,rax
0xc579e565b05 37 48bb681c51f000110000 REX.W movq rbx,0x1100f0511c68 ;; property cell
0xc579e565b0f 47 4d8b55b0 REX.W movq r10,[r13-0x50]
0xc579e565b13 51 4c3913 REX.W cmpq [rbx],r10
0xc579e565b16 54 0f84f6000000 jz 306 (0xc579e565c12)
0xc579e565b1c 60 488903 REX.W movq [rbx],rax
0xc579e565b1f 63 e911000000 jmp 85 (0xc579e565b35)
0xc579e565b24 68 4883ec08 REX.W subq rsp,0x8
0xc579e565b28 72 488b45f8 REX.W movq rax,[rbp-0x8]
0xc579e565b2c 76 488b5d10 REX.W movq rbx,[rbp+0x10]
0xc579e565b30 80 e908000000 jmp 93 (0xc579e565b3d)
0xc579e565b35 85 488b5d10 REX.W movq rbx,[rbp+0x10]
0xc579e565b39 89 488b45e0 REX.W movq rax,[rbp-0x20]
0xc579e565b3d 93 48ba681c51f000110000 REX.W movq rdx,0x1100f0511c68 ;; property cell ptr
0xc579e565b47 103 488b12 REX.W movq rdx,[rdx] ;; read counter from global
0xc579e565b4a 106 f6c201 testb rdx,0x1 ;; type tag test
0xc579e565b4d 109 0f8553000000 jnz 198 (0xc579e565ba6)
0xc579e565b53 115 48c1ea20 REX.W shrq rdx,32 ;; unboxing counter
0xc579e565b57 119 81fa00ca9a3b cmpl rdx,0x3b9aca00
0xc579e565b5d 125 0f8d32000000 jge 181 (0xc579e565b95)
0xc579e565b63 131 493ba540070000 REX.W cmpq rsp,[r13+0x740] ;; ???
0xc579e565b6a 138 0f8261000000 jc 241 (0xc579e565bd1); jump on carry
0xc579e565b70 144 83c201 addl rdx,0x1 ;; counter increment
0xc579e565b73 147 8bca movl rcx,rdx ;; redundant move
0xc579e565b75 149 48c1e120 REX.W shlq rcx,32 ;; boxing counter
0xc579e565b79 153 48ba681c51f000110000 REX.W movq rdx,0x1100f0511c68 ;; property cell ptr
0xc579e565b83 163 4d8b55b0 REX.W movq r10,[r13-0x50] ;; ???
0xc579e565b87 167 4c3912 REX.W cmpq [rdx],r10 ;; ???
0xc579e565b8a 170 0f8487000000 jz 311 (0xc579e565c17) ;; ???
0xc579e565b90 176 48890a REX.W movq [rdx],rcx ;; write counter to globa
0xc579e565b93 179 eba8 jmp 93 (0xc579e565b3d) ;; jump to loop header
0xc579e565b95 181 48b82141d0c48b060000 REX.W movq rax,0x68bc4d04121
0xc579e565b9f 191 488be5 REX.W movq rsp,rbp
0xc579e565ba2 194 5d pop rbp
0xc579e565ba3 195 c20800 ret 0x8
0xc579e565ba6 198 4d8b5500 REX.W movq r10,[r13+0x0]
0xc579e565baa 202 4c3952ff REX.W cmpq [rdx-0x1],r10
0xc579e565bae 206 751a jnz 234 (0xc579e565bca)
0xc579e565bb0 208 f20f104207 movsd xmm0,[rdx+0x7]
0xc579e565bb5 213 f20f2cd0 cvttsd2sil rdx,xmm0
0xc579e565bb9 217 0f57c9 xorps xmm1,xmm1
0xc579e565bbc 220 f20f2aca cvtsi2sd xmm1,rdx
0xc579e565bc0 224 660f2ec1 ucomisd xmm0,xmm1
0xc579e565bc4 228 7504 jnz 234 (0xc579e565bca)
0xc579e565bc6 230 7a02 jpe 234 (0xc579e565bca)
0xc579e565bc8 232 eb8d jmp 119 (0xc579e565b57)
0xc579e565bca 234 e86304caff call 0xc579e206032 ;; deoptimization bailout
0xc579e565bcf 239 eb86 jmp 119 (0xc579e565b57)
0xc579e565bd1 241 50 push rax
0xc579e565bd2 242 51 push rcx
0xc579e565bd3 243 52 push rdx
0xc579e565bd4 244 53 push rbx
0xc579e565bd5 245 56 push rsi
0xc579e565bd6 246 57 push rdi
0xc579e565bd7 247 4150 push r8
0xc579e565bd9 249 4151 push r9
0xc579e565bdb 251 4153 push r11
0xc579e565bdd 253 4156 push r14
0xc579e565bdf 255 4157 push r15
0xc579e565be1 257 488d6424d8 REX.W leaq rsp,[rsp-0x28]
0xc579e565be6 262 488b75f8 REX.W movq rsi,[rbp-0x8]
0xc579e565bea 266 33c0 xorl rax,rax
0xc579e565bec 268 498d9d451eb1fe REX.W leaq rbx,[r13-0x14ee1bb]
0xc579e565bf3 275 e82806faff call 0xc579e506220 ;; CEntryStub
0xc579e565bf8 280 488d642428 REX.W leaq rsp,[rsp+0x28]
0xc579e565bfd 285 415f pop r15
0xc579e565bff 287 415e pop r14
0xc579e565c01 289 415b pop r11
0xc579e565c03 291 4159 pop r9
0xc579e565c05 293 4158 pop r8
0xc579e565c07 295 5f pop rdi
0xc579e565c08 296 5e pop rsi
0xc579e565c09 297 5b pop rbx
0xc579e565c0a 298 5a pop rdx
0xc579e565c0b 299 59 pop rcx
0xc579e565c0c 300 58 pop rax
0xc579e565c0d 301 e95effffff jmp 144 (0xc579e565b70)
0xc579e565c12 306 e8f303caff call 0xc579e20600a ;; deoptimization bailout
0xc579e565c17 311 e80c04caff call 0xc579e206028 ;; deoptimization bailout

At the start, there’s a long sequence of instruction which seems to be a function entry prelude. This is not part of the loop body. There are many instructions in this machine code which are never executed. You can see, for example, that there is floating-point code. That code was generated in case an overflow might occur, but this can’t happen here. There are also deoptimization bailouts (calls into the JIT compiler) which are never needed. This code is unnecessary, but won’t affect performance. I’ve highlighted in bold the machine instructions inside the loop body which do affect performance but could potentially be eliminated. Essentially, in each loop iteration, the counter variable is unboxed (using a shift instruction), its type tag is tested and then it is boxed again.

In contrast, Higgs, using basic block versioning, is able to eliminate the type test from the loop body. Higgs also uses an unusual scheme where type tags are stored separately from value words. Because of this, Higgs does not need to box and unbox object property values. Since the type of the counter variable is known to always be an integer in this case, type tags don’t need to be touched at all inside the loop body. Furthermore, in Higgs, code is lazily generated, so there is no floating-point or bailout code. There are no long sequences of instructions that are never executed.

The machine code produced by Higgs for the loop body is as follows:

for_test(19337):
; $3 = shape_get_def $1, “i” => capture_cont(19386)
mov rdx, 140451224576896; rdx = shape pointer, unused

capture_cont(19386):
; $8 = shape_get_prop $1, $3
mov r10, [qword rsi + 2616];

; $0 = lt_i32 $6, 1000000000
cmp r10d, 1000000000;
jge branch_for_exit(1933A);
nop5;

for_body(19338):
; $4 = shape_get_def $1, “i” => capture_cont(193A5)
mov r8, 140451224576896; rdx = shape pointer, unused

; $11 = shape_get_prop $1, $4
mov rdi, [qword rsi + 2616];

; $17 = add_i32_ovf $9, 1 => call_merge(193FF), if_false(19410)
add edi, 1;
mov eax, 9795; eax = block idx, for eventual bailout
jo branch_if_false(19410); overflow test

; $14 = shape_get_def $1, “i” => capture_cont(19429)
mov r9, 140451224576896; rdx = shape pointer, unused

; shape_set_prop $1, “i”, $13
mov [qword rsi + 2616], rdi;

; jump => for_test(19337)
jmp for_test(19337);

The machine code above is far from optimal. It contains redundant moves and overflow checks which are not yet optimized away. I’m also fully aware that V8 is a moving target in terms of performance: the V8 team is already working on a better optimizing compiler, and they will no doubt improve upon the current situation. Still, it’s nice to see that all the work I’ve put into basic block versioning is paying off. In the general case, Higgs doesn’t beat V8, and things are likely to remain that way, but in this specific instance, Higgs is able to generate better quality machine code than a commercial compiler which has tens of man-years of engineering behind it. It’s a small victory, but I find it very encouraging nevertheless.

The tests I performed were done with V8 version 3.29.66 (current SVN trunk revision), and the current version of my development branch. If someone has a recent version of the SpiderMonkey shell built, I’d be curious to see the generated code and performance results for this microbenchmark. I wouldn’t be surprised if SpiderMonkey did better than V8 and Higgs, considering it sports a powerful hybrid type inference engine.

Advertisements
8 Comments
  1. Instructions on 131-138 are “interrupt check”, needed to interrupt the loop from outside. V8 injects them in every loop to support stuff like “pause running script”.

    163-167 is a “hole check” – needed to support deletion of deletable global variables – arguably V8 could figure out that there is no way this variable could be deleted in this loop, but in the Hydrogen IR this check is not “explicified”, but hidden inside the store and thus not eliminated. It will go away if you declare the loop iteration variable with var. You don’t need a new shiny compiler to solve this issue btw… and as soon as you have a call inside this loop this check will come back even if we could eliminate it for an empty loop, so it’s questionable how much that actually matters for the real world code.

    In some sense this hole check, boxing and unboxing are all actually historical artifacts from the times when V8 did not have a system of code dependencies and type tracking which it has now. Today V8’s runtime system actually should know that ‘i’ is an Int32 variable – so it could just load and store it unboxed just like you do, unfortuntely optimizing compiler does not utilize this knowledge at all. [Maybe new compiler eventually will, but again one does not need a new compiler to achieve this].

    It sound ironic but having “tens of man years” in your luggage is precisely what leads to various small unoptimalities like this accumulating here and there.

    Finally I am looking forward to either Higgs or V8 (whoever gets there first) compiling this loop down to i = 1000000000, anything other than that is redundant :)

    • Thanks for your insight. I’m sure the V8 team can get rid of those inefficiencies, and maybe they already have in the bleeding edge branch in fact (I tested against the latest trunk revision). There is no doubt that overall, V8 is the more competent compiler, and I have a lot of respect for what the V8 team has achieved. I chose the post title with humor in mind.

      I’m just happy because it took me a while to even be able to generate machine code of this level of quality. Good enough to be somewhat competitive. It still needs to be improved, and it will be, but there are some nice tricks in there I’m quite proud of. Consider, for example, that the Higgs global object is an object just like every other in the system, but my loop has no inline caches.

      • [somehow I don’t get any mail notifications about the replies :-(]

        maybe they already have in the bleeding edge branch

        I don’t think anybody is looking at this at the moment, so these “inefficiencies” will be there for a bit. I am daring to put “ineffeciencies” in quotes because I have serious doubt that this particular issue really matters in the real world code.

        Consider, for example, that the Higgs global object is an object just like every other in the system.

        Yep, I noticed that and I am curious how it will scale in practice (beyond benchmarks). It definitely simplifies a lot of things and works perfectly when all variables are declared ahead of time. In V8 it’s kinda “non-normal” object based on the logic that people tend to gradually add global variables as the program runs – and this will cause too much polymorphism which will polute all the sites that use global variables. Under this assumption it seems preferable to treat each global variable in separation (that’s where V8’s property cells are coming from).

        but my loop has no inline caches

        Well, if optimizing compiler would leave inline cache in this loop that would be quite unoptimal :)

        I’m just happy because it took me a while to even be able to generate machine code of this level of quality.

        I think it’s a well deserved happiness. It’s a nice quality code after all: I especially appreciate that compiler knows the type of the value stored in i.

        • It’s a tradeoff. The property cells are easy to scale but complicate some aspects of the code where the global object is a special case. My approach is more uniform but requires dynamically generating new code paths if/when the global object shape changes. I think with a fast incremental compiler, this is possible. Probably, the number of sites which access global properties and witness multiple global object shapes and require optimized code are not that many. Most of the code is not hot and can use some generic code path that never gets optimized.

          My ultimate vision for Higgs would be to have a garbage collector and compactor for machine code. Property accesses would be profiled before generating optimized code paths. New code could be generated, reordered, and dead code compacted without recompiling entire functions.

    • I just posted a message, but I don’t know where it went… in any case, I asked about where is the optimization expected to end. As Vyacheslav Egorov said, the loop could be optimized to a single variable assignment, considering a global variable.

      • It’s fair to optimize away the loop. As you said, it would be semantically equivalent, but it seems JS JITs don’t typically do that. My guess is that it’s not usually worth it to run this kind of analysis on every loop. You need to be able to prove a number of things in order to prove that this optimization is valid. Among other things, you need to guarantee that the global variable doesn’t have accessor methods and that it won’t be deleted, which, according to Vyacheslav, V8 isn’t yet able to do.

        • that it won’t be deleted, which, according to Vyacheslav, V8 isn’t yet able to do.

          That’s not exactly what I said. I just said that IR is not “explicifying” the check which verifies that the property was not deleted.

          Removing the check could been done in several ways, two simplest ones:

          make it explicit and allow LICM to hoist it out of the loop, as loop has not interfering side-effects (note: LICM is already well capable of figuring that there are no interfering side-effects, i.e. that nobody can delete the property because there are no calls in the loop);
          don’t emit the check at all and instead add optimized code to the cell’s dependency group, this code will be deoptimized when somebody deletes the property.

          As you said, it would be semantically equivalent, but it seems JS JITs don’t typically do that. My guess is that it’s not usually worth it to run this kind of analysis on every loop.

          I actually think there is a very important (I will even dare say fundamental) optimization here that is neglected by the modern JS VM. It’s called “store sinking” which works very nicely after load forwarding… Essentially what we want to do is first to forward all stores so that

          B0:
          v0 = 0
          v1 = 1000000000
          StoreGlobal "i", v0
          goto B1
          
          B1:
          v2 = LoadGlobal "i"
          if v2 < v1 then B2 else B3
          
          B2:
          v3 = v2 + 1
          StoreGlobal "i", v3
          goto B1
          
          B3:
            // done
          

          becomes

          B0:
          v0 = 0
          v1 = 1000000000
          StoreGlobal "i", v0
          goto B1
          
          B1:
          v4 = phi(v0, v3)
          if v4 < v1 then B2 else B3
          
          B2:
          v3 = v2 + 1
          StoreGlobal "i", v3
          goto B1
          
          B3:
            // done
          

          An then we sink the store out of the loop because this store reaches no loads in this loop, this also makes first store dead.

          B0:
          v0 = 0
          v1 = 1000000000
          goto B1
          
          B1:
          v4 = phi(v0, v3)
          if v4 < v1 then B2 else B3
          
          B2:
          v3 = v2 + 1
          goto B1
          
          B3:
            StoreGlobal "i", v4
            // done
          

          What is left is to replace v4 in B3 with v1 (can be done with either ad-hoc loop variable recognition or a combination of SCCP and range analysis) and then kill B1/B2 as subgraph with no side-effects/uses which is also rather trivial.

          I would like to point out that all of the optimizations used here are rather fundamental and in my opinion should be run over any highlevel code to squeeze away as much abstractions as possible – it’s just that JS VMs somehow neglect to implemented them.

  2. Congrats on reaching this milestone! I’ve been following your research for awhile. Although I still can’t grasp most of the technical strategies used, it’s a shining example of the kind of low-level yet highly flexible systems you can write in D lang. For those looking to learn more about the language, see http://dlang.org.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: