Why language interpreters and virtual machines are slow

6 points by abnercoimbre

olliej

Re jit vs interpreter loop - assuming a highly optimized interpreter the loop is generally significantly less the 10% of the overhead, depending on IR granularity and the rest of the overhead.

The bulk of the JIT performance win (assuming essentially a template jit) is removing overheads like indirect loads, instruction conditions, etc required for the interpreter to actually perform the instruction, e.g consider this basic example from JSC:

$ jsc -d
>>> function f(a, b) { return a + b; }
<snip dump info for executing the above>
>>> f(1,2)
<snip dump of making the call>
f#EVgtVm:[0x1138543a0->0x113829900, NoneFunctionCall, 9]: 3 instructions (0 16-bit instructions, 0 32-bit instructions, 0 instructions with metadata); 9 bytes (0 metadata bytes); 3 parameter(s); 6 callee register(s); 5 variable(s); scope at loc4
bb#1
Predecessors: [ ]
[   0] enter              
[   1] add                dst:loc5, lhs:arg1, rhs:arg2, profileIndex:0, operandTypes:OperandTypes(126, 126)
[   7] ret                value:loc5
Successors: [ ]

For the interpreter this evaluation requires:

load the index of the register for arg1 (indirect load through instruction pointer)
load frame[arg1Idx] (unpredictable dependent load)
load the index of the register for arg2 (indirect load through instruction pointer)
load frame[arg2Idx] (unpredictable dependent load)
perform the addition
load the index of loc5 (indirect load through instruction pointer)
store the result into frame[loc5idx] (unpredictable dependent store, which may of course may mess with a subsequent instruction performing an unpredictable load that is for frame[loc5idx]

So in the interpreter we have a pile of indirect and dependent loads[1], and may have runtime branches determined by instruction parameters. In the jitted code we lose the reads from the instruction pointer, the register indices all become constants, many/most branches determined by instruction parameters disappear as the behaviour is known at compilation time, because the registers stores/loads have constant offsets you get load/store forwarding, and similar between instructions. All of these are responsible for a much larger proportion of the interpreter to template jit performance win.

Obviously once the optimizers kick in the relationship between the interpreter instructions and generated code disappears so a performance comparison there isn't relevant (from the pov of things like interpreter loop overheads, etc).

[1] the add does type checks on entry, but will swap the generic add instruction with one the prioritized the types seen, so that's generally well predicted, this happens more or less immediately and before any kind of jit compilation happens, so the bizarreness of JS add isn't super expensive

xnacly

Not having code blocks in monospace is so confusing to me.

The mentioned overhead of vms can also be optimised away by moving the execution strategy nearer to the cpu by using something wasm3 uses, called Tightly Chained Operations, they use a signature pc_t pc, u64 * sp, u8 * mem, reg_t r0, f64 fp0 to pass arguments exactly as x86 passes them. It also uses TCO for all operator implementation and therefore omit the loop and probably please the branch predictor.

A vm can also be just register based and even fully stackless.

In regards to performance, id say things like jscore with its OSR and 4 tiered execution modes (the faster than light jit is still a mystery to me) and v8 with its pipeline being fully fine tuned are pretty much top of the line for any language implementation, be it compiled, interpreted or a hybrid.

abnercoimbre

Nice how did you learn of tightly chained operations? Is there some RSS feed I should be subscribing to? :)
- xnacly
  
  My bad in this context it means tail call optimisation