JIT: so you want to be faster than an interpreter on modern CPUs…
21 points by sjamaan
21 points by sjamaan
When running a short sequence of operations in a loop, the jumps will be far more predictable, making the branch predictor’s job easier, and thus improving the speed of the interpreter.
This used to be true, and has become accepted common knowledge, but it turns out branch predictors have improved, and can mostly handle interpreters just fine!
Jump-threading is still a win, but for different reasons (fewer instructions, and especially fewer branches), and the effect is smaller.
I recently wrote about the ITTAGE branch predictor, which is the main innovation behind this change.
That is an enlightening post, thanks for writing it!
A fun thing about a tail-call interpreter (as I wrote previously) is the way it is (in effect) fettling the compiler’s optimizer. It gives the compiler only one opcode to chew on at a time, so the codegen is less likely to make suboptimal choices due to things like poor code layout or the register allocator getting overwhelmed with too many variables. And it forces the hot interpreter state to remain in registers instead of hoping for the best. It’s kind of amusing that nowadays writing an interpreter is about using the compiler well, not so much about using the CPU well :-)
It also makes it possible to extend the VM in such a way that a hybrid JIT/interpreter thing can exist -- for instance, by generating superinstructions (functions implementing multiple instructions) on the fly so that, instead of threading between tiny tail-called functions, you'd thread through some tiny tail-called functions and some that combine some of them (possibly being direct-called or inlined, to allow the JIT compiler to perform some better analysis).
The biggest issue here is that you must keep the number of opcodes within (unspecified) limits: too many opcodes could make the compiler job far worse.
A good compiler will hopefully handle this well enough (if not, I'd consider this a bug worth reporting). On the CPU side, the interpreter code should fit into the L1 ICache and the hot parts should fit into the uop cache.
On JIT compilation vs. interpretation: JIT compilation gives two key benefits over interpretation:
Elimination of interpreter dispatch. This can be substantial for bytecodes with many small opcodes; for larger opcode handlers (like here or in Python), this overhead gets smaller. Combining common sequences of ops into a macro-op can help as well. Template-based code generation ("copy-and-patch") eliminates the interpreter dispatch cost.
Keeping values in registers across operations. Memory accesses are comparably expensive, even with store-to-load forwarding and memory renaming in modern CPUs. Basic register allocation already gives huge performance improvements. Template-based code generation cannot really address this.
There's also the point of optimizations, but in principle, many optimizations can be implemented on bytecode as well.
(This is an excessively pedantic reply, but I have an excuse!)
Keeping values in registers across operations.
A tail-call interpreter (like CPython’s new one) uses the tail call arguments for manual register allocation. I wrote about this in more detail previously
Template-based code generation cannot really address this.
I would expect a template JIT to do about as well as a tail-call interpreter, with the big caveat that the manually assigned register variables are the hot language runtime state, and a template JIT has far fewer hot variables than an interpreter.
But I think you meant keeping user-level values in registers, not the language runtime state, and you are right about that :-) I thought a pedantic reply is warranted because it’s easy for a reader to get confused about whether we’re talking about how an interpreter is translated to machine code by its compiler, vs how the user’s program is translated to machine code by a JIT, especially when both kinds of machine code need to call language runtime support functions that depend on those hot register variables.
But I think you meant keeping user-level values in registers, not the language runtime state
Yes, that's what I meant. Didn't expect other interpretations (no pun intended) were possible, so thanks for pointing this out!
Keeping values in registers across operations. Memory accesses are comparably expensive, even with store-to-load forwarding and memory renaming in modern CPUs. Basic register allocation already gives huge performance improvements. Template-based code generation cannot really address this.
Couldn't the same be achieved with an interpreter, provided it is register-based instead of stack-based?
The "registers" of a register-based interpreter are just an array in memory; the register operand of a bytecode op is the array index.
So it doesn't really make a difference compared to stack-based interpreter?
When I was with the Faster CPython team at MS, we experimented with this idea and it didn't really make much difference. In some cases, it made things a bit worse.
That’s curious. The received wisdom is that a register VM can be faster than a stack VM because it requires fewer opcodes to faff around with the stack (no LOAD_x opcodes). I wonder why Python doesn’t benefit as much as (say) Lua. Could it be due to the way local variables are represented? A Lua VM instruction uses a small integer to directly address local variables, constants, or temporary values; was that not feasible in Python?
Here's the conclusion of the discussion: https://github.com/faster-cpython/ideas/issues/485#issuecomment-1402781915
For locals and whatnot, it already uses something similar to what Lua does; the performance improvements weren't as pronounced as we thought it would be. (It has things like STORE_FAST and LOAD_FAST that stores things in fixed stack slots, so it's already a bit of a stack/register hybrid thing. It also has LOAD_FAST_LOAD_FAST, which loads two things at the same time from the local storage.)
The main issue is that pretty much every value in Python is boxed, so it has to go through methods that unbox them, perform the work, and then box the result back. With quickening and type specialization, this work has reduced quite a bit (as it doesn't really have to look up what is the method that adds two integers, for instance, it can have an specific instruction that adds two integers), but it's still quite a bit heavyweight when compared to something that, for instance, uses NaN boxing.
Reference counting also counts for some good amount of memory bandwidth and branches (even if well predicted), when decrementing refs.
There are many other factors that essentially make the stack-vs-register difference not that pronounced at the moment. As things improve, I'm sure costs will shift and a register based VM will be the next low-hanging fruit. I haven't been paying too much attention to the developments there, though, as I have since moved on to other pastures, though, so there might even be more recent discussions somewhere with fresher data that's better than my unreliable memories :)
I’m curious about what kinds of real world database queries would benefit from a JIT. Query performance is usually dominated by I/O, and the CPU-intensive parts are usually native code doing things like sorting and hashing.
Yes, they do benefit. Warm pages are typically in memory and intermediate results are (almost) never written to disk. For OLAP workloads, this is almost always beneficial (even for small data sizes), but also OLTP workloads can benefit. Many industry databases (e.g., SAP HANA, Amazon Redshift, SQL server's Hekaton, Salesforce Hyper, CedarDB) use query compilation to get substantial throughput improvements.
Disclosure: I work in TUM's database group, whose head initially published about query compilation.
This old paper about Sawzall (an old filter language that Google used to use heavily) is also a good read: https://static.googleusercontent.com/media/research.google.com/sv//archive/sawzall-sciprog.pdf