Where are all the low-level JITs?
67 points by MatejKafka
67 points by MatejKafka
In theory, a good JIT should, over a sufficiently long run time, outperform a native AOT-compiled binary, as it has more information about the code, processed data, and actual hardware. Similarly, a mix of manual memory management with GC should be potentially faster than a purely manual approach, due to a more efficient heap structure and allocator.
However, most state-of-the-art JITs & runtimes today are instead used to make somewhat performance-hostile high-level languages run acceptably (e.g., JVM, CLR, V8,...), rather than attempting to compete with AOT compilers for lower-level languages (C++/Rust/Zig/...).
This leads me to wonder, where are all the low-level runtime-based languages? Were there any serious attempts at developing a lower-level, runtime-targeted language with the explicit goal of beating exiting native languages on throughput? If not, why?
My current hypothesis is that for a sufficiently hand-optimized code in a low-level language, there are just not that many optimizations a JIT can do compared to a modern AOT compiler to make it worth it, but as someone mostly unfamiliar with JIT and runtime internals (and pretty familiar with AOT compiler internals), I'm looking for more qualified opinions. :)
One I know of is SynthesisOS, by Alexia Massalin, written in 1992. You can think of it as a JITted kernel, where one can construct custom system calls at runtime. It's written in 68030 assembly code, and it could run SunOS executables faster than SunOS could on the same hardware. If you have time, her thesis (warning: PDF link) makes for fascinating reading. Unfortunately, finding more information or even the implementation, is difficult if not downright impossible.
I've heard of SynthesisOS before, and got curious ... and the impressive recall of Gemini pieced together a story. (I think I've Googled it before, but simple searches did not produce a story)
She joined a company/startup called MicroUnity (founded in 1988). This company didn't succeed in the marketplace, but managed to produce valuable intellectual property
https://en.wikipedia.org/wiki/MicroUnity
They eventually won a $300 M lawsuit from Intel for patent infringement
https://www.susmangodfrey.com/attorneys/max-l-tribble/
For example, Tribble represented his client MicroUnity in a hard-fought patent infringement action against Intel and forced Intel to pay his client $300 million to license the MicroUnity patents. In follow-up cases, Tribble represented MicroUnity in cases against Apple, Samsung, AMD, and Texas Instruments—each of which also paid to license the MicroUnity patents.
And also performed what Gemini calls a "sweep" of the industry. Their strategy was to prove in court that NOBODY could build a smartphone without infringing on their patents. So they sued EVERYONE, and secured licensing deals. That's kinda crazy
This seems to be poorly covered? Not widely known?
https://www.eetimes.com/microunity-sues-mobile-giants-for-patent-infringement/
https://www.zdnet.com/article/chip-start-ups-big-payoff-comes-in-at-last/
Chip start-up's big payoff comes in, at last
MicroUnity gets a $300 million settlement from Intel, 17 years after it was tipped as one of the most promising ventures in Silicon Valley.
"If you want to talk about innovative, they were off the charts. They were looking at making major changes to the microarchitecture and fabrication techniques. They just had too many radical ideas in one package," Gwennap said. "But from a technology standpoint, they had some really innovative stuff."
She has filed and been awarded patents as recently as 2017
https://patents.google.com/?inventor=alexia+massalin&oq=alexia+massalin
This reminds me a bit of Mill Computing - https://millcomputing.com/
I don't think they were close to a product, but they survived for a long time based on producing intellectual property
The patents can be very valuable, so investors will continue to support the company
But yes there is no code for SynthesisOS
The patents can be very valuable
Producing IP but no actual product just sounds like patent trolling, tbh. I could see how a company like that falls into obscurity.
Well, I think the intention matters ... if you intended to build a product or not
Intellectual Ventures by Myhrvold (former CTO of Microsoft) was explicitly created to file patents, and had no products
MicroUnity definitely tried to create products ... but they were too early, innovated on too many dimensions at once, or both
I think it's not well known because, in Silicon Valley, people don't like/respect people who file patent lawsuits. So MicroUnity didn't do much publicity -- they filed the lawsuits relatively quietly, and "won" essentially all of them. (If you lose most/all of your lawsuits, that is a sign you are a patent troll)
But winning lawsuits / licensing deals against all those huge companies is a pretty unique and unusual story.
If you lose most/all of your lawsuits, that is a sign you are a patent troll
The successful patent trolls usually get the companies they are suing to settle out-of-court, before it even goes to trial. The successful patent troll companies also have a deep enough bank account that they are OK with this process dragging on for years.
Well, they managed to get ~22 companies including Apple, Samsung, AMD, etc. to license their IP
Those companies have the deepest pockets in the world, and are experienced and adept at fighting off patent trolls
If you're a patent troll, it would be wise to shake down someone a bit weaker than that :-) Companies like Apple have more money, time, and expertise than you ...
My current hypothesis is that for a sufficiently hand-optimized code in a low-level language, there are just not that many optimizations a JIT can do compared to a modern AOT compiler to make it worth it
There’s also profile guided optimization as a middle ground that is AOT compilation but taking runtime information into account during code generation. It is supported by rustc, for example.
So the benefit of a JIT compiler would be that the profile would be automatically recalculated on every execution. The downside is that you’d be paying for the complexity of the JIT compiler and the overhead of always running what is essentially an instrumented executable in production.
On top of profile-guided compilation there is also profile-guided linking: https://github.com/llvm/llvm-project/tree/main/bolt. When fb first announced it, they saw 2-15% improvement in performance (source).
I found it very surprising that you could do enough at the linking step, where you have lost any intermediate representation, to get such perf benefits. But in practice, its more like JITing. You get to see instructions across translation units, which the compiler doesn't, and with the profiling information you get much clearer understanding of hot spots so you can optimize the layout accordingly
This comes from a paper ages ago that showed that a load of compiler 'optimisation' papers were actually measuring noise. They demonstrated that compiling with -ffunction-sections and -fdata-sections and then randomising the order had (I think) a ±15% performance delta on vaguely modern systems due to cache aliasing, branch predictor aliasing, TLB misses, and so on. A load of papers were showing a 10% speedup from their optimisation, but if you randomised the layout of the before and after versions then the speedup wasn't statistically significant: they were just showing a speedup due to different code layout, not the result of their transform. It also explained why some seemingly promising techniques appeared to result in a slowdown on SPEC. As I recall, they later released a tool to do these measurements automatically given two compiler versions (it might have been a different group), but I've rarely seen it used in more recent papers (I would reject any optimisation paper that doesn't do it or similar, but that's why I don't review for compiler conferences).
Relaying from IRC from aw1621107:
Hello all! I saw this comment (https://lobste.rs/c/ipltrd) by david_chisnall and I think I know the paper/tool that is mentioned in the first part of the comment (Stabilizer https://github.com/ccurtsinger/stabilizer by Charlie Crutsinger and Emery Berger from University of Massachusetts Amherst, paper at https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf)
Yup, that's the one! Thank you aw1621107! Someone give that alphanumeric string an account!
Shouldn't there be big wins possible by systematically exploiting that variance and optimizing the code layout for performance rather than treat it as some kind of noise that disturbs the measurements of the "real" optimizations?
Yes, iiuc that's one of the things that PGO tends to do. ie measure what code/data tends to be related to each other, and rearrange the binary to put them close together in cache. (Oh, this is mentioned in a sibling comment too.)
Are you talking about link-time optimisation (LTO) or traditional linking but with the layout guided by a profile? Most of your comment makes it sound like the latter, but this sentence makes it sound like the former:
You get to see instructions across translation units, which the compiler doesn't
This is the advantage link-time optimisation has, but it doesn't really have anything to do with linking, it's just an optimiser that happens to run at link time. Benefits from smarter layout guided by a profile aren't really related to the linker seeing instructions across translation units AFAIU.
The second link explains it’s somewhere in between:
BOLT rearranges code inside functions based on their execution profile. Then the body of the function can be split based on how frequently the code is executed. Once this is done, the final step is to perform an optimal layout of the hot chunks of code depending on the call graph profile.
I just wish there were sufficiently smart PGO compilers that would tell me where I could get better performance if I loosened some of the constraints. Or where I've narrowly missed vectorization. Obviously not across the whole codebase, but in the hot spots.
And sometimes I wish the compiler drilled down to the hot spot and tried various degrees of loop unrolling to find the best one instead of me having to do it by hand. Especially for vectorization.
I guess that at some point we will have to move onto more high level languages that communicate intent and constraints instead of high-level operations. And have the compiler figure out construction methods and their costs.
As in: "Zip these two vectors, not necessarily materialized, take mag^2 and materialize. They have same length, power of two, at least 512."
1 This can be done via Superoptimization, but I am unaware of techniques to keep the high level information along the assembly.
2 As I understand it, an efficient and exact algorithm based on profiling information would need a timing model and processor simulator depending on how accurate the result should be. But I might be wrong and simpler approaches might work well.
3 Constraint based programming is very common for hardware design, where time steps are defined by clock cycles. What would this intent constraint besides optimization time? Does this only cover native execution or also simulation CPU+cache model type (static, simulation, hardware) like how hardware design works?
Ad 2., LLVM already has timing models for popular uarches that it uses for instruction ordering, and even exposes them in llvm-mca. Intel used to provide a similar tool, can't remember the name of the top of my head.
What would this intent constraint besides optimization time?
I think floating point operation ordering matters, for instance. And sometimes you can guarantee e.g. no NaNs are present, so the compiler wouldn't have to ensure correctness in their presence, which might enable some otherwise unsafe transformations. The ability to omit length precondition checks is also nice. Or to indicate you already have correct alignment.
In some cases you might be reading & writing shared memory but not actually require perfect coherency. I've been doing a lighting sim on an MCU with two cores and the collisions on cache access were so infrequent that I've just traded super rare corrupted pixels for a decent speedup.
Ah, so you want to use assumptions on 1 ordering constraints for floats (similar to atomic ordering constrains).
2 Assumptions on value (ranges) can be done via __builtin_assume(isfinite(x));, see https://stackoverflow.com/a/79708202 or what is there missing?
3 There is LLVM memref.assume_alignment or what is missing?
4 You can read+write to shared memory in a racy way and annotate your intention with __attribute__((no_sanitize("thread"))) or __builtin_allow_sanitize_check("thread") or what do you think is missing?
I'm aware of PGO and occasionally use it, but it has various issues that a proper JIT could alleviate:
The DX is not great: The instrumentation typically has a significant overhead, so you can't really profile in production. Therefore, you have to setup a test scenario that roughly matches production use (and keep it up to date) and run it for each build to gather the profile and then rebuild with it (since profiles afaik don't translate very well between source code changes). The tooling itself for PGO is also not great, you have to setup most of it yourself. As a result, while JITs are ubiquitous and everyone in high-level languages uses them whether they're aware or not, PGO is still somewhat uncommon for AOT-compiled software.
PGO cannot adapt to runtime data: In your profiled scenario, maybe you exercised two different code paths that call a single function with different types of data, but the actual user only ever uses one of those, and the JIT can specialize it, and still bail out and later re-optimize the function when the second code path is invoked.
You still can't target specific hardware and dynamic library versions: A JIT knows the exact uarch, how much cache you have, whether you're memory-constrained. Somewhat separately, it can also JIT FFI calls to remove translation overhead that's hard to avoid in AOT compilation.
PGO cannot "experiment" with compilation: Not sure if this is utilized today, but in theory, a JIT could try to recompile a function multiple times with various block and instruction ordering, levels of inlining, enabling different optimizations,... to effectively "brute-force" the optimal compilation for that function on the specific target hardware.
My instinctual answer to the initial question is "context." A static compiler for a language like Go or C has vastly more context that you typically do at runtime, and the compilers tend to do an exceptional job. So if you were to compile to some non-machine code it needs to be something that can be compiled in to similar machine code quickly but also retain enough context to make micro-optimization decisions that end up providing more benefit than the cost of it all.
It might be an interesting exercise to show a program and dataset that PGO does a great job with and then another dataset that PGO does a worse job with. I'm a fan and I've used it on projects but I don't know that I've ever seen a solid 10% improvement from using it and I don't know if I've ever seen it make something worse (other than build complexity.) The hardware guys have also been working on branch predictors and caches and things to help with all of this stuff too. You should be able to measure what you'd expect for your second point, I'm thinking a little program that does something with a hot loop, like a quicksort or zstd or something and then 2 datasets and 2 PGO builds that optimize for each dataset. I suspect it's not hard to see a difference, but I suspect it's probably challenging to see a big difference. Maybe I'm wrong, and we're due for a compiler revolution, but a lot of energy has been put in to compilers and modern ones are really good.
There’s a lot of runtime overhead in having a JIT at all. Code generation, remapping pages as executable, keeping fine grained stats of which control flow paths are taken, etc. A JIT also has to coexist with a regular interpreter since otherwise startup times are terrible.
So it makes sense to me that a JIT is a win only when performance without one is much, much slower, as in interpreted languages.
The main use case I had in mind was a long-running server, where you expect to keep the process running for weeks at a time, so wasting some time during periods of lower load to optimize everything is not that much of an issue, if you can get better performance as a result.
The second option would be to cache the JIT state between process invocations, as Android afaik does. Azul also does something similar.
This isn’t exactly what you are asking for, but dynamic translation such as https://dynamorio.org/ or https://www.qemu.org/ or https://valgrind.org/ are roughly speaking low-level JITs. However they typically aim to provide functionality that the hardware doesn’t, without too much overhead. But there have been reports of programs running faster under dynamic translation when the source and target instruction sets were the same and no extra fancy features were introduced.
WRT why not, the main optimization that JITs can do and AOT compilers can’t is dynamically discovering opportunities for monomorphization. What we do instead is profile-guided link-time optimization. So I suppose the question could be reframed as, do programs written in low-level languages have enough dynamic calls that it would be profitable to reoptimize at runtime?
DynamioRIO in particular is probably something to look further into, since it came out of the HP Dynamo project which was designed with the goal of speeding up arbitrary low-level programs through leveraging dynamic recompilation and runtime information.
It just also was a massive failure, mostly because doing that for blackbox binaries like they cared about is very difficult because you've already destroyed all memory aliasing information. You really want something more like holyjit (or just wasm) where you keep around the IR that resulted in your original binary, so that it can be recompiled by the JIT.
For most cross-arch emulators JITs is basically a requirement as you can't just interpret individual instructions for any console past the 5th gen (and even that is iffy), there's just not enough compute to do that. PCSX2, RPCS3, Dolphin, Melon, Yuzu, ... all use JITs (although the latter can avoid JITs when running some / many games on some ARM64 targets)
WASM sounds like what you're talking about? The JVM and CLR are not compiling Java or C#, they're compiling an already mostly compiled bytecode. Only fully dynamic, raw source based languages have JITs that are going from source to execution - the nature of these languages means that the initial source->bytecode conversion is much much faster that is possible for languages like rust or c++.
WASM really is what you're looking for: it is a low level bytecode format that languages like C, C++, rust, etc can target - it's literally part of the llvm toolchain.
JIT compiling a language like rust or C++ from source means spending as much time - and memory - as a debug build of that code, in the best case, but given you're trying to use a JIT for performance you're look at at least as much time as a release build, and the even greater amounts of memory that requires.
e.g you don't JIT Rust, C++, ... you would JIT a lower level bytecode you have compiled the higher level language to.
In addition to WASM I believe there are a few other JITs that try to do similar things, but they lack the engineering backing of the big JS engines, which are where the fastest WASM JITs are.
[edit: I forgot the bigger thing - JITs create a very large attack surface, the reason JITs are safe in non-C,C++,rust languages is that they are enforcing runtime safety semantics which prevent the JITs themselves from being attacked - the languages you're referencing have no memory safety in the sense that matters here, even rust allows unsafe operations, those operations undermine the security of the JIT - WASM supports these higher level languages by ensuring that they are still fundamentally running in a VM, just without a builtin object model]
JVM bytecode isn't "mostly compiled", my understanding is that it's more or less just a binary representation of Java. It's why Java decompilers are so exceptionally good. So by going JVM bytecode -> native instead of source -> native, the JVM is really just saving on some parsing work.
I believe C# is the same.
This is actually a change as Java systems have evolved. Earlier versions of javac did quite a lot of optimisation on the bytecode. Then newer versions of the JRE ended up spending a lot of time undoing them to get the code back into an easier shape to optimise. Now, javac does only a handful of trivial optimisations (e.g. constant folding). This was a problem for earlier versions of Android, which did a fairly straightforward translation from Java (stack-based) bytecode to Dalvik (register-based) bytecode and then interpreted them. Android went through a bunch of AoT (install-time) and JIT compilers and now I think uses a caching JIT, so now prefers the not-very-optimised javac output.
The early .NET marketing played up a lot the idea of doing install-time compilation. This is quite a nice idea because you can have binaries that are tailored to your specific CPU model (not just a single distribution format for Alpha and x86 Windows, but also one where a Pentium II and a K6 could have different tuning in the binary). Unfortunately, it turned out to be much too slow. It's not uncommon for a complex desktop app to need at least tens of minute of CPU time to compile, and hours are not unusual. Spending this time on every client is not great, and starting with an interpreter then moving hot paths to the JIT gives a more immediate experience.
For iOS, Apple moved to doing this in the App Store. You upload LLVM IR and the first time anyone downloads it for a specific model of iOS device their system will produce an optimised version for that specific device class. It then caches it, so you don't do the duplicated work that .NET originally proposed. I don't know if this uses profiling information. We explored the possibility of building something for FreeBSD where we'd do some very lightweight statistical profiling in the kernel and uploading it to the package build machines so that they could aggregate it and do profile-guided rebuilds. We never shipped it because we couldn't come up with a solution to avoid people uploading malicious profiling traces to make everyone slow. Apple has a solution to that because their secure boot chain is signed by them and gives remote attestation of devices.
For a while, about 20 years ago, I played with a variant I called 'just too late compilation', where an interactive system (one where you modified the code as the program ran) would do interpreting and some mild JITing, but would then do AoT-style compilation when the program exited so the next time you ran it you'd get a fast and optimised version of the end state of your interactive development model (you could then keep modifying it, and the bits you changed would be a bit less fast).
The early .NET marketing played up a lot the idea of doing install-time compilation. This is quite a nice idea because you can have binaries that are tailored to your specific CPU model (not just a single distribution format for Alpha and x86 Windows, but also one where a Pentium II and a K6 could have different tuning in the binary). Unfortunately, it turned out to be much too slow. It's not uncommon for a complex desktop app to need at least tens of minute of CPU time to compile, and hours are not unusual. Spending this time on every client is not great, and starting with an interpreter then moving hot paths to the JIT gives a more immediate experience.
This is not at all true. In fact, it's closer to the opposite of what you said.
First, the product you're referring to here is the .NET desktop framework, i.e. the version of .NET that ships in Windows. Since the beginning of .NET it was widely recognized that startup was a big problem for JITed languages -- the bytecode is not directly executable so you have to spend time to JIT it. There are two main ways to address this problem:
.NET has always relied heavily on (2). In .NET Framework this process is called NGEN and it has been in Windows since .NET 1.0. It runs on install of the .NET Framework and it precompiles the entire framework. As to why it runs on install -- it's mostly not about producing highly optimized code for different architectures. .NET interoperates with a large number of different native APIs and platform features in Windows and this interaction can have subtle code changes for the target platform. Combined with the huge number of Windows machines and huge number of Windows configurations, that means that the best strategy to compile code for every possible client configuration was to compile the code on the machine at install time.
This process can take a while -- hours or maybe a day -- but this is mostly because the NGEN process runs in the background at low priority, it has to compile the entire .NET Framework, and the whole thing isn't well optimized for finishing as quickly as possible. However, this wasn't about apps. You certainly could queue your app for NGEN but this is a relatively rare thing and would produce significant results only for very large apps, since the startup path of most applications is heavily dominated by code in the framework, not the application. For various reasons it also requires admin permissions, so there are also plenty of reasons not to do it.
Now, to switch over to modern cross-platform .NET (.NET 5+), we have a replacement technology that we call "ReadyToRun" or "crossgen". In contrast to NGEN, it's specifically designed to remove those platform dependencies and intermodule dependencies that required code to be compiled on the target machine. Nowadays we ship with the framework precompiled for all the targets, so the benefit is provided for free to every .NET application. It's still an essential part of the startup story -- .NET is much slower to start without precompilation.
Finally, you mention an interpretation. .NET has never had an interpreter. Interpretation and precompilation serve many of the same purposes and we went the precompilation route, so there wasn't a lot of reason to build it. This is in contrast to Java, which has a smaller precompilation story and a much larger interpreter story. The other reason for this, aside from path dependence, is that it's much harder to build a good interpreter for .NET IL vs Java bytecode. Reified generics, value types, and specialization mean that the .NET IL is much more complicated, so building a good interpreter requires a lot more work. Moreover, offering generic specialization as a feature causes .NET users to write code that depend on it for performance and interpreting that code can sometimes mean slowdowns so large that the entire performance class of the application changes.
.NET Framework was in fact very simple. It had no tiering (only one JIT optimization mode) and precompilation. There were no optimizations for "on-stack replacement" where a particular method could be recompiled on the fly into a more optimized form.
This is no longer true for modern .NET -- we have multiple JIT tiers and regularly recompile and replace code -- but precompilation is still a critical part of our startup performance story.
Re install-time compilation, isn't this exactly what Android currently does to Dalvik bytecode? IIUC when you install a program it interprets it or maybe does a quick-and-dirty compilation so you can use it immediately, while in the background it compiles it to target-specific native code. Do you know anything about why this seems to work for Android and not for 2000's-era Windows?
(After doing some wikipedia-crawling it appears that the Android Runtime is the component responsible for this now, replacing the Dalvik VM, and the pipeline is more complicated than I thought. It doesn't necessarily AoT the entire program to native code, just portions of it, and it will also try to JIT some of the code as the program is running. I suspect I could happily spend years digging into the details of how it works.)
Re install-time compilation, isn't this exactly what Android currently does to Dalvik bytecode?
I am not completely up to date here, butI believe they moved away from this. This is what the 'optimising system performance' thing was. I believe they now do JIT compilation and cache bits of JIT state so that the second time you run a program the hot bits are already compiled (and the cold bits can be interpreted or JIT'd if desirable).
Do you know anything about why this seems to work for Android and not for 2000's-era Windows?
The ratio between program size and CPU power has changed. A typical Android phone has at least a quad-core CPU that runs at around 2 GHz and it at least as fast as a Pentium III, clock-for-clock, probably faster. My computer in 2000 was a single-core 550 MHz Pentium III. Desktop apps remain a lot more complex than most mobile ones. And Android apps tend to offload performance-critical parts (and parts that need to be shared between Android and iOS or Android and desktop versions) to native code. Again, it's been ages since I looked, but back then all of the top 100 most-popular Android apps included NDK code. So if your performance-critical parts are shipped as native binaries, the install-time compiler isn't stressed as much. The main reason for doing it on Android was not performance relative to a JIT, it was doing all of the JIT-like work at a time when you were on battery (they'd redo it overnight periodically while the phone was charging too).
my understanding is that it's more or less just a binary representation of Java.
This isn't true. The JVM has a stack-based instruction set, just like WASM. Unlike WASM, though, JVM bytecode doesn't require managing memory (as you have a GC) and it understands the concept of objects and inheritance, which simplifies that aspect of compilation.
Here's an example of Java Bytecode:
Compiled from "SimpleAlgorithm.java"
public final class SimpleAlgorithm {
public SimpleAlgorithm();
Code:
0: aload_0
1: invokespecial #1 // Method java/lang/Object."<init>":()V
4: return
public static void main(java.lang.String[]);
Code:
0: aload_0
1: arraylength
2: ifle 12
5: aload_0
6: iconst_0
7: aaload
8: astore_1
9: goto 17
12: iconst_1
13: invokestatic #2 // Method java/lang/System.exit:(I)V
16: return
17: aload_1
18: invokestatic #3 // Method java/lang/Integer.parseInt:(Ljava/lang/String;)I
21: istore_2
22: new #4 // class DivisorPrinter
25: dup
26: iload_2
27: invokespecial #5 // Method DivisorPrinter."<init>":(I)V
30: astore_3
31: aload_3
32: invokevirtual #6 // Method DivisorPrinter.print:()V
35: return
A simple compiler generating a stack-based bytecode at the level of objects and method calls means the output for most statements is pretty close to a postorder traversal of the syntax tree, which is easily rendered into concrete syntax. Then recognize some consistent patterns like the gotos generated by if statements, and hey! — a (primitive) decompiler!
I haven't explored WASM much, but my understanding was that the JIT compilation is pretty rudimentary, right? That is, the WASM runtime assumes that the bytecode was already optimized by a full AOT compiler and it "just" needs to translate it to native code with local optimizations, without inlining, vectorization, devirtualization and similar larger-scale transformations.
I guess the real question is what optimizations you think the JITs are doing that would meaningfully benefit lower level languages, enough to compensate for the extremely large costs of runtime codegen.
The entire model of WASM is designed around ahead of time compilation because its express purpose is safely running languages like C and C++ which have no significant runtime object model and the operations of these lower level languages are inherently unsafe. The major JITs do perform a lot of optimizations during lowering - bounds check reduction/removal, register allocation, load propagation, etc but the whole point is that it exists to support those lower level languages, while being safe, so the assumption is the AOT compiler has managed the performance aspects.
The way JITs for compiled languages are able to perform reasonably is by precompiling to a bytecode, the requirements for those bytecode languages is that they are verifiably safe, and that's no different than wasm. The difference is that wasm is designed for languages like c++ and rust, and so allows "arbitrary" pointers, but those pointers are entirely in the confines of the wasm vm.
If you wanted to make a jit for rust or c++, the first step would be making your ahead of time compiler compile those down to a compact bytecode, presumably cross platform, that includes enough information about the object model to allow you to execute the code safely - if you don't care about safety you could skimp on some of that, but to benefit from the kinds of optimizations things like the JVM do you'll need to carry in runtime information that rust and c++ generally don't include.
Realistically for languages like C, or C++, or to an even greater extreme rust, the optimizations a JIT can do that meaningfully help performance, that an AOT cannot do are fairly minimal - devirtualization is the biggest one, and that's only possible by generating code that is slower than it might otherwise need to be in order to re-virtualise later if needed. If you don't need to worry about dynamically loading libraries they may require revirtualisation the profile guided optimization gets you more or less everything the JVM or CLR can do, with higher performance everywhere else as well.
The optimizations in JS engines are simply irrelevant to C, C++, etc, and the optimizations performed by the JVM or CLR are mostly able to be performed ahead of time by compilers for those lower level languages because the optimizations being done in the JVM and CLR are the lowering from a safe IR to the inherently unsafe instructions in the ISA. For instance converting the array index ir instructions into pointer arithmetic.
Perhaps GraalVM is what you are looking for
But GraalVM is still a JVM, so you have to target Java bytecode, which is fairly high-level and won't let you express some of the operations common in native languages, right? Or did I misunderstand the project?
Where are all the low-level JITs?
Baked into the very circuitry of the microchip you're running your code on (register file allocation, speculative execution, branch prediction, etc.)
This is only somewhat a tongue-in-cheek response: many of the supposed advantages that JIT-ed code has over AOT code vanishes when you've got a sufficiently smart CPU, as most of us do nowadays.
Julia uses LLVM at runtime to generate specialized code as a sort of JIT. It does this for flexibility and dynamism rather than exceeding AOT performance though (it typically matches AOT performance instead). Sometimes this flexibility makes it quicker to implement more ambitious algorithms which can have better performance but that’s sort of a separate consideration.
My suspicion is that (although not in theory) JIT is heavily constrained by compile time in practice and with that constraint can't compete with AOT. HotSpot already gets "it's slow to warm up" complaints, and hypothetical AOT-competitive or AOT-beating JIT would be even more slow to warm up.
The way JavaScript engines solve this is by having multiple levels. You have the interpreter which starts running the code immediately, while a fast, naïve, mostly-non-optimising first stage compiler compiles the program, then once that's done, it spends more time compiling the program with increasingly aggressive optimisations, prioritising the hottest functions. It still takes time to properly warm up completely, but you get pretty good performance pretty quickly.
Copy-pasting my comment from https://lobste.rs/c/epkw3z:
The main use case I had in mind was a long-running server, where you expect to keep the process running for weeks at a time, so wasting some time during periods of lower load to optimize everything is not that much of an issue, if you can get better performance as a result.
The second option would be to cache the JIT state between process invocations, as Android afaik does. Azul also does something similar.
To my infra/sysops eyes, this looks like a nightmare: inconsistent performance, a penalty on deployment that can take at least minutes. For anything where performance is critical I'd need to allocate resources to make it fast enough on start, which would then be overprovisioned after it warms up. Deployments or restarts would need careful choreography. Caching the JIT state would not help when deploying a new version, would be tricky to use when scaling up (when I add a new replica, how do I initialize its JIT state? From one of the existing replicas?). And the state cache smells like very interesting failure modes, too interesting to debug when paged at 3AM. Reproducing performance issues in this setup also sounds like a lot of fun.
I'll take AOT-compiled binary that behaves consistently at 80% or 60% performance of the hypothetical JIT without magic runtime state or startup penalty, thank you very much.
The Azul JVM does something very close to what I suggested (caching JIT-compiled Java bytecode between runs, using LLVM as a JIT compiler, with very expensive warmup), and it seems to be pretty popular in server space.
As for deploying updates, you'll typically do that outside of peak load, where higher resource usage is less of a problem, some of the cached JIT state for the previous version could be reused even for the new version (for unchanged functions, which will be the majority), and unless you have a breaking change, you could most likely stage the rollout to avoid a single spike.
Depends on the operational context, and performance gain from JIT, I guess (OTOH, it's kinda balanced: if JIT benefit is small, then penalty for initially running uncompiled code is also small, so it's just JIT overhead). Feels similar to a relational DB: on a cold restart (like an upgrade) it will underperform until it can run ANALYZE. I learned to plan Postgres maintenance for low traffic window to account for that.
But, at least in my projects, Postgres maintenance doesn't happen more frequently than once a month. Application servers are under constant development and sometimes are deployed multiple times a day. With this kind of startup penalty we'd need to switch to a different release cycle and deal with the fact that when new code is deployed to prod, very few engineers are present (low traffic window is outside working hours). On top of that, releasing a quick fix for an incident (an incident won't wait for a low traffic window) wouldn't be as easy.
Yeah, I just realized that it's not exactly a "you expect the process running for weeks at a time" service you had in mind. If it was actually weeks, like it is for the DB, it would be more acceptable.
The best way to do that with existing technology is to use some of the non-invasive profiling tools. Intel has a processor-trace thing that can generate a stream of every branch source and target. There's tooling to turn that into LLVM or GCC profiling data, so it would be moderately easy to turn this on when load is high but not quite 100% (you don't want to run it when load is low because that's not the work profile when performance matters). The overhead while generating the trace is pretty small, but it then generates an interrupt when the trace buffer is full and you have to store it somewhere. You could quite easily generate smallish (< 1 GiB) traces periodically as samples over a day or two, then crunch them during periods of low load and rebuild the service using PGO. Then just replace the binaries and restart.
Perhaps not what you are looking for, but a lot of GPU APIs jit IRs to machine code at runtime. Old style graphics APIs even jit source code at runtime.
Afaik, that's not really a JIT, but AOT compilation just before the code is submitted to the GPU, right? This still means that it can optimize for the specific hardware it's executed on, but there's no feedback loop.
To me the lack is mostly historical:
Webassembly is the main candidate currently, but wasm isn't running the world yet either. The JVM and CLR are kinda mis-designed; they weren't supposed to be performance-hostile, but they ended up that way in practice as Moore's Law became bogged down and "high performance" became all about dealing with caches and expensive pointer operations. There was also a false association between "compiles to native code" and "inconvenient low-level language"; there were precisely zero mainstream languages that compiled to native code besides C and C++, for no particularly good reason. Go, Rust and friends showed that wasn't necessary, so a lot of interest and development in the late 10's and early 20's turned towards "why don't we just AOT compile to rockin' native code?" and we're slowly re-exploring the limitations of that approach.
Oh, it is also far far easier to design, build and compile for a performance-hostile high-level VM like JVM and CLR. Dynamic dispatch and "everything is a pointer" representations (or "most things are a pointer", for CLR) smooths over a huge multitude of problems.
I do think there is a niche for such things though. I think an Electron-ish desktop app framework built around webassembly and a good JIT (with cached machine code) would be pretty useful in the current desktop ecosystem, especially if you include something along the lines of Flatpak that also handles dependency management and sandboxing.
This isn't really what you're asking for, but there are notable cases of "JITs that operate on low level code" in emulators. For example rpcs3 basically lifts the game's machine code and dumps it into llvm. It tries to do it at first boot to avoid stuttering, though, so it's really a half way point between AOT and JAOT. And of course it's not a programming target.
yeah, this! most emulators use jitting in some form, and the JIT is crucial for their performance. QEMU is an actual JIT, it lifts guest instructions into its IR (called tcg), which is then optimized and turned into host instructions. It even does a bit of static analysis of every translated basic block to be able to optimize more (it has a coarse range domain and a known bits abstract domain).
Another JIT along similar lines is Valgrind. The architecture is pretty similar, it lifts instructions to its IR, instruments the IR, then produces machine code again. The motivation is different of course. QEMU does this to emulate other processors, Valgrind does this to instrument the running code (eg fo profiling or for finding bugs).
However, most state-of-the-art JITs & runtimes today are instead used to make somewhat performance-hostile high-level languages run acceptably (e.g., JVM, CLR, V8,...), rather than attempting to compete with AOT compilers for lower-level languages (C++/Rust/Zig/...).
JavaScript is a performance-hostile language, but Java and C# both are performance-friendly. Generally, what slows down Java and C# isn't the language, but rather the fact that most developers allocate 17 billion objects instead of 0 (the number they'd dynamically allocate if they were decent C++ developers). One of my engineering teams ported Java code to C++ (keeping the same allocation-happy logic) and it was 3x slower, because C++ memory management (new) is much slower than Java's Hotspot compiler's arena ("slab allocator") implementation.
At any rate, your statement is an example of "begging the question", i.e. a logical fallacy.
This leads me to wonder, where are all the low-level runtime-based languages? Were there any serious attempts at developing a lower-level, runtime-targeted language with the explicit goal of beating exiting native languages on throughput? If not, why?
Memory has a cost. The overhead of a JIT is quite literally orders and orders of magnitude for many smaller programs. I can write a simple program in C, and it runs in a few 10s of kilobytes of RAM in total. The same thing with a JIT might take 16MB or more.
And look at the "benchmark games" website, where it measures the speed of things by how long something runs, i.e. until termination is complete. So anyone attempting to use a JIT (let alone a profile-guided JIT) gets laughed out of town. No, it's not logical, but that is the reality that we live with.
But there are low level jits. Cranelift is an example for WASM and it's pretty good. Cliff Click (author of Hotspot) is working on a JIT for his compiler book project ("simple") and separately for his FP language ("aa"). There are a number of JITs being built on top of LLVM, too, although I haven't played around with any of them (except one by Azul for Java).
but Java and C# both are performance-friendly.
I somewhat disagree with that statement. You can write high-performance software in both Java and C#, but idiomatic Java and C# are fairly performance-hostile (lots of indirection, bad memory locality, APIs designed for convenience over raw performance,...). I do write low-level C# fairly frequently, and it involves ignoring significant parts of the standard library, especially on older .NET versions. Java is even worse, given that it still does not have value types, so even something as trivial as ArrayList<Vec3> needs to be hand-rolled using arrays unless you want to have double pointer dereference on each read and also waste a lot of memory.
As for the rest of your comment, yes, there are obviously situations where JITs are not the right option. There's also a huge number of applications where no one will notice if a process takes 16 MB more RAM, especially if it leads to better performance.
Is there a reason you're ignoring profile-guided optimization in ahead-of-time compilers? It usually generates relatively modest speed-ups and it has significant operational challenges, but it's a lot more applicable in the traditional domains of lower-level systems languages.
Not exactly the same, but related and hopefully interesting: Nvidia's ARM CPUs (Tegra, I think Tegra K1) included a CPU level JIT technology, compiling (and recompiling using guided profiling) the ARM ISA to underlying uOPS. The CPU would literally allocate something like 64MB and run invisible OS-like thingy to facilitate it. It is not very well documented publically, best I could find ATM: https://hothardware.com/news/nvidias-64bit-tegra-k1-the-ghost-of-transmeta-rides-again
Like some people already answered there are JITs for lower-level, compiled languages (for instance in LLVM). However I think the optimal language for a JIT is a simple dynamic language, for two reasons:
To have a fast tracing JIT you need a fast interpreter. This is often the most problematic part: being able to run the code fast enough while collecting the data necessary to trace. A good example of this is LuaJIT.
Dynamism enables the JIT to optimize things by using data types based on actual code usage at runtime. For instance, say your code takes a floating point number but in practice you always use it with small integers at least in some context, the JIT can specialize for this. In theory you can do this with a statically typed language (I think HotSpot does something like this) but it makes more sense to just start from something relatively dynamic.
LuaJIT
If anyone is interested in what's possible with JIT, they should really check out LuaJIT. It is an amazing piece of engineering, with outstanding performance.
I recently tried LuaJIT and it was slower than the Lua interpreter for my workload. I'm not sure it's a great showcase for JITs in general.
Ah, maybe my information is out of date. A decade ago, the LuaJIT interpreter (written in assembly) was significantly faster than the PUC-Rio C-based implementation, and JIT could boost performance from there. I know that the PUC-Rio guys were working on improvements, but I haven't kept up-to-date on the latest developments.
It’s not quite a JIT because it requires a separate build pass but Profile Guided Optimization does exist for statically compiled languages and can show significant improvements.
PGO was already mentioned in another comment with longer discussion: https://lobste.rs/c/55dioz
I think the very pithy answer is: theory and practice are the same. In theory.
The theoretical (and benchmark) benefits of JITs have been difficult to translate into practice at anything like the theoretical promise. On real-world code, the benefits tend to be less than promised, sometimes a lot less, and the costs tend to be significant.
I wrote about this phenomenon in 2015: Jitterdämmerung.
As others have said for statically compiled languages you already have a lot of information at compile time and a JIT cannot add too much value at runtime.
I was thinking about what kind of optimizations you could actually do at runtime, maybe only some constant folding or slight partial evaluation for functions called always with the same parameters but that may cost too much memory to always keep in memory and not give too much benefits.
Another information that may be more valuable are array (and tensor) sizes, I think Mojo is doing something in this space. There are optimization like loop reorderings that can be done "cache-efficiently" only when the loop sizes are known (like this crazy cwise project does)
I don't have a complete answer, but my guess is that it's a combination of the following:
This page is absolutely related: https://bitbashing.io/gc-for-systems-programmers.html .
In particular, we learn from it that Linux kernel does have GC hidden inside: RCU
That (otherwise decent) article omitted a major drawback of deferred reclamation (GC): elevated memory usage. Little’s Law applies to GC just as it does to queues: the amount of memory wasted by garbage in steady state is directly proportional to the latency of reclamation. Memory usage is a major consideration for a lot of “systems software”, so it makes sense that systems programmers tend to be skeptical of deferred reclamation in general. That’s why deferred reclamation strategies like RCU, epochs, or hazard pointers tend to be used only in niche scenarios like lock-free data structures or highly concurrent and latency-sensitive readers.
RCU, and similar algorithms necessary to make lock-free data structures work were a major motivation for asking the question. :)
I will go bit further: low-level (closer to hardware) does not automatically mean that the performance is critical. Imagine e.g. a device driver - it does 1) some initialization / setup / handshake at the beginning + maybe some runtime configuration and 2) passes data through (e.g. frames from a camera or audio samples). Only for 2) the performance is critical, but the logic here is usually trivial (copy one buffer to another one). On the other hand, for 1) the performance does not matter at all – even the slowest JIT (or even plain interpreter) would be fast enough to configure the device, but the logic is relatively complex and error-prone. And in many cases, you even do not need a garbage collector – region-based memory management is a simpler and good-enough solution, for temporary tasks.
There is more Unix-nature in one line of shell script than there is in ten thousand lines of C.
OS kernels can benefit from a „scripting language“ in the same way the user-space benefits from shell (over C).