mimalloc: A new, high-performance, scalable memory allocator for the modern era
23 points by eatonphil
23 points by eatonphil
There doesn't look like there's much new there. The tricks described in the post for making the fast path really fast look like a mix of the ones that snmalloc adopted from mimalloc and the ones mimalloc adopted from snmalloc 5+ years ago.
My favourite one that we copied from mimalloc that avoided a branch on the fast path was to statically initialise the thread-local block point to an immutable empty allocator (an empty free list in mimalloc). This means that check that takes you off the fast path is the check for there being memory available on the local cache to allocate. After you fall off the fast path, you do the TLS-needs-initialising-with-proper-mutable-state check, because there one extra branch is very cheap: you're already on a code path that's hit 0.1% of the time or less. Obvious in hindsight.
Speaking of snmalloc, how do you know how it stacks up to mimalloc? Mimalloc seems to be a lot more popular, but I don't know if that's due to superiority in the common cases or like, random chance.
The two share a lot of ideas (both adopted ideas from the other). The main difference is that mimalloc treats slabs as the first-class objects and has freeing threads do remote frees, whereas snmalloc has a notion of an allocator as an object that manages a set of slabs for a thread and batches up remote objects to be sent back to the allocator that owns them.
In terms of performance, they tend to be close, one better for one workload the other for another, but not much overall difference.
In terms of flexibility, mimalloc is a malloc implementation. Snmalloc is a toolbox of reusable allocator components that you can use to implement a variety of different allocators tuned for different purposes, which includes a malloc implementation as an example use case. You couldn’t easily use mimalloc to do the kind of thing that we did in the Verona process-based sandboxing code to allow the parent process to cheaply allocate memory in the child’s space and the child to free it (I wish someone would do this for a WAsm runtime).
This is a bit weird. From the “new” in the title I thought this would be an article from several years ago, but no, it’s dated last month. It says mimalloc was designed in 2020 but my link log says 2019. It’s amusing that they mention Google’s tcmalloc and FreeBSD’s / Facebook’s jemalloc, but not ~david_chisnall’s / Microsoft’s other open source malloc snmalloc. And why are they censoring the benchmark?
it is relatively small (~12K lines)
LOL the original paper bragged about being only ~3K LOC, funny how simplicity never survives first contact with production.
That's about the size of snmalloc, but snmalloc is written as a toolbox for building different allocators, with a couple of reference designs that give malloc APIs in a couple of places in the design space (one more tuned for security, one for performance), and includes implementations of memcpy and memmove (again, as building blocks for creating performance-tuned versions for different targets) that can do automatic bounds checks for heap objects.
Errr.. a note - this started as being an example of how supporting multiple platforms can very quickly swell a code base, but devolved into memory lane for me.
It also depends on how much comes in from supporting multiple platforms. If you remember the original blink post about how much code they removed from webkit/blink, a lot of people talked about it as if it was an amazing bit of engineering that they were prevented from doing due to being constrained by being in webkit. I believe they talked about deleting multiple millions of lines of code.
That is technically true.
One thing they removed was JavaScriptCore of course - it would be easier to think of this change as trading the line count of jsc with v8
But the rest of what they removed:
Support here is important: things like the core ui components, text entry, interaction with platform pasteboards are all platform native (for things that do need to be handled directly within webkit have interfaces in the platform abstraction layer that ensures that each platform has a correct - for that platform - implementation of the behavior). But I think Qt alone - the most mature port - was in the order of a million or so LoC, to give some sense of scale.
Removing all of these ports means removing all of the build systems: Xcode, visual studio, make, make files (of course at this point webkit also had all of chrome's build system files).
The API layer is a big one. Chromium/blink's API surface is outside of webkit/blink, and doesn't provide a platform native ui element, so you essentially have to make electron style apps - doing better requires a large amount of engineering. The WebKit API makes the interface the UI library's WebView element the API, the thread separation, sandboxing, etc all occurs underneath the API, so you can just have an arbitrary WebView in any place of any app, with no more impact than an other ui element in your UI library. This one is actually a major difference: Gecko and Blink are both setup as essentially "the browser process is the app". The "the browser interface is an entirely standard UI element" was a core design decision that defined the design of webkit (I don't know the history - that's before my time, but I would not be surprised if the original selection of khtml rather than gecko is because khtml was designed with that model).
The big impact there of course is the chromium had it's entire threading+sandboxing layer, and webkit had it's own as well, that is functionally doing the same thing, so it would be insane to keep that around :D
One other note (as I was primarily working on JSC at time) is that removing JSC also meant removing JSC's native support for a large number of architectures:
The things both supported:
JSC's full pipeline (the native interpreter, base line JIT, and the optimizing DFG JIT) also supported
[I think ibm had an out of tree Power backend but I'm not sure?]
JSC did have PowerPC support at some point but I think it had already been removed before the webkit/blink split.
JSC also has a last leg of support: it always had an interpreter, which is written in a pseudo-assembly with per-architecture backends, but also a C backend to support everything else.
In terms of platforms, I think/assume Symbian and blackberry are gone at this point, and JSC has dropped at least SH4 and MIPS (I think it hit points where it just wasn't maintained, and changes to JSC started requiring significant additional support in the architecture layer which just aren't possible to do if there are no architecture experts)
Of course webkit itself also got to remove code, but at a much smaller scale:
But you can imagine the code reduction difference of "removing support for 11 platforms* vs removing support for most of one. Most because again the retention of the skia backend, but again, webkit's platform abstraction layer is designed specifically to support multiple backends, so the complexity cost of an additional graphics library (or the networking library, font library, ui element drawing, etc) has very low complexity overhead.
Out of curiosity I just checked what platforms are still supported - a bunch of them have been removed now, the absolute dominance of chrome means that for most of the previously supported libraries it just was not practical or valuable to maintain their port, which means they just ended up having to be removed (it's a lot of work and the smaller projects with few people that's a lot of work, and with chrome now there's just no point in doing so).
The only non-apple platforms that are still complete ports at this point are WebKitGTK, WPE (apparently the web platform for embedded?), the playstation (???), and windows. Qt, Wx, EFL all seem to be gone.
Similarly JSC has dropped a bunch of architectures (everything else now uses the generic C interpreter backend - which works on every architecture - which is a fast interpreter, but not at the same level as the architecture specific interpreter, and of course comparing the performance to the JITs is just silly), it only supports the boring big architectures now:
So that's kind of sad - I remember working with folk from those other platforms (especially the UI toolkits), and they were really great, and it was really great watching and helping them while they implemented their platform layer. It was also great for webkit architecturally as it exposed weaknesses and gaps in the platform abstraction.
Anyway, this is a giant trip down memory lane for me, and is obviously far more text than was warranted for the purpose of the brief example I had intended it to be :D
SH4 (SuperH) - even more bizarre than MIPS, I think this is a chip used in things like car stereos?
Don't forget the Sega Dreamcast (and its Arcade counterpart, the NAOMI).
oh wow, I simply didn't know that - I guess when that was first added to JSC it was a little after the dreamcast era :D
I've had mimalloc in production for several months as part of a tool for processing large sets of near-realtime netflow data. I can't go into a lot of detail, but the difference between mimalloc and jemalloc was pretty noticeable at least in that a lot more memory is getting released over time. But as always, YMMV, but I've been happy with it.
Snmalloc and mimalloc really shine in multithreaded producer-consumer workloads.
Jemalloc is a thread-caching allocator. If thread A is allocating and B is freeing, A allocates a big chunk of memory to populate its cache and then does fast-path allocations from it. B frees the memory to its local cache until the cache is too big. Then it returns memory to the global pool. A can exhaust its local cache and have to go back to the global pool to refill it.
With a message-passing allocator, B passes the memory back to A, and A reuses it. This involves less global synchronisation and avoids the cases of A exhausting its pool frequently.
We’ve seen some multithreaded transaction-processing workloads get close to a 2x overall speedup moving from jemalloc to snmalloc.
This has been the default allocator in Chimera Linux for a few years now (since 2022). It has worked well in that time.
I've switched iocaine from jemalloc to mimalloc a few releases ago, and on x86-64, it's been working well, no complaints. On aarch64 however, the combination of mimalloc & cranelift & a rather large Roto script reliably leads to crashes. I ended up disabling mimalloc on aarch64, and using the standard allocator instead. Less performant, higher memory use, but doesn't crash, and that wins.