mimalloc: A new, high-performance, scalable memory allocator for the modern era

23 points by eatonphil

There doesn't look like there's much new there. The tricks described in the post for making the fast path really fast look like a mix of the ones that snmalloc adopted from mimalloc and the ones mimalloc adopted from snmalloc 5+ years ago.

My favourite one that we copied from mimalloc that avoided a branch on the fast path was to statically initialise the thread-local block point to an immutable empty allocator (an empty free list in mimalloc). This means that check that takes you off the fast path is the check for there being memory available on the local cache to allocate. After you fall off the fast path, you do the TLS-needs-initialising-with-proper-mutable-state check, because there one extra branch is very cheap: you're already on a code path that's hit 0.1% of the time or less. Obvious in hindsight.

lonjil

Speaking of snmalloc, how do you know how it stacks up to mimalloc? Mimalloc seems to be a lot more popular, but I don't know if that's due to superiority in the common cases or like, random chance.
- david_chisnall
  
  The two share a lot of ideas (both adopted ideas from the other). The main difference is that mimalloc treats slabs as the first-class objects and has freeing threads do remote frees, whereas snmalloc has a notion of an allocator as an object that manages a set of slabs for a thread and batches up remote objects to be sent back to the allocator that owns them.
  
  In terms of performance, they tend to be close, one better for one workload the other for another, but not much overall difference.
  
  In terms of flexibility, mimalloc is a malloc implementation. Snmalloc is a toolbox of reusable allocator components that you can use to implement a variety of different allocators tuned for different purposes, which includes a malloc implementation as an example use case. You couldn’t easily use mimalloc to do the kind of thing that we did in the Verona process-based sandboxing code to allow the parent process to cheaply allocate memory in the child’s space and the child to free it (I wish someone would do this for a WAsm runtime).
fanf

This is a bit weird. From the “new” in the title I thought this would be an article from several years ago, but no, it’s dated last month. It says mimalloc was designed in 2020 but my link log says 2019. It’s amusing that they mention Google’s tcmalloc and FreeBSD’s / Facebook’s jemalloc, but not ~david_chisnall’s / Microsoft’s other open source malloc snmalloc. And why are they censoring the benchmark?
tobin_baker

it is relatively small (~12K lines)

LOL the original paper bragged about being only ~3K LOC, funny how simplicity never survives first contact with production.
- david_chisnall
  
  That's about the size of snmalloc, but snmalloc is written as a toolbox for building different allocators, with a couple of reference designs that give malloc APIs in a couple of places in the design space (one more tuned for security, one for performance), and includes implementations of memcpy and memmove (again, as building blocks for creating performance-tuned versions for different targets) that can do automatic bounds checks for heap objects.
- olliej
  Errr.. a note - this started as being an example of how supporting multiple platforms can very quickly swell a code base, but devolved into memory lane for me.
  
  It also depends on how much comes in from supporting multiple platforms. If you remember the original blink post about how much code they removed from webkit/blink, a lot of people talked about it as if it was an amazing bit of engineering that they were prevented from doing due to being constrained by being in webkit. I believe they talked about deleting multiple millions of lines of code.
  
  That is technically true.
  
  One thing they removed was JavaScriptCore of course - it would be easier to think of this change as trading the line count of jsc with v8
  
  But the rest of what they removed:
  
  The large amount of code and complexity that was introduced to support two completely different JS engines (and the bindings code gen for one version). The two engine requirement added so much complexity to the code in the DOM implementation.
  
  Support for:
  
  Native windows (e.g applies to macOS below as well blink uses is what is essentially it's own OS: it's own networking stack, it's own graphics stack, font rendering, it's own completer UI library, .... that's why all electron apps are such absolute garbage)
  
  Native macOS
  
  Gtk+
  
  Qt
  
  EFL
  
  "Linux" - this isn't a full platform, the UI Libraries above are doing the bulk of the work, but because webkit handles the multi process work as part of the WebView, it obviously needs a per-underlying OS implementation of that
  
  WxWindows
  
  WinCE
  
  iOS
  
  Symbian/Nokia -- though to be fair I'm not even sure why webkit had those still? The platform abstraction layer was reasonably robust so it's possible that retaining it just didn't carry any cost? I really can't recall
  
  Blackberry
  
  Support here is important: things like the core ui components, text entry, interaction with platform pasteboards are all platform native (for things that do need to be handled directly within webkit have interfaces in the platform abstraction layer that ensures that each platform has a correct - for that platform - implementation of the behavior). But I think Qt alone - the most mature port - was in the order of a million or so LoC, to give some sense of scale.
  
  Removing all of these ports means removing all of the build systems: Xcode, visual studio, make, make files (of course at this point webkit also had all of chrome's build system files).
  
  The API layer is a big one. Chromium/blink's API surface is outside of webkit/blink, and doesn't provide a platform native ui element, so you essentially have to make electron style apps - doing better requires a large amount of engineering. The WebKit API makes the interface the UI library's WebView element the API, the thread separation, sandboxing, etc all occurs underneath the API, so you can just have an arbitrary WebView in any place of any app, with no more impact than an other ui element in your UI library. This one is actually a major difference: Gecko and Blink are both setup as essentially "the browser process is the app". The "the browser interface is an entirely standard UI element" was a core design decision that defined the design of webkit (I don't know the history - that's before my time, but I would not be surprised if the original selection of khtml rather than gecko is because khtml was designed with that model).
  
  The big impact there of course is the chromium had it's entire threading+sandboxing layer, and webkit had it's own as well, that is functionally doing the same thing, so it would be insane to keep that around :D
  
  One other note (as I was primarily working on JSC at time) is that removing JSC also meant removing JSC's native support for a large number of architectures:
  
  The things both supported:
  
  x86
  
  x86-64
  
  ARMv7 (the 32bit architecture)
  
  MIPS (which still seems bizarre to me) - I think MIPS only had the baseline JIT (not the optimizing DFG JIT), whereas I think MIPS (the company) provided a full implementation of MIPS in V8, but I could be wrong?
  
  JSC's full pipeline (the native interpreter, base line JIT, and the optimizing DFG JIT) also supported
  
  ARM's Thumb2
  
  ARM64
  
  SH4 (SuperH) - even more bizarre than MIPS, I think this is a chip used in things like car stereos?
  
  [I think ibm had an out of tree Power backend but I'm not sure?]
  
  JSC did have PowerPC support at some point but I think it had already been removed before the webkit/blink split.
  
  JSC also has a last leg of support: it always had an interpreter, which is written in a pseudo-assembly with per-architecture backends, but also a C backend to support everything else.
  
  In terms of platforms, I think/assume Symbian and blackberry are gone at this point, and JSC has dropped at least SH4 and MIPS (I think it hit points where it just wasn't maintained, and changes to JSC started requiring significant additional support in the architecture layer which just aren't possible to do if there are no architecture experts)
  
  Of course webkit itself also got to remove code, but at a much smaller scale:
  
  Support for V8 in the DOM bindings and similar parts of the web api implementations - this really was the single biggest win: the complexity of supporting two JS engines (and their respective completely different GCs) cannot be underestimated, it required huge amounts of invasive functionality across pretty much everything in webkit.
  
  Support for much of the chrome/blink platform, but not Skia, because I think one of the above platforms did actually use Skia rather than their native graphics subsystem due, which is entirely fair: Skia is a really good graphics library, and it supports everything you need in a browser, and especially for the smaller ui libraries (I think Wx did this) it would be really really hard to get to a graphics library to the level of functionality and performance of Skia.
  
  The chromium build system (.gyp or .gn files, I can't recall what it was at that particular point in history - I think this was early days of ninja, so I can't recall if all the build files were ninja as that point)
  
  But you can imagine the code reduction difference of "removing support for 11 platforms* vs removing support for most of one. Most because again the retention of the skia backend, but again, webkit's platform abstraction layer is designed specifically to support multiple backends, so the complexity cost of an additional graphics library (or the networking library, font library, ui element drawing, etc) has very low complexity overhead.
  
  Out of curiosity I just checked what platforms are still supported - a bunch of them have been removed now, the absolute dominance of chrome means that for most of the previously supported libraries it just was not practical or valuable to maintain their port, which means they just ended up having to be removed (it's a lot of work and the smaller projects with few people that's a lot of work, and with chrome now there's just no point in doing so).
  
  The only non-apple platforms that are still complete ports at this point are WebKitGTK, WPE (apparently the web platform for embedded?), the playstation (???), and windows. Qt, Wx, EFL all seem to be gone.
  
  Similarly JSC has dropped a bunch of architectures (everything else now uses the generic C interpreter backend - which works on every architecture - which is a fast interpreter, but not at the same level as the architecture specific interpreter, and of course comparing the performance to the JITs is just silly), it only supports the boring big architectures now:
  
  arm64
  
  x86_64
  
  risc-v (the 64bit one)
  
  The 32bit arms: armv7 and thumb2
  
  x86
  
  So that's kind of sad - I remember working with folk from those other platforms (especially the UI toolkits), and they were really great, and it was really great watching and helping them while they implemented their platform layer. It was also great for webkit architecturally as it exposed weaknesses and gaps in the platform abstraction.
  
  Anyway, this is a giant trip down memory lane for me, and is obviously far more text than was warranted for the purpose of the brief example I had intended it to be :D
  - FedericoSchonborn
    
    SH4 (SuperH) - even more bizarre than MIPS, I think this is a chip used in things like car stereos?
    
    Don't forget the Sega Dreamcast (and its Arcade counterpart, the NAOMI).
    
    olliej
    
    oh wow, I simply didn't know that - I guess when that was first added to JSC it was a little after the dreamcast era :D
    
    lgerbarg
    
    I think it is is also used in some Carrier Infinity thermostats.
  - seabre
    
    I've had mimalloc in production for several months as part of a tool for processing large sets of near-realtime netflow data. I can't go into a lot of detail, but the difference between mimalloc and jemalloc was pretty noticeable at least in that a lot more memory is getting released over time. But as always, YMMV, but I've been happy with it.
    
    david_chisnall
    
    Snmalloc and mimalloc really shine in multithreaded producer-consumer workloads.
    
    Jemalloc is a thread-caching allocator. If thread A is allocating and B is freeing, A allocates a big chunk of memory to populate its cache and then does fast-path allocations from it. B frees the memory to its local cache until the cache is too big. Then it returns memory to the global pool. A can exhaust its local cache and have to go back to the global pool to refill it.
    
    With a message-passing allocator, B passes the memory back to A, and A reuses it. This involves less global synchronisation and avoids the cases of A exhausting its pool frequently.
    
    We’ve seen some multithreaded transaction-processing workloads get close to a 2x overall speedup moving from jemalloc to snmalloc.
  - wezm
    
    This has been the default allocator in Chimera Linux for a few years now (since 2022). It has worked well in that time.
  - algernon
    
    I've switched iocaine from jemalloc to mimalloc a few releases ago, and on x86-64, it's been working well, no complaints. On aarch64 however, the combination of mimalloc & cranelift & a rather large Roto script reliably leads to crashes. I ended up disabling mimalloc on aarch64, and using the standard allocator instead. Less performant, higher memory use, but doesn't crash, and that wins.
    
    david_chisnall
    
    We test snmalloc quite a bit on AArch64, so might be worth trying.