smalloc: a simple memory allocator

35 points by asb

andrewrk

You can make it even simpler if you design the allocator interface carefully:

https://codeberg.org/ziglang/zig/src/commit/0.15.2/lib/std/heap/SmpAllocator.zig

This is it. Less than 200 lines in 1 file, and it's competitive with glibc.

Main difference vs malloc/free is the user is required to track length and alignment, which is generally information statically known by the type system.

kenballus

C23 added free_sized and free_aligned_sized, which seems to be a step in this direction.

Unfortunately, all allocations still need to be compatible with free, so you can't really use this for much afaict. That is, we need to somehow track allocation sizes even if the client only calls free_aligned_sized and aligned_alloc, because they might call free in the future :(

This is presumably why glibc just implements these functions as free wrappers.
- andrewrk
  
  That's right. You have to fully commit to the alternative inferface, breaking backwards compatibility, in order to get the benefits. Unfortunately C will never be able to do this.
ChrisDenton

I think the idea is not to be competitive with glibc malloc but to be in a different class entirely. After all, glibc malloc already exists and is available to -gnu targets,
- mort
  
  A different class as in much faster? Or what do you mean?
tobin_baker

Another allocator that exploits 64-bit virtual address space is scalloc:

https://ckirsch.github.io/publications/conferences/OOPSLA15-Scalloc.pdf https://github.com/cksystemsgroup/scalloc
- pervognsen
  
  My favorite 64-bit allocator hack is probably https://github.com/GJDuck/GC. Aside from the specific way it slices up the address space, conservative GC is generally also a lot more practical with 64-bit addresses since the probability of accidental aliasing between integers and valid GC pointers is highly diminished.
pervognsen

Virtual memory addresses are a free and near-limitless resource. Use that big idea by reserving a huge swathe of virtual memory addresses that you will use only sparsely. When allocating memory, this sparseness enables you to efficiently find an unoccupied space big enough to hold the request.

As a categorical statement this IMO needs a few caveats in case people get the wrong idea. On x86, for example, you generally want to strive for chunks of 2 MB of consecutively mapped virtual memory even if you're using small pages. Why? Suppose you have sparse 4 KB allocations separated by at least 2 MB. Then to map such a data page requires a full leaf page in the multi-level page table, so you're essentially paying a RAM overhead of 100% whereas the PTE overhead in the 2 MB clustered case is only 1/512 < 0.2%. It won't show up in your process's RSS but on Linux it's accounted for in VmPTE.

There are other good reasons to keep things clustered beyond just the page level, e.g. when you miss the TLB you'd like the hardware page table walk to make good use of page table cache lines, and a 64-byte cache line at the leaf level of the page table contains information about an aligned group of 8x4 KB pages.

The same phenomenon is repeated at different scales, so if you have sparse 4 KB mapped pages separated by 1 GB then you need to pay for a page at the next higher level of the page table just to keep that data page mapped, and the RAM overhead for a mapped page is now 200%. But because of the geometric scaling as you go up the radix tree, the absolute overhead from the higher levels is generally insignificant, so you should mainly worry about the leaf level.
- pkhuong
  
  Yeah, performance claims for an allocator without mentioning how the design interacts with page table walks and huge (2-4 MB) pages or, better, higher order sizes like 1GB pages makes me think the author is being myopic about the impact of memory allocators on a program's performance.
david_chisnall

Other state of the art allocators have similar limitations, for example rpmalloc tops out at 8 MiB allocations according to https://github.com/mjansson/rpmalloc?tab=readme-ov-file, although I suspect they fall back to mmap or the system allocator in that case rather than immediately failing, so that instead of an immediate failure you get a drop in performance and possibly later and different failures.

I found that surprising. snmalloc has two strategies.

For small allocations, we allocate a chunk (I think the current default is 16 KiB) that we use for sizeclass allocations (i.e. allocating small objects of the same size, rounded up from the requested size). The chunk map is a big array (lazily allocated by mmap, or explicitly as new regions are used on Windows or other silly systems that don't do overcommit) and we also allocate a metadata object to describe the per-sizeclass state and put that in the chunk map. This allows us to de-commit the chunk, but then reuse its address space for the same size class later (or for a different one).

Chunks are allocated by the range allocator, which is a nested set of allocators that do splitting / caching / [de]committing of large ranges, with something like mmap as the back end. For objects that are large, we allocate them directly from the buddy allocators. The chunk map entries for these is either the size, or a marker telling us where the start is. When you free a large allocation, we may decommit the pages immediately (depending on OS-specific policy, if you have MADV_FREE we typically do, for Windows and Linux it's more complicated) and then fed back into the chunk allocator, which may do coalescing and may then decide to decommit if we didn't earlier (again, depending on platform-specific policy).

This means that we can handle any number of allocations of any size, limited only by available address space.

This also means that we can very cheaply go from an address to the start or end of an allocation, which is what we use to enable lightweight bounds checks.
kenballus

smalloc doesn't have any features for hardening your process against exploitation of memory management bugs.

This is a perfectly valid design choice, but claiming superior performance to glibc malloc when it does provide this functionality seems a little unfair.

glibc malloc is in a tough spot. It's one of those rare codebases that's been read more than written. Unfortunately, the readers are mostly CTF players, very few of whom think to contribute patches. See how2heap, a compendium of glibc malloc exploitation primitives, many of which could be mitigated pretty easily.

Combined with the usual old project staleness and inertia, plus the exposure of internal allocator details through mallopt, mallinfo, and GLIBC_TUNABLES, old design choices tend to stick around for way too long.

Recently, a patch series went around that removed fastbins from the allocator, which is a significant simplification. Hopefully we see more of this going forward.
tef

nice to see minmalloc given a shoutout

it's a very similar approach (sharded free lists), but with more fancy things for concurrent use
pervognsen

Comment removed by author
pervognsen

Comment removed by author
dataangel

The description in the README makes me think the GLOBAL_THREAD_NUM scheme will break if you are frequently creating and destroying threads. I don't see any mention of reusing earlier numbers.