vm.overcommit_memory=2 is always the right setting
45 points by FedericoSchonborn
45 points by FedericoSchonborn
I have rarely read an article that I disagreed so strongly with.
Windows doesn’t overcommit. It has a policy of never making promises it can’t keep. And this is how I ended up with a machine with 128 GiB of RAM, 60 GiB free, and memory allocation failing.
Failing locally is nice for debugging but is absolutely not how you design resilient systems. You build resilient systems by ensuring that you can handle failure and that’s much easier at a higher level. Do you handle malloc failure gracefully? At every call site? In every library that you call? If those libraries allocate memory, every single API that may allocate memory now has to be able to report failure. Do you gracefully handle that?
Kernels and embedded systems have to and it is the biggest reason why embedded development is hard. Handling allocation failure is really hard because you can’t allocate memory to clean up. Often, the only thing that complex stateful code can do in case of allocation failure is exit the program. Indeed, a load of classic UNIX software called malloc via a macro that called exit on allocation failure, because handling the failure in the general case is too hard.
Without overcommit, you massively increase the probability that malloc will fail. And resilient system design is all about probabilities because even the underlying hardware may fail. And you can’t handle those errors locally. The Erlang philosophy is the one that lets you build systems with ludicrously high uptime: allow local failure, recover at the system level.
You have two choices:
Option 1: Invest a huge amount of engineering effort in making sure that every call to an allocation function gracefully handles failure and propagates this failure to the caller. Run in a mode where this kind of failure is likely. Waste a load of RAM.
Option 2: Ensure that your programs persist any important data regularly so that they don’t lose data on crash. Do best-effort soft limiting to further reduce the likelihood of failure. Build process monitoring to restart processes that exhaust memory and crash.
For software running on platforms with memory measured in GiBs, I will choose option 2 every time because we have a huge body of experimental evidence that it leads to more reliable systems.
Note: I maintain a platform that chooses option 1 because option 2 is unavailable on hardware where RAM is measured in tens to hundreds of KiBs. We put a lot of effort into API design to propagate allocation failures up and pass heap quotas down. We can make pretty good use of the available memory. But I absolutely would not build software for computers with three or more orders of magnitude more memory like this.
You build resilient systems by ensuring that you can handle failure and that’s much easier at a higher level.
Yeah, but Linux doesn't do that, either. An intermittent but persistent problem on my Linux desktop is running out of memory, usually 'cause I have left a few memory-hungry Firefox tabs open for like three weeks and then start a large game or something. 32 GB of RAM, no swap (I know, I know), and what happens on OOM? Does it kill the largest program? Nope. Does it kill the newest or least-important-looking program or even some program at random? Nope.
It stutters, grinds, and hiccups to a halt, and never starts again. The only way to unwedge it I've found so far is a hard reboot. The palliative solution is to reboot my computer a couple times a week. I'm literally back to the same workflow as using Windows XP in 2005.
Now, I haven't even started to try to solve this problem, because in reality it happens like twice a season. It's running Void linux with a slightly kooky set of software so I have no idea what the actual problem is. I don't care. It's a complicated problem with lots of different programs that don't want to cooperate with each other at all. But I would take literally any other failure mode over the one I have.
Linux’s virtual memory system has never coped well under memory pressure with its default settings. It tends to accumulate too many dirty pages, so when things get tight it has to spend ages writing random pages all over the filesystem so that it can allocate fresh pages. This also affects write-heavy IO: the kernel allows itself to get horribly backlogged and doesn’t impose enough backpressure on writes, then later everything grinds to a halt when it discovers it needs to catch up with all the deferred work. It’s a kind of buffer bloat effect.
Ugh.
I was recently fighting with my Ubuntu system trying to do some builds (bitbake / Yocto) and had to really turn down the number of tasks and threads below what my desktop system should be able to handle.
A big part of my problem wasn't the kernel OOM killer, it was the Gnome OOM killer. This is far more conservative, and looks at statistics like how long it takes a process to have an allocation request fulfilled. My system, at the time, was using about half the RAM for processes, the other half for buffer cache. But it seemed to be the situation you described, where too many dirty pages needed to be written out.
So if it takes too long to fulfill an allocation request, yeah, let's just kill my build process instead of letting the system gracefully slow down.
The Gnome OOM killer can be disabled, and I had started looking into other tweaks and bypasses. You can, for example, just escape the Gnome OOM killer by starting a screen / tmux session outside of Gnome's purview (ssh into localhost, for example).
Ultimately, I just turned down the tasks / threads and also running builds on an older system which had 72GB of RAM.
And swap? That might as well not exist, I guess.
I was very frustrated with the situation at the time, because there was nothing actually wrong. It wasn't as if a process had a bad memory leak, and all the available memory was used up (and the swap). The out-of-the-box tuning is set to start killing processes before the system can actually slow down.
Fortunately now the MADV_COLD/MADV_PAGEOUT flags to madvise(2) let applications hint/force swap of known-cold anonymous pages.
And this is how I ended up with a machine with 128 GiB of RAM, 60 GiB free, and memory allocation failing.
Is something trying to allocate 60 GB + 1B, or by ‘60 [GB]’ free do you mean ‘60 GB allocated but unused’? Neither of those seems like a problem: either something is trying allocate more memory than is available, which ought to be an error, or the memory has actually been allocated and isn’t really free, even if the owner isn’t using it right now. Or is there another alternative?
Do you handle malloc failure gracefully? At every call site? In every library that you call? If those libraries allocate memory, every single API that may allocate memory now has to be able to report failure. Do you gracefully handle that?
Yes. If a function may encounter an error then it needs to handle or report that error to its caller, and the caller must handle or report that error to its caller. malloc may return NULL; if so, then one has to handle it or report the failure to one caller.
Option 1: Invest a huge amount of engineering effort in making sure that every call to an allocation function gracefully handles failure and propagates this failure to the caller. Run in a mode where this kind of failure is likely. Waste a load of RAM.
That sounds like the correct engineering choice to me. Every program runs and terminates cleanly. RAM is cheaper than an unreliable system.
Option 2: Ensure that your programs persist any important data regularly so that they don’t lose data on crash. Do best-effort soft limiting to further reduce the likelihood of failure. Build process monitoring to restart processes that exhaust memory and crash.
Of course, ensure that programs persist important data is also engineering effort, and often doesn’t get properly done, and unreliability has its own cost.
One should probably still have process monitoring and management with Option 1.
I feel a lot of the problem is that programmers want to treat memory as though it’s an infinite resource, but it’s not.
RAM is cheaper than an unreliable system.
Most-by-RAM-consumption systems are cheap to be unreliable occasionally, and have many many more reliability issues coming from non-overcommit sources. Caches. Browsers. Games. Batch runs so long you need checkpointing anyway.
Windows doesn’t overcommit. It has a policy of never making promises it can’t keep.
That sounds like such an obviously, tautologically good thing that I feel, though I admit I can't logically demonstrate, that the problem must be in Windows's execution of that policy.
I guess the flip side of this analogy is that a lot of software allocates more memory than it actually needs. Sometimes this is just done out of convenience because the developer "knows" Linux will overcommit, but there are other more legitimate reasons to do so as well, like dynamic space-time tradeoff heuristics.
The best reason for overcommit is that virtual memory is so cheap on 64-bit systems. Not exploiting that fact is just leaving efficiency on the table.
in windows, you can over-reserve address space, but you can't commit more than RAM+swap
i don't think this is much of an issue if you're writing programs for windows. but if you port from linux, where the program assumes the OS lets you overcommit and does lazy initialization for you, it can be an issue
Suppose you have a server that's serving multiple API calls at once. One of them turns out to trigger unbounded memory allocation, resulting in your server running out of memory and then dying. Now your retry/resilience mechanism amplifies the problem, because you have no way of knowing which query is the problem, so you'll retry the killer query too!
I work on a database that does handle allocation failure for every single allocating call and the complexity is overwhelming.
It isn’t mentioned explicitly in the article, but setting the value to 2 disables overcommit.
Has the title changed since you posted this comment or is this a joke going over my head?
Neither. The title wasn't changed and it's not a joke. Linux sysctls are just that weird.
I wouldn't call it weird tbh - 0 is kind of "auto" option, which is nice default, 1 is "on" and the next free value is 2 for "off" (I don't recall any sysctl with possible negative value)
https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
I know those, I still think it's weird. Why is this shoehorned into integers at all? sysctls can contain strings (for an example, see kernel.modprobe). Use your words!
Yep. It was somewhat justifiable 40 years ago to just integer flags.
But this is 2025, let's use words instead.
I'm not quite at the point where I'm going to rip out every boolean in my programs and replace them with enums, but I'm leaning in that direction.
In compiled languages there's no cost, so I've always run enums over bools where possible, makes things easier to read.
From the redis FAQs:
The Redis background saving schema relies on the copy-on-write semantic of the fork system call in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory all the pages may change while the child process is saving, Linux can't tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero the fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages. If you have a Redis dataset of 3 GB and just 2 GB of free memory it will fail.
Maybe it's reasonable to require that the system have enough headroom to handle the worst case here, but I suspect it depends on the size of your database?
Since in theory all the pages may change while the child process is saving, Linux can't tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero the fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages.
This is really the problem with disabling overcommit: it breaks the usefulness of the CoW memory when forking.
This explanation also highlights that Redis isn’t “poorly made” and just ignores malloc errors. It affects a specific feature that depends on fork.
Fork shouldn't exist. Redis is taking advantage of a misfeature of Unix so "poorly made" while an exaggeration is not entirely unfitting. The correct (and efficient) way to do what Redis is doing is to implement CoW at the application layer and with threads.
The correct way is to reimplement something the OS can already do? And you’d still have to allocate memory for that.
That's last resort. The first is probably to use posix_spawn() instead. It's not as flexible as fork()+exec(), but it avoids the COW problem. Or vfork() if you're able to use it correctly (with extreme caution).
Or in Redis' case, there's a milliard other ways to implement a transaction – try to choose the least flawed one first.
I’m not sure if that’s clear but Redis isn’t executing a different program. It’s using fork specifically because of the COW semantics. The newly forked process is then writing out the database to the disk in the background.
Ok, then I think we have to use the Linux clone() syscall with a preallocated stack. COW is obviously flawed if we want to handle allocation failure upfront.
illumos systems do not overcommit memory, and while we occasionally have to deal with poorly architected software where overcommit is load bearing (like Redis, as mentioned in the article) I vastly prefer this behaviour to the OOM killer world view.
What happens on Ilumos when a process with a big allocation forks and then writes over the COW memory? You’re gonna need more memory without any explicit allocation that could have failed.
You can't fork if you can't get a reservation for all of the memory that you could end up duplicating through COW.
This is indeed a problem in cases like, say, a machine that has 8GB of RAM and runs a large process like a 6GB JVM, which then wants to spawn even a tiny child process and cannot fork because there is no room. But the way we fix that is through replacing classic fork() stuff with interfaces like posix_spawn(3C) where you straight up create a new child without a copy of whatever previous address space exists, and thus not requiring even a temporary lease on double the memory, etc.
ok but you realize that’s a tradeoff, right? It means that you cannot use copy on write as an optimization to save memory, like in prefork servers. Less of an issue today with most things using threads I guess.
Yes, I do realise! It's still possible to use COW the way you are describing, but you have to have available swap space to cover a potential run on the bank for pages of physical memory. I believe modern languages and runtimes are mostly moving away from using fork() in that way anyhow, because there are other challenges beyond just memory use created in the process (what to do about threads and locks during a fork, handling file descriptors and other resources, interactions with signals, etc).
Also things that use a zygote model to rapidly create child processes from a pre-initialised environment that might change some of that state, but probably don't.
When I was running Solaris boxes 25ish years ago, the solution was to give them huge swap partitions which provided space for large reservations needed for fork and COW. The swap was rarely used. These servers were running things like Apache and Exim which fork a lot, and of course shell scripts do too.
I believe the same is the case for Windows, and it looks like redis does not exist for windows. (their solution is of course just to tell people to run it in a vm)
Yeah. I mean, at the end of the day you can kind of paper over the lack of overcommit by adding swap to the system. Your reservation for memory usage can be backed either by physical memory, or by available space in the swap partition. If you don't actually end up touching enough pages to overflow physical memory then that isn't a performance problem -- and if you do, there is at least somewhere to put the pages we promised you could have.
To put this in perspective: At Microsoft, they gave me a desktop with 128 GiB of RAM. To be able to actually use all of it (i.e. not have things die due to allocation failure, in spite of some RAM sitting unused), I had to allocate 512 GiB of swap space.
I'm not sure what's going on in the Windows world, but it seems like the root problem there is probably applications making reservations vastly bigger than what they actually then intend to use -- a behaviour which is certainly encouraged by more haphazard allocation strategies on certain popular systems!
Playing heavily modded Skyrim is one of my past times. It's not available for other PC operating systems. It requires hard allocating 40+ GiB of swap, or it just won't start. It never uses all that allocation.
It's just a bad OS design not to overcommit.
That sounds like a bad program design to overallocate, not a bad OS design.
I agree, but I think the counterargument is: programmers are (on average) generally bad at their jobs/working under time pressure/etc., and therefore programs exist which a) people want to use but b) are badly designed.
A lot of these comment threads are about what POSIX provides, how the Linux allocator handles low memory conditions, etc. But APIs and implementations can theoretically be fixed. If you strip all that away I think the very core tradeoff here is: in the face of a world that will generate badly designed programs, do you want to prioritize (short-term) user happiness, or system correctness?
Based on that framing, I suspect the answer here may also be different based on whether you're on a desktop or on a server.
If you strip all that away I think the very core tradeoff here is: in the face of a world that will generate badly designed programs, do you want to prioritize (short-term) user happiness, or system correctness?
There are several problems with this framing:
Failure locality is a good thing to have, sure. But you can't just look at the upsides, you have to weigh them against the downsides. There's no free lunch, everything is tradeoffs.
In the case of a dedicated redis server, disabling overcommit means needing to actually buy about twice as much RAM. In the current economy, that's not a no-brainer.
How do sparse allocations work without overcommit? It's very common to allocate massive address spaces rather than "actual memory". GHC Haskell programs allocate 2TB (IIRC), Go programs allocate 1G, Deno allocates 34G, WebKit allocates 70G, Firefox sometimes (wasm?) allocates 20G… Virtual memory is intended to allow this, and depriving userspace of that ability would be silly.
That does not require overcommit. On a non-overcommit operating system like Windows which makes a distinction between reserving and committing virtual memory ranges, you can reserve a multi-terabyte range and then commit/decommit subranges as you go. The commit/decommit, which happens by calling VirtualAlloc to set/clear the MEM_COMMIT flag, is what increments/decrements the commit charge. That call is what can fail if the request exceeds the commit limit.
You can implement your own user space version of overcommit on Windows by using a structured or vectored exception handler to commit on demand, which has pros and cons compared to the Linux approach. An advantage is that you still have process-level non-overcommit semantics (no OOM killer) and you can design the policy: the granularity of on-demand commit growth, how to respond to overcommit in the exception handler, etc. And structured exception handling (unlike POSIX-style signal handlers) makes it easy to scope that policy to a particular dynamic code extent like a function call. But a disadvantage is that if you want page-at-a-time on-demand commit granularity then the overhead of the user space version is much worse.
All of those programs work on Windows, which doesn't have overcommit. You can reserve the address spaces just fine on Windows with VirtualAlloc. Committing the pages is a separate step (with another call to VirtualAlloc), there is no need to actually do the commit before you intend to use the memory. So the only win by committing to it at once would be saving that call to VirtualAlloc.
That's true on Windows; however if you set vm.overcommit_memory=2 on Linux, as suggested here, you won't get that behavior.
POSIX does not support this two-steps process of reserving an address space then committing portions of it, mmap is either just the first (with commit being implicit) or both. And linux doesn’t have MADV_SPACEAVAIL (assuming that even does something useful which I would not guarantee), nor does any OS that I know of have a way of requiring it.
Linux has MAP_NORESERVE for this. I generally find that POSIX is woefully inadequate for anything involving low-level memory management. Most of snmalloc’s platform abstraction layer is papering over all of the non-POSIX extensions to mmap, madvise, and so on that are necessary to get good performance.
That said, the Windows APIs are much nicer here. You can do reserve+commit in a single call with the right flag, but the two operations are explicit. The things that are missing are a userspace handle for the kernel object that represents the reservation and the ability to coalesce reservation objects (in POSIX, you can mmap two adjacent regions and then munmap them in a single call or, with madvise, discard the contents. In Windows, none of the memory-manager calls can span multiple reservations).
Oh, and you can trick the Windows memory manager into doing overcommit by creating a file and using MapViewOfFile with FILE_MAP_COPY for each allocation. Creating a 2MiB file and mapping it repeatedly gives you overcommit on Windows. You can also do it entirely in userspace with VEH and userspace fault handlers too, for extra fun.
Linux has MAP_NORESERVE for this.
In mode 2, the MAP_NORESERVE flag is ignored.
MAP_NORESERVE is only relevant in the default heuristic overcommit mode (0), and the docs specify that
the default check is very weak
Ouch. That makes overcommit=2 completely unusable for anything more complex than a toy malloc.
Nah, you can still mmap as much as you want with PROT_NONE and mprotect(PROT_READ|PROT_WRITE) the regions you want to use later.
I think this is probably right in theory, but in practice even with sufficient swap, you can not do a whole lot on a modern Linux system with overcommit turned off.
I don't entirely agree with the failure locality argument. With no OOM killer, an entirely well behaved process using minimal memory may die by being the unlucky winner to trip over the limit. The OOM killer tries to maintain failure locality in the sense of killing the process most likely to be badly behaved.
Very much agree with the sentiment here, although I still think that overcommit is a good default for users. I am surprised that there is no kernel API for allocating committed memory other than dancing all over the memory to force the kernel to do so (unless the article just neglected to mention it).
You can use the mmap flag MAP_POPULATE which will pre-fault, and thus ensure that all of it are in memory (at least to start with, I think it might be able to be moved to swap unless you lock it with mlock)
Although that might just get you oom killed if you try to allocate too much instead of an actual failure so its not too great.
There is also calloc, which 0 initializes the allocation, but I don't think the implementation ensures that everything must be able to fit in memory.
https://man.archlinux.org/man/mmap.2
There is also calloc, which 0 initializes the allocation, but I don't think the implementation ensures that everything must be able to fit in memory.
You need to be a bit careful with that. A lot of calloc implementations special case large allocations. The kernel guarantees that fresh memory is zeroed to prevent information leaks, so if you're asking for a large buffer it's often more efficient to just return the result of mmap, or use a madvise call that guarantees that it will zero the data.
I have to wonder: in your experience, how common is it for software to allocate large blocks of memory and then never touch most of those blocks (or, at least, most of the pages in those blocks)? The philosophy behind overcommit seems to presume that the answer is "very often". I can imagine it being the case with, say, vector-like data structures where an average of 25% of the allocation is going unused at any one moment.
Overcommit is incredibly important in high-performance computing, and for sparse data structures in general. On the other hand there are cases where we want allocations to fail immediately. What we need is a way to explicitly request overcommit vs non-overcommit on a per-allocation basis, rather then this being a systemwide setting.
Ultimately, that's what MAP_NORESERVE is for: creating an allocation not backed by a guarantee of pages, where you may eventually be hit with a SIGBUS when the cupboard is bare.
With the significant drawback that on systems which actually care about MAP_NORESERVE, as most software does not use it you need massive swaps.
This argument kind of reminds me of people who claim that you should always just buy lots of physical RAM and turn off swap completely. Such a stance reveals a deep misunderstanding of the purpose of swap.
Other than hibernation, what are you using swap for?
Putting infrequently used memory pages there so there is more space in main memory for buffering files and more frequently used pages. See https://chrisdown.name/2018/01/02/in-defence-of-swap.html
That would be good. In practise though I find my RAM is never full even when you count all the buffer/cache the kernel is doing. So the swap sits unused...
Like others, I don't believe in the contention of this article. I have both general issues and a specific issue on Linux.
Unix kernel interfaces, especially fork(), are not designed in a way that cooperates well with strict memory allocation. A program that forks may in theory modify 100% of its writable memory, but in practice most fork()s do not even come close. This creates an API situation where fork() heavy workloads create completely artificial memory pressure and failed memory allocations unless you over-provision RAM (possibly vastly).
We frequently trade memory for speed all through computing. Over-use and over-allocation of memory is a pervasive practice at multiple levels of the stack (all the way down to CPUs), and in the face of this practice requiring all of this allocated memory to be backed by RAM is wasteful. If you want the long argument, see this.
Many Linux systems in specific run with significantly or vastly more committed address space than used and active memory. You can check this on your own Linux systems by looking at /proc/meminfo and comparing Committed_AS to other values, such as 'Active'. Requiring all of that committed address space to be backed by RAM would require a lot more RAM in your systems or result in a lot of programs failing their allocations. And if you had that RAM, a lot of it would be inactive and probably unused, wasted.
One reason that committed RAM will be wasted is that in practice, it is extremely challenging for kernels to reliably know how much memory can be freed up on demand. When kernels can't be sure how much memory they can free up if programs require it, they have less room for using committed but not active RAM and more of your RAM goes to waste.
I think everywhere you wrote “RAM” in your comment it would have been more correct to write “swap”. Unix systems without overcommit don’t need to reserve real RAM for memory allocations, they can allocate virtual memory. Allocations can succeed as long as there’s enough space to spill to disk when too many anonymous pages are dirty.
Overcommit is just oversubscription for memory. Seems most of the arguments here apply generically to any form of resource oversubscription. And I don’t see how you can design efficient multitenant systems without oversubscription.
My Mom worked in the restaurant industry, and they do overcommits on reservations because the restaurants know that a certain percentage won't show up. How much they overcommit probably comes down to historical data to get a decent baseline for how much to overcommit. That's easy enough for a restaurant, but can a program query for allocations vs. usage?
Same with the airline industry, overbooking is universal. Linux does apply some (possibly dubious) heuristics in the default configuration to avoid “over-overcommit”.