In defense of lock poisoning in Rust
71 points by claytonwramsey
71 points by claytonwramsey
I think poisoning is a bad solution to a very real problem.
Poisoning is persistent. Without implementing a specific mechanism to fix/restore/reset the state, the lock remains useless forever. The application can easily get permanently stuck in a livelock-like state, where it keeps tripping over poisoned locks, systematically aborting requests/operations/threads, but error handling that limits "blast radius" can prevent it from crashing hard enough to restart the whole process.
Poisoning is detected too late, after the state has already been lost. An outside observer can't know what really happened, the stack has been unwound, so strategies to recovery are very limited.
Compare it to something like a Drop guard used in the critical section. The guard can fix/restore the state right away, before the lock is even unlocked, while the necessary data may be still available! The critical section can still unexpectedly panic, but it doesn't have to leave a bad state behind. You can also "pre-poop your pants": when entering the critical section, swap the state with some tolerable temporary placeholder. If the critical section gets interrupted you'll have some data loss, but not corruption.
Instead of poisoning that can cause stable failure states, I'd prefer locks that automatically reset potentially-corrupt state (replacing it with some default value). Or just abort the whole process instead of poisoning — at least a process watchdog or crash reporter will restart it and restore some working state, instead of leaving a live application in a vegetative state.
Poison<T> doesn't prevent corruption, only contains it a bit, and adds extra work for handling the poisoning and recovery. There are more specific patterns that can prevent corruption and reduce the faff (native defer, lock.with(critical_section_callback, reset_state_callback), temporarily getting owned values from &mut, etc.)
From the article:
Note that PoisonError, and poisoning more generally, is purely advisory: you can retrieve the data underneath, and even clear the poison bit in Rust 1.77 and above.
The crucial state that has been lost is the temporary working state inside the critical section. Code that lacks that state can’t make sense of what’s protected by the lock if the invariants were broken when the panic happened.
I was responding to this part of kornel’s comment:
Poisoning is persistent. Without implementing a specific mechanism to fix/restore/reset the state, the lock remains useless forever.
Instead of poisoning that can cause stable failure states, I'd prefer locks that automatically reset potentially-corrupt state (replacing it with some default value).
I think software transactional memory is an approach to do what you suggest! If I understand correctly, you want to use mutexes to contain data which are operated on atomically, where panicking while the mutex is held rolls back the transaction.
Or just abort the whole process instead of poisoning — at least a process watchdog or crash reporter will restart it and restore some working state, instead of leaving a live application in a vegetative state.
Erlang gets away with actors contain their blast radius. Even if an actor crashes, it can't mutate the heap of another actor, so there are limits to state corruption. This is much harder to do when programs can mutate a shared heap. However, I think Verse is trying to address this.
I think the proposal is a bit simpler and more flexible than STM. The requirement is for some code that runs in the panic handler before the mutex is unlocked. This could be used as a primitive to implement something like STM, if the panic handler undid all mutations to the state protected by the mutex, but it can also just put things in a well-defined error state, release the lock, and continue.
Personally, I like the notion of removing as much default behavior as possible; one of the specific things I like about Rust is how variables have no default values, requiring that the developer reason out ahead of declaration when and where those variables are going to be used - and it also means that, for instance, seeing a variable with a value of "0" or "null" or "nil" or whatever is meaningful, rather than "but was this actually set on purpose or is this a dangling default" like I'm used to from other codebases.
So in this case, I'd be all for removing default behavior on mutexes and forcing the same kind of preemptive reasoning: where is this mutex required and what am I using it for? And likewise, having a universal "poison this thing if something happens" interface could make a number of other interesting patterns available that I think could lead to safer code.
Although I really like the pattern of "don't provide a default -- make people think about this" in many cases, there are several reasons I don't think that makes sense here:
Isn't it possible today to create a wrapper around a mutex which automatically clears the poison bit every time the mutex is accessed? If so, we could call it something like "non-poisoning mutex".
In other words, I believe we already have the necessary tools to give a developer control over whether a given mutex supports poisoning or not. If we were to switch to non-poisoning mutexes, it wouldn't work the other way around: I don't know how to build a poisoning mutex from a non-poisoning one.
I can see an argument for non poisoning read locks as definitionally a read lock is not mutating state (noting that a read operation that computes and caches a result requires acquisition of a write lock so yay for deadlock hazards :D)
But a write lock can't be considered sound if unpoisoned, at least not by default: you'd have to be exceptionally certain that your writing operation could never panic or fail mid way through any operation that mutates the relevant data structures, and it's hard to see how you could make such a claim given that by definition such a panic is occurring in an unexpected location?
I would say, most of the time I don't need poisoning, but sometimes I do. The fact that it exists by default right now acts as a prompt for me to consider if there's any rescuing to be done or not.
I could see an obvious compromise here by offering two variants of Mutex, analogous to e.g. blocking and non-blocking channels, so you can choose at construction time which behaviour you want.
Declaring it up-front could also bring the opportunity to declare rescuing mechanisms up-front. Right now you have to deal with rescuing everywhere you lock a Mutex, which could be a lot of different places.
Something else that I have found off is that rescue logic mostly depends on the operation that panicked when holding the Mutex, but has to happen on the next attempt to lock it, which doesn't give you a lot of info about the state the contained data might be in. I feel like it would be more useful to define a rescue strategy on the former, along the lines of "if this panicked with the lock, just roll back the value" or whatever is needed to regain consistency.
I would say, most of the time I don't need poisoning, but sometimes I do.
This has been mentioned a few times in various discussions on the topic, but I don't understand the argument. How is this different from array bounds checks, for example? Most of the time you don't need them. They exist for the times that you do. Like bounds checks, the cost is presumably non-zero but small.
The fact that it exists by default right now acts as a prompt for me to consider if there's any rescuing to be done or not.
This is a separate issue. If you feel it's unnecessary friction to consider this at all, there's the proposal to panic from lock() instead of requiring an unwrap(). Then you get no friction, but also the safety from poisoning.
I could see an obvious compromise here by offering two variants of Mutex, analogous to e.g. blocking and non-blocking channels, so you can choose at construction time which behaviour you want.
I believe that is the plan and that part is not controversial. The question is which one std::sync::Mutex will be. That's the one that would be used by existing code (after an edition change, so it won't be silent behavior change). It's also the one that people who don't know about this question will presumably reach for.
The fact that it exists by default right now acts as a prompt for me to consider if there's any rescuing to be done or not.
I think I wasn't clear enough here, I consider this something good. The power of defaults is strong, and I think it's good to nudge people in the direction of safety.
In terms of the ergonomics though, if you actually don't care about poisoning at the moment, you don't just want to unwrap the lock, you still need to match the poisoned lock as well and unwrap that to resume operation. When I say I don't care about poisoning I don't mean I want to halt my program if a lock gets poisoned, I just don't want to deal with repairing the data. I might just live with the fact that the data might be inconsistent.
I saw the plans for splitting Mutex, and I think that's not a bad idea, it achieves those ergonomics for when keeping the program running is more important than data consistency.
The current libs-api thinking is that making poison explicit would be better for those that want it. E.g. Mutex<Poison<T>> where Mutex itself is non-poisoning. Which is similar to how other rusty APIs are composable rather than having multiple specific types (the poster child for this is Arc<Mutex<T>>).
I wish Rust had a panic_safe block that the compiler would reject if anything inside it could panic. Then a mutex could require locking only within that block, and the poisoning issue would disappear.
Minor nit and completely beside the point but...
Even something as simple as a
println!can panic.
println! is not simple. Or rather, it's simple to use but it's doing a lot and, more to the point, explicitly panics on I/O errors. Which I'd argue you rarely want in production code outside of tests. E.g. closing the other end of a pipe is common and should not cause your program to crash with an error. Even with other I/O errors you'll likely want to exit more gracefully with a better error message.
True, I've definitely been known to go around patching Rust binaries to set the SIGPIPE disposition to SIG_DFL =)
It's kind of funny how people conflate ease or difficulty of interface with the simplicity or complexity of the implementation, and it goes both ways: many vector related functions have complicated interfaces but the implementations are actually trivial.
Lock poisoning has been a mis-design and caused a me lots of pain before when combined with panic handlers in web handlers. Over time some of your stack starts to break because some global lock ends up poisoned and panics but it does so on some routes only.
I only had frustrations with that feature and not once did it actually help me with anything. Sometimes well intended is just bad :)
As I outlined in the post, we have had no end of actual bugs related to unexpected cancellations in critical sections.
What was the thing that panicked inside the lock to poison it in the first place? Was it not a concern?
What was the thing that panicked inside the lock to poison it in the first place?
It's quite common that locks are held for quite a bit longer and what panics is not even related to the particular lock. As for why things panic: at scale stuff just panics. An out of bounds alone will do that. Sure, you should fix that, but you don't want a panic to take down your entire service which is why you catch the panics in a landing pad.
Was it not a concern?
I can genuinely say that poisoning has never helped me, but it has caused me great pain. I am not sure what code you have to write where the poisoning actually helps you in practice.
but you don't want a panic to take down your entire service
Oxide Computer, where @sunshowers works, seem to disagree. They use panic = "abort" in release builds of their control plane.
Which is a way to run things, but then you program very defensively. I would argue very few companies operate like oxide does. The risk of a panic in a high throughput web service is very high because of the collateral damage. We have also seen this just recently cause a massive incident at Cloudflare where a panic on a critical path caused a service to fail completely.
Isolating failures for many web services is important. Either you need to avoid most panics and resort to error passing, or you need a landing pad.
This is restating some of the OP but just to be on the same page: I come from a background where we separate runtime failures into programmer errors and operational errors -- the same as what the OP calls unexpected and expected errors, respectively. Operational errors are handleable and in Rust they're generally handled with Result and error propagation. Panicking is for programmer errors. The key point about programmer errors is that as the programmer, you cannot reliably reason about the behavior of the program after it's hit a programmer error. You may think you get better availability by swallowing them, but you're rolling the dice because you don't really know what the blast radius was. By construction: this is a thing that you thought couldn't happen and didn't write code to handle. Here's the most concrete example from my career: in a [2013-era] Node.js program I worked on, folks installed a global uncaughtException handler for exactly the reason you said: to keep the program running in the face of whatever strange thing came up. One day, the program threw an exception while processing a database result while hanging onto a database connection with a session open that had row locks held in the database. If the program had simply crashed on the uncaught exception, this would have been a non-issue. The process exits, the file descriptor is closed, the TCP socket is shut down, the database cleans up the session, and the lock gets dropped. Instead, everything just barreled along with this lock held and tons of queries started backing up. The whole application ground to a halt and required operator intervention to correct. The problem here is that while, as an industry, we have done a lot of work to make the Unix process a meaningful fault domain, it's much harder to construct meaningful fault domains within a process such that you can reason about the rest of the program.
Now, you could make a credible case that with Rust, it's possible to create smaller fault domains. I'm not convinced, but it's at least a lot harder for one thread to randomly corrupt data in another one or for state (like the database session above) to not get cleaned up on unwind or whatever.
But then we get to the second point: with Rust, you just don't really need to do any of that. At least for me, Rust has almost completely eliminated these implicit fatal failures (invalid memory access, bounds check violations, division by zero, etc.). (Some of this depends on how much you lean into it. If you write code using as and array indexes, you can still have these; if you switch to try_from() and iterators, etc., then all these implicit failures become explicit ones.) What's left are the explicit failures: calls to panic!, unwrap(), expect(), etc. We do use those for things that really shouldn't be possible (e.g., I just checked the length of this collection, it was non-zero, and then I called iter().next()). But I almost never see those fire. And if a case is reasonably possible, even if it "shouldn't" happen (e.g., got garbage data from the database), we just ... handle those cases. It's usually not hard -- define some useful error types and use ? instead of .unwrap(). In the end (and I do feel kind of crazy saying this -- I know how it sounds), it just hasn't been that hard to have the program not panic. (I'm not saying Rust eliminates all bugs. Programs still have bugs. But they mostly only panic because someone wrote code that explicitly panics.) Of course, we have a restarter and we have redundancy so that when it happens, the system survives. But it's not like stuff's panicking all the time in production.
I can't say how many companies operate like this, but I can say we're definitely not unique among companies seeking to build highly reliable software in a variety of languages (I've been working this way in C, JavaScript, and Rust).
To add to that: another downside of some sort of global "handle error and log it" hook is that it results in the error being ignored, especially when errors are frequent. For example, at GitLab our Sentry instance had something like half a million errors stored at any given time. While many were timeout related errors that shouldn't be logged in the first place (as there's often nothing that can be done), there were also many errors that likely would've been resolved a long time ago had they actually caused the program to terminate.
The problem here is that while, as an industry, we have done a lot of work to make the Unix process a meaningful fault domain, it's much harder to construct meaningful fault domains within a process such that you can reason about the rest of the program.
I've found that any normal (fallible) function, taking some input and returning either an output or an error, gives you exactly this kind of "fault domain" model, in a way that's composable and easy to reason about. The control flow is just what's written on the page, and the fault domain is just the function call.
Exceptions, in contrast, mean you end up with one visible (happy path) control flow on the page, and another invisible (sad path) control flow hiding behind every function call and return. With exceptions the "fault domain" is enormous, as far up the stack that the exception is able to bubble. Much harder to reason about.
This is restating some of the OP but just to be on the same page: I come from a background where we separate runtime failures into programmer errors and operational errors -- the same as what the OP calls unexpected and expected errors, respectively. Operational errors are handleable and in Rust they're generally handled with Result and error propagation. Panicking is for programmer errors.
We are in agreement here.
The key point about programmer errors is that as the programmer, you cannot reliably reason about the behavior of the program after it's hit a programmer error. You may think you get better availability by swallowing them, but you're rolling the dice because you don't really know what the blast radius was.
Panics should be rare, but they can be forced. If you operate a web service an uncaught panic will not just terminate one request, it will take down the entire process. The blast radius of this can be severe. That brings a lot of risk to operating a stable service.
But then we get to the second point: with Rust, you just don't really need to do any of that. At least for me, Rust has almost completely eliminated these implicit fatal failures (invalid memory access, bounds check violations, division by zero, etc.). (Some of this depends on how much you lean into it. If you write code using as and array indexes, you can still have these; if you switch to try_from() and iterators, etc., then all these implicit failures become explicit ones.)
Panics in rust in web services are quite common. From my experience a quite common occurrence are people working with data and indexing from untrusted user input into collections. They are often not caught until someone fuzzes a codebase. But the impact they can have on service stability when high concurrency is involved is significant.
I'm quite confident that if you let me look at the codebase of a rust web-service that uses std mutexes (paired with catch_unwind) I can find some mis-behavior with poisoning that can severely degrade the service but does not crash it completely creating the precise situation that you worried about with node.
I am not sure what code you have to write where the poisoning actually helps you in practice.
Imagine you have some state like this:
struct Server {
state: Arc<RwLock<State>>,
}
struct State {
foo: Vec<Foo>,
bar: Vec<Bar>,
}
And you have some methods like this (this is contrived, bear with me)
impl Server {
fn add_entry(&self) {
let mut state = self.state.write().unwrap();
state.foo.push(Foo::new());
state.bar.push(Bar::new());
}
fn use_state(&self, f: impl Fn(&Foo, &Bar)) {
let state = self.state.read().unwrap();
for index in 0..self.state.foo.len() {
f(&self.foo[index], &self.bar[index]);
}
}
}
use_state has a precondition that foo and bar have the same number of elements. If Bar::new panics then that precondition is violated. Without poisoning, you would rewrite use_state to assert!(state.foo.len() == state.bar.len()), but with poisoning you don't need to explicitly assert on all your preconditions for State whenever acquiring a lock for it.
Of course you can rewrite this code such that lock poisoning is useless, but I'd also say that you can rewrite your locking code such that critical sections are small enough that if a panic really occurs while that lock is held it is meaningful.
In ACID terms, poisoning the lock asserts that the guarded state is consistent. Whether or not it's required depends on what your updates are. Personally I think it's a bit heavy handed but since much of Rust defaults to "do the safe thing" it's better than the alternative (for that, there's parking_lot)
I know how all of this works, but I can also tell you that at no point did poisoning ever solve anything for me. This is also in no way a problem that is unique to Rust.
Serious question: how is it different from, say, bounds checks?
A slice is equipped to ensure memory safety by checking that some index is less than its length, whereas Mutex can not be equipped to ensure correctness for any given T about which it knows nothing and about its usage it also knows nothing.
but you don't want a panic to take down your entire service
Why not? Just crash and restart. If you don't want to restart all the time then panic less.
That turns into a problem once someone finds invalid input to your web service and starts to DOS you. Particularly problematic if you also manage to end up with a persisted DOS. You can do all of that as an additional layer but unless you run server less functions with perfect isolation and you can pay for that, you need to catch panics to limit the impact on your infrastructure.
So use a language where you can recover from invalid input without panicking. If you're using Rust for the performance that's great, but the downside is that if you panic your application ends up in an inconsistent state (such as, but not limited to, violated lock conditions). If you don't like that then don't unwrap! (etc.) or use a language where the (moral) equivalent of unwrap just raises an error with well-defined semantics.
some global lock ends up poisoned and panics
If it didn't need to panic, it could be recovered. If it did need to panic, its good that it did.
Were they your locks or someone else's (library)? Perhaps the author didn't know that a panic was advisory and not propagating it was okay.
Were they your locks or someone else's (library)? Perhaps the author didn't know that a panic was advisory and not propagating it was okay.
Rust does not make it particularly easy to work with a poisning locked, in particular because there is no way to reset it. Often you cannot swap it out. So the only option is to use into_inner on the mutex error and then ignore poisining that way. In practice people just unwrap() from my experience. I have seen poisons an issue in my own code, coworkers code and others. Gradually that became less of an issue because for the most part people moved to parking_lot which does not have poisoning.
https://doc.rust-lang.org/std/sync/struct.Mutex.html#method.clear_poison
Seems to have been introduced some time ago, so it should nowadays be possible to do work with poison in a much nicer manner.
Today I learned. But it doesn’t really because how do you clear it in practice? The mutexes are usually hidden somewhere inside a structure that doesn’t necessarily expose it. And when you are handling it yourself already, calling into_inner is already easier than clearing the bit.
I’m not sure how that actually simplifies any case I currently handle differently.
Well, I'd assume that the case where you don't really care about poison you'd check for poison, log the finding somewhere (so that you can later see that there's something potentially concerning here) and clear poison before taking the lock. I at least assume panicking within a "critical section" is something you want to eliminate in the long run either by shortening the period the lock is held, by moving fallible prep work outside the lock if that is what panics, or by changing panicky code paths into checked ones.
Though, if you absolutely don't care about panics in the critical section (eg. you've checked that they don't cause issues and specifically want to use exceptions instead of Result for performance purposes), then I guess just clear the poison unconditionally before taking the lock, I guess?
Though, if you absolutely don't care about panics in the critical section (eg. you've checked that they don't cause issues and specifically want to use exceptions instead of Result for performance purposes), then I guess just clear the poison unconditionally before taking the lock, I guess?
Or, just use a non panicking lock. If you really care about panics, wrap it in something that sets a panic bit on unwind like the proposed Poison<T>. For when you need it, opt into it.
Yup, at this point we're back to the traditional arguments over defaults and if they matter, and opt-in versus opt-out safety.