> Poisoning: Panic Safety in Mutexes This is one of the biggest design flaws in ...

LegionMammal978 · 2025-11-24T18:35:06 1764009306

> It's always either an `unwrap` (and we know how well that can go [2])

If a mutex has been poisoned, then something must have already panicked, likely in some other thread, so you're already in trouble at that point. It's fine to panic in a critical section if something's horribly wrong, the problem comes with blindly continuing after a panic in other threads that operate on the same data. In general, you're unlikely to know what that panic was, so you have no clue if the shared data might be incompletely modified or otherwise logically corrupted.

In general, unless I were being careful to maintain fault boundaries between threads or tasks (the archetypical example being an HTTP server handling independent requests), I'd want a panic in one thread to cascade into stopping the program as soon as possible. I wouldn't want to swallow it up and keep using the same data like nothing's wrong.

kprotty · 2025-11-24T19:07:37 1764011257

> so you have no clue if the shared data might be incompletely modified or otherwise logically corrupted.

One can make a panic wrapper type if they cared: It's what the stdlib Mutex currently does:

MutexGuard checks if its panicking during drop using `std::thread::panicking()`, and if so, sets a bool on the Mutex. The next acquirer checks for that bool & knows state may be corrupted. No need to bake this into the Mutex itself.

LegionMammal978 · 2025-11-24T20:05:10 1764014710

My point is that "blindly continuing" is not a great default if you "don't care". If you continue, then you first have to be aware that a multithreaded program can and will continue after a panic in the first place (most people don't think about panics at all), and you also have to know the state of the data after every possible panic, if any. Overall, you have to be quite careful if you want to continue properly, without risking downstream bugs.

The design with a verbose ".lock().unwrap()" and no easy opt-out is unfortunate, but conceptually, I see poisoning as a perfectly acceptable default for people who don't spend all their time musing over panics and their possible causes and effects.

kouteiheika · 2025-11-24T18:59:30 1764010770

> If a mutex has been poisoned, then something must have already panicked, likely in some other thread, so you're already in trouble at that point.

I find that in the majority of cases you're essentially dealing with one of two cases:

1) Your critical sections are tiny and you know you can't panic, in which case dealing with poisoning is just useless busywork.

2) You use a Mutex to get around Rust's "shared xor mutable" requirement. That is, you just want to temporarily grab a mutable reference and modify an object, but you don't have any particular atomicity requirements. In this case panicking is no different than if you would panic on a single thread while modifying an object through a plain old `&mut`. Here too dealing with poisoning is just useless busywork.

> I'd want a panic in one thread to cascade into stopping the program as soon as possible.

Sure, but you don't need mutex poisoning for this.

LegionMammal978 · 2025-11-24T20:23:06 1764015786

> 1) Your critical sections are tiny and you know you can't panic, in which case dealing with poisoning is just useless busywork.

Many people underestimate how many things can panic in corner cases. I've found quite a few unsafe functions in various crates that were unsound due to integer-overflow panics that the author hadn't noticed. Knowing for a fact that your operation cannot panic is the exception rather than the rule, and while it's unfortunate that the std Mutex doesn't accomodate non-poisoning mutexes, I see poisoning as a reasonable default.

(If Mutex::lock() unwrapped the error automatically, then very few people would even think about the "useless busywork" of the poison bit. For a similar example, the future types generated for async functions contain panic statements in case they are polled after completion, and no one complains about those.)

> 2) You use a Mutex to get around Rust's "shared xor mutable" requirement. That is, you just want to temporarily grab a mutable reference and modify an object, but you don't have any particular atomicity requirements.

Then I'd stick to a RefCell. Unless it's a static variable in a single-threaded program, in which case I usually just write some short wrapper functions if I find the manipulation too tedious.

JoshTriplett · 2025-11-24T19:24:50 1764012290

We're currently working on separating poison from mutexes, such that the default mutexes won't have poisoning (no more `.lock().unwrap()`), and if you want poisoning you can use something like `Mutex<Poison<T>>`.

kouteiheika · 2025-11-24T19:53:13 1764013993

Yeah, I'm looking forward to it!

While we're at it, another thing that'd be nice to get rid of is `AssertUnwindSafe`, which I find even more pointless.

JoshTriplett · 2025-11-25T04:07:02 1764043622

Speaking only for myself (though several other people have expressed the same sentiment), I wish we could get rid of unwinding. That would be a massive challenge to do while preserving capabilities people care about, such as the ability to handle panics in http request handlers without exiting. I think it would be possible, though.

questioner8216 · 2025-11-25T04:33:37 1764045217

That sounds really interesting, whether it is done in Rust, some Rust 2.0, or a successor or experimental language. I do not know whether it is possible, though. If one does not unwind, what should actually happen instead? How would for instance partial computations, and resources on the stack, be handled? Some partial or constrained unwinding? I have not given it a lot of thought, though. How do languages without exceptions handle it? How does C handle it? Error codes all the way? Maybe something with arenas or regions?

I do not have a good grasp on panics in Rust, but panics in Rust being able to either unwind or abort dependent on configuration, seems complex, and that design happened for historical reasons, from what I have read elsewhere.

JoshTriplett · 2025-11-25T17:03:04 1764090184

Vague sketch: imagine if we had scoped panic hooks, unhooked via RAII. So, for use cases that today use unwinding for cleanup (e.g. "switch the terminal back out of curses mode"), you do that cleanup in a panic hook instead.

The hard use case to handle without unwinding is an HTTP server that wants to allow for panics in a request handler without panicking the entire process. Unwinding is a janky way to handle that, and creates issues in code that doesn't expect unwinding (e.g. half-modified states), and poisoning in particular seems likely to cascade and bring down other parts of the process anyway if some needed resource gets poisoned. But we need a reasonable alternative to propose for that use case, in order to seriously evaluate eliminating unwinding.

questioner8216 · 2025-11-25T17:25:08 1764091508

I am not sure that I understand what scoped panic hooks would or might look like. Are they maybe similar to something like try-catch-finally in Java? Would the language force the programmer to include them in certain cases somehow?

If a request handler for example has at some point in time 7 nested calls, in call no. 2 and call no. 6 have resources and partial computation that needs clean-up somehow and somewhere, and call no. 7 panics, I wonder what the code would look like in the different calls, and what would happen and when, and what the compiler would require, and what other relevant code would look like.

JoshTriplett · 2025-11-25T22:27:24 1764109644

For the simple case, suppose that you're writing a TUI application that takes over the terminal. When it exits, even by panic, you want to clean up the terminal state so the user doesn't have to blindly type "reset".

Today, people sometimes do that by using `panic = "unwind"`, and writing a `catch_unwind` around their program, and using that to essentially implement a "finally" block. Or, they do it by having an RAII type that cleans up on `Drop`, and then they count on unwinding to ensure their `Drop` gets called even on panic. (That leaves aside the issue that something called from `Drop` is not allowed to fail or panic itself.) The question is, how would you do that without unwinding?

We have a panic hook mechanism, where on panic the standard library will call a user-supplied function. However, there is only one panic hook; if you set it, it replaces the old hook. If you have only one cleanup to do, that works fine. For more than one, you can follow the semantic of having your panic hook call the previous hook, but that does not allow unregistering hooks out of order; it only really works if you register a panic hook once for the whole program and never unregister it (e.g. "here's the hook for cleaning up tracing", "here's the hook for cleaning up the terminal state").

Suppose, instead, we had a mechanism that allowed registering arbitrary panic hooks, and unregistering them when no longer needed, in any order. Then, we could do RAII-style resource handling: you could have a `CursesTerminal` type, which is responsible for cleaning up the terminal, and it cleans up the terminal on `Drop` and on panic. To do the latter, it would register a panic hook, and deregister that hook on `Drop`.

With such a mechanism, panic hooks could replace anything that uses `catch_unwind` to do cleanup before going on to exit the program. That wouldn't fully solve the problem of doing cleanup and then swallowing the panic and continuing, but it'd be a useful component for that.

Rusky · 2025-11-27T23:03:54 1764284634

> Suppose, instead, we had a mechanism that allowed registering arbitrary panic hooks, and unregistering them when no longer needed, in any order. Then, we could do RAII-style resource handling: you could have a `CursesTerminal` type, which is responsible for cleaning up the terminal, and it cleans up the terminal on `Drop` and on panic. To do the latter, it would register a panic hook, and deregister that hook on `Drop`.

This doesn't get rid of unwinding at all- it's an inefficient reimplementation of it. There's a reason language implementations have switched away from having the main execution path register and unregister destructors and finally blocks, to storing them in a side table and recovering them at the time of the throw.

JoshTriplett · 2025-11-28T19:10:29 1764357029

The difference is that unwinding unwinds code that isn't necessarily prepared for it, rather than only code that explicitly wants it.

And I would expect to turn it into an efficient solution, in part by doing the "store in a side table" approach for hook registration.

Rusky · 2025-11-29T07:52:10 1764402730

Giving special treatment to code that "explicitly wants" to handle unwinding means two things:

* You have to know when an API can unwind, and you have to make it an error to unwind when the caller isn't expecting it. If this is done statically, you are getting into effect annotation territory. If this is done dynamically, are essentially just injecting drop bombs into code that doesn't expect unwinding. Either way, you are multiplying complexity for generic code. (Not to mention you have to invent a whole new set of idioms for panic-free code.)

* You still have to be able to clean up the resources held by a caller that does expect unwinding. So all your vocabulary/glue/library code (the stuff that can't just assume panic=abort) still needs these "scoped panic hooks" in all the same places it has any level of panic awareness in Drop today.

So for anyone to actually benefit from this, they have to be writing panic-free code with whatever new static or dynamic tools come with this, and they have to be narrowly scoped and purpose-specific enough that they could essentially already today afford panic=abort. Who is this even for?

JoshTriplett · 2025-12-02T16:35:25 1764693325

To be very explicit about something: these are all vague design handwaves, and until they become not only concrete but sufficiently clear to handle use cases people have, they're not going to go anywhere. They're vague ideas we're thinking about. Right now, panic unwind isn't going anywhere.

questioner8216 · 2025-11-26T01:55:26 1764122126

I have not given it much thought, but it would primarily be for the subset of Rust programs that do not need zero-cost abstractions as much, right? Since, even in the case of no panics, one would be paying at runtime for registering panic hooks, if I understand correctly.

JoshTriplett · 2025-11-26T15:48:40 1764172120

I can imagine ways to reduce that cost substantially. And the cost would be a key input into the design, since it's important to optimize for the success path and not have the success path pay cost for the failure path.

sunshowers · 2025-11-25T05:50:32 1764049832

(If unwinding goes away then, sure, mutex poisoning becomes moot.)

sunshowers · 2025-11-24T20:55:22 1764017722

I'm very disappointed at this. The path of least resistance ought to be the right thing to do.

JoshTriplett · 2025-11-25T01:22:24 1764033744

In the entire history of the standard library, we have never once seen a single report of anyone attempting to recover from poison.

exDM69 · 2025-11-25T12:22:08 1764073328

I've used recovering from poisoned state in impl Drop in quite a few places.

In my case it's usually waiting for the GPU to finish some asynchronous work that's been spun up by CPU threads that may have panicked while holding the lock. This is necessary to avoid freeing resources that the GPU may still be using.

I usually prefix this with `if !std::thread::panicking() {}`, so I don't end up waiting (possibly forever) if I'm already cleaning up after a panic.

JoshTriplett · 2025-11-25T23:27:14 1764113234

Thank you for mentioning this; I'd be really interested in hearing more about this, and seeing some examples.

exDM69 · 2025-11-26T08:23:33 1764145413

Hi, I don't have public examples to share but I can give an explanation of a simple scenario.

I have a container of resources, e.g. textures. When the GPU wants to use them, CPU will lease them until a point of time in the future denoted by a value (u64) of a GPU timeline semaphore. The handle and value of the semaphore is added to a list guarded by a mutex. Then GPU work is kicked off and the GPU will increment semaphore to that value when done.

In the Drop implementation of the container, we need to wait until all semaphores reach their respective value before freeing resources, and do so even if some thread panicked while holding the lock guarding the list. This is where I use .unwrap_or_else to get the list from the poison value.

It's not infeasible to try to catch any errors and propagate them when the lock is grabbed. But this is mostly for OOM and asserts that are not expected to fire. The ergonomics would be worse if the "lease" function would be fallible.

This said, I would not object to poisoning being made optional.

sunshowers · 2025-11-25T02:28:18 1764037698

Oh, I don't think recovery from poison is why poisoning is good. The reason poisoning is good is that at the moment you've acquired a lock on a mutex, you should be able to assume that the invariants guarded by the mutex are upheld (and panic if not).

simonask · 2025-11-25T09:47:05 1764064025

Mutex doesn't promise to uphold any more invariants than `&mut T` does. If the state can be corrupted by a panic while holding `&mut T`, I don't think there's any good reason to expect that obtaining it through `MutexGuard` should make any difference.

Panic propagation is typically handled much better at thread `join()` boundaries.

zozbot234 · 2025-11-25T10:11:02 1764065462

A panic in single-threaded, non-parallel code will either terminate the program or be recovered cleanly, so the potential for side effects to be silently observed in a way that breaks invariants is unique to Mutex<>. This is the reason for mutex poisoning,

simonask · 2025-11-25T14:29:16 1764080956

I fail to see that there is any material difference. Whether you catch-unwind within a single thread or in a separate thread such that the panic can be resumed on join makes zero difference.

Heck, you can have Drop impls observing the state while unwinding.

A true panic-safe data structure requires serious thought, and mutex poisoning does nothing here - it is neither necessary nor sufficient.

dap · 2025-11-26T04:14:17 1764130457

This is a false dichotomy. Not every technique needs to work in all cases in order to be useful.

This seems analogous to arguing that because seat belts don't save the lives of all people involved in car crashes, and they're kind of annoying, then they shouldn't be factory-standard.

simonask · 2025-11-26T07:34:39 1764142479

This is a case of a feature that is actively harmful for the things it tries to prevent, because it increases the risk in practice of panics "spreading" throughout a system, even after the programmer thought she had finished handling it, and because it gives a false impression what kind of guarantee you actually have.

JoshTriplett · 2025-11-25T22:29:32 1764109772

This is exactly the problem. Poison is enough to be painful but not enough to fully solve the problem.

> Heck, you can have Drop impls observing the state while unwinding.

Yeah, this is really painful and regularly forgotten. And one reason it'd be nice to not have unwinding.

sunshowers · 2025-11-25T18:01:57 1764093717

I understand what you mean, but you're saying has not been true for me in practice. Mutexes absolutely are used to uphold invariants in a way that &mut T is much less often.

There's something to be said here about what I've sometimes called the cancellation blast radius. The issues with cancellation happen when the data corruption/invariant violation is externally visible (if the corrupt data is torn down, who cares.) Mutexes make data corruption externally visible very often.

simonask · 2025-11-25T18:42:08 1764096128

In projects I've worked on, this just hasn't been the case. Mutexes, especially in Rust, can grant you a `&mut T` when what you have is `&Mutex<T>`, and that's it - failing to uphold invariants in the API surface of `T` is a bug whether or not it lives inside a mutex.

Lots of data structures need to care about panic-safety. Inserting a node in a tree must leave the tree in a valid state if allocating memory for the new node fails, for example. All of that is completely orthogonal to whether or not the data structure is also observable from multiple threads behind a mutex, and I would argue especially in the case of mutex, whose purpose it is to make an object usable from multiple threads as-if they had ownership.

sunshowers · 2025-11-26T01:40:14 1764121214

Acknowledging that panic safety is a real issue with data structures that mutex poisoning does not solve, I don't think we're going to agree on anything else here, unfortunately. We probably have entirely different experiences writing software -- mutex poisoning is very valuable in higher-level code.

dap · 2025-11-25T16:16:51 1764087411

That’s not surprising to me, but it’s not much of an argument for changing the default to be less safe. Most people want poisoning to propagate fatal errors and avoid reading corrupted data, not to recover from panics.

Edit: isn’t that an argument not to change the default? If people were recovering from poison a lot and that was painful, that’s one thing. But if people aren’t doing that, why is this a problem?

JoshTriplett · 2025-11-25T22:28:06 1764109686

Because right now everyone writes `.lock().unwrap()` everywhere without really thinking about it, and it just makes Mutex more painful to work with.

sunshowers · 2025-11-26T01:36:22 1764120982

If the issue is that everyone has to write an extra unwrap, then a good step would be to make lock panic automatically in the 2027 edition, and add a lock_or_poison method for the current behavior. But I think removing poisoning altogether from the default mutex, such that it silently unlocks on panic, would be very bad. The silent-unlock behavior is terrible with async cancellations and terrible with panics.

dap · 2025-11-26T04:10:41 1764130241

You seem to keep making the implicit assumption that because people are using `unwrap()`, they must not care about the poisoning behavior. I really don't understand where this assumption is coming from. I explicitly want to propagate panics from contexts that hold locks to contexts that take locks. The way to write that is `lock().unwrap()`. I get that some people might write `lock().unwrap()` not because they care about propagating panics, but because they don't care either way and it's easy. But why are you assuming that that's most people?

JoshTriplett · 2025-11-26T06:36:47 1764139007

https://news.ycombinator.com/item?id=46051602

I'm suggesting that the balance of pain to benefit is not working out enough to inflict it on everyone by default. I'm not suggesting it has no value, just not enough to be worth it.

dap · 2025-11-26T21:08:28 1764191308

I hear that, but it feels kind of empty because I haven't seen much discussion of that cost/benefit analysis (both of poisoning itself and of the change to the default behavior, which has its own costs and benefits).

I take it as uncontroversial that an important function of Mutexes is to ensure that invariants about data are maintained when the data is modified and that very bad things can happen when a program's data invariants are violated at runtime and the program doesn't notice. Maybe folks disagree about whether a program should always panic when invariants are violated at runtime (though there's certainly plenty of precedent in Rust itself for doing this, like with array bounds checking). Probably the bigger question mark is that panicking with a Mutex held doesn't necessarily mean an invariant is violated. But it does mean that the mechanism for ensuring the invariant has itself failed. I can see different choices about what to do here. For myself, the event itself is so rare and the impact of getting an invariant wrong so high that I absolutely do want to panic -- the false positive rate is just too small to matter.

halper · 2025-11-25T07:52:56 1764057176

Is that not because there is not much to do, and therefore people use .unwrap() — because crashing is actually quite sane?

Correctness trumps ergonomics, and the default should definitely be poisoning/panicking unless handled. There could definitely be an optional poison-eating mutex, but I argue the current Mutex does the right thing.

nixpulvis · 2025-11-25T01:35:05 1764034505

Excited to hear this.

sunshowers · 2025-11-24T18:47:58 1764010078

To the contrary, the projects I've been part of have had no end of issues related to being cancelled in the middle of a critical section [1]. I consider poisoning to be table stakes for a mutex.

[1] https://sunshowers.io/posts/cancelling-async-rust/#the-pain-...

kouteiheika · 2025-11-24T19:05:58 1764011158

Well, I mean, if you've made the unfortunate decision to hold a Mutex across await points...?

This is completely banned in all of my projects. I have a 100k+ LOC project running in production, that is heavily async and with pervasive usage of threads and mutexes, and I never had a problem, precisely because I never hold a mutex across an await point. Hell, I don't even use async mutexes - I just use normal synchronous parking lot mutexes (since I find the async ones somewhat pointless). I just never hold them across await points.

sunshowers · 2025-11-24T20:59:51 1764017991

As I said in the article, we avoid Tokio mutexes entirely for the exact reason that being cancelled in the middle of a critical section is bad. In Rust, there are two sources of cancellations in the middle of a critical section: async cancellations and panics. Ergo, panicking in the middle of a critical section is also bad, and mutexes ought to detect that and mark their internal state as corrupted as a result.

kouteiheika · 2025-11-24T21:20:37 1764019237

> Ergo, panicking in the middle of a critical section is also bad, and mutexes ought to detect that and mark their internal state as corrupted as a result.

I fundamentally disagree with this. Panicking in the middle of an operation that is supposed to be atomic is bad. If it's not supposed to be atomic then it's totally fine, just as panicking when you hold a plain old `&mut` is fine. Not every use of a `Mutex` is protecting an atomic operation that depends on not being cancelled for its correctness, and even for those situations where you do it's a better idea to prove that a panic cannot happen (if possible) or gracefully handle the panic.

I really don't see a point of mutex poisoning in most cases. You can either safely panic while you're holding a mutex (because your code doesn't care about atomicity), or you simply write your code in such a way that it's still correct even if you panic (e.g. if you temporarily `.take()` something in your critical section then you write a wrapper which restores it on `Drop` in case of a panic). The only thing poisoning achieves is to accidentally give you denial-of-service CVEs, and is actively harmful when it comes to producing reliable software.

sunshowers · 2025-11-25T00:36:16 1764030976

I've written many production Rust services and programs over the years, both sync and async, and in my experience—by far the most common use of mutexes is to temporarily violate invariants that are otherwise upheld while the mutex is unlocked (which I think is what you mean by "atomic"). In some cases invariants can be restored, but in many cases they simply cannot.

Panicking while in the middle of a non-mutex-related &mut T is theoretically bad as well, but in my experience, &mut T temporary invariant violations don't happen nearly as often as corruption of mutex-guarded data.

conradludgate · 2025-11-24T22:52:09 1764024729

You might not think you need atomicity, but some function you call that takes in a `&mut T` might actually expect it

saagarjha · 2025-11-25T11:51:46 1764071506

If you’re not looking to scope out an atomic section, why are you taking the lock?

exDM69 · 2025-11-25T11:36:44 1764070604

Worth noting that this is not `std::mutex` or `parking_lot::mutex` as discussed in the article, but `tokio::sync::Mutex` in cancellable async code.

sunshowers · 2025-11-25T18:01:19 1764093679

Correct yes, but my point was about cancellation in the middle of a critical section more generally.

exDM69 · 2025-11-25T11:29:24 1764070164

I disagree, lock poisoning is a good way of improving correctness of concurrent code in case of fatal errors. As demonstrated by the benchmarks in this article, it's not very expensive for typical use cases.

In 99% of the cases where one thread has panic'd while holding a lock, you want to panic the thread that attempts to grab the lock. The contents of anything inside the lock is very much undefined and continuing will lead to unpredictable results. So most of the time you just want:

    let guard = mutex.lock().expect("poisoned");

The last 1% is when you want to clean up something even if a panic has occured. This is usually in a impl Drop situation. It's not much more verbose either, just:

    let guard = mutex.lock().unwrap_or_else(|poison| poison.into_inner());

What is painful is trying to propagate the poison value as an error using `?`. In that case you're probably better off using a match expression because the usual `.into()` will not play nice with common error handling crates (thiserror, anyhow) or need to implement `From` manually for the error types and drop the contents of the poison error before propagating.

This might be the case for long running server processes where you have n:m threading with long running threads and want to keep processing other requests even if one request fails. Although in that case you probably want (or your framework provides) some kind of robustness with `catch_unwind` that will log the errors, respond with HTTP 500 or whatever and then resume. Because that's needed to catch panics from non-mutex related code.

lmm · 2025-11-25T02:37:35 1764038255

> poisoning a Mutex is a very convenient avenue for a denial-of-service attack, since a poisoned Mutex will just completely brick a given critical section.

There's a tension between making DoS hard and avoiding RCE vulnerabilities, since the way to avoid an unplanned/bad code state becoming an RCE vulnerability is to crash as quickly and thoroughly as possible when you get into that state.

thayne · 2025-11-24T18:19:17 1764008357

There are cases where it is useful.

I had a case where if the mutex was poisened it was possible to reset the lock to a safe state (by writing a new value to the locked content).

Or you may want to drop some resource or restart some operation instead of panicing if it is poisoned.

But I agree that the default behavior should be that the user doesn't have to worry about it.

questioner8216 · 2025-11-25T02:18:30 1764037110

Questions for anyone who is an expert on poisoning in Rust:

Is it safe to ignore poisoned mutexes if and only if the relevant pieces of code are unwind-safe, similar to exception safety in C++? As in, if a panic happens, the relevant pieces of code handles the unwinding safely, thus data is not corrupted, and thus ignoring the poison is fine?

juliangmp · 2025-11-25T10:30:18 1764066618

I've dug into this topic in the past and my takeaway for this entire thing was “cool idea, but don't use it practice ”. I.e. just unrwap the lock call's result. If a worker thread panics you should assume your applications done for. Some people even recommend setting panic=abort for release builds, in which case you won't even be able to catch those panics to begin with.

I mean, think about the actual use cases here. On of my threads just panicked. Does it make sense to continue running the application? And if you answer yes, this is an error condition that can occur, then it shouldn't panick to begin with and instead handle errors gracefully, leaving the mutex unpoisoned.