What does it take to ship Rust in safety-critical?
33 points by eBPF
33 points by eBPF
I’m a bit surprised the focus is on async rather than lower-level bits that don’t seem to align with safety-critical systems; e.g., preventing allocations and other operations that have unpredictable latency. Am I overblowing the importance of this?
I want a no_panic that's either a subset of no_std, or ideally orthogonal to it. I've seen it done by checking object files for banned symbols, but it would be nice to have it supported at the language/library level.
We have a homegrown no_panic in our codebase that requires everything to be built with LTO in CI: we assert that the final binaries don't have a panic handler, and that only works with aggressive inlining and dead code pruning. And it's not 100% reliable: we still have to annotate functions with #[trust_me_this_doesnt_panic]. As our codebase has grown, the cost of doing LTO in CI has exploded.
I wish Rust had far fewer features, but no_panic can only really be done well at the language level, and I wish Rust had it. (Of course, listening to lots of folks like me is how you end up with a big language).
Agreed. At work we write everything in Rust. We don't use async in our firmware, and I try to stay away from it even as we go up the stack.
Our code isn't "safety critical" by the OP's definition because there's no external standard that we're required to adhere to. But it runs in outer space, so all the same practices apply.
I'm sad about async Rust. It took on a ton of complexity to be able to work at the lowest levels of the stack: polled futures, swappable runtimes, pin/unpin, cancellation safety. And the lowest levels of the stack are the places where I'm least likely to use async. In return we get the pleasure of multiple different async runtimes, and mutually incompatible async libraries (e.g. can't use a crossbeam channel with tokio's select, but only crossbeam supports async unbuffered channels).
This quote from Graydon's blog comes to mind:
most abstractions come with costs and tradeoffs, and I would have traded lots and lots of small constant performance costs for simpler or more robust versions of many abstractions.
I'd rather have a simpler concurrency story with a built-in runtime, at the cost of not using async in my firmware. Which it turns out I don't do anyway.
I dont write timing critical software, but it is in a medical device, so it is considered safety critical (though the specific device I work on has quite harmless failure modes compared to what most people think of with medical device.).
Id urge you to take another look at embedded async if you havent in a while, the Embassy framework is a pleasure to use, and in testing I found most executors can get to latencies of a few microseconds with very little jitter. I quite prefer embedded async to standard library async, infact.
I definitely want to take a look at some point, especially for my own stuff.
At work, all our code is timing critical. We poll all hardware (no interrupts), most of the code is a main loop with a series of nested step functions. We run all the code every tick, even if it's not relevant to the current state of the system (we just discard outputs we don't care about).
I'm skeptical that we could get the same guarantees with async Rust. It would be cool to be proven wrong.
I also don't know how typical our requirements are for folks doing embedded or safety critical work. Maybe we're in the minority!
There's also some folks who don't use async in embedded, Oxide wrote a little embedded "not an RTOS" for example https://hubris.oxide.computer/
Actually, Oxide would like to support async, as stated in the hubris faq. The main blocker seems to be their debugger supporting it.
This is a little bit different than what I meant; hubris itself is synchronous. Writing a task using async/await wouldn’t change that fact.
That is really interesting. Normally in 'timing critical' code, interrupts see a lot of use, I haven't heard of doing it by polling. When you say you poll all hardware in the main loop, I assume you don't mean your blocking waiting for e.g i2c, your just polling to see if there is new data, right?
I'd rather have a simpler concurrency story with a built-in runtime, at the cost of not using async in my firmware. Which it turns out I don't do anyway.
The situation of async in safety critical would not improve if Rust had a default runtime. The reasons to use asynchronicity in safety critical systems and the patterns differ quite a bit from why e.g. a tokio user would use async for.
The situation of async in safety critical would not improve if Rust had a default runtime.
Oh yeah I agree. But I think attempting to support async in these contexts was a mistake. It's not worth the cost.
If embedded async (which is very connected to safety/timing critical async) wasn't a design requirement, Rust could have had a nicer async story: e.g. a default runtime would get rid of the "mutually incompatible async crates" problem. Push-based futures would make cancellation safety much easier. Heap allocating all futures would get rid of the need for Pin. Etc.
Isn't there very little if not zero allocation in this domain in the first place?
Rust’s async was specifically designed to do zero allocations! It’s one of the reasons for its complexity.
Ensuring that you don't blow up the stack is generally a bigger deal than dynamic allocations because of that.
... which always lets me chuckle because the stack is a form of dynamic allocation. A better term would be that heap allocations are forbidden (because timing guarantees are not feasible).
Another flexibility is that one can do heap allocation in some initial setup phase. You just need a convincing way how to not do heap allocations during regular execution.
... which always lets me chuckle because the stack is a form of dynamic allocation. A better term would be that heap allocations are forbidden (because timing guarantees are not feasible).
Safety standards actually appreciate that and consider the stack dynamic allocation.
Another flexibility is that one can do heap allocation in some initial setup phase. You just need a convincing way how to not do heap allocations during regular execution.
Yep, it's not whether you call malloc somewhere, it's much more a statement on memory lifetime.
The flight system standards actually have a whole section about mitigation against dynamic memory hazards, so it's not like it's never done.
Another flexibility is that one can do heap allocation in some initial setup phase. You just need a convincing way how to not do heap allocations during regular execution.
This seems simple in Rust: split the project into two crates: outer "init" that calls into inner "core", and make the "core" crate no_std. Then you can enforce that the core crate doesn't use alloc (not sure off the top of my head if there's a switch/lint for that, but even if the language doesn't cooperate, a link-time or post-link check would ensure there's no allocation in that crate or any of its dependencies)
... which always lets me chuckle because the stack is a form of dynamic allocation.
If you have no recursion or indirect calls then you can bound the stack depth statically by looking at the static program text (e.g. StackAnalyzer which I believe does a path-sensitive analysis to get a tighter bound than what you'd get by bounding each function separately). [1] Interrupt handlers throw a wrench into that and have to be handled carefully, but they're tricky for reasons beyond just the stack analysis.
Nested unboxed Futures in Rust are effectively a call stack for a static call graph wrapped into a data structure. You can also take advantage of static call graphs for efficient compilation of programs when targeting an ISA like 6502: LLVM-MOS 6502 Backend: Having a Blast in the Past.
[1] Of course you can also do static analysis to verify safety properties related to dynamic allocation, but the stack depth is much more tractable to analyze in the absence of recursion/indirect calls/interrupt handlers. Indirect calls can be handled by conservatively analyzing the possible call targets, but the more conservative, the looser the estimate (and it might be so conservative it looks like there's a potential cycle in the call graph when there's not).
In theory if you have no recursion then you can effectively treat every function's stack-allocated variables as globals and just assign them memory addresses ahead of time, and if you can prove that two functions can never be in the call stack at the same time they can even share their "stack". I wonder if anyone's ever done that.
I'm super happy that someone not just tried to actually talk to people who write that kind of software, but made a very delliberate effort to organise their statements and come up with nuanced and solid actionable calls. I hope this is the sort of thing that's going to become a little more common in the Rust community. Things have been a little better in the last couple of years or so (see e.g. Ferrocene Systems' work) but by and large I've found that much of the Rust community has a lot of weird ideas about what's important for safety-critical systems and what isn't.
I initially wrote a much longer post about it but I don't know if it would've added any real value. I just want to acknowledge the really good parts here that I think might mark a real, substantial change in how safety-critical systems are viewed by the Rust community:
(Edit: no -- there is one part in my longer post that I do want to mention, even in condensed form, because I think it's a common misconception.
Safety-critical industries basically run on risk management. Correctness is, of course, the most critical feature of any design, and you want to ensure it every step of the way. But you also have to demonstrate that, every step of the way, you've acknowledged and worked to mitigate risk factors in your design and implementation.
Stability by flintstone tools is an engineering choice by proxy. It's not something you do out of a conviction that older toolchains are better, more stable, and more bug-free. It's what you do when the compiler vendor doesn't give you the means to manage the upgrade risk -- i.e. when you have no way to say what an upgrade changes and what it doesn't change.)
On a different note: I hope I'm wrong, but I harbor a secret (well I guess it's not very secret now...) conviction that async is the Algol 68 of Rust. I get its appeal in a safety-critical environment (I was actually just playing with Embassy and I like how clean it is). But even defining its semantics to an acceptable level is daunting. And once that's done, we're still stuck with something that's not strictly a language-level thing, but depends on a runtime that's tightly bound to the language. I hope it succeeds, and I don't know about compiler and runtime engineering to realistically say if it can or can't, but my entirely uninformed take is that there are structural problems that make it untractable in the current Rust landscape.
Things have been a little better in the last couple of years or so (see e.g. Ferrocene Systems' work) but by and large I've found that much of the Rust community has a lot of weird ideas about what's important for safety-critical systems and what isn't.
(Ferrous Systems, the product is Ferrocene) :D
I mean, it's a back and forth. There's things in the Rust community that make our work very easy, because sometimes it has a big taste for rigor. But there's requirements from safety that some people just don't understand. Statements like the one on the blog make our work a ton easier.
(Ferrous Systems, the product is Ferrocene) :D
That's what happens when I type out things over lunch :-D
There's things in the Rust community that make our work very easy, because sometimes it has a big taste for rigor. But there's requirements from safety that some people just don't understand.
That matches my experience as well, too. I appreciate the taste for rigor and correctness, but I've had difficulty explaining a lot of things about the general dynamics of how safety critical software is written and what kind of things we need in order to trust software that we use and build.
Hey everybody, doing these interviews was a nice experience. Some of the takes might be a bit raw, sorry if they might come off as somewhat entitled. (I heard this feedback elsewhere)
I tried to frame these quotes and experiences within the overall scope of "going forward how can we in safety-critical improve our own story through collaboration with the Rust Project?"
--
If you're interested in working a little more closely on shaping the story of Rust in safety-critical, come on over to the Safety-Critical Rust Consortium's GitHub repo and join up.
We're also working on Safety-Critical Rust Coding Guidelines that, don't tell anybody, I want to expand to cover many other use cases over time (e.g. game engines, server frameworks, and so on) with some tuning of how pedantic you'd like to be.