What Now? Handling Errors in Large Systems

23 points by aminom

ssokolow

Yes, I know a panic isn’t necessarily a crash, but it’s close enough for our purposes here. If you’d like to explain the difference to me, feel free.

Before async Rust, it was very intentional that panicking Rust only crashed the current thread. Now, with async Rust, it's convention that HTTP frameworks like actix-web catch panics so they only crash that specific request. (i.e. The boundary beyond which inconsistent state shouldn't be observable.)

raoulmillais

I never want any part of my system to fail silently or with a warning. I want it to die and die fast. I’m mostly working in systems where the infra and services are in place to reconstruct what happened though. The requirements of a database cluster vs application/services are totally different though. I don’t really control the db cluster code so what I want is for it to appear to run consistently (even if it’s internally making different decisions about what should cause the process to crash). I have worked on db clusters though and even there I still think a crashed pod is better than some warning that (maybe, hopefully?) got logged and that the metrics dashboard was set up to look for messages to put in an alert /graph. It all depends on - will the crash negatively affect the users? Can we recover? Do we need to investigate? Crashing forces you not to ignore those things

oliverpool

STPA seems to propose a more systematic approach to this kind of problem: https://entropicthoughts.com/aws-dynamodb-outage-stpa