Preventing outages with pkill's new --require-handler flag

59 points by cdown

jclulow

I feel like the log rotation example is a pretty egregious case of papering over one bug by creating a more insidious bug that will be much harder to diagnose. To wit: if you thought a process was going to reopen its log files on SIGHUP and engineers remove the handler by mistake, pkill -H will ignore this problem and allow the process to continue writing to the now rotated log file. It won’t be logging to the new log file, which will presumably then not contain any evidence that this has occurred, and it might corrupt or screw up any subsequent attempts to compress or archive the rotated log that it still has open.

It seems far preferable to have the process terminate unexpectedly, have the service management system restart it and log that it’s done it, and be able to actually detect this new bug ASAP and just get it fixed.

cdown

It doesn’t paper over the problem – pkill -H will return exit code 1 when it finds no matching process with the relevant handler, which makes the postrotate step fail, which means logrotate will show as failed. This should trigger alerts in any reasonable monitoring setup.

Instead of the service crashing unexpectedly and having data loss/corruption, you now instead get logrotate failing with a specific, traceable error, and the engineer being alerted to fix the missing handler. Instead of having an uncontrolled outage, you get some slightly suboptimal but highly controlled behaviour. That’s hugely preferable to taking down your service.
- jclulow
  Instead of the service crashing unexpectedly and having data loss/corruption
  
  I’m not really sure why a process being interrupted and restarted should result in data loss or corruption, especially when you have to consider and defend against unanticipated interruptions in general.
  
  I’m also still sceptical that this is a good trade-off, not least of which because of:
  
  why didn’t this (the accidentally missing signal handler) get caught during testing (I appreciate you can miss things during development and subsequent code review)?
  
  even if it didn’t get caught during testing, why didn’t it subsequently get caught during the (presumed, at FAANG scale!) canary roll-out of the new bits, etc?
  - cdown
    
    This isn’t about a single process restart, it’s about the cascading effects when many servers terminate simultaneously. When a signal like SIGHUP unexpectedly terminates hundreds or thousands of processes across your fleet within minutes, something like this happens:
    
    Replicas suddenly need to promote primaries en masse
    
    Connection pools to these services collapse
    
    Cache coherency breaks down
    
    …and eventually you don’t have enough capacity to run your service, and you get data loss :-)
    
    We can tolerate single or multiple servers going down, that’s not the problem. The problem is when you distribute such termination across the system. Even with redundancy, when you lose too many servers at once, you exceed the fault tolerance threshold of distributed systems. A single server restart is fine, but thousands within minutes is catastrophic.
    
    As for testing and canaries, as I remember it was missed in testing because the team reviewed their codebase for signal senders they controlled, but this particular one was in a configuration owned by another team which was hard to enumerate. The canary deployment ran for 12 hours, but this particular logrotate configuration only fired weekly. Many operational patterns simply don’t manifest during short canary windows, especially maintenance operations like log rotation, certificate rotation, etc. Of course, there are things that can be done better, and we do those as part of followups, but there are also the more systemic footguns to try and mitigate.
    
    Strong testing and canaries are great, but they’ll never catch 100% of issues, especially those that cross team boundaries or have long periodicity. For critical infrastructure, having safety mechanisms that convert uncontrolled failures into controlled ones is just good, defensive engineering.
- muvlon
  
  Comment removed by author
heftig

Notably, this flag is broken in the current procps release (4.0.5): https://gitlab.com/procps-ng/procps/-/issues/369
- cdown
  
  …now that looks like a search and replace gone wrong :-)
  
  Looks like Craig has already fixed it for the next release, thanks for pointing it out. For any distro maintainers who want to patch it, https://gitlab.com/procps-ng/procps/-/merge_requests/247 should hopefully apply cleanly.
dijit

I’ve been hit by similar “bugs” countless times and it’s one of those thorny edges you learn the hard way by being a sysadmin long enough. Not quite a rite of passage but similar, stops you being a cowboy.

fwiw, you can check the active signal handlers in procfs.

/proc/<PID>/status

Requires a minute understanding of signals;

man 7 signal
oger
So what is the long-term approach to find and eliminate these senders of unwanted signals? Currently I see these possibilities:
- a) run the bpftrace script (from the article) forever, to catch cases where a process receives a signal it cannot handle. This approach looks a bit hacky and ugly to me, at least for a production environment. But I’m not experienced with BPF; maybe it’s not a big problem in the end?
- b) don’t use || true in the pkill call, and instead log somewhere that the command did not find any receiver.
- c) don’t use the -H flag at all, but build all processes so that they gracefully handle unexpected terminations (and then have the monitoring system alert about such cases).
- d) ignore this problem, and accept that over time more and more receiver-less pkill calls will accumulate in various scripts. This seems like a path towards https://how.complexsystems.fail/#4 .
How would you solve this?

My preference would be towards approach c); but maybe Facebook has uptime requirements that preclude this.
- cdown
  The reality is that in large scale environments like the one we have, we end up using a mix of approaches rather than a single one.
  
  eBPF is generally pretty cheap. At Meta, we typically have 100+ eBPF programs running on servers during nominal state handling monitoring, security, profiling, etc. The performance impact with eBPF is minimal as long as you don’t attach to highly expensive probes (either because they are incredibly hot, or because they are slow, like uprobes). For long-term production monitoring one likely implements such things directly as an eBPF program rather than using bpftrace, but the concept is more or less the same. That said, it’s hard to know in an automated way whether such terminations are incorrect or not – SIGHUP is still used in some places for its original meaning, for example – so this requires some amount of human review by someone familiar with the system. With the number of processes and servers we have there’s also a significant storage cost, so this data would also end up being heavily sampled. Aside from that, part of the challenge is making the metrics available to people and making them aware of it rather than being purely a technical problem.
  
  In terms of option c), we do push for standardisation where possible. Most services call a function called init_facebook() during startup, which sets sensible defaults based on the language and environment. In some cases that includes setting signals with a default terminal disposition to SIG_IGN. That said, this doesn’t cover everything – many lower level parts of the stack need to be more opinionated than that, so don’t use init_facebook(), and the state of language support and what exactly init_facebook() does depends a lot on the context. For example, we have some languages which are really only used for one area (like Erlang) which have a very different level of support than more widespread languages (like C++). Some of those languages (like Java) also make extensive use of signals internally, which makes this difficult to achieve universally in a sensible way.
  
  Another problem with option c) is that if there’s no signal handler, there’s no way to gracefully handle most of these things. For example, when services like LogDevice or a database like MySQL receives an unhandled terminal signal, the process terminates immediately without any opportunity to flush in-progress writes, close client connections properly, complete transactions, update replicas, release locks, write to errors logs, etc. You can mitigate some of these with external handlers, but even then there is a more pervasive problem: assuming this comes from log rotation or similar, even if you splay them out in time, having this termination happen widely is likely to trigger a huge number of primary promotions. When you multiply this across hundreds of thousands of machines, the operational cost of passively allowing termination and monitoring becomes prohibitive compared to mitigating the issue in the first place. This is especially true for core infrastructure services where even brief unavailability can cascade into widespread issues. That’s one reason why we can’t generally just afford to monitor, but have to provide some proactive mitigations up front.
  
  For option b), we use logrotate less and less frequently nowadays in favour of other approaches, but yes, we do try to log failures for things like this rather than just papering over them (in whatever form that may take depending on the context). The bigger problem is having that this data is available even surface in teams’ minds when they are doing such cleanup work.
  
  Things like this mean that we tend to combine elements of multiple approaches for production readiness issues like this. In this case, we are:
  
  Using pkill -H as a safety net when setting up the signal senders in the first place (so if they are missed later, it’s less dangerous)
  
  Monitoring and providing data around signal use through eBPF
  
  Gradually improving process resilience and IPC strategy to not use signals where possible (but this is not without its own risks, since such work can inadvertently trigger exactly the kinds of problems listed in the post)
  
  In general, eliminating signal-based IPC in services is what we want to do, but for various reasons that’s not always practical and as such we will always need some amount of proactive mitigating strategy like pkill -H.
  
  (I also think the || true example I put in the post is unhelpful, I’ve removed that and rephrased. Thanks!)
- donio
  
  The long term approach is to have robust process and PID tracking and being in charge of your management scripts so that you can be sure that your automation never sends unwanted signals to random processes in the first place. This is quite achievable and has been for many years. Ducttaping around it is like trying to work around broken scripts deleting your config files instead of fixing the scripts.
  
  PID recycling race conditions could still be a problem in theory but running with a sufficiently high kernel.pid_max should make this very unlikely.
  - LeahNeukirchen
    
    And likewise, means of logging that don’t need external triggers for logrotation.
  - jclulow
    
    There’s also really no reason to be signalling individual processes with something as course as pkill anyway, in most cases. The illumos Service Management Facility (SMF) uses contracts, an inescapable kernel facility, to group together the process tree of related processes, and you (or more commonly the service management libraries and tools) can send signals to those contracts. I’m sure systemd and similar software have similar constructs.
    
    In the case of sending signals as part of interfacing with a process, you should be running your services in separate unprivileged user accounts anyway, and if you run the pkill or whatever else as those users they will only have permission to send signals to related processes as well.
    
    cdown
    
    Your response addresses a different problem than what the article is going over :-)
    
    The issue here isn’t about signals being sent to the wrong processes, it’s about what happens when a signal handler is removed but signals continue to be sent to the correct process. In the LogDevice example, the problem wasn’t that SIGHUP was sent to the wrong process – it was sent to exactly the right process, but that process no longer had a handler for it.
    
    While using contracts/cgroups/systemd units to manage process groups is good practice, and running services as unprivileged users provides important security benefits, neither addresses the fundamental issue that signals like SIGHUP have dangerous default behaviors that kick in when handlers are removed. You can still have exactly the same problem with ReloadSignal= in systemd or anything else.
  - cdown
    
    We’re not ignoring bugs :-) I don’t think anyone who had such a mentality would even get past the screening interviews. That is to say, of course we’d love to remove all such mistakes from our codebase, but in any sufficiently complex codebase or infrastructure there are bugs, and it’s part of our job to think about how to mitigate those. That’s not duct taping, that’s being defensive.
    
    Just to give an example from our production environment, but the same is true at almost all FAANG companies – our environment includes millions of machines running thousands of services maintained by hundreds of engineering teams. We have:
    
    Decades of historical scripts and configurations built up over time
    
    Various third-party libraries and tools, many with their own quirks and signal behaviors
    
    Many, many programming languages
    
    Different teams owning different parts of the stack
    
    Constant change as systems evolve and new features deploy
    
    In such an environment manually auditing really isn’t something that can keep up, and without that, it’s very hard to know “is this unhandled SIGHUP ok? is it not ok?” because it requires domain expertise and nuance.
    
    Unlike explicit operations like file deletion, signals can change meaning when code is refactored or handlers are removed, which is much more subtle. In many cases it’s very hard to know abstractly whether such code is correct without deep expertise in the subject area, so we need defense in depth.
    
    As such we:
    
    Continue to add strategies to avoid such issues in the first place (lints, static analysis, BPF monitoring)
    
    Use pkill -H as a safety net
    
    Gradually replace ad-hoc signal usage with more explicit IPC mechanisms where feasible
    
    I think we are all abstractly in favour of doing things “properly” where possible, nobody is against that :-) But having pragmatic safety mechanisms delivers more vastly reliable systems than pursuing theoretical purity alone in complex distributed systems.
    
    grayhatter
    
    We’re not ignoring bugs :-) I don’t think anyone who had such a mentality would even get past the screening interviews
    
    We must have been sitting in very different sev reviews :D
    
    I think we are all abstractly in favour of doing things “properly” where possible, nobody is against that :-) But having pragmatic safety mechanisms delivers more vastly reliable systems than pursuing theoretical purity alone in complex distributed systems.
    
    IFF it’s maintainable. I’m very often of the opinion that instrumentation is a good thing. It’s even more useful in a system where it’s less a problem that the left hand doesn’t know what the right hand is doing… the more accurate description is that the left hand doesn’t know the right exists.
    
    Is this a good feature that will fix real issues? Without a doubt! Is it worth the maintenance overhead? That I’m less convinced of. You already mentioned it, a scary number of people using the title software engineer don’t know signals exist, let alone how they work. And to really evaluate this, you have to understand the esoteric parts too. I can’t grade the exact temperature on this take, but: signals will never behave robustly. Which includes the implication that it’d be better to replace that infra with something that can be made more stable.
    
    I get the sense most of the objections stem from the fact this feels like a treatment, and not a definitive cure. That said, I don’t keep a “definitive cure” in my first aid kit :D
    
    cdown
    
    I’m not seeing the additional maintenance burden. The code using signals already exists, this just makes it safer. We’re not encouraging new signal usage, quite the opposite. The definitive cure is “don’t use signals” as is mentioned in the article (use varlink, socket communication, or similar), but that’s not practicable in all cases.
    
    grayhatter
    
    I’m abusing “maintenance cost” here as somewhat of a hand-wavy stand in for “opportunity cost” (which is a meme I don’t fully buy into), and for “sunk cost” which might prolong the necessity of a definitive solution that fixes everything all at once, forever. Which is just that same “abstract ideal” that can be paralyzing to good engineering if you spend too long staring at it.
    
    forestj
    
    It strikes me that if the LogDevice should never have all of its instances restarted at once, The problem is not just the signal that caused it to be restarted. The problem appeared because this action was scheduled for all instances at the exact same time… “Normal” Linux stuff like cron/logrotate being mixed in with your fancy cluster thing.
    
    I’ve noticed a pattern where distributed systems tend to distance themselves from the Linux OS as much as possible and don’t interact with it unless they have to… I always wondered if it was deliberate or just people reinventing the wheel because they can. Probably a little bit of both, but it seems to work pretty well.
    
    Linux has terrible usability; practically no affordances. So I’m not surprised that someone was unaware of something that it was doing.
    
    cdown
    
    They weren’t restarted simultaneously, log rotation was splayed out over the course of an hour. The issue is that in stateful systems, even staggered restarts can overwhelm failover mechanisms, because failover isn’t instantaneous. When nodes go down faster than new primaries can be promoted and set up, the system eventually loses too much capacity to handle its traffic.
    
    okapi
    
    Is procps-ng strictly Linux only? It always feels a bit hacky to me when system interfaces entail digging around in /proc. Is the same information available somehow on other OSes like Solaris or the BSDs. I’d have thought the option might be appropriate for kill too. While there is /bin/kill, what actually gets used is builtin to the shell so does need to be implemented portably.
    
    cdown
    
    While there is /bin/kill, what actually gets used is builtin to the shell so does need to be implemented portably.
    
    Even more confusingly, there is not only one /bin/kill on Linux :-) Just as a start, it could either come from util-linux, or coreutils, but there are even more options. I have implemented this in util-linux kill, too – see here.
    
    Is procps-ng strictly Linux only? It always feels a bit hacky to me when system interfaces entail digging around in /proc.
    
    You’re right that this approach relies on Linux’s /proc filesystem, which is one of the non-portable parts of the Linux ecosystem. That said, it’s not really a hack, it’s part of the official kernel API. We are very careful about maintaining compatibility in /proc interfaces precisely because userspace applications depend on them and their format. The format of these files is part of the kernel’s contract with userspace applications.
    
    Implementing it broadly in a shell would be vastly more difficult because it has to be operating system agnostic, but not all operating systems expose this information.
    
    The challenge of cross-platform deployment is real, but I’d rather have better tools on one platform than not having them on any platform.