When was the last time you broke production and how?

50 points by freddyb


Inspired by this post on senior engineers telling junior engineers about their mistakes, I am asking you all to share your stories or even existing posts about a situation where you royally fucked something up.

(Bonus points for every story that is turned into a blog post just for this thread)

icefox

Ooh, war stories. Uh, I’ve worked in R&D for a long time so “production” isn’t quite a thing that happens as much, but I vividly recall a field test a couple years ago where we were having our control system fly a quarter-million-dollar cargo drone around. I was the test engineer sitting on the ground with a laptop and making sure everything was operating smoothly, and we were doing a test where our control system would spot and avoid another drone that was going to come too close to it. Turns out this particular piece of hardware had performance problems that weren’t apparent until the thing was actually flying around and working hard, and I didn’t realize that was happening in our previous test flights; it seemed laggy sometimes, but I thought it was just the network connecting to it being laggy sometimes.

So the intruder drone flew close to ours, ours spotted it while still far away and started avoiding it, but did so by flying straight into a dense line of trees nearby. Its collision-avoidance spotted the trees and made it stop-and-hover and ask for a human to help… but because of the hardware problems the thing’s CPU was so overloaded that emergency-stop-and-hover message didn’t actually get to the flight controller for a good 5-10 seconds. Far, far too late to stop it from flying into a tree.

Fortunately our safety pilot had a good paranoid finger on the manual-takeover switch, and stopped the drone from blindly committing suicide because it didn’t know that it was going somewhere unsafe. I didn’t even figure out what was happening or how close it’d been until I’d downloaded the data off of the drone and replayed it in simulation.

The culprit? After a month of tearing the thing apart and putting it back together again, turns out the thermal paste we normally used for all our systems had stopped being manufactured, and we’d used a different brand. When flying around outside under the hot sun and working hard, the inside of the system had gotten hot enough that the thermal paste melted and mostly oozed out from under the CPU’s heat sink, despite being supposedly rated to handle the heat. So the CPU overheated and throttled itself down to like 20% clock speed in an effort to not cook itself, and couldn’t keep up with crunching all the sensor data.