When was the last time you broke production and how?
50 points by freddyb
50 points by freddyb
Inspired by this post on senior engineers telling junior engineers about their mistakes, I am asking you all to share your stories or even existing posts about a situation where you royally fucked something up.
(Bonus points for every story that is turned into a blog post just for this thread)
Ooh, war stories. Uh, I’ve worked in R&D for a long time so “production” isn’t quite a thing that happens as much, but I vividly recall a field test a couple years ago where we were having our control system fly a quarter-million-dollar cargo drone around. I was the test engineer sitting on the ground with a laptop and making sure everything was operating smoothly, and we were doing a test where our control system would spot and avoid another drone that was going to come too close to it. Turns out this particular piece of hardware had performance problems that weren’t apparent until the thing was actually flying around and working hard, and I didn’t realize that was happening in our previous test flights; it seemed laggy sometimes, but I thought it was just the network connecting to it being laggy sometimes.
So the intruder drone flew close to ours, ours spotted it while still far away and started avoiding it, but did so by flying straight into a dense line of trees nearby. Its collision-avoidance spotted the trees and made it stop-and-hover and ask for a human to help… but because of the hardware problems the thing’s CPU was so overloaded that emergency-stop-and-hover message didn’t actually get to the flight controller for a good 5-10 seconds. Far, far too late to stop it from flying into a tree.
Fortunately our safety pilot had a good paranoid finger on the manual-takeover switch, and stopped the drone from blindly committing suicide because it didn’t know that it was going somewhere unsafe. I didn’t even figure out what was happening or how close it’d been until I’d downloaded the data off of the drone and replayed it in simulation.
The culprit? After a month of tearing the thing apart and putting it back together again, turns out the thermal paste we normally used for all our systems had stopped being manufactured, and we’d used a different brand. When flying around outside under the hot sun and working hard, the inside of the system had gotten hot enough that the thermal paste melted and mostly oozed out from under the CPU’s heat sink, despite being supposedly rated to handle the heat. So the CPU overheated and throttled itself down to like 20% clock speed in an effort to not cook itself, and couldn’t keep up with crunching all the sensor data.
I had enough CPU but forgot that +z is down in a NED frame and definitely commanded a climb to -8m AGL and it did exactly what I asked :D
The existence of coordinate frames is the best proof that there is no god.
I mean there is a way to derive the third axis from the first two in a Cartesian frame.. but it requires people knowing their right hand from their left :P https://en.wikipedia.org/wiki/Right-hand_rule
The problem is more that moving in and processing information from 3D space involves dealing with lots of coordinate frames, and building a coherent picture requires making all the transformations between them correct. At a minimum you will have coordinate frames for “the world”, “my robot’s center of mass”, and “where my sensors are”, and you will probably end up with far more. So you get a big complicated skeletal-animation-like tree of transforms and any mistakes anywhere make Bad Things Happen. As described.
Some people are really good at it; they just have whatever mental wiring makes it easy for them to go from one transform to the next in their head. The rest of us get wrist damage from trying to step through the chain of transforms one right-hand-rule at a time while debugging transform issues. My best coworker 3D printed a bunch of little right-hand-axis indicators, each a couple inches high and labelled with large X/Y/Z, and having a couple of them lying around is a really good debugging tool.
That time I broke Wikipedia, but only for vandals was the first time I broke something in production. Proud to have spent years after that making it basically impossible to break again in that way.
Also, this incident is the only time I helped take down basically all of Wikipedia. We deployed a buggy change that inadvertently caused a regex deep in MediaWiki to become invalid, and PHP’s preg_replace()
helpfully returned null instead of throwing an error, which got cast into an empty string and used as the article’s text, whoops. Worst part was this all got cached at multiple levels, so then we needed to write more code to purge the cache from during that time range.
I immediately told Ori that it was broken and his reaction was along the lines of: “You didn’t test it??”
the reverse of
If ‘Tested-by:’ is set to yourself, it will be removed. No one will believe you.
One of our SaaS customers’ environment was running slowly and we noticed that the audit log table had millions of rows. That seemed a likely culprit, since almost every transaction writes to that table (and was unnecessarily filling up the database anyway.) Our terms of service says we’ll provide 90 days of audit data, so I fire up a mariadb client and delete from audit where time_entered < ONE_YEAR_AGO
. This was taking a while, so I just let it go.
What I didn’t realize is that that query had locked the entire audit table and the environment was effectively read only. The health checks were fine - they don’t write to the audit table - so no alarms were going off. But all actual work was failing. By the time the customer informed us what was going on, the query had been running for hours. Cancelling it triggered a rollback - which of course wouldn’t release the lock until complete and would presumably take just as long to complete as the query had run so far (or longer - who knows?)
In the end, we restored the mariadb server from a snapshot that had been taken a little after the original query started (since nothing was working while that table was locked, there would be no data loss). Now we have a script that deletes a couple thousand rows at a time, instead of trying to do it in one query.
This is interesting, and I’m asking in all honesty, as we are in a similar predicament. wouldn’t this be a case for something like creating a temp table then swapping? I mean something like
Create table tmp as select * from audit where time entered > ONE_YEAR_AGO;
Rename table audit to old-audit;
Rename table tmp to audit;
Or is there a reason not to? I remember reading something about deleting massive amounts of rows here but I can’t find it anymore
Consider table partitioning by date range, or a query planning hint to not take locks. Partitions are quick to drop though. e.g. in postgres https://www.postgresql.org/docs/current/ddl-partitioning.html or sql server https://learn.microsoft.com/en-us/sql/relational-databases/partitions/create-partitioned-tables-and-indexes?view=sql-server-ver17
The irritating thing is partitions have to be declared in advance, it’s not like a good timeseries db where you can expect each day/week/month to be its own thing.
(We’re about to run out of partitions in prod at my company. I have a slack timer set to remind the DBA team in a few months. Hope it fires!)
In 2020, I tried to remove capabilities from a privileged UI context in Firefox. This context can request arbitrary URLs (ignoring the Same-Origin Policy) and do a lot of dangerous stuff. It has wide access to user data and settings. I removed everything I could but also made more my block permissive for tests in CI, because they would often do complicated stuff with these APIs.
Anyway, at some point I had a successful CI run that blocked A LOT of stuff. Including downloading things with privileges. Basically everything that wasn’t part of the browser build. In the end, I could block all access to http and https URL schemes. And it was green in CI. Perfect!
At some point, I requested reviews and got it landed in our repository.
Next thing I know is Firefox Nightly doesn’t have favicons and the update check stops working…
Why was this not caught in tests? Well, because I made it more permissive.
And this is why you should (almost) never write code that behaves differently in production than in tests.
Just last week I broke a data pipeline by trying to make test and production more similar.
There were several special cases and if statements for test mode. I refactored the whole thing so that there were just a few lines at the beginning, so that theoretically test mode would just receive a (much) smaller dataset but still execute all the same code. I happened to miss one case, so testing worked fine but production would crash.
Why didn’t I catch this? Because I only tested in test mode…
For me, production is “the developer pipeline”, and as a matter of fact broke it last week. I made a change to the entry point of our build scripts that caused the exit code to always be 0. As you can imagine, this made build and test failures invisible to CI, so by the time a user noticed something was wrong… main already had a bunch of breakages checked in. This was an interesting incident because “it was not very impactful for users, but of high severity to the codebase”. It was painful to revert get back to healthy.
Since then, I’ve added “tests for the tests” that use a completely different code path to validate the foundational behavior of the build tool. But we are still discussing in the postmortem what to do about it at a higher level because the way CI interacts with the build could also suffer from similar bugs. (E.g. we will likely add a background job that tries to submit a known-broken code change, and expects that the CI validations catch the issue. And we will also likely add some extra sanity-checks to cross-reference exit codes from other failure signals in logs and the like.)
Not a very exciting incident but you asked about “last time” ;-P
Nice. There’s something very Godel-adjacent about this. A proof system cannot be used to prove its own validity, and a test system cannot be used to test its own correctness.
i was working a cloud engineering job at $MEGACORP, and it was black Friday. we have this huge wall of monitors & at the time, it was powered by graphite, which i was the subject matter expert for.
our CTO was pointing at one of the traffic graphs, surrounded by his posse of corpo yes men, and the storage backend completely froze - the graphs all turned red, and displayed a “NO DATA FOUND” message. it must have looked like our entire system just went down.
the director of our org was a few seats from me, and i met his eyes - he was looked at me, as if to say “jes, please, fix this right now) - i swear his eyebrows nearly reached the ceiling.
long story short, i hard powered the server off and turned it back on, and the graphs all came back up.
I realise I’m drifting off topic a little here, but one team I worked on was diagnosing a production issue and one of my colleagues had the good sense to look at resource usage graphs.
He happened to notice that the memory usage for one server looked odd, and so shared its graph with the rest of us, but hadn’t noticed that the graph also looked vaguely like a cartoon penis and couldn’t understand why were giggling so much.
Maybe I should be given a pass because I’m an EM but I don’t think I’ve ever broken production yet at $CURRENT_JOB
. That’s why I’m pushing more and more risky code to production since I feel I have to use this accumulated error budget somehow.
It’s been some years now, but i did an oopsie in our manufacturing system and it went unnoticed some days.
In the meantime, we produced over 500 devices which shared a single MAC address, as the system allocate a new one after successful testing, but still wrote the one allocated at system startup into the devices.
This means we had to recall all of these devices and update their MAC addresses to the one actually assigned.
I broke production briefly this week! Our backend had a code change that needed a change to the Docker container configuration, with a mutual dependency between the two: rolling either one out alone would break things. I kicked off the code deployment (automated in GitHub Actions) and had the command to apply the config change ready to go in a terminal window. I was going to wait until the deployment was just about to actually fire up the backend with the new code version before hitting Enter on the config command. There’d be a brief outage but it’d be well within our allowed tolerances.
So of course, that’s exactly when one of my coworkers pinged me on Slack with a quick question, which I was foolish enough to read and start answering. No problem, I thought: the release process still has a few minutes to go before it’s to the point where I have to change the config.
As such things often do, the quick question led to a not-so-quick discussion and my novelty-craving ADHD brain switched focus to it to the exclusion of the boring “wait patiently until the release is far enough along” task.
Luckily, I was snapped out of it by the production outage alert. I ran the config command and all was well.
And yeah, of course the whole process should have been automated or we should have made the code and config changes backward compatible. A nice reminder that taking shortcuts on that kind of stuff can bite you sometimes.
Perhaps not as severe as the other stories shared here, but it was kinda ridiculous.
Some time ago I broke a customer’s production website by SSH’ing into the server. No, really. The website worked fine, I connected via SSH because I wanted to test whether our deployment pipeline worked or whatever, and all of a sudden the site was blank. I closed the connection and the site worked again.
Long story short, the server wasn’t ours; we were using one of those hosting panel services, and on a relatively low tier at that. I think that that tier had an extremely low process limit. When I connected to the server, sshd forked and seemingly occupied the last “free space”, so the web server couldn’t do the same anymore.
At least, I’m pretty sure that that was the case. I checked the pipeline, reported my findings and, as is the case with these things sometimes, never heard back apart from a “great, thanks for letting us know”.
I was deploying co-curricular course selections software, custom-written for my school. The manually-renewed TLS certs they gave me a few months ago expired, and this time around I forgot to append the CA and intermediate certificates when converting their weird MS-format certificates (securely stored in a Microsoft Teams folder) to PEM. But my browser had the entire school’s certificate chain cached so nothing broke for myself, but when course selections started in production, many people complained about TLS certificate issues.
Also, this should probably be tagged as devops instead of practices?
I think you can screw everything up if you just try hard enough. So this thread is practiced inclusion :-)
Not that long ago, during a high traffic day due to some politics and the stock markets bouncing around, I accidentally throttled an entire production Kubernetes clusters network throughput from around 100Gb/s to, ~5Gb/s. Incident was declared as I had realized I accidentally applied a Cilium network policy to production rather than my load testing cluster.
But (aside from why I could apply this, which I get to later) I had set enableDefaultDeny
to false, which means any network flows not matching the policy won’t be dropped by default, but allowed through. What we saw in fact was the eBPF map pressure skyrocket to 100%, and started to overflow the conntrack map. I deleted the network policy and traffic returned to usual. Learning #1: we ought to up the map memory limits and tune the map limits for such network policies in future.
Then we began to ask, why the f*#k was I able to apply this to production by mistake through an incorrect kubectl context?
Well… Turns out our Teleport setup for Kubernetes was using the system:masters
group, but had denied permissions such as create/delete/update. Which in turn, we discovered only applies to well-known resources (such as pod, secrets, configmap, etc.) but any verb operating on an unknown apiVersion would fall through to the default system:masters role. Learning #2: Teleport for Kubernetes isn’t magic, and requires methodical management of RBAC roles and bindings underneath, especially for custom resource definitions.
We spent some weeks re-evaluating the entire Kubernetes RBAC setup, and also workshopped improvements to our pull request process to ensure things like using the most privelleged group as the base for a role, don’t get shipped to production.
My most recent mess up was also one of the dumbest I’ve been a part of in a long time.
CTO wants an LLM powered internal chat bot and the idea is that we’re powering the thing with a vertex database. So the chat bot is going to be entry point for employees, initially at corporate and then at the factories and retail stores around the world. Cool, got it.
So we go through the process of building connectors to all the internal sources of truth, chunking them, adding them in, etc. We’re testing it and quickly realize “oh crap RBAC isn’t going to work”. These documents are coming from a bunch of sources and it’s impossible for me to know if they should keep the permissions they had before or get new ones without someone looking at each one.
We get so obsessed with testing the RBAC and making sure that dangerous docs aren’t going to get exposed to the wrong users that I mentally lose track of what vertex database is the real one. We made a fake one full of like random garbage.
Launch day comes, test site is working well. I am 100% confident that I have not made a mistake until everyone in Teams starting making laughing emojis at the launch post. My bot is answering their questions with (thankfully not offensive) but just sorta random garbage. Screenshots flood in of the bot responding to questions like “what is my vacation policy” with Star Trek quotes.
Screenshots flood in of the bot responding to questions like “what is my vacation policy” with Star Trek quotes.
“To boldly go where no man has gone before” is a valid answer, no?
I don’t remember some of the details, but the gist of it was: the CI build for the the prod containers failed for unrelated reasons, so I ran the build locally and deployed those. However, because I forgot to add the flag to use prod configs, the login endpoint would redirect to localhost
instead of the production URL, which naturally broke login.
Fortunately, it was an easy fix to rerun the whole CI process and redeploy correctly-configured containers.
Yesterday, actually. It’s funny because someone else’s code actually broke production, but because I was the one rolling it out, and also because I didn’t strictly follow the release process (tagging commit authors to check canary logs), it seems I was responsible for it.
I did the same thing last week - deploying someone else’s code that broke production. In this case, the existing release process was followed, but the other developer and the people doing change management didn’t know there was another system that had to be updated in lock-step with this one. I should have known, though.
At least it was easy and fast to roll back.
This didn’t bring down production per se, but it halted progress for all application updates for a while.
At some point we developed this pattern of exposing secrets as env vars at build time, using helm charts. This makes deployment more complicated and makes it a pain to add secrets, so it was decided we should move toward grabbing secrets from the secrets manager directly in code.
I diligently changed our Flask app to do exactly that with FLASK_APP_SECRET_KEY
, for a start. Well, it turns out we have multiple buckets for different levels of secrets, e.g. things the app needs access to vs things only async jobs need. And the access that the production app has is more limited than what the helm charts have at build time. So in local/test/CI everything was fine, but once it hit production the app failed to start.
This happened at the end of a long week, and I must have forgotten to check our CD dashboard because the app had entered a crash loop. Because the app was failing to start, Kubernetes refused to “progress” new builds, and I had effectively blocked all updates to the app. Kubernetes actually kept the app running for four days on stale pods before anyone noticed!
Usually I dislike Kubernetes but I was grateful for it that day–and the three previous days, retroactively. :P
I’ll share a recent, unimpressive one. :)
Yesterday. We were making a small tweak to a configuration file with strong types. We ran the syntax checker, as we always do, and it was good so we pushed it. We discovered that one of our plugins does not do proper type checking when accessing a dictionary with an enum.
Code example:
SOME_CONFIG[SomeEnum.InvalidOption] = "VALUE"; // should throw a syntax check error; it did not
We were alerted by dozens of users within minutes of the change that things weren’t working. Thanks to verbose error messages, we found the cause in minutes, and then spent half an hour writing a bug report for our plugin vendor.
Would I change anything else? Probably not. We just know to be careful with our enums until the vendor fixes it. We only do small tweaks like this live on production; big changes are done in test first.
Less than a six months ago, I added another commit with a summary matching the regex [Ff][Uu][Cc][Kk] [Hh][Aa][Mm][Ll]
after yet again thinking that I could make a small change to a Rails app that uses HAML for all HTML and XML rendering.
I added several commits in this style during a conference in 2020, and people were actively using the website during the conference. Our release process takes at least 10 minutes, so every error meant the affected endpoints were broken for at least 10 minutes.
I shared this story with a coworker so it’s already typed, and I’ll nest it because it’s a story of how something went right and I thought little of it but so many people were worried.
I had two people watching over my shoulder as I prepped an in-place license key file swap on eight production machines for a government project 15 years ago. They watched my every move because during this period, an outage of more than one machine for five minutes would get labeled a “SERVICE DEGRADATION,” and an outage of more than three meant “SERVICE OUTAGE” and triggered contractual clauses that cost ~$1,000.00 per minute someone told me.
What did I do? Something like this, in a script, showing them the script before executing it, comparing hashes.
cp /mnt/usb/license_key.txt /opt/vivisimo/velocity/latest/license_key_20100901.txt
(mv license_key.txt OLDlicense_key.txt && ln -s license_key_20100901.txt license_key.txt) || mv OLDlicense_key.txt license_key.txt)
/etc/init.d/velocity restart
/opt/vivismo/velocity/latest/velocity --check-key && rm OLDlicense_key.txt
The license key expired at noon, and I started this ~40-minute process—allocating five minutes per machine, eight machines—at around 11:15 a.m. If anything went wrong, there would be a $1,000/min outage in 45 minutes.
“Hey this kid’s pretty young, does he know what he’s doing?”
Motherf##$%er, despite being 25, I’m one of four people in the world who’s allowed to touch your systems, and the other three are dealing with their weird shit and have families and couldn’t drive to Washington DC from Pittsburgh at 8 p.m. at night to be onsite by 9 a.m. because your team dragged their feet paying our company the $500,000 we’re owed to renew your license!
…but the last laugh was ours because none of the other three would have thought to do a license audit then. We upsold them another $350,000 two weeks later because their usage was significantly greater than that for which they’d paid. That same team also did highly formalized code reviews that cost $3,600 per meeting. American tax dollars at work!
Quite recently, actually. I did a few queries against the prod db to ensure that an Integer in a Java entity class couldn’t be null.
I suppose those of you who are in Java land can see where it went wrong.
Of course there was a path that would give a null there. So I broke prod.
Thankfully, it’s a 1 to 1 replacement of an older system, and part of that is, the system can self-heal by swapping back to the old system if it sees enough unhandled exceptions it swaps back.
To the user who repeatedly hit the button until the system worked again, I’m sorry. But you did the right thing.
I was migrating an EKS cluster to using the AWS Load Balancer Controller, which allows you to provision load balancers from your Kubernetes cluster. You can do that without AWS LBC but that method is deprecated, and they don’t let you migrate existing load balancers in-place, so I had to delete and recreate the load balancers that process the entire company’s network traffic.
I came up with a plan to migrate without downtime: create a new set of load balancers pointing to the same Pods (an istio-gateway, which is just a fancy Envoy server that handles all our traffic before reaching our in-house developed microservices), update DNS to point to the new load balancer IPs, wait for client DNS TTLs to expire, then delete the old load balancers.
Creating and deleting load balancers involves creating and deleting Kubernetes Services (of type: LoadBalancer). So when I went to delete the old load balancers, I hesitated and realized I could just disable the Service instead by updating its label selectors to not point to any backend pods. That way if something broke, I could instantly restore the label selector to fix whatever broke, which was safer than deleting the service outright (which would have returned the load balancer IPs to amazon’s pool, never to be seen again).
So I disabled the service and waited a few hours. Literally as my finger was hovering over the button to permanently delete the Service, someone notices customers complaining of a broken application.
Turns out that we had one app trying to talk to another app within the EKS cluster in a non-standard way. Typically you can just talk to any service using DNS records that Kubernetes creates automatically which look like <service>.<namespace>.svc.cluster.local.
But instead, this app sent traffic to the istio-gateway Service IP directly. I can’t remember why exactly, but there was a good reason. Maybe because the app was prevented from using the auto-created DNS records since it used a custom DNS resolver, not sure. Anyway, somewhere in the code of this failing application, we had hardcoded the Service IP address of the istio-gateway. This is a big no-no because Service IPs are ephemeral. If your cluster restarts, the IPs could be reassigned.
So as soon as I disabled the Service in question, connections to the hardcoded IP address started getting refused.
Thankfully I was able to restore the Service’s label selector right away to fix the traffic, but it did cause a several hour outage for that app.
We still have the hardcoded service IP address, I just updated it to point to the new Service attached to the new load balancers. There was really no other option. This will be a nice surprise the next time Amazon decides to make us migrate load balancers again!
This was around 2018. I don’t remember the technical detail, but I remember the people. A change I made to a recently inherited system started filling a DynamoDB table with bad data. I knew SQL world well, but this was not that. My whole business was front end focused and new to cloud things like document databases. The effects were bad enough that the product needed to be taken offline, but it still had to be back by prime time in the customer market. So I was up until 3 in the morning with a very generous teammate, both of us studying how the product and underlying services actually worked, in order to correct it. A client rode along the whole time. I recall he was patient and ultimately impressed with us for sticking to the problem until it was solved.
My other big prod takedown memory was indeed SQL, in maybe 2006. I wasn’t that long out of college, and an update I made live in production (we sure did) set an entire database column to one value, ruining some 25 million rows. Well kids, even if you can’t be bothered to run a transaction, at least write the where clause first and change the select keyword to update last. At the time, that kind of habitual poka-yoke was called safe technique… I remember thinking it was odd the query was taking so long, and then came the “oh shit” feeling. I went and told my senior, who said ok, thanks for raising it right away, we’ll handle it. The client shut that department of their business down for the day and loaded yesterday’s backup. ⌘Z. Everyone was grateful for the transparency, and the relationship continued for many more years at least.
In both of these cases I was surrounded by others who had fucked it all up before and owned it. So just start by owning it; they’ll have your back. I’m grateful for being able to make mistakes and simply learn from them.
This was not production and it was a while ago, so it doesn’t really answer the question but… I was working on an autoscaling thing for AWS EC2 instances and was trying to debug/troubleshoot the EC2 instance bootstrap which was doing something weird. The details escape me.
Anyways, this process involved “triggering my code, watching the newly-provisioned nodes do something weird, deleting them from the AWS console, repeat”. Did this over and over and over again. At some point I hit the back button on my browser, so when I clicked “select all -> delete” on the console, instead of just deleting the EC2 nodes I was troubleshooting I deleted…… every single EC2 instance in our dev environment.
(uh whoops)
Fortunately a) this was dev, not prod, and b) we had everything in terraform, so bringing it all back up again wasn’t too painful, but that was the day that I started advocating for read-only-access always for everybody on the AWS console.
After GoComics had a big site refresh on April 1st I had to update my Garfield post Discord bot, because I lacked a test for the date, and a bit overeager redirect from GoComics the bot began to post yesterdays comic as the one for today.
https://git.sr.ht/~erk/lasagna/tree/master/item/announcements/2025-04-03-post-mortem.md