Impromptu disaster recovery
12 points by Picnoir
12 points by Picnoir
Disappointed to see such blind trust in LLMs from someone I’d been looking up to.
To be honest, the tool is wrong here. I don’t think it’s unreasonable to expect that an edit across multiple files won’t merge them into the first argument. If anything the fact that the langle mangles assumed that’s how the tool works should be a sign as to the tool’s behavior being in the wrong here.
Yes, but I’d argue that due to automation bias, if they had found the tool themselves instead of asking an LLM, they would’ve been more likely to actually look at the output.
I’m not saying that there was not an automation bias driven logical error made in the process of that migration, I’m saying that the tool having surprising behaviour is really the root cause of the failure.
I am not saying it to smugly criticize the author (who’s content I read regularly and really enjoy), and running k8s/k3s for the sake of education is kind of worth it, and in this case it’s an opportunity to have something to write about on your very popular blog (not trying to be snarky, it’s just true - there isn’t much value in a blogpost about how you did absolutely nothing about something).
But I think it very well highlights why for personal deployments, just running your stuff on a single (few) dedicated server can carry your very far, even to low-key business tier.
It just takes a lot of effort and knowledge to manage a distributed system. Lots of people , who are not as smart/driven/experienced/don’t have time, try to offload it to cloud tech that is charging fat $$$ to pretend the problem is gone, but it still is there. And when the app actually scales a bit up, these pricetags start to sting.
A single server is just … simple. Yes, you get worse availability, worse latencies for international users, but in practice the extra complexity fails surprisingly often, and needs to be managed. Single point of failure hardware server tends to work just fine for years. Yes, you need backup and there’s a chance a disk failure will turn into a DR event, but … that’s OK from time to time? I’m in California, my tiny server is in Germany, I really don’t care about this 200ms latency. If the web app is written well, client side caching was taken care of and it works snappily, it feels instant anyway. Deployments? A single nixos-rebuild switch
. Something is off? Switch back to the previous version, easy. I could migrate everything to a new server in an hour, two tops. Simple monitoring will send me a text message when something is down (I actually don’t have it set up, I don’t care as much right now).
Even when scaling out, I would just add a load balancer and/or replicate the db/storage, and keep everything (except backups) in the same datacenter. Yeah, they sometimes burn down etc. But Github, Slack etc. - very important apps go down for hours, near days sometimes and everyone keep using them, so your low key app will be fine too.
I think NixOS and its reproducible and atomic upgrades / configuration switches is the key ingredient that tips the scale (at least for me) to ride a non-distributed NixOS-based deployments as long as pragmatically possible.
Thanks for this, it was nice to see a behind-the-scenes walkthrough of your infrastructure and the pains of recovery from self-inflicted injury.
I think I have a half way finished blog post about recovering a MySQL-backed Wordpress installation that accidentally had its filesystem disappear (something with LXD clones of LXCs, and then deleting the parent. That was a fun (for some values of fun) 30 minutes of messing around with the output of lsof
and trying to piece everything together. Though the underlying LXC had gone the MySQL DB was still running and still had open files, so I managed to copy all DB files and make everything work again. …maybe I’ll get around to doing a proper writeup someday.