Being a SysAdmin is hard

14 points by lilac


strongoose

If you're finding tailscale unreliable, you could try something more limited like a point-to-point wireguard VPN between your public webserver and the backend. Or something else from the awesome tunnelling list.

It's true though, sysadmin is hard.

Vaelatern

It's curious to me that the Tailscale container would have crashed. I wonder what the conditions to make it crash are because Tailscale has been really bulletproof where I have set it up.

Being operationally minded, however, does not come naturally to most people, and does take the trial-by-fire approach to actually learn.

markerz

Even big enterprises deal with mysterious problems with silly non-solutions like restarting or re-deploying. The main requirement for paying customers is to minimize downtime. Little blips are fine. Focus on what’s important to you. Biggest example is every company deploys during holiday shutdowns, but with no-op commits, because we don’t know how the software behaves if we don’t deploy every day. Haha!

Restarting tailscale on crash should be default. All my containers are “restart-unless-stopped”.

I love tail scale but I’ve had repeated trouble with the client over the last few months where it would be “connected” but not processing any data. Restarting often got me going again. Just automate and move on until it’s a bigger problem.

But yes, sysadmin is hard. I advocate my colleagues to try self-hosting because it teaches you a lot, but I would never critically depend on my own services because it’s too many surprises and stress.

jstoja

To the risk of being seen too negative, I think this is kind of « unacceptable » for your (paying) customers.

The first paid plan says « 24/7 support », so the service being down for 23h because you were playing video games doesn’t sound very good. At least, you are bing honest about it and that’s a mark of trust.

If those issues are so impactful, you should at least do a retrospective of what could be done so it doesn’t happen anymore. To me, yes there root cause is technical, but many things can be solved before that:

  1. First incident is you NOT KNOWING it was down, this should be remediated.
  2. Second one is you KNOWING BUT IGNORED. This should be solved too.

I bet that after the long day, you preferred gaming because you are burning out. Solve those issues before you really do.

For example you could setup auto restarts on crashes, using a service for push notifications to ping you multiple times if you don’t look at your mails, etc.

Sysadmin is hard, but it’s even harder when you don’t try to solve the right parts first.

lilac

Authors note here: if you go clicking around the rest of the site, you might see some weird satirical references to some "Radiant" thingamajig. Those are because the site is based off of a template. I only really made enough changes to it for it to look nice "above the fold" so that I could have a prettier landing page than the Forgejo default. It happened to come with a built-in blog, so I decided to write a little bit about these recent experiences, and figured I'd put it here rather than on my personal blog.

geekodour

I had tailscale crash on me due to some DERP changes/downtime(atleast that was one thing i noticed), I think a fix for this could be checking if the container is running with something like systemd restarts etc. I have setup my services that way for now, been running fine for over 1.5years so far. But yeah, does seem fragile sometimes as I use it as a internal bridge between internal services, probably should setup form selfhosted vpn for that.

weberc2

Reminds me of my last trip to Paris when my blog, which was replicated across 3 Raspberry Pis, crashed and my servers all became unreachable because my cat knocked the entire Pi stack off the desk, disconnecting the power cord. Everything except power and network was redundant, so I was surprised when I tried to troubleshoot remotely and everything was down.