Being a SysAdmin is hard

37 points by lilac

jstoja

To the risk of being seen too negative, I think this is kind of « unacceptable » for your (paying) customers.

The first paid plan says « 24/7 support », so the service being down for 23h because you were playing video games doesn’t sound very good. At least, you are bing honest about it and that’s a mark of trust.

If those issues are so impactful, you should at least do a retrospective of what could be done so it doesn’t happen anymore. To me, yes there root cause is technical, but many things can be solved before that:

First incident is you NOT KNOWING it was down, this should be remediated.
Second one is you KNOWING BUT IGNORED. This should be solved too.

I bet that after the long day, you preferred gaming because you are burning out. Solve those issues before you really do.

For example you could setup auto restarts on crashes, using a service for push notifications to ping you multiple times if you don’t look at your mails, etc.

Sysadmin is hard, but it’s even harder when you don’t try to solve the right parts first.

muvlon

the service being down for 23h because you were playing video games doesn’t sound very good

On the other hand, not giving a shit because you'd rather play video games is a time-honored sysadmin tradition.
lilac

as others have suggested, I do not make any money off of this. the only one paying for anything is me, and I pay a lot more than whatever numbers are on the pricing page! I wasn't worried about playing video games instead because it meant that the single paying customer (me) wouldn't need access anyway :p
bityard

This doesn't seem to be an actual business... If you click around, the site has a bunch of content that doesn't match up at all with anything TFA wrote about. My best guess is that they were going for satire but it didn't land.

Anyway, yes, being a sysadmin is hard and one of the first things a sysadmin learns is that datacenters with reliable power, networking, and redundant hardware exist for a reason.

I worked (for money) on home grown budget infrastructure like this a lifetime ago and never want to do it again. You are constantly dealing with unpredictable outages that are magically fixed with a manual reboot, chasing technical debt, and solving for bottlenecks that require redoing half of your stack. Multiple times. Leaving little time for actually shipping new stuff.

Home-labbing is a fine hobby for learning a lot of the technology and difficulties of sysadmin work. But bringing paying customers onto into the picture is inviting disaster.
Unreal

The first paid plan says « 24/7 support », so the service being down for 23h because you were playing video games doesn’t sound very good.

The pricing page also has a quote saying:

Thanks to Radiant, we're finding new leads that we never would have found with legal methods.

And the author commented on here 8+ hours before you did to explain what the site actually was, so I'm confused why you think there's actual paying customers?
- ryanford
  
  Radiant is one of the templates offered from TailwindUI. A lot of this site is literally unedited copy from that template, the fake quotes, the pricing/features, etc.
- jstoja
  
  And the author commented on here 8+ hours before you did to explain what the site actually was, so I'm confused why you think there's actual paying customers?
  
  I didn’t get the satire, no. I didn’t read the comments (I actually didn’t see there was one from the author…).
  - gerikson
    
    I understand your reaction. I don't understand copying a template wholesale and not doing the bare minimum to either edit it or remove the template text.
- strongoose
  
  If you're finding tailscale unreliable, you could try something more limited like a point-to-point wireguard VPN between your public webserver and the backend. Or something else from the awesome tunnelling list.
  
  It's true though, sysadmin is hard.
  - Halkcyon
    
    sysadmin is hard
    
    I think many an admin make it harder than it needs to be since they don't document or automate any of their work.
  - ecksdee
    
    tbh keeping the networking on the host could avoid some stability issues instead of removing tailscale all together. having to deal with containerized or namespaced networking abstractions while self hosting can cause some pretty extreme headache scenarios. and while docker is nice for spinning up services that aren't packaged i wouldn't containerize everything especially essential services like a vpn. tailscale outside of a container does have some pretty bad defaults though (cough cough running as root), but you can fix them by overriding the systemd service in this guide: https://tailscale.com/kb/1279/security-node-hardening#suggested-hardening-configuration-based-on-systemd, but imo its better to just use https://yggdrasil-network.github.io/configurationref.html#allowedpublickeys if you want stability, path selection, fail-over, public key based addressing, and post quantum cryptography.
    
    ggpsv
    
    Plain wireguard installed from your linux package manager can be as stable as it gets, at least for me with Debian. Unless you need Tailscale's other features, that is.
    
    Even better, for my home servers I moved wireguard off the servers and to the router.
  - lilac
    
    Authors note here: if you go clicking around the rest of the site, you might see some weird satirical references to some "Radiant" thingamajig. Those are because the site is based off of a template. I only really made enough changes to it for it to look nice "above the fold" so that I could have a prettier landing page than the Forgejo default. It happened to come with a built-in blog, so I decided to write a little bit about these recent experiences, and figured I'd put it here rather than on my personal blog.
    
    stephenr
    
    Is that also why the pricing page says stuff about "deal boards" and all the stuff beyond X team members?
    
    I cannot be sure this entire site isn't an LLM generated spoof/phishing attempt to be honest.
    
    tonyfinn
    
    There’s definitely at least some trolling going on somewhere:
    
    Years ago, while working as sales associates at rival companies, Thomas, Ben, and Natalie were discussing a big client they had all been competing for. Joking about seeing the terms of each other’s offers, they had an idea: what if they shared data to win deals and split the commission behind their companies’ backs? It turned out to be an incredible success, and that idea became the kernel for Radiant.
    
    Today, Radiant transforms revenue organizations by harnessing illegally acquired customer and competitor data, using it to provide extraordinary leverage. More than 30,000 companies rely on Radiant to undercut their competitors and extort their customers, all through a single integrated platform.
    
    Vaelatern
    
    It's curious to me that the Tailscale container would have crashed. I wonder what the conditions to make it crash are because Tailscale has been really bulletproof where I have set it up.
    
    Being operationally minded, however, does not come naturally to most people, and does take the trial-by-fire approach to actually learn.
    
    weberc2
    
    Reminds me of my last trip to Paris when my blog, which was replicated across 3 Raspberry Pis, crashed and my servers all became unreachable because my cat knocked the entire Pi stack off the desk, disconnecting the power cord. Everything except power and network was redundant, so I was surprised when I tried to troubleshoot remotely and everything was down but my router was not.
    
    a4rbay4mnv
    
    I was part of LinkExchange, part of the old internet some people will remember. Anyway, in its earliest it was hosted entirely from the apartment of the founders. They prepared for eventualities like power outages by buying generators. When the power eventually went down, they fired up the generators, but to no avail because the telco providing their T1 didn't provide backup power in its local point of presence. Still, there was a nice bit of serendipity: the apartment building manager saw lights on in their apartment, went to investigate, joined their company as a sysadmin, and retired five years later after the company was acquired.
    
    e12e
    
    I used to have a shell account with Ukshells/UK solutions. They had one big downtime incidencent - and in their post mortem they explained that their redundant Internet connection had both failed. Turned out both cables ran through the same connection box under a freeway - and someone hadde torched it in order to disable alarms/caneras in connection with committing a burglary or robbery (can't remember which).
    
    They were pretty cross with their provider (that the redundant connections had a single point of failure).
    
    rileylabrecque
    
    And my last trip my internet closet UPS died and just completely stopped providing power on every port [both backed up ones, and surge protected only]. First time any of that stuff went down in a way that required manual input in like 3+ years and it happens the day after I leave for a week...
    
    weberc2
    
    Yeah, I’ve thought about it and I think if I cared enough I would set up another “zone” in my brother’s house. Instead of a UPS or bonded internet, I’ll just get redundancy from a different physical location with a different ISP? 🤷‍♂️
    
    Curious if anyone has done this, especially with Kubernetes, and if so, what challenges did they run into?
    
    natkr
    
    Depends.. if you're stretching the same cluster across both zones then you'll want a third location for quorum votes. Your cluster will fail if there isn't a clear majority of nodes online, so a 2+1 setup will fail as soon as the 2-location goes down. (Or, worse, 1+1 will fail as soon as either goes down!)
    
    For K8s itself that means that you'll want to replicate at least etcd, but many "legacy" systems will also want to run their own independent quorum system on top, subject to the same constraints. (ZooKeeper, KRaft, etc, etc.)
    
    whjms
    
    It turns out cats can be as destructive in cyberspace as they can be in real life 🤣
    
    markerz
    
    Even big enterprises deal with mysterious problems with silly non-solutions like restarting or re-deploying. The main requirement for paying customers is to minimize downtime. Little blips are fine. Focus on what’s important to you. Biggest example is every company deploys during holiday shutdowns, but with no-op commits, because we don’t know how the software behaves if we don’t deploy every day. Haha!
    
    Restarting tailscale on crash should be default. All my containers are “restart-unless-stopped”.
    
    I love tail scale but I’ve had repeated trouble with the client over the last few months where it would be “connected” but not processing any data. Restarting often got me going again. Just automate and move on until it’s a bigger problem.
    
    But yes, sysadmin is hard. I advocate my colleagues to try self-hosting because it teaches you a lot, but I would never critically depend on my own services because it’s too many surprises and stress.
    
    unixdigest
    
    I rarely participate in these discussion, but this time I just could not help myself.
    
    When I read the post I kept thinking, "when is the real sysadmin stuff coming?"
    
    I have a lot of reasons for running Treehut, but one of the goals is just to see what it's like to try and run a service yourself while aiming for several nines of uptime. It turns out that it's, like, really fucking hard.
    
    With all due respect, that's because you're doing it wrong. For what you're aiming at, get down to the bare metal, stop running services on top of services, abstractions upon abstractions, containers. All this modern junk is what's making it difficult.
    
    Use the right tools, old and battletested, and document everything, every step in the setup, how AND the why too. Learn to read the logs, text, not fancy graphs. See what's going on. Make sure you understand every single step, every single service/process.
    
    The learning process of "old school sysadmin" is steep in the beginning, but so rewarding afterwards. When you just need to automate the most tedious and everything then just keeps running without issues.
    
    Stuff like this: https://openbsdrouterguide.net (I'm the author). 30+ years of experience, ISP, hosting, you name it.
    
    I f***** hate the modern approach to sysadmin, NOT because it's "modern", but because it just SUCKS sooo much.
    
    Truly sorry if I have offended anyone.
    
    mewse
    
    Everyone wants to be a sysadmin until its time to do real sysadmin shit
    
    gerikson
    
    Speaking as someone who returns from an overnight trip to find that their home server suddenly is suspending for no reason, when it didn't before - I just want my fucking computer to work the way I want it to, not spend time googling 50 different outdated forums and stackoverflows looking for the solution.
    
    eduard
    
    Will ISPs ever provide free static IPv6 addresses to residential costumers?
    
    That would make self-hosting at home (at least on the IPv6 side of things) much nicer.
    
    My ISP provides that service for an additional fee. Are they not providing the free static IPv6 addresses because they are greedy or because it would be a big hassle for them to setup their infrastructure so that everyone gets a static IPv6 address, or is it another reason?
    
    johnklos
    
    Lessons:
    
    Don't trust the thing that's required to talk to your server to 1) a third party, and 2) a container, especially if you don't control it. Port forward ssh from your home's public IP on some random port to your server.
    
    Consumer hardware is fine, but make it more robust. If you've got a remotely accessible KVM, that's great, but if the KVM's security is shit, port forward another port to a Raspberry Pi or something similar so you can access the KVM remotely and securely. Or, heck, use the Pi to access a serial console on your consumer software.
    
    anyfactor
    
    Comment removed by author
    
    tiff
    
    Can I ask why you're running Tailscale in a container? Not trying to be snarky or trolling but I have a similar lack of static ips available to me and running Tailscale on bare metal hasn't given me problems.
    
    geekodour
    
    I had tailscale crash on me due to some DERP changes/downtime(atleast that was one thing i noticed), I think a fix for this could be checking if the container is running with something like systemd restarts etc. I have setup my services that way for now, been running fine for over 1.5years so far. But yeah, does seem fragile sometimes as I use it as a internal bridge between internal services, probably should setup form selfhosted vpn for that.
    
    lifty
    
    Did I understand correctly: he’s running a commercial service from a server in his closet?
    
    Unreal
    
    No, you did not. She's running a personal service from a server in her closet.