A Caddy Cert Expired Because systemd-resolved Was Selectively Broken
30 points by robalex
30 points by robalex
Sadly we never find out why systemd-resolved is dropping NXDOMAIN responses.
It was a well written blog post which had built up suspense. Disappointing to not see the root cause. I'd say it's not even established that it's systemd-resolved that is broken.
I noticed resolved dropping NXDOMAINs multiple times already, but never bothered to investigate. Might this be the final push?
I've seen systemd-resolved do weird things with DNSSEC-enabled domains before. Perhaps the circumstances I saw weirdness matched this, but I don't have the notes from debugging it before. I've learned not to trust systemd-resolved (or dnsmasq) at all and always replace it with good old Unbound.
This domain isn’t signed and the article says systemd-resolved’s DNSSEC validator was turned off.
But I seem to have found a bug: takeonme.org is hosted by Cloudflare, and although the authoritative servers return NXDOMAIN for most query types, they return NODATA for DNSKEY. But I would be surprised if that’s relevant to this article’s issue.
Ah, I misread then, I thought I read that DNSSEC was in play.
I still don't trust systemd-resolved. :-)
There is now a follow-up project on my whiteboard
I learned I need a whiteboard in my home lab.
Regarding the staging fallback: Caddy will not use a certificate retrieved on staging, it is only used as a way to check if the challenge is solvable, without being hindered by the rate-limiting of LE prod. Once staging is successful, Caddy retries against prod immediately.
Regarding the monitoring: a soon-to-expire certificate should trigger an Uptime-Kuma alert if configured correctly ([ ] Certificate Expiry Notification).
I started removing systemd-resolved from my linux machines. Too much troubleshooting complexity. I don't need a third or fourth way to cache DNS between my ISP, router, and apps. What is the point of it? Didn't ask for it.
The post suggests using log base alerts to check if the TLS certificate renewal is working. I suggest that you'll get more bang for your buck by having alerts on your certificates having less than a week left to renew. It'll catch the same problem, and also problems like "certbot renewed successfully but didn't manage to install the new certificate" or "caddy didn't pick the new certificate up in a timely fashion".
Good job having site monitoring that caught the invalid cert in prod tho. Could have been worse.
Why would you want an init system to handle DNS resolution? That thing is a huge pile of junk, it even tries to replace sudo through the run0 gimmick.
systemd-resolvd is not part of the init system though. It is a DNS resolver daemon that just happens to be developed by and be a part of the systemd suite of software.
unfortunately you can say this until you're blue in the face and it'll still go ignored by people who simply have an axe to grind against what they think systemd is. i have begun operating under the assumption that the people who do this do similar things to cars and believe that all toyota models, including tgr, are actually just one big toyota.