A Caddy Cert Expired Because systemd-resolved Was Selectively Broken
24 points by robalex
24 points by robalex
Sadly we never find out why systemd-resolved is dropping NXDOMAIN responses.
It was a well written blog post which had built up suspense. Disappointing to not see the root cause. I'd say it's not even established that it's systemd-resolved that is broken.
I noticed resolved dropping NXDOMAINs multiple times already, but never bothered to investigate. Might this be the final push?
I've seen systemd-resolved do weird things with DNSSEC-enabled domains before. Perhaps the circumstances I saw weirdness matched this, but I don't have the notes from debugging it before. I've learned not to trust systemd-resolved (or dnsmasq) at all and always replace it with good old Unbound.
This domain isn’t signed and the article says systemd-resolved’s DNSSEC validator was turned off.
But I seem to have found a bug: takeonme.org is hosted by Cloudflare, and although the authoritative servers return NXDOMAIN for most query types, they return NODATA for DNSKEY. But I would be surprised if that’s relevant to this article’s issue.
Ah, I misread then, I thought I read that DNSSEC was in play.
I still don't trust systemd-resolved. :-)
Regarding the staging fallback: Caddy will not use a certificate retrieved on staging, it is only used as a way to check if the challenge is solvable, without being hindered by the rate-limiting of LE prod. Once staging is successful, Caddy retries against prod immediately.
Regarding the monitoring: a soon-to-expire certificate should trigger an Uptime-Kuma alert if configured correctly ([ ] Certificate Expiry Notification).
I started removing systemd-resolved from my linux machines. Too much troubleshooting complexity. I don't need a third or fourth way to cache DNS between my ISP, router, and apps. What is the point of it? Didn't ask for it.