The dangers of SSL certificates
17 points by typesanitizer
17 points by typesanitizer
Any aspect of your operation is “dangerous” if it doesn't often go wrong. I am not a fan of the operational burden of ever decreasing certificate lifetimes, but to call them a “fundamentally dangerous technology” is extreme, imo. It is an obvious thing to monitor and document and you have (at least for now) a few weeks from renewal failure to big problems. If you can't tell your renewals are failing, you can either invest a few hours in setting up monitoring, or set a reminder to check your certificate expiry every couple weeks.
More annoying than it needs to be? Maybe you could argue that. But “fundamentally dangerous” it simply is not.
I think we should perhaps be worrying a bit more about what we'd do if Let's Encrypt disappeared overnight, but that's another story.
Calling SSL certificates dangerous because they expire is like calling locks dangerous because they can lock you out. Even ignoring all the automation that is possible today, at the end of the day it comes down to a calendar reminder. If that's too much to handle perhaps the service is not all that important anyway.
Domain names also expire if not renewed before a certain date. So they're just as dangerous, and in just the same way, as certificates.
locks are dangerous because they lock you out though.
It’s one of the first things that is brought up in sysadmin textbooks. Fail-safe (open), or fail-secure (closed): massively depends on what the problem space is.
The classic example given is an exit door. Sure, you want it to lock, but its actually more important that it will always open.
The danger of locks locking people in is the reason for a great number of workplace safety standards:
https://en.wikipedia.org/wiki/Triangle_Shirtwaist_Factory_fire
So the full error was:
ERROR: Error computing the main repository mapping:
Error accessing registry https://bcr.bazel.build/:
Failed to fetch registry file https://bcr.bazel.build/modules/platforms/0.0.7/MODULE.bazel:
PKIX path validation failed: java.security.cert.CertPathValidatorException:
validity check failed
There is a "fundamentally dangerous technology" here, but it's not the certificate, it's the central registry.
The "Error accessing registry" Could have been due to:
If bcr.bazel.build breaks your workflow you should be asking why your build tool has a hard dependency on someone else's computer, not complaining about certificates.
I had more than my fair share of TLS-related headaches so I sympathise with the post.
For my own websites I have a Gatus instance periodically checking that the services are up and that the TLS certificates still have a long enough lifetime. Don’t trust the renewal process to always happen flawlessly and don’t even trust it to report issues.
The graceful degradation idea is interesting. I wonder if TLS certificates should have multiple expiration dates:
For web browsing, the second would make things a bit slower but the third would be ignored. For build tools like Bazel, the third would print a message saying ‘TLS certificate is still in use after documented renewal date’ and then anyone using the tool would get a warning that might prompt some action.
As I understand the format (in as much as I understand anything related to ASN.1) you can add arbitrary additional fields to TLS certificates, but some signing infrastructure (in particular, Let’s Encrypt) strips ones it doesn’t know about from the CSR, so perhaps this would need a bit of coordination to deploy.
TLS certificates already have all those dates, called notAfter, notBefore, and the third is a matter of how renewal is configured. But ACME renewal is becoming dynamic so that certificates are automatically replaced sooner than planned if that becomes necessary, which means a static date in the cert would be incorrect.
There were several things that went wrong in this case: their DNS was inconsistent with their service configuration; errors from the cert renewal job were lost; the service’s health checks didn’t report problems with the DNS or the expiry time; the people responsible for the service lacked the knowledge and control to fix it easily. None of these are really specific to TLS certificates (there are plenty of other periodic maintenance jobs that can fail and cause an outage weeks later) and judging by the failure rate it isn’t clear that renewal is dangerous enough to need extra complication to make it more robust. Functional error reports and monitoring are better places to focus attention.
AFAIK the notAfter date is supposed to be a hard limit.
The point is to have soft limits to stop clients from suddenly failing hard.
I'd like some early warning signs. For example, clients could start failing a couple of days earlier with a random probability. This way users would see errors related to the impending expiration doom, but still be able to retry requests to get through.
It doesn't even have to be an explicit field in the certificate, but we'd need a broad agreement that certificates are supposed to be renewed X days before notAfter, or that clients are allowed to use certificates in some degraded noisy way X days after notAfter.
It would be much worse to deal with certificate expiry by randomly making every site’s users and tech support people do manual work, and especially bad to depend on useful bug reports from nontechnical users. Vague complaints of intermittent low-probability failures are notoriously difficult to diagnose.
If you want an early warning, make sure your automated monitoring is set up to give you one. You shouldn’t expect a useful signal from your users.
This is a "just don't have the problem" solution! Of course I'll know about certificates that I correctly set up monitoring for. But I'd like some more graceful failure for the ones that I forget to monitor, or when my monitoring fails and I haven't set up monitoring of my monitoring.
Even if clients would soft-fail a few days before, users of the service do not care and think that they already know, until it fails hard.
Any important online presence should have a proper SysAdmin / SysEng (team) who is taking care of the general operations and so is able to also put monitoring in place to warn (21 days) and alert (14 days) before a certificate expires and so have enough time to fix it. But sadly with all this cloud native stuff companies think that everything can be done from the developers who mostly don’t have much operational experience.
This assumes that the users and maintainers of the system are entirely disjoint groups with no communication. For something like a Bazel system, I’d expect that there is a large overlap and if a tool starts saying ‘it’s fine for now, but it looks as if the certificate should have been renewed and it hasn’t been’ then that wouldn’t prevent the builds from working but would let someone who at least knows the right people know.
The reality is that SSL certificates are a fundamentally dangerous technology
I think we differ in the meaning of the word "reality" here. I, for one, side with the dictionary.
congratulation that you got that important sentence speaking out an uncomfortable fact through moderation here at lobsters https://surfingcomplexity.blog/2025/12/27/the-dangers-of-ssl-certificates/#:~:text=SSL%20certificates%20are%20a%20fundamentally%20dangerous%20technology