It's NOT always DNS
37 points by untitaker
37 points by untitaker
Sometimes it’s BGP…
Even this one is similar. Many of the "BGP outages" at big companies is from automation bugs, not anything particular to BGP.
(I developed network operation platforms in the past for similar companies)
Sometimes they play well with "DNS outages". For instance, tying DNS records to the existence of an advertised VIP. If automation breaks connectivity, causing the VIP to withdraw, then it might very well look like a DNS problem.
If you choose to back your DNS with an excel spreadsheet or bind or a key value store, and that mechanism breaks?
It was DNS.
It was perhaps your implementation of DNS, and the protocol itself is fine, but the problem, in your environment, is your DNS.
This is besides the point. When the author mentions KV-stores he is not talking about implementations of DNS, but about what would replace DNS. People dunk on DNS for failure modes that are not caused by design flaws, and which would be shared by any successor of DNS-the-protocol.
The meme of "it was DNS" to me does not ever consider replacing DNS. Most of the time it's a "get good at DNS" caution.
I never interpreted the meme as meaning 'DNS is bad', it was more 'DNS is one of the absolutely foundational things and it's complicated, if it goes wrong, things will break and the failures may manifest at any level of the stack, so it's not obvious what caused them. You will spend ages debugging the failures assuming that they're related to the thing that you observe breaking, and then discover that it was a DNS misconfiguration / failure and that's why a load of seemingly unrelated things stopped'. As @fanf says, this also works if you substitute BGP for DNS. It's a cautionary piece of advice about not assuming that the problem comes from near the top of the stack just because the observed failure is near the top.
I think it's less that DNS is particularly complicated* and more that it's just the highest thing we've standardised on as an industry. Everything lower-level we call a netops problem. Everything higher is as ad-hoc house of cards that it can't "always" be. In aggregate it is usually the house of cards, not DNS. So I've also always been annoyed by the mantra of "it's always DNS" because it's like people are applauding their own inability to reuse anything else.
And I guess also because people often say "it's always DNS" when the network is down. Shouldn't that be "It's always IP"?
*I haven't thought about this hard, but I think DNS is less complicated than BGP or HTTP. And most of the complexity in DNS is only relevant for authoritative servers talking to each other.
I think DNS is less complicated than BGP or HTTP. And most of the complexity in DNS is only relevant for authoritative servers talking to each other.
DNS is simpler than HTTP as a protocol, but it’s more complicated as a distributed system, and harder to debug.
The DNS is distributed in two dimensions: there’s the query resolution path, and the distributed namespace. HTTP has the first, but relies on the DNS for the second.
The query path is more complicated for the DNS: you have the stub resolver, possibly a system cache, maybe something on your NAT gateway, your ISP’s recursive servers, and finally the authoritative servers. Many of these hops are not under the control of the client nor the server, nor either of their operators.
There’s a lot more cacheing in the DNS than in HTTP, and much less control over how it behaves. A lot of the difficulty of debugging the DNS is because you have to understand how your queries are affected by temporal skew.
Error reporting through the DNS resolution chain is fairly terrible. Often all you get is a timeout, and you have to go and run the resolver algorithm by hand to work out what went wrong. (There’s an effort to add richer error codes to the DNS protocol; we’ll see how much that helps.)
An HTTP request crosses fewer servers, fewer org boundaries, and fewer caches than a DNS request.
Probably the most complicated part of the DNS is how resolvers traverse the namespace and discover the right auth servers. There’s a lot of subtlety in how NS records and glue work, and a design error in the DNS (mirroring records across a zone cut instead of using different types) makes it much harder than it could be.
Authoritative servers talking to each other is pretty simple in comparison. I’ve almost never needed to care about the details of NOTIFY/AXFR/IXFR. (Tho TSIG is a bit fiddly.)
Then there’s the whole business of registering domains, tho arguably that’s EPP not DNS :-)
Yeah. There are a couple of other aspects to it always being DNS:
The first symptom of a network outage is usually DNS lookup failure.
DNS-related cockups have large blast radius, and big outages are the memorable ones.
If someone sends you some bad news by email, it was SMTP.
Edit: ok, looking back I think this comes across quippier than I meant it to. I think a somewhat more appropriate example would be that if you put incorrect data into your database and then bad things happened, you would not say "it's always SQL". As I understand it, that's effectively what happened in the Amazon incident.
although I like the article and its content, and tend to believe that simplifications are always the root of all evil, I find the writing style pedantic, condescending, and completely contrary to the practices that Feymann commited so hard to implement in order to learn and spread the word of information
we can do better
If you are incorrectly blaming DNS, your logging is probably insufficient and your error messages are misleading.
In most cases we deal with much more complex systems, but still we can find some inspiration in programs like wget:
$ wget http://such-DNS-record.does.not.exist
--2025-10-28 14:08:04-- http://such-dns-record.does.not.exist/
Resolving such-dns-record.does.not.exist (such-dns-record.does.not.exist)... failed: Name or service not known.
wget: unable to resolve host address ‘such-dns-record.does.not.exist’
$ http_proxy=http://localhost:8888 wget http://such-DNS-record.does.not.exist
--2025-10-28 14:11:39-- http://such-dns-record.does.not.exist/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:8888... connected.
Proxy request sent, awaiting response... No data received.
Retrying. ...
$ wget http://example.com/this-file-does-not-exist
--2025-10-28 14:08:32-- http://example.com/this-file-does-not-exist
Resolving example.com (example.com)... 23.215.0.136, 23.220.75.232, 23.192.228.80, ...
Connecting to example.com (example.com)|23.215.0.136|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2025-10-28 14:08:33 ERROR 404: Not Found.
This shows, how detailed the logs and error messages should be.
In the first case, we were unable to resolve the hostname, so we did not try to connect at all. Only unclear part is what DNS servers were used, but we can check the /etc/resolv.conf file. We either have wrong hostname, wrong DNS server configured, or the DNS server does not know (yet) about the hostname.
In the second case, we know that we tried to connect through a proxy (we know it from the log and error messages – the environment variable might be hidden somewhere), we know its hostname, its IP address and its port. So we look for the cause of the problem at given proxy server.
In the third case, we know the IP addresses resolved from the hostname, we know which one was chosen to connect, we know the port and the protocol. And we know, that the server returned particular error message, so we look for the cause of the problem at given server (or check if our URL is correct).
P.S. This may apply to any situation where you are unable to find the cause of the problem.
P.P.S. DNS is not trivial, but its complexity is rarely the cause of the problem.
like 8.8.8.8 and 1.1.1.1 disagreeing about resolving a domain because of DNSSEC
What does this mean? Why would Google and Cloudflare resolve names differently?