It's amazing how often finding the obvious cause of a problem only mitigates it, and you end up having to solve it 2 or 3 more times in the following weeks.
In this case, there were NUMEROUS suboptimal or misconfigurations of DNS but none of them mattered until the volume reached a tipping point, and, suddenly, ALL of them came into play. Fixing one overflowed into the next, which overflowed into the next.
The real learnings of this incident is how to handle incidents effectively. The author cites ones that I use every time I manage one:
* Centralize - 1 tracking doc that describes the issue, timeline, what's been tested, who owns the incident. Have 1 group chat, 1 'team' (virtual or in person). Get an incident commander to drive the group.
* Create a list of hypotheses and work through them one at a time.
* Use data, not assumptions to prove or disprove your hypotheses.
* Gather as much data as you can, but don't let a particular suspicious graph lead you into a rabbit hole. Keep gathering data.
If you don't do the above you are guaranteed to have a mess, have to repeat yourself over and over and waste time.
I don't know of a non-Kubernetes situation off the top of my head where this would be an issue, but I definitely learned some new things about DNS resolution on Linux by reading the article, and so I'll think to look for similar scenarios in the future.
Because you generally only need to do DNS lookup at app startup and you are done. So node local is overhead and complexity you rarely need. In this case, it was present and I wonder if this level of complexity was one of reasons for the outage.
Sounds more like a kubernetes problem than a dns problem.
I hate coredns. Everything running inside of a kubernetes cluster should just be querying the kubernetes endpoints api for these IPs directly and using the node dnsservers for external hosts.
If I restart my DB, will the database service host env var also be updated? Will restarting a DB or changing the IP of a DB will also imply a restart of all of the services that need access to the DB?
One red flag that stood out for me is where the blog says the team considered all apps to be the same and hadn’t looked at any of their logs, only infrastructure stuff.
When they looked they saw all apps were not the same, and it was only a few kinds of apps that were affected.
When a big incident hits, you need people drilling down not just across; and hopefully people who know the actual apps in question.
Maybe this was DevOps people too far into the ops side and not as much on the dev?
Had a similar problem at work a while ago. One service was unable to connect to another occasionally. The Splunk logs said it was a TLS connection problem. After an unsuccessful attempt at reproducing the problem locally, it eventually dawned on me it might be Kubernetes DNS. And by changing temporarily to not using DNS for connecting to that host, we confirmed that indeed it was Kubernetes DNS.
Did you actually query the DNS from the container to verify DNS was returning an incorrect record in response to the query? I ask because I've seen similar behavior and it turned out the service was only doing DNS lookups at startup and then cached the record indefinitely (or until restarted), regardless of the TTL on the record. Unfortunately some software and libraries don't respond well to even occasional DNS changes.
About 15 years ago I worked with a vendor that didn’t realize their web service was ignoring TTLs. I think this was the Java 5 days. We had changed an IP on our end and they kept trying to connect to the wrong one for a webhook. It took weeks of sending tcpdump logs back and forth to convince them. They finally restarted their app.
In our case the problem is kind of the opposite, as far as we could tell.
The TTL is 2 seconds, but because the the app and the service always deploy together and always run on the same mode as one-another. So we deploy and this deploys both the app and the service to a new node, on which both will run.
But because TTL is so low, every new connection (traffic is pretty low for this particular app unlike some other apps in our cluster) is pretty certain to do another DNS lookup. And about 10% of the time we were getting connection error which boiled down to DNS.
So to confirm it was the problem we changed it to not do DNS lookup for now, since it’s as of now always same node for app and service.
But soon we are changing things around and they will no longer be guaranteed to run on same node nor will they deploy together.
So I still need to come up with something that let’s us do DNS lookups but not have the problem we’ve been having.
This post reminds me about a similar but different K8s Turned-Out-To-Be-DNS problem we had recently. We published a write-up about it but never got around to submitting it here before now: https://hackertimes.com/item?id=38740112
> To avoid ndots issues, the most straightforward solution is to have at least five dots in our hostname. Fluent-bit is one of the biggest abusers of the DNS requests. ... As it now has five dots in the domain, it doesn’t trigger local search anymore.
But it wasn't DNS. DNS didn't break. The protocol didn't break. Not even issues with the CoreDNS or dnsmasq implementations.
The culprit was ndots (why did Kubernetes arbitrarily choose five dots) and the general way that Kubernetes (ab)uses DNS.
I always chuckle at the "It's not DNS...it was DNS" line because in my experience, the problem is usually actually DHCP.
I'm struggling with a problem where a VM is supposed to get an IP address from the host, but it takes forever to do so. The host is telling me it has assigned an IP, but the VM says it hasn't. It can take anywhere from 10-60 minutes for the VM to actually get the IP that the host has assigned.
DNS is for friendly names; friendly to humans using web browsers. Using DNS for machine to machine communication is not essential complexity. Every chance I get I eliminate DNS from internal infrastructure and a whole lot of things get a lot better. If you naively keep forward/reverse DNS resolutions on in different parts of the stack, you end up with a shitstorm of DNS lookup requests at even a moderate scale infrastructure. Then bad things tend to happen.
DNS is more than just pretty names, it allows for a hierarchy that holds meaning. It is way more than just friendly for humans. TBH I would posit that having everything as IP literals would cause more human errors than not. You need to keep a context of all subnetting in your mind, which is not feasible in many networks.
What are you using instead? Hard coded IPs? Or have you built your own lookup service?
If you have decent TTLs dns doesn’t result in a shitstorm of lookups, nor does it require anything more powerful than a raspberry pi to respond to them
Yes, it is much easier to build and scale a general config service that can serve keys (like names in dns) and associated config snippets with versioning and expiry/go-live timestamps etc. Such a service enables us to build on top of it things like service discovery, failover, draining, cutover, weighted load-balancing etc. It is much easier to also control/orchestrate/audit changes to key-configs in globally consistent transactional manner and guarantee these changes will be instantly visible to every client or will be deterministically spread-out/staggered etc. It is also much easier to do interpolation of config variables arranged in a hierarchical class namespace. All this makes it a lot more powerful building block for large scale infra services than dns ever could and it has none of the drawbacks of dns.
> If you have decent TTLs dns doesn’t result in a shitstorm of lookups, nor does it require anything more powerful than a raspberry pi to respond to them
This applies equally to any kind of lookup service you use. It's not a distinguishing feature of DNS.
The distinguishing features of DNS are that it's a global, highly regulated, key-value storage with only eventual consistency that may take days to reach. (It has probably never been consistent on practice.) None of those features are desirable for your internal servers configuration.
Opposite take: I consider IPs appearing anywhere except the dhcpd configuration and the DNS zone files (or their database equivalents) to be a bug.
IPs are opaque and meaningless. Maybe you can keep in your head that “.2 is the database, .3 is the web server, .4 is the redis, .5 is the other api, .6 is the other database”, but I can’t and wouldn’t even if I could.
It's not like appserver1234.internal is significantly less opaque and meaningless. Either way you probably want a control panel somewhere that can give you extra information about a node.
Not if your hostnames look like that. There is the potential in DNS for semantic hierarchy, though, if you choose to take advantage of it, that is not available in IP addressing.
No what happens when you want to add information to the name. Do you go through and update all your existing records? Having DNS names
just be opaque ids that point to an entry in a db (which can be a TXT record) is usually a lot better.
It's the only thing you can guarantee will never change. The name points to exactly that server.
heh. no, don't use IPs either. You use well-known service names and use a dedicated service discovery mechanism to reach your service nodes in a resilient and scalable manner.
oh, k8s and DNS...
Spent a lot of hours trying to debug a bug and it was "k8s DNS would eventually expose pods though DNS, but it could take 30 seconds" (or time till pod becomes ready + 30 seconds, because coredns caches negative DNS responses).
I am feeling that caching all DNS responses for 30 seconds is not always the solution for all kind of usage patterns... Ah, generic solutions are for generic problems (which are usually not your problems).
I'm not sure if it was a prank or mistake, but someone recently set up a machine for me and they fat fingered the IP on the primary DNS server, so everything "worked" but was super slow due to the primary lookup silently timing out.
I barked up the wrong tree for a while and then a more senior guy immediately found the issue. Anyways, now I grok this headline and have a new prank in my kit.
The latest thing I had with DNS is that a client and server were communicating with EDNS packet sizes greater than 4096 but an intermediate caching server couldn't handle it and I'd get intermittent resolution failures when the intermediate server landed on one host. Fortunately was just able to boost.
Great writeup, and having a lot of issues with fluentd buffer overflows over the years it absolutely tickled me that was the main clue that led to the discovery of the issue.
It's a riff on the "It's never/always DNS" meme. Pointing out the common inconsistency between reality and expectations of when the issue you are facing has is due to DNS.
So it's a double cancellation of it that bring it back to the original phrase? To show that they didn't take it seriously because it isn't actually true, but then they found it was true in this case, or at least it felt true if all other cases were ignored? Like it was sorta true but not completely since another problem might have had a non DNS issue at its source?
I feel like I see what they're saying but I'm still confused at what's getting communicated. Just "sometimes DNS can actually be a source of problems?"
It's probably really only "funny" if you're familiar with the meme. In that way, it's like many inside jokes. You can't really logic it out. It's like, when someone explains a joke to you, you can now understand why it's funny, but you can't put yourself back in that place where the joke would hit you with the intended impact. Don't worry about it.
You can "math" out the grammar. Treat it as an equality and use the "double negatives cancel" rule to flip the "not" modifying the 2 "is('s)" in that sentence and the title can be rewritten such that:
"It’s not always DNS — unless it is. "
Becomes :
"It’s always DNS — unless it isn't."
Ultimately they can both get interpreted as something like "It's DNS except if it isn't DNS." "DNS (NOT equal) (NOT DNS)" even. Not a super surprising statement.
So wording that statement either way has the same meaning, however the way the author worded it for the article title matches the chronological order of troubleahooting events (at first seemed to not be DNS, but later it turned out it actually was).
In this case, there were NUMEROUS suboptimal or misconfigurations of DNS but none of them mattered until the volume reached a tipping point, and, suddenly, ALL of them came into play. Fixing one overflowed into the next, which overflowed into the next.