It's not always DNS, unless it is

rconti · on Dec 22, 2023

It's amazing how often finding the obvious cause of a problem only mitigates it, and you end up having to solve it 2 or 3 more times in the following weeks.

In this case, there were NUMEROUS suboptimal or misconfigurations of DNS but none of them mattered until the volume reached a tipping point, and, suddenly, ALL of them came into play. Fixing one overflowed into the next, which overflowed into the next.

rdoherty · on Dec 23, 2023

The real learnings of this incident is how to handle incidents effectively. The author cites ones that I use every time I manage one:

* Centralize - 1 tracking doc that describes the issue, timeline, what's been tested, who owns the incident. Have 1 group chat, 1 'team' (virtual or in person). Get an incident commander to drive the group.

* Create a list of hypotheses and work through them one at a time.

* Use data, not assumptions to prove or disprove your hypotheses.

* Gather as much data as you can, but don't let a particular suspicious graph lead you into a rabbit hole. Keep gathering data.

If you don't do the above you are guaranteed to have a mess, have to repeat yourself over and over and waste time.

a012 · on Dec 23, 2023

I can’t read the blog but Pagerduty provides a good standards for handling incidents: https://response.pagerduty.com/

debarshri · on Dec 22, 2023

Making it sounds like DNS problem is clickbaity. It is a coredns/kube-dns problem.

And yes k8s world, dns fails more often than you think.

blincoln · on Dec 22, 2023

I don't know of a non-Kubernetes situation off the top of my head where this would be an issue, but I definitely learned some new things about DNS resolution on Linux by reading the article, and so I'll think to look for similar scenarios in the future.

hitpointdrew · on Dec 22, 2023

I don't know why node local dns isn't installed by default on a vanilla k8s setup. Seems like it would reduce a lot of headache.

stackskipton · on Dec 22, 2023

Because you generally only need to do DNS lookup at app startup and you are done. So node local is overhead and complexity you rarely need. In this case, it was present and I wonder if this level of complexity was one of reasons for the outage.

mad_vill · on Dec 22, 2023

Sounds more like a kubernetes problem than a dns problem.

I hate coredns. Everything running inside of a kubernetes cluster should just be querying the kubernetes endpoints api for these IPs directly and using the node dnsservers for external hosts.

dharmab · on Dec 22, 2023

> Everything running inside of a kubernetes cluster should just be querying the kubernetes endpoints api for these IPs directly

Wouldn't this put a huge load on the apiserver? Not to mention it's incompatible with software not designed for Kubernetes.

paulddraper · on Dec 22, 2023

> querying the kubernetes endpoints api for these IPs directly

Isn't DNS built explicitly for this?

This might fix problems accidentally, at the cost of k8 dependency.

remram · on Dec 22, 2023

The options for that are:

* dnsPolicy: Default

* enableServiceLinks: true (the default)

Then you can use MYDATABASENAME_SERVICE_HOST from the environment and there is no CoreDNS in the path at all.

eptcyka · on Dec 22, 2023

If I restart my DB, will the database service host env var also be updated? Will restarting a DB or changing the IP of a DB will also imply a restart of all of the services that need access to the DB?

remram · on Dec 22, 2023

The service's ClusterIP will not change.

Unless you want to also get rid of kube-proxy, in addition to CoreDNS; in that case you don't get ClusterIPs for services.

To be honest kube-proxy and CoreDNS are probably the only components I haven't had problems with on my cluster.

Scubabear68 · on Dec 23, 2023

One red flag that stood out for me is where the blog says the team considered all apps to be the same and hadn’t looked at any of their logs, only infrastructure stuff.

When they looked they saw all apps were not the same, and it was only a few kinds of apps that were affected.

When a big incident hits, you need people drilling down not just across; and hopefully people who know the actual apps in question.

Maybe this was DevOps people too far into the ops side and not as much on the dev?

codetrotter · on Dec 22, 2023

Had a similar problem at work a while ago. One service was unable to connect to another occasionally. The Splunk logs said it was a TLS connection problem. After an unsuccessful attempt at reproducing the problem locally, it eventually dawned on me it might be Kubernetes DNS. And by changing temporarily to not using DNS for connecting to that host, we confirmed that indeed it was Kubernetes DNS.

brazzledazzle · on Dec 22, 2023

Did you actually query the DNS from the container to verify DNS was returning an incorrect record in response to the query? I ask because I've seen similar behavior and it turned out the service was only doing DNS lookups at startup and then cached the record indefinitely (or until restarted), regardless of the TTL on the record. Unfortunately some software and libraries don't respond well to even occasional DNS changes.

icedchai · on Dec 23, 2023

About 15 years ago I worked with a vendor that didn’t realize their web service was ignoring TTLs. I think this was the Java 5 days. We had changed an IP on our end and they kept trying to connect to the wrong one for a webhook. It took weeks of sending tcpdump logs back and forth to convince them. They finally restarted their app.

codetrotter · on Dec 23, 2023

In our case the problem is kind of the opposite, as far as we could tell.

The TTL is 2 seconds, but because the the app and the service always deploy together and always run on the same mode as one-another. So we deploy and this deploys both the app and the service to a new node, on which both will run.

But because TTL is so low, every new connection (traffic is pretty low for this particular app unlike some other apps in our cluster) is pretty certain to do another DNS lookup. And about 10% of the time we were getting connection error which boiled down to DNS.

So to confirm it was the problem we changed it to not do DNS lookup for now, since it’s as of now always same node for app and service.

But soon we are changing things around and they will no longer be guaranteed to run on same node nor will they deploy together.

So I still need to come up with something that let’s us do DNS lookups but not have the problem we’ve been having.

pixl97 · on Dec 22, 2023

ugh what a terrible bug especially in the cloud age.

StianOvrevage · on Dec 22, 2023

This post reminds me about a similar but different K8s Turned-Out-To-Be-DNS problem we had recently. We published a write-up about it but never got around to submitting it here before now: https://hackertimes.com/item?id=38740112

gunapologist99 · on Dec 22, 2023

> To avoid ndots issues, the most straightforward solution is to have at least five dots in our hostname. Fluent-bit is one of the biggest abusers of the DNS requests. ... As it now has five dots in the domain, it doesn’t trigger local search anymore.

But it wasn't DNS. DNS didn't break. The protocol didn't break. Not even issues with the CoreDNS or dnsmasq implementations.

The culprit was ndots (why did Kubernetes arbitrarily choose five dots) and the general way that Kubernetes (ab)uses DNS.

StianOvrevage · on Dec 22, 2023

thockin explains why exactly 5 here: https://github.com/kubernetes/kubernetes/issues/33554#issuec...

bombcar · on Dec 23, 2023

Wow that link blows safari on iOS up real hard.

wrboyce · on Dec 23, 2023

I had to remove the fragment from the URL to load the page.

Sohcahtoa82 · on Dec 22, 2023

I always chuckle at the "It's not DNS...it was DNS" line because in my experience, the problem is usually actually DHCP.

I'm struggling with a problem where a VM is supposed to get an IP address from the host, but it takes forever to do so. The host is telling me it has assigned an IP, but the VM says it hasn't. It can take anywhere from 10-60 minutes for the VM to actually get the IP that the host has assigned.

c0nsumer · on Dec 22, 2023

Does the VM get the address via DHCP, or something else?

I'd first answer that, then answer how the VM is supposed to get configured, then figure out how to break down/instrument the steps along the way.

Sohcahtoa82 · on Dec 22, 2023

VM gets the address via DHCP, but the host is supposed to assign it, not some external DHCP server.

justsomehnguy · on Dec 23, 2023

>> 60

And the default lease time is 60 minutes?

vinay_ys · on Dec 22, 2023

DNS is for friendly names; friendly to humans using web browsers. Using DNS for machine to machine communication is not essential complexity. Every chance I get I eliminate DNS from internal infrastructure and a whole lot of things get a lot better. If you naively keep forward/reverse DNS resolutions on in different parts of the stack, you end up with a shitstorm of DNS lookup requests at even a moderate scale infrastructure. Then bad things tend to happen.

whalesalad · on Dec 22, 2023

DNS is more than just pretty names, it allows for a hierarchy that holds meaning. It is way more than just friendly for humans. TBH I would posit that having everything as IP literals would cause more human errors than not. You need to keep a context of all subnetting in your mind, which is not feasible in many networks.

pixl97 · on Dec 22, 2023

Yea, how exactly is things like TLS supposed to work without using DNS!

midasuni · on Dec 22, 2023

What are you using instead? Hard coded IPs? Or have you built your own lookup service?

If you have decent TTLs dns doesn’t result in a shitstorm of lookups, nor does it require anything more powerful than a raspberry pi to respond to them

vinay_ys · on Dec 22, 2023

Yes, it is much easier to build and scale a general config service that can serve keys (like names in dns) and associated config snippets with versioning and expiry/go-live timestamps etc. Such a service enables us to build on top of it things like service discovery, failover, draining, cutover, weighted load-balancing etc. It is much easier to also control/orchestrate/audit changes to key-configs in globally consistent transactional manner and guarantee these changes will be instantly visible to every client or will be deterministically spread-out/staggered etc. It is also much easier to do interpolation of config variables arranged in a hierarchical class namespace. All this makes it a lot more powerful building block for large scale infra services than dns ever could and it has none of the drawbacks of dns.

moondev · on Dec 22, 2023

Now we just need a kubernetes cluster with coredns to deploy eureka!

    http://eureka-0.eureka.default.svc.cluster.local:eureka:8761/eureka
    http://eureka-1.eureka.default.svc.cluster.local:eureka:8761/eureka
    http://eureka-2.eureka.default.svc.cluster.local:eureka:8761/eureka

marcosdumay · on Dec 22, 2023

> If you have decent TTLs dns doesn’t result in a shitstorm of lookups, nor does it require anything more powerful than a raspberry pi to respond to them

This applies equally to any kind of lookup service you use. It's not a distinguishing feature of DNS.

The distinguishing features of DNS are that it's a global, highly regulated, key-value storage with only eventual consistency that may take days to reach. (It has probably never been consistent on practice.) None of those features are desirable for your internal servers configuration.

jovial_cavalier · on Dec 22, 2023

/etc/hosts ?

sneak · on Dec 22, 2023

Opposite take: I consider IPs appearing anywhere except the dhcpd configuration and the DNS zone files (or their database equivalents) to be a bug.

IPs are opaque and meaningless. Maybe you can keep in your head that “.2 is the database, .3 is the web server, .4 is the redis, .5 is the other api, .6 is the other database”, but I can’t and wouldn’t even if I could.

DNS is rarely the problem.

salawat · on Dec 23, 2023

>IPs are opaque and meaningless.

Get a sheet of paper. Draw a line down the middle. Put the IP in one column. Put the thing on it on the other side. Tape to wall, or put in binder.

>...but I can’t and wouldn’t even if I could.

>DNS is rarely the problem.

People that can't even be arsed to remember where their bits are on the other hand...

charcircuit · on Dec 22, 2023

It's not like appserver1234.internal is significantly less opaque and meaningless. Either way you probably want a control panel somewhere that can give you extra information about a node.

marcusb · on Dec 22, 2023

Not if your hostnames look like that. There is the potential in DNS for semantic hierarchy, though, if you choose to take advantage of it, that is not available in IP addressing.

pixl97 · on Dec 22, 2023

db-server-4x-16g.cluster1.fqdn

contains a lot of information outside of an external lookup.

vinay_ys · on Dec 22, 2023

this doesn't scale – what happens when you want to add another piece of information? how do you change the schema for all the names?

aaronax · on Dec 22, 2023

Cnames, reverse records, TXT records, the sky is the limit

Spivak · on Dec 23, 2023

No what happens when you want to add information to the name. Do you go through and update all your existing records? Having DNS names just be opaque ids that point to an entry in a db (which can be a TXT record) is usually a lot better.

It's the only thing you can guarantee will never change. The name points to exactly that server.

hughesjj · on Dec 22, 2023

DNS offers TXT records...

MadsRC · on Dec 22, 2023

What happens when “db-server-4x-16g.cluster1.fqdn” stops hosting a database?

I’m all for DNS instead of IP’s, but we need to stop encoding too much information into names…

aaronax · on Dec 22, 2023

Delete it.

justsomehnguy · on Dec 23, 2023

This is funniest convo I heard about DNS for a long time.

Makes me wonder what GP does with the plate when he finishes his dinner.

pixl97 · on Dec 24, 2023

Pets = ceramic plate

Cattle = paper plate

I'll let the reader make the connections

vinay_ys · on Dec 22, 2023

heh. no, don't use IPs either. You use well-known service names and use a dedicated service discovery mechanism to reach your service nodes in a resilient and scalable manner.

midasuni · on Dec 23, 2023

Some kind of name service could return various values for you different domains. Sounds like a great idea.

Spivak · on Dec 23, 2023

You joke but Consul is usually better at being DNS than actual DNS for a lot of use-cases.

otabdeveloper4 · on Dec 22, 2023

DNS allows for graceful failover and balancing, using standard and platform-independent tools.

(You don't want L3 failover, trust me when I say it won't work as you expect.)

dilyevsky · on Dec 23, 2023

You’re thinking of FNS

mike503 · on Dec 22, 2023

It's always DNS. And if it's not DNS, it's certificates.

99.9% of the time.

gerdesj · on Dec 22, 2023

This afternoon I fixed up a certificate by fixing DNS.

octacat · on Dec 22, 2023

oh, k8s and DNS... Spent a lot of hours trying to debug a bug and it was "k8s DNS would eventually expose pods though DNS, but it could take 30 seconds" (or time till pod becomes ready + 30 seconds, because coredns caches negative DNS responses).

I am feeling that caching all DNS responses for 30 seconds is not always the solution for all kind of usage patterns... Ah, generic solutions are for generic problems (which are usually not your problems).

pphysch · on Dec 22, 2023

I'm not sure if it was a prank or mistake, but someone recently set up a machine for me and they fat fingered the IP on the primary DNS server, so everything "worked" but was super slow due to the primary lookup silently timing out.

I barked up the wrong tree for a while and then a more senior guy immediately found the issue. Anyways, now I grok this headline and have a new prank in my kit.

renewiltord · on Dec 22, 2023

The latest thing I had with DNS is that a client and server were communicating with EDNS packet sizes greater than 4096 but an intermediate caching server couldn't handle it and I'd get intermittent resolution failures when the intermediate server landed on one host. Fortunately was just able to boost.

zshrc · on Dec 22, 2023

I’ve come to check DNS first nowadays. It’s the equivalent of checking if it’s plugged in at this point for me.

hitpointdrew · on Dec 22, 2023

Do you have node local DNS setup? https://kubernetes.io/docs/tasks/administer-cluster/nodeloca...

Might have been a quicker, easier "fix".

zeroxfe · on Dec 22, 2023

This reminds me of an experience from two decades ago. https://0xfe.blogspot.com/2023/12/the-firewall-guy.html

It's not always the Firewall -- unless it is :-)

JohnMakin · on Dec 22, 2023

Great writeup, and having a lot of issues with fluentd buffer overflows over the years it absolutely tickled me that was the main clue that led to the discovery of the issue.

AlecSchueler · on Dec 22, 2023

Anyone else have difficulty parsing that headline and making sense of it?

baobun · on Dec 22, 2023

It's a riff on the "It's never/always DNS" meme. Pointing out the common inconsistency between reality and expectations of when the issue you are facing has is due to DNS.

https://www.cyberciti.biz/humour/a-haiku-about-dns/

AlecSchueler · on Dec 22, 2023

So it's a double cancellation of it that bring it back to the original phrase? To show that they didn't take it seriously because it isn't actually true, but then they found it was true in this case, or at least it felt true if all other cases were ignored? Like it was sorta true but not completely since another problem might have had a non DNS issue at its source?

I feel like I see what they're saying but I'm still confused at what's getting communicated. Just "sometimes DNS can actually be a source of problems?"

kseifried · on Dec 22, 2023

From many years ago on the #osspodcast

Episode 184 – It’s DNS. It’s always DNS

https://opensourcesecurity.io/2020/02/24/episode-184-its-dns...

If you look at a lot of outages and incidents, DNS is a common problem.

AlecSchueler · on Dec 22, 2023

So a better headline might have been "It is, indeed, always DNS?"

froggit · on Dec 22, 2023

Except that would be factually incorrect.

The title of the article doesn't convey what i would call useful information, but at least it checks out.

8organicbits · on Dec 22, 2023

The article begins by explaining exactly that.

AlecSchueler · on Dec 22, 2023

I read the article but the title is still difficult for me grammatically.

rconti · on Dec 22, 2023

It's probably really only "funny" if you're familiar with the meme. In that way, it's like many inside jokes. You can't really logic it out. It's like, when someone explains a joke to you, you can now understand why it's funny, but you can't put yourself back in that place where the joke would hit you with the intended impact. Don't worry about it.

froggit · on Dec 22, 2023

You can "math" out the grammar. Treat it as an equality and use the "double negatives cancel" rule to flip the "not" modifying the 2 "is('s)" in that sentence and the title can be rewritten such that:

"It’s not always DNS — unless it is. "

Becomes :

"It’s always DNS — unless it isn't."

Ultimately they can both get interpreted as something like "It's DNS except if it isn't DNS." "DNS (NOT equal) (NOT DNS)" even. Not a super surprising statement.

So wording that statement either way has the same meaning, however the way the author worded it for the article title matches the chronological order of troubleahooting events (at first seemed to not be DNS, but later it turned out it actually was).