HN2new | past | comments | ask | show | jobs | submitlogin
It's not always DNS, unless it is (medium.com/adevinta-tech-blog)
115 points by fauria on Dec 22, 2023 | hide | past | favorite | 73 comments


It's amazing how often finding the obvious cause of a problem only mitigates it, and you end up having to solve it 2 or 3 more times in the following weeks.

In this case, there were NUMEROUS suboptimal or misconfigurations of DNS but none of them mattered until the volume reached a tipping point, and, suddenly, ALL of them came into play. Fixing one overflowed into the next, which overflowed into the next.


The real learnings of this incident is how to handle incidents effectively. The author cites ones that I use every time I manage one:

* Centralize - 1 tracking doc that describes the issue, timeline, what's been tested, who owns the incident. Have 1 group chat, 1 'team' (virtual or in person). Get an incident commander to drive the group.

* Create a list of hypotheses and work through them one at a time.

* Use data, not assumptions to prove or disprove your hypotheses.

* Gather as much data as you can, but don't let a particular suspicious graph lead you into a rabbit hole. Keep gathering data.

If you don't do the above you are guaranteed to have a mess, have to repeat yourself over and over and waste time.


I can’t read the blog but Pagerduty provides a good standards for handling incidents: https://response.pagerduty.com/


Making it sounds like DNS problem is clickbaity. It is a coredns/kube-dns problem.

And yes k8s world, dns fails more often than you think.


I don't know of a non-Kubernetes situation off the top of my head where this would be an issue, but I definitely learned some new things about DNS resolution on Linux by reading the article, and so I'll think to look for similar scenarios in the future.


I don't know why node local dns isn't installed by default on a vanilla k8s setup. Seems like it would reduce a lot of headache.


Because you generally only need to do DNS lookup at app startup and you are done. So node local is overhead and complexity you rarely need. In this case, it was present and I wonder if this level of complexity was one of reasons for the outage.


Sounds more like a kubernetes problem than a dns problem.

I hate coredns. Everything running inside of a kubernetes cluster should just be querying the kubernetes endpoints api for these IPs directly and using the node dnsservers for external hosts.


> Everything running inside of a kubernetes cluster should just be querying the kubernetes endpoints api for these IPs directly

Wouldn't this put a huge load on the apiserver? Not to mention it's incompatible with software not designed for Kubernetes.


> querying the kubernetes endpoints api for these IPs directly

Isn't DNS built explicitly for this?

This might fix problems accidentally, at the cost of k8 dependency.


The options for that are:

* dnsPolicy: Default

* enableServiceLinks: true (the default)

Then you can use MYDATABASENAME_SERVICE_HOST from the environment and there is no CoreDNS in the path at all.


If I restart my DB, will the database service host env var also be updated? Will restarting a DB or changing the IP of a DB will also imply a restart of all of the services that need access to the DB?


The service's ClusterIP will not change.

Unless you want to also get rid of kube-proxy, in addition to CoreDNS; in that case you don't get ClusterIPs for services.

To be honest kube-proxy and CoreDNS are probably the only components I haven't had problems with on my cluster.


One red flag that stood out for me is where the blog says the team considered all apps to be the same and hadn’t looked at any of their logs, only infrastructure stuff.

When they looked they saw all apps were not the same, and it was only a few kinds of apps that were affected.

When a big incident hits, you need people drilling down not just across; and hopefully people who know the actual apps in question.

Maybe this was DevOps people too far into the ops side and not as much on the dev?


Had a similar problem at work a while ago. One service was unable to connect to another occasionally. The Splunk logs said it was a TLS connection problem. After an unsuccessful attempt at reproducing the problem locally, it eventually dawned on me it might be Kubernetes DNS. And by changing temporarily to not using DNS for connecting to that host, we confirmed that indeed it was Kubernetes DNS.


Did you actually query the DNS from the container to verify DNS was returning an incorrect record in response to the query? I ask because I've seen similar behavior and it turned out the service was only doing DNS lookups at startup and then cached the record indefinitely (or until restarted), regardless of the TTL on the record. Unfortunately some software and libraries don't respond well to even occasional DNS changes.


About 15 years ago I worked with a vendor that didn’t realize their web service was ignoring TTLs. I think this was the Java 5 days. We had changed an IP on our end and they kept trying to connect to the wrong one for a webhook. It took weeks of sending tcpdump logs back and forth to convince them. They finally restarted their app.


In our case the problem is kind of the opposite, as far as we could tell.

The TTL is 2 seconds, but because the the app and the service always deploy together and always run on the same mode as one-another. So we deploy and this deploys both the app and the service to a new node, on which both will run.

But because TTL is so low, every new connection (traffic is pretty low for this particular app unlike some other apps in our cluster) is pretty certain to do another DNS lookup. And about 10% of the time we were getting connection error which boiled down to DNS.

So to confirm it was the problem we changed it to not do DNS lookup for now, since it’s as of now always same node for app and service.

But soon we are changing things around and they will no longer be guaranteed to run on same node nor will they deploy together.

So I still need to come up with something that let’s us do DNS lookups but not have the problem we’ve been having.


ugh what a terrible bug especially in the cloud age.


This post reminds me about a similar but different K8s Turned-Out-To-Be-DNS problem we had recently. We published a write-up about it but never got around to submitting it here before now: https://hackertimes.com/item?id=38740112


> To avoid ndots issues, the most straightforward solution is to have at least five dots in our hostname. Fluent-bit is one of the biggest abusers of the DNS requests. ... As it now has five dots in the domain, it doesn’t trigger local search anymore.

But it wasn't DNS. DNS didn't break. The protocol didn't break. Not even issues with the CoreDNS or dnsmasq implementations.

The culprit was ndots (why did Kubernetes arbitrarily choose five dots) and the general way that Kubernetes (ab)uses DNS.



Wow that link blows safari on iOS up real hard.


I had to remove the fragment from the URL to load the page.


I always chuckle at the "It's not DNS...it was DNS" line because in my experience, the problem is usually actually DHCP.

I'm struggling with a problem where a VM is supposed to get an IP address from the host, but it takes forever to do so. The host is telling me it has assigned an IP, but the VM says it hasn't. It can take anywhere from 10-60 minutes for the VM to actually get the IP that the host has assigned.


Does the VM get the address via DHCP, or something else?

I'd first answer that, then answer how the VM is supposed to get configured, then figure out how to break down/instrument the steps along the way.


VM gets the address via DHCP, but the host is supposed to assign it, not some external DHCP server.


>> 60

And the default lease time is 60 minutes?


DNS is for friendly names; friendly to humans using web browsers. Using DNS for machine to machine communication is not essential complexity. Every chance I get I eliminate DNS from internal infrastructure and a whole lot of things get a lot better. If you naively keep forward/reverse DNS resolutions on in different parts of the stack, you end up with a shitstorm of DNS lookup requests at even a moderate scale infrastructure. Then bad things tend to happen.


DNS is more than just pretty names, it allows for a hierarchy that holds meaning. It is way more than just friendly for humans. TBH I would posit that having everything as IP literals would cause more human errors than not. You need to keep a context of all subnetting in your mind, which is not feasible in many networks.


Yea, how exactly is things like TLS supposed to work without using DNS!


What are you using instead? Hard coded IPs? Or have you built your own lookup service?

If you have decent TTLs dns doesn’t result in a shitstorm of lookups, nor does it require anything more powerful than a raspberry pi to respond to them


Yes, it is much easier to build and scale a general config service that can serve keys (like names in dns) and associated config snippets with versioning and expiry/go-live timestamps etc. Such a service enables us to build on top of it things like service discovery, failover, draining, cutover, weighted load-balancing etc. It is much easier to also control/orchestrate/audit changes to key-configs in globally consistent transactional manner and guarantee these changes will be instantly visible to every client or will be deterministically spread-out/staggered etc. It is also much easier to do interpolation of config variables arranged in a hierarchical class namespace. All this makes it a lot more powerful building block for large scale infra services than dns ever could and it has none of the drawbacks of dns.


Now we just need a kubernetes cluster with coredns to deploy eureka!

    http://eureka-0.eureka.default.svc.cluster.local:eureka:8761/eureka
    http://eureka-1.eureka.default.svc.cluster.local:eureka:8761/eureka
    http://eureka-2.eureka.default.svc.cluster.local:eureka:8761/eureka


> If you have decent TTLs dns doesn’t result in a shitstorm of lookups, nor does it require anything more powerful than a raspberry pi to respond to them

This applies equally to any kind of lookup service you use. It's not a distinguishing feature of DNS.

The distinguishing features of DNS are that it's a global, highly regulated, key-value storage with only eventual consistency that may take days to reach. (It has probably never been consistent on practice.) None of those features are desirable for your internal servers configuration.


/etc/hosts ?


Opposite take: I consider IPs appearing anywhere except the dhcpd configuration and the DNS zone files (or their database equivalents) to be a bug.

IPs are opaque and meaningless. Maybe you can keep in your head that “.2 is the database, .3 is the web server, .4 is the redis, .5 is the other api, .6 is the other database”, but I can’t and wouldn’t even if I could.

DNS is rarely the problem.


>IPs are opaque and meaningless.

Get a sheet of paper. Draw a line down the middle. Put the IP in one column. Put the thing on it on the other side. Tape to wall, or put in binder.

>...but I can’t and wouldn’t even if I could.

>DNS is rarely the problem.

People that can't even be arsed to remember where their bits are on the other hand...


It's not like appserver1234.internal is significantly less opaque and meaningless. Either way you probably want a control panel somewhere that can give you extra information about a node.


Not if your hostnames look like that. There is the potential in DNS for semantic hierarchy, though, if you choose to take advantage of it, that is not available in IP addressing.


db-server-4x-16g.cluster1.fqdn

contains a lot of information outside of an external lookup.


this doesn't scale – what happens when you want to add another piece of information? how do you change the schema for all the names?


Cnames, reverse records, TXT records, the sky is the limit


No what happens when you want to add information to the name. Do you go through and update all your existing records? Having DNS names just be opaque ids that point to an entry in a db (which can be a TXT record) is usually a lot better.

It's the only thing you can guarantee will never change. The name points to exactly that server.


DNS offers TXT records...


What happens when “db-server-4x-16g.cluster1.fqdn” stops hosting a database?

I’m all for DNS instead of IP’s, but we need to stop encoding too much information into names…


Delete it.


This is funniest convo I heard about DNS for a long time.

Makes me wonder what GP does with the plate when he finishes his dinner.


Pets = ceramic plate

Cattle = paper plate

I'll let the reader make the connections


heh. no, don't use IPs either. You use well-known service names and use a dedicated service discovery mechanism to reach your service nodes in a resilient and scalable manner.


Some kind of name service could return various values for you different domains. Sounds like a great idea.


You joke but Consul is usually better at being DNS than actual DNS for a lot of use-cases.


DNS allows for graceful failover and balancing, using standard and platform-independent tools.

(You don't want L3 failover, trust me when I say it won't work as you expect.)


You’re thinking of FNS


It's always DNS. And if it's not DNS, it's certificates.

99.9% of the time.


This afternoon I fixed up a certificate by fixing DNS.


oh, k8s and DNS... Spent a lot of hours trying to debug a bug and it was "k8s DNS would eventually expose pods though DNS, but it could take 30 seconds" (or time till pod becomes ready + 30 seconds, because coredns caches negative DNS responses).

I am feeling that caching all DNS responses for 30 seconds is not always the solution for all kind of usage patterns... Ah, generic solutions are for generic problems (which are usually not your problems).


I'm not sure if it was a prank or mistake, but someone recently set up a machine for me and they fat fingered the IP on the primary DNS server, so everything "worked" but was super slow due to the primary lookup silently timing out.

I barked up the wrong tree for a while and then a more senior guy immediately found the issue. Anyways, now I grok this headline and have a new prank in my kit.


The latest thing I had with DNS is that a client and server were communicating with EDNS packet sizes greater than 4096 but an intermediate caching server couldn't handle it and I'd get intermittent resolution failures when the intermediate server landed on one host. Fortunately was just able to boost.


I’ve come to check DNS first nowadays. It’s the equivalent of checking if it’s plugged in at this point for me.


Do you have node local DNS setup? https://kubernetes.io/docs/tasks/administer-cluster/nodeloca...

Might have been a quicker, easier "fix".


This reminds me of an experience from two decades ago. https://0xfe.blogspot.com/2023/12/the-firewall-guy.html

It's not always the Firewall -- unless it is :-)


Great writeup, and having a lot of issues with fluentd buffer overflows over the years it absolutely tickled me that was the main clue that led to the discovery of the issue.


Anyone else have difficulty parsing that headline and making sense of it?


It's a riff on the "It's never/always DNS" meme. Pointing out the common inconsistency between reality and expectations of when the issue you are facing has is due to DNS.

https://www.cyberciti.biz/humour/a-haiku-about-dns/


So it's a double cancellation of it that bring it back to the original phrase? To show that they didn't take it seriously because it isn't actually true, but then they found it was true in this case, or at least it felt true if all other cases were ignored? Like it was sorta true but not completely since another problem might have had a non DNS issue at its source?

I feel like I see what they're saying but I'm still confused at what's getting communicated. Just "sometimes DNS can actually be a source of problems?"


From many years ago on the #osspodcast

Episode 184 – It’s DNS. It’s always DNS

https://opensourcesecurity.io/2020/02/24/episode-184-its-dns...

If you look at a lot of outages and incidents, DNS is a common problem.


So a better headline might have been "It is, indeed, always DNS?"


Except that would be factually incorrect.

The title of the article doesn't convey what i would call useful information, but at least it checks out.


The article begins by explaining exactly that.


I read the article but the title is still difficult for me grammatically.


It's probably really only "funny" if you're familiar with the meme. In that way, it's like many inside jokes. You can't really logic it out. It's like, when someone explains a joke to you, you can now understand why it's funny, but you can't put yourself back in that place where the joke would hit you with the intended impact. Don't worry about it.


You can "math" out the grammar. Treat it as an equality and use the "double negatives cancel" rule to flip the "not" modifying the 2 "is('s)" in that sentence and the title can be rewritten such that:

"It’s not always DNS — unless it is. "

Becomes :

"It’s always DNS — unless it isn't."

Ultimately they can both get interpreted as something like "It's DNS except if it isn't DNS." "DNS (NOT equal) (NOT DNS)" even. Not a super surprising statement.

So wording that statement either way has the same meaning, however the way the author worded it for the article title matches the chronological order of troubleahooting events (at first seemed to not be DNS, but later it turned out it actually was).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: