Break before make, abstractions, and sleazy ISPs

cfors · on Oct 6, 2019

> If you're using things which do macro expansion or anything else that involves you writing format A and it generating format B which actually does the work (or worse yet, gets turned into format C), you really owe it to yourself to run a 'diff' on the before-and-after versions of the output from the tool BEFORE it goes and takes any action on your behalf.

Seriously. This is a huge issue with Helm templating Kubernetes resources definitions. When you make a PR on your infrastructure repository, there could be many changes underneath the hood in a Helm chart that Are invisible from changes to a values file.

We had a rule in my last job the diff of the actual resource files had to be included in the PR in order to be approved because we were bit by the same.

Quekid5 · on Oct 6, 2019

We're probably going to be moving to a K8s thing, and this what I'm most scared of, really. This kind of macro expansion is (to me) a huge red flag wrt. understandability, audits, etc. It's really just an indicator of a lack of composability. Yeah, you can try to add it, but that doesn't really work (see Ansible.)

Ideally, I'd want something like Propeller (with static type checking) to actually have a little confidence in my changes. We're not using Propeller due to the fact that it requires/installs GHC on the target machines -- I wish there was a GHC-to-bash compiler...

Bnshsysjab · on Oct 6, 2019

I recently moved to Brisbane, Australia. The apartment building I rent in was advertised as NBN (public fibre) connected.

Turns out, it’s vendor locked to a specific ISP, iiNet, which wouldn’t be a problem if they offered static ips but they don’t. Its kinda funny because another sub company, internode, of their parent company, TPG, does offer static ips for vendor locked FTTB but it wasn’t offered in the building.

It gets weird for various other reasons (convoluted inbound tcp/25/80/443 unblocking process, static ips).

But the larger point is that ISPs largely control the internet and frequently do scummy things. in Australia it’s quickly becoming a very select vendor market and monopolising behaviour is occurring to the point where there’s very few decent ‘neutral’ isps left.

Here there’s laws to prevent complete monopolisation of media but I don’t think the same exists for networks comms, and I fear that this place is going to quickly end up owned by a few select companies that will effectively be able to do what they want, without intervention.

Maybe this sounds grim but I needed the rant

Jonnax · on Oct 6, 2019

See this bullshit from ISPs is what something like DNS over HTTPs would help with.

It helps the average user who is using their own internet connection or public internet connections.

Probably 90% of users will never be on a corporate network where they have control over their own browser configuration.

But all the detractors are like "I lose control over my network"

Whilst seemingly not having the skills to block all public DoH providers in their network. Not even going into DPI or MITM style web security products.

DNS isn't security. It's just an address book.

nominated1 · on Oct 6, 2019

> See this bullshit from ISPs is what something like DNS over HTTPs would help with.

I’ll start by saying I’ve setup DoH on my home router just for fun.

> Whilst seemingly not having the skills to block all public DoH providers in their network.

Now with my admin hat on - maintaining a list is an unnecessary burden. It’s a matter of sensibility not skill. Time is money after all.

DoT (DNS-over-TLS) probably should have won since it uses it’s own port (853) which makes it easily manageable, does the same thing and is more mature. The “but privacy” argument for DoH looks like a red herring.

dataflow · on Oct 6, 2019

I don't get what the relevance of the ISP ad page is. Wouldn't it be a similar problem if the any DNS server just cached the NXDOMAIN for too long? Seems to me that the problem is either the ISP's DNS server is using a higher TTL than specified, or the user specifying a higher TTL than necessary?

michaelbuckbee · on Oct 6, 2019

I think the relevance is that because the ISP is incentivized to offer their ad page instead of the correct setting they're monkeying with the proper operation of DNS + caching. Or, even more likely the service the ISP is using to do ad serving on "missing" domains is doing this on their behalf.

perspective1 · on Oct 6, 2019

The whole situation is a bizarre and I'm surprised any effect was noticed at all. You had to get unlucky enough that this ISP's recursive resolver cache expired in the 1-2 seconds you sent an NXDOMAIN. And then you have your NXDOMAIN TTL set far enough in the future it causes a problem. One possibility is the ISP ignores TTLs, setting its negative ones higher than the SOA settings and the others lower. I think the more likely scenario is weird caching-- either because of geopolitical boundaries or propagation issues on the service provider's side.

acranox · on Oct 6, 2019

Before doing the switchover they might have lowered the TTL to something like 5s, which greatly increases the chance the TTL in the resolver cache would expire during the switchover. And then the ISP probably did set a longer than normal TTL on the record they inserted.

Bnshsysjab · on Oct 6, 2019

Some time ago I’m pretty sure that TTL wasn’t respected pretty much at all on legitimate requests. I dunno how it is now though.

mwyah · on Oct 6, 2019

NXDOMAIN unlike SERVFAIL is cached for whatever the regular TTL is. So yeah, seems like this person is complaining about something that would've gone wrong anyway

cannonedhamster · on Oct 6, 2019

No what he's complaining about is that the ad network laden DNS server provided by the company had a longer TTL than what was provided which lead to numerous complaints. I've actually seen this happen before with Google DNS where I've of their servers would randomly choke on our DNS settings because of something obscure we had set. It took us weeks to get things fixed because it was only a very small subset but anyone using a Google DNS would have intermittent problems that whole time. We've also seen local ISPs cache temporary statuses for far, far longer than what the records TTL. This is definitely something that happens.

josh_fyi · on Oct 6, 2019

The sleaze will slime you, and it's hard to escape it, regardless of whether you use Infrastructure as Code or do things by hand.

Even by hand, it would be easy to do destroy/create -- even if update would have been better -- and think that a few seconds downtime will not do much harm.

saagarjha · on Oct 6, 2019

This is the thing where if you mistyped an address the ISP would present a “helpful” search (usually full of ads) to you instead of letting the application deal with it appropriately?

gumby · on Oct 6, 2019

By the way you can have multiple A records on a domain name. So you should add the .2 address before removing the .1

Not that this excuses the crummy ISPs

zbentley · on Oct 6, 2019

The article is about automating away that change to the point where you don't have control over the order of operations, or the intermediate state(s) of the system.

eximius · on Oct 6, 2019

I see the point, but I consider this more an indictment of ISPs than complexity or IaC or more abstractions.

Even if I saw that the update was modeled as delete/insert, I probably would have okayed it.

ISPs doing shitty things... I mean, honestly, it's hard to believe that's even legal.

markbnj · on Oct 6, 2019

We use "Infrastructure as Code thing[s]" but no way would we use, say terraform, to execute a change to a public-facing DNS record without confirming whether it was an update or a destroy/create operation. I'm not shaming someone who does, or did, and I love Rachael by the Bay, but there seemed to be a wee bit of snark coming through with respect to all the "magic" layers of stuff that makes things work in the cloud. I don't know if it's the old "the cloud is just someone else's computer" thing that you often hear, but honestly I wish we'd get over it. Cloud computing has been transformative, and there are lots of business that are able to exist because of the efficiencies derived from these platforms. I don't think there's much question that cloud deployments can be done correctly and managed well, and after all there is always someone upstream whose competence you rely on.

jcims · on Oct 6, 2019

Something I've learned over the past 3 years or so is that when you have Infrastructure as Code, you get Infrastructure by Coders. This is incredibly empowering and useful, but sometimes little details sneak into the system because the folks writing the code don't have any experience managing the systems that have been so neatly abstracted. Or, as could be the case here, they choose to simplify the interface by making a number of policy decisions by default...such as make before break when revising DNS A records.

phs318u · on Oct 6, 2019

> when you have Infrastructure as Code, you get Infrastructure by Coders

This.

wbl · on Oct 6, 2019

Raise your hand if you've been bitten by Salt's weak typing. I have!

ohazi · on Oct 6, 2019

Remember when Verisign decided to try this on the .com/.net/.org TLDs (15ish years ago)?

Any good stories about stuff that broke?

chaz6 · on Oct 6, 2019

I feel that the article should make mention of dnssec which can guarantee non-existance of dns records by signing every response. Of course this relies on the end user having a dnssec-aware stub resolver.

tptacek · on Oct 6, 2019

It also relies on the zones being looked up actually being signed with DNSSEC, but virtually none of them are. After 25 years of standardization effort there is practically no deployment of DNSSEC among the popular Internet or in the US. The protocol is moribund; it's not worth configuring.

milankragujevic · on Oct 6, 2019

OpenDNS used to do this a few years ago.

Animats · on Oct 6, 2019

It's very common to hit this on public WiFi hotspots. All DNS queries lead to some signin page. APIs have to detect that.

forgotmypw · on Oct 6, 2019

Typically, in my experience, WiFi portals will hijack HTTP traffic, but not DNS requests.

DNS will either be blocked until you're signed in, or actually resolve correctly even prior to login.

Otherwise, the incorrect DNS record could still be cached even after signing in.

There's even a tool for routing all traffic over DNS queries, with a specialized resolver on the other end: https://code.kryo.se/iodine/

Mathnerd314 · on Oct 6, 2019

Why does changing a DNS record take a few seconds? Shouldn't it just be some milliseconds as the packet is sent and ack'd?

draw_down · on Oct 6, 2019

> I keep asking if people do this on purpose as a job security gambit.

People are just trying to get stuff done. Come on.

wodenokoto · on Oct 6, 2019

What is her beef with infrastructure as code, and what is her suggested alternative?