> If your app is down, it's only because you don't care about the availabili...

NyxWulf · on June 30, 2012

If you operate a service where uptime is critical, two of everything isn't enough. You have to ensure they will be separated geographically and logically so that when things like this happen, you don't lose your systems.

We operate systems that sit on the pages of the top e-commerce companies in the world. We have 10 separate segments of clusters. Operating in four AZ's in East, three AZ's in West-1 and two AZ's in West-2. When this outage happened, the servers that were impacted in east were removed from our DNS and within 9 minutes the impact of this event on their sites was eliminated.

paulsutter · on June 30, 2012

Thanks for the most constructive post in this whole discussion. Hacker News is a much better place to talk about solutions than just whining.

At Quantcast we have physical servers in 14 cities. We use anycast to achieve site failovers in 6 seconds. Downtime for us would impact millions of websites, so we don't have downtime.

NyxWulf · on June 30, 2012

That's our next step is automating our fail over. We currently use DNS Made Easy and while you can do that with them it's tougher. We are switching to Dyn, and their Layer 2 failure detection will put our failover in that 6 second range.

rdl · on July 1, 2012

DNS based failover in practice will not give you 6 second failover, unlike BGP. Too much of DNS is broken and out of your control.

(The trend for regular ISPs is probably improving, except that mobile/carrier DNS is often particularly broken. It would be interesting to do monthly surveys of this.)

donavanm · on June 30, 2012

Can you clarify? I assume you mean that you withdraw announcements and you see BGP convergence in about 6 seconds?

paulsutter · on June 30, 2012

Exactly. We announce different subnets for each continent or region, and all the sites within that region announce the same subnet. When one site ceases announcing the shared subnet, BGP usually converges in just more than 5 seconds. I was astonished the first time I saw it. It really makes you appreciate the solid engineering behind the core routers.

It's nontrivial to determine exactly when to drop the announcement. And be careful, because if you are too eager to drop the announcement, you may do it in more than one site at a time.

At first we used DNS with short timeouts, but those timeouts are only advisory and are ignored by some implementations. We would see most traffic tail off within 10 minutes for a one minute timeout, but it took many hours for all the traffic to migrate over to the new DNS. The folklore on using less than one minute for a DNS timeout is that a huge percentage of implementations ignore sub-minute timeout. Funny how much of the Internet's operation is passed along as folklore and not really known for sure.

Thanks for asking. Hacker News should be about sharing best practices and making the Internet a more reliable place.

meskyanichi · on June 30, 2012

^ This. You'll do it if you care enough or if it's absolutely required. Everyone knows in advance that this kind of shit can and will happen. Also, if people don't plan on hosting in multiple areas/AZs, I wonder why the hell anyone would even consider overpriced cloud technology while you can get practically 5 powerful dedicated servers for the same price, except that it's more performant than a shitty VM on EC2. That said, if you have 5 dedicated servers, why even bother with EC2? It's more expensive in every way. "Pay more as you grow" is actually extremely expensive for what you actually get. Infinite scalability? Please. When people think about scalability, they think about adding a few gigs of ram to their VM with a little more I/O throughput (e.g. migrating from VM1 to VM2) for hundreds of dollars, and it's still shitty VM performance compared to raw metal. Instead, why not spend that money on a few good dedicated boxes with 96-128gb+ ram and a bunch of true (not virtual) CPU cores, then you're done for a while, and for the same price. Hardware is dirt cheap these days.

The only useful/sane use case I can see in Amazon EC2 would be for services like Heroku where they need to automatically be able to manage a truckload of VM's as their rapidly growing infrastructure, unless you want to do it yourself which I imagine is quite a headache unless you work closely with someone like Amazon or Rackspace.

donavanm · on June 30, 2012

The "scalability" thing isnt about adding some ram or getting a larger proc. It's about adding a few dozen (or hundred) instances in minutes. Or getting hosts turned up in 7 different regions. Anyone can do that right now with AWS, let me know how your Equinox negotiations go for the next month.

Yes white boxes are cheap. Site negotiations, design, procurement, networking, operations, and maintenance are expensive in dollars and time. Personally I run "a bunch" of physical sites across the globe. It would be waaaay easier to be able to turn up rackspace/aws/google instances as needed.

meskyanichi · on July 1, 2012

> The "scalability" thing isnt about adding some ram or getting a larger proc.

You'd be surprised how many people that actually use EC2 think it is.

> Yes white boxes are cheap. Site negotiations, design, procurement, networking, operations, and maintenance are expensive in dollars and time.

It's called planning ahead of time. If not, then here's a suggestion: Use EC2 until you set it up and migrate, if you cannot wait that is.

All in all I don't mind whether people use EC2 for whatever reason. Just stating my opinion. I agree of course that in terms of "convenience" is has the upper hand. Not having to wait for boxes to be added to data centers, being able to spin up boxes in multiple regions through a single company/console. Maybe your use case does justify using EC2. Many other people clearly do not (hence all the whining because of all the downtime, which they wouldn't have had if they deployed to multiple AZs/Regions).

dsl · on June 30, 2012

> let me know how your Equinox negotiations go for the next month.

How do cloud services compare to a gym membership? Are you implying you can't get out of your AWS contract?

donavanm · on July 1, 2012

Sigh, I blame auto correct. See https://en.wikipedia.org/wiki/Equinix

lallysingh · on June 30, 2012

If you want reasonable uptime, you don't have any single points of failure. So 2 of everything.