Except, almost without exception, eventually the one or two machines will fall o...

alfalfasprout · on Feb 19, 2020

It is a philosophical one when you design around scaling out at a high rate. You incur significant additional complexity in many cases along with increased overhead.

It's fallacious to think that relying on "n" things is strictly safer than "3" things where n is large. That's not quite true due to the significant complexity increases when dealing with large "n" and accompanying overhead.

For web applications (which I suspect the majority of HN readers work on) then sure, but plenty of realtime or safety critical applications are perfectly ok with three-way redundancy.

Bad_CRC · on Feb 19, 2020

I work a lot with voip systems and it's much much easier to have one big machine that trying to make it work distributed...

PudgePacket · on Feb 19, 2020

I've heard lots of anecdotes from big sites doing fine with a small number of machines.

eg stackoverflow only has 1 active DB server, and 1 backup.

https://stackexchange.com/performance

endymi0n · on Feb 19, 2020

Actually, this makes a lot of sense. Reasoning about a single machine is just way simpler and keeps the full power of a modern transactional database at your fingertips. Backups keep taking longer and disaster recovery isn‘t as speedy anymore, but we‘re running on Postgres internally as well and I‘d scale that server as big as possible (even slightly beyond linear cost growth, which is pretty huge these days) before even thinking about alternatives.

aidenn0 · on Feb 18, 2020

Plenty of services can deal with X hours of downtime when a single machine fails for values of X that are longer than it takes to restore to a new machine from backups.

chucky_z · on Feb 19, 2020

I'd like to add to this and say that a server being down for 6 hours, that if over the life of its uptime (months? years?) saves uncountable number of hours on computations and complexity, is so worth it.

Heck, even a machine like that being down for a week is usually still worth it.

darkwater · on Feb 19, 2020

I do not agree, and it is not my experience.Mind you that I've always worked in small/mid-sized businesses (50-300 employees) and basically every service has someone needing it for their daily work. Sure, they may live without it for some times, but you will make their lives more miserable.

And anyway if you already have all in place to completely rebuild every SPOF machine from scratch in a few hours, go the extra mile and make it an active/passive cluster, even manually switched, and make the downtime a minutes thing.

aidenn0 · on Feb 20, 2020

A small amount of work over a long period of time (i.e. setting up a redundant system) may be worse than losing a large amount of work in a short period of time.

Single machines just don't fail that often. I managed a database server for an internal tool and the machine failed once in about 10 years. It was commodity hardware, so I just restored the backups to a spare workstation and it was back up in less than 2 hours. 15 people used this service and they could get some work done, without it, so there was less than 30 person-hours of productivity lost. If I spent 30 hours getting failover &c. working for this system over a 10 year period, it would have been more hours lost for the company than the failure caused.

lmeyerov · on Feb 19, 2020

Totally. We could replace our GPU stack with who knows how many CPUs to hit the same 20ms SLAs, and we'll just pretend data transfer overhead doesn't exist ;-)

More seriously, we're adding multi-node stuff for isolation and multi-GPU for performance. Both are quite different... and useful!

mrich · on Feb 19, 2020

The solution often is to have a warm standby that can take over immediately. You do not get any distributed overhead that is present in a fully load-balanced system during normal operation, and only pay a small amount in the very exceptional failure case.