Except, almost without exception, eventually the one or two machines will fall over. Ideally you can engineer your way around this ahead of time - but not always. Fundamentally relying on a few specific things (or people) will always be an existential risk to a big firm. Absolutely agree re: start small - but the problem with “scale out” is a lack of good tooling - not a fundamental philosophical one.
It is a philosophical one when you design around scaling out at a high rate. You incur significant additional complexity in many cases along with increased overhead.
It's fallacious to think that relying on "n" things is strictly safer than "3" things where n is large. That's not quite true due to the significant complexity increases when dealing with large "n" and accompanying overhead.
For web applications (which I suspect the majority of HN readers work on) then sure, but plenty of realtime or safety critical applications are perfectly ok with three-way redundancy.
Actually, this makes a lot of sense. Reasoning about a single machine is just way simpler and keeps the full power of a modern transactional database at your fingertips. Backups keep taking longer and disaster recovery isn‘t as speedy anymore, but we‘re running on Postgres internally as well and I‘d scale that server as big as possible (even slightly beyond linear cost growth, which is pretty huge these days) before even thinking about alternatives.
Plenty of services can deal with X hours of downtime when a single machine fails for values of X that are longer than it takes to restore to a new machine from backups.
I'd like to add to this and say that a server being down for 6 hours, that if over the life of its uptime (months? years?) saves uncountable number of hours on computations and complexity, is so worth it.
Heck, even a machine like that being down for a week is usually still worth it.
I do not agree, and it is not my experience.Mind you that I've always worked in small/mid-sized businesses (50-300 employees) and basically every service has someone needing it for their daily work. Sure, they may live without it for some times, but you will make their lives more miserable.
And anyway if you already have all in place to completely rebuild every SPOF machine from scratch in a few hours, go the extra mile and make it an active/passive cluster, even manually switched, and make the downtime a minutes thing.
A small amount of work over a long period of time (i.e. setting up a redundant system) may be worse than losing a large amount of work in a short period of time.
Single machines just don't fail that often. I managed a database server for an internal tool and the machine failed once in about 10 years. It was commodity hardware, so I just restored the backups to a spare workstation and it was back up in less than 2 hours. 15 people used this service and they could get some work done, without it, so there was less than 30 person-hours of productivity lost. If I spent 30 hours getting failover &c. working for this system over a 10 year period, it would have been more hours lost for the company than the failure caused.
Totally. We could replace our GPU stack with who knows how many CPUs to hit the same 20ms SLAs, and we'll just pretend data transfer overhead doesn't exist ;-)
More seriously, we're adding multi-node stuff for isolation and multi-GPU for performance. Both are quite different... and useful!
The solution often is to have a warm standby that can take over immediately. You do not get any distributed overhead that is present in a fully load-balanced system during normal operation, and only pay a small amount in the very exceptional failure case.