Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

> Can anyone explain why status pages are so difficult.

What is an outage? When does an outage reach sufficient scale that updating the status page is the right thing to do?

I used to work for AWS, and now work for another cloud provider.

One thing that's hard to communicate is the sheer scale that these services operate at, what that means architecturally, and how they tend to break.

Outages, even just slight degradation, occurring on a whole service scale are very rare. I would argue from my experiences there that most incidents affect less than 10% of any given service's customers. Whether it gets noticed in part depends on who is encompassed by that percentage.

What is very often the case is that a subset of customers get impacted to some degree during any given incident. That can be even things like single percentage of customers or less, but be an incident that has all hands to deck and the entire management chain of the service aware and involved in.

At what percentage do you draw the line and say "Yes we need this many percentage of our customers to be affected before we post a green-i" (AWS terminology for the first stage of failure notification).

How do you communicate that effectively to customers, in such a way that doesn't suggest your service is unreliable when it really isn't.

The moment you post a green-i or above, customers start blaming you and your service for problems with their infrastructure that are not caused by it. If you're looking to use a service and go look at the status history and see it filled with green-i or similar, are you likely to trust it? No. Even if those green-i's were for impacts on a limited subset of customers.

AWS wrestled with this a bunch about 5-6 years ago. There were no end of discussions during the weekly ops meetings with senior leadership, directors and engineers across the company. Everyone wants to do the right thing and make sure customers get an accurate picture about the health of the service, without giving the wrong impression.

In the end they opted to move towards having personal notifications for outages, and build tooling to help services quickly identify which customers are being affected by any particular incident and provide personalised status pages for them that can be way more accurate than any generalised status page.



Exactly this. I work for a cloud provider and there has been a ton of push in the last year or so to develop customer communication teams and involve them at the first inkling of an outage. We can identify the subset of customers affected and contact them directly. Just publically saying there’s an outage would cause much more chaos.


Posting percentages instead of green/red would fix all of these, no?


Not really. People will automatically assume they were in that impacted percentage and that what was happening with their stuff was entirely AWS's fault.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: