Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

No way at all? FB has a lot of servers and a lot of users. They have opportunities for quality practices that few other organizations get.

Here's a simple example:

Split the user population into 365* groups. Assign them to days of the year. Test new code only the group whose day has come up. Follow them until they stop having problems. Now you can deploy that to a month's worth of groups. All good? Deploy to everyone.

Yes, that means that you can't have more than 365 changes in simultaneous development. Tough.

*Yes, yes, leap years. Take a day off from deploying.



They graph the number of incidents. Even if a bad release impact only 0.3% of the user base, it's still an incident.

They have to investigate it, revert or fix the bad code and start the deployment process again.


Ack. Also, the system introduces a new problem: If you are deploying on weekend, most devs are not around to help solving the problem. Thus, the outages would be longer.


That is usually how all companies at that scale release code (at least the ones that I know of).

Only that it's not 365 groups, because at the size of FB that would be several million people.


They might already be doing A/B testing (or in your example A1/A2/.../A365). It's not clear what's the definition of an "incident". At my workplace, even if a bad code push affects say 0.1% of users, it would be classified as an incident.


That's deploying on weekends, isn't it?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: