Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

I'm not sure I agree with "Deploying in such a way that all your servers are not running the same codebase is obviously bad." I have a lot of experience in large scale systems (although this incident with 8 machines does not qualify) and I would say there is _always_ a period of transition where versions X and Y are online in production simultaneously. How can it be otherwise? You'd need scheduled downtime to do it any other way.

I think the main problem here is nobody at this company pushed back on this stupid development plan of reusing a flag for a different purpose. There's no excuse for that (or maybe there is, they had run out of fields in some fixed-width message format or something dumb like that). Also apparently the use of the flag was not tied hermetically to the binary in production; when they rolled back the binary the flag was still there but it meant something different to the old software.

The correct way to roll this type of change out is for the new input (the "flag" in this case) to be totally inert for the old version of the software, and for the new version to have a config file or command line argument that disables it. So _first_ you start sending this new feature in the input, which is meaningless and ignored by the existing software, and then you roll out the new software to maybe 1% of your fleet, and see if it works. Then roll it out to maybe 10% and leave it that way for a week. Insist that your developers have created a a way to cross-check the correctness of the feature in the 10% test fleet (structured logging etc). If it looks good roll it to 100%. You now have three ways to disable it: turn it off in the input stream, turn it off in the new software with the config or argument, or roll back the software.

Doesn't look like these guys really knew what they were doing.



"How can it be otherwise? You'd need scheduled downtime to do it any other way."

Trading floor is only open a few hours every day, the functionality being rolled out required the markets to be open. Furthermore, since the changes were all for new functionality they rolled it out in stages days ahead of time (good move).


> How can it be otherwise? You'd need scheduled downtime to do it any other way.

Roll out the code in advance, and have the production machines switch to it at a defined, synchronized time?

I mean, imagine you only have one production machine. If you're willing to admit that you can have it switch from version X to version Y with no downtime, then synchronization is the only barrier to doing the same on n machines. Why would you need scheduled downtime?


Synchronization is non-trivial, but the question is mostly how fine you need the synchronization to be. E.g., if you are doing a live upgrade using e.g Erlang or Nginx, you can sort-of decide when new processes will be served by the new server, but existing processes and requests may linger with old code until much later.

But there's at least 30 minutes of downtime per week per market (usually per day), and the vast majority of downtimes coincide during the weekend - so this is all moot discussion and needlessly complex solution. If you can afford the downtime, switch midnight GMT between Saturday and Sunday, when all markets are closed.


You're right, yet nearly everybody operates on a daily restart schedule. I guess it's more intuitive, but it's wrong. Still, nobody ever got fired for bouncing systems once a day!


You're right. I understand a slow rollout, I was referring to the fact that they thought they deployed to all servers but didn't.

If you plan is for the software to be on all servers, it needs to be on all servers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: