Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

All kinds of badness here. Bug #2 really reduces the level of comfort I would have with using ZooKeeper as a tool.

First of all, the default Java policy of terminating the thread, instead of the process, when a runtime exception is not handled is fully boneheaded and the first thing you should always do in a server program is to set a default uncaught exception handler which kills the program. Much better to flame out spectacularly than to limp along with your fingers crossed hoping for the best, as this bug amply demonstrates.

On the heels of that, there's this: "Unfortunately, that means the heartbeat mechanisms would continue to run as well, deceiving the followers into thinking that the leader is healthy." Major rookie mistake here; the heartbeat should be generated by the same code (e.g. polling loop) which does the actual work, or should be conditioned on the progress of such work. There's no indication that ZooKeeper is bad enough to have a separate thread whose only responsibility is to periodically generate the heartbeat (a shockingly common implementation choice), but it is clearly not monitoring the health of the program effectively.

Suffering a kernel level bug is outside the control of a program, but this demonstrates a lack of diligence or experience in applying the appropriate safety mechanisms to construct a properly functioning component of a distributed system.



There's no indication that ZooKeeper is bad enough to have a separate thread whose only responsibility is to periodically generate the heartbeat (a shockingly common implementation choice), but it is clearly not monitoring the health of the program effectively.

I spent a few minutes digging around in the source and found this:

http://svn.apache.org/viewvc/zookeeper/trunk/src/java/main/o...

Around line 1100 is the entry point of the thread that sends requests, and it's also what generates heartbeats, so it looks like if the receiver thread dies it'll just keep going...


> this demonstrates a lack of diligence or experience in applying the appropriate safety mechanisms to construct a properly functioning component of a distributed system

That's every experience I've had with ZooKeeper – gratuitous complexity, complicated setup and deployment and too many failure modes where it silently stops working without logging anything helpful. For a system like this simplicity is a necessity, not a quaint affection.


> gratuitous complexity, complicated setup and deployment and too many failure modes where it silently stops working without logging anything helpful

Well, it is a Java program…


Error detection is unfortunately similar if not identical to the halting problem for most cases, and heartbeating is not an exception.

For each system there is a single state which exhibits correct behaviour and an infinite set of states which produce incorrect behaviours.

Detecting an error correctly in theory would therefore have to require testing of all possible code paths and inputs.

This is of course impractical, so we try to find a middle ground, but this middle ground is in my experience far to simplistic to be of any real use in all but the simplest failure cases (IMO).

For this reason, I prefer to spend more effort in proactive error prevention than reactive. Time spent improving stability of the product generally has a better payoff than adding fault detection and recovery, which IMO should only be used as a belt-and-braces approach of returning the system to a known state. But there is always another type of failure which your error detection cannot detect, and so you should never rely on it.


I believe 100% in proactive error prevention as a means of building robust programs but, as you say, there are failures that cannot be handled this way; when the kernel fails (as in this case) or when the hardware fails (e.g. spontaneous bit flip error in the memory when not using ECC/ECM) there is no way to handle this. Given that this is the case, you must also make the system robust, and one element of that is being disciplined about communicating program state via mechanisms such as heartbeats. This is not a magic bullet, but produces much better results than a lackadaisical approach. I think that as much or more effort should be spent on system robustness as program robustness because much of the error handling code I've seen at the program logic level is overly complicated and under-tested; when in doubt, call abort() and let the overall system sort things out (and design your system so that this approach works, check out Netflix's Chaos Monkey).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: