October 2014 Archives

Ben and I were discussing this thread, so we ended up writing a reply together.

This is an interesting discussion indeed, and we appreciate your interest in
learning more about the implementation of zookeeper. Here is another attempt to
explain you the differences between our algorithm and Paxos.

The Paxos Multi-Decree protocol basically consists of running parallel
instances of the Synod protocol (you probably know this one, but just for
completeness the paper is called "The Part-time parliament", and can be found
on Lamport's web page). The original Synod protocol, which is what we are
calling Paxos, proceeds in three phases, just like a three-phase commit
protocol. Our protocol instead, has only two phases, just like a two-phase
commit protocol. Of course, for Paxos, we can ignore the first phase in runs in
which we have a single proposer as we can run phase 1 for multiple instances at
a time, which is what Ben called previously Multi-Paxos, I believe. The trick
with skipping phase 1 is to deal with leader switching.

However, if we have a run with multiple proposers, operating simultaneously or
not, then we have to run phase 1 at least for the instances that haven't been
committed. The ZooKeeper protocol does not. The reason why we don't is
twofold. First, we assume FIFO channels. (FIFO meaning if a packet is received
from the channel all previously sent messages will have been delivered. If a
packet is lost in the channel, all subsequent packets will be lost. TCP is a
FIFO channel.) Paxos doesn't assume such a channel, and it is a rather
practical assumption that simplifies the protocol a lot. Second, there can be
at most one leader (proposer) at any time, and we guarantee this by making sure
that a quorum of replicas recognize the leader as a leader by committing to an
epoch change. This change in epoch also allows us to get unique zxids since the
epoch forms part of the zxid. Followers (they both acceptors and learners in
the Paxos terminology) have a FIFO
channel to a single leader, so that can only be a single active leader.

As a result, we can skip the phase 1 of Paxos completely, and also during
recovery we can skip all the uncommitted zxids of the epoch of the previous
leader. Since messages can be received out of order and even lost with Paxos,
it is possible to have gaps in the sequence of instances, and these instances
have to be decided when a new proposer arises. The conclusion is that by making
stronger assumptions for the system, we are able to use a simpler algorithm
that works truly in two phases.

One difference we find interesting is that Paxos embeds recovery into the
protocol. According to the algorithm, a new proposer just has to start one
phase 1 for each instance that it believes hasn't been committed yet. If such
an instance has been committed, then there is no problem as the value can't
change once it is committed. With the ZooKeeper protocol, we have to run an
auxiliary protocol to make sure that new leaders are up to date with respect to
operations that have been committed, but because of the FIFO assumption, we
know that the replica with the latest transaction id has the latest committed
state. Again, strengthening the assumptions for the system enable a simpler

Oh and don't get distracted by the leader election algorithm. Our protocol
assumes there is one, but it's not part of the broadcast protocol. The leader
election algorithm can easily change. There are actually two different ones in
the sources, and one of them doesn't even use notifications.

-Flavio and Ben


参考 Thoughts On Zookeeper 


| No Comments | No TrackBacks


| 2 Comments | No TrackBacks

国庆给MacMini加装了块SSD, 直接重装了系统,用Mac Homebrew安装各种GNU工具ing

托GFVV的福, 不但login.live.com和yahoo.com的SSL被MITM了, 连homebrew去sf.net和github.com下载源码都会被干扰SSL握手


homebrew使用curl,  所以直接给curl加proxy就行了

curl --socks5-hostname ip.cn查看结果是linode日本了...

在~/.curlrc中设置 socks5 = ""就搞定了


Homebrew should not ignore curlrc

Use curl behind SOCKS5 proxy