Does ZooKeeper includes Paxos ?

| No Comments | No TrackBacks

Ben and I were discussing this thread, so we ended up writing a reply together.

This is an interesting discussion indeed, and we appreciate your interest in
learning more about the implementation of zookeeper. Here is another attempt to
explain you the differences between our algorithm and Paxos.

The Paxos Multi-Decree protocol basically consists of running parallel
instances of the Synod protocol (you probably know this one, but just for
completeness the paper is called "The Part-time parliament", and can be found
on Lamport's web page). The original Synod protocol, which is what we are
calling Paxos, proceeds in three phases, just like a three-phase commit
protocol. Our protocol instead, has only two phases, just like a two-phase
commit protocol. Of course, for Paxos, we can ignore the first phase in runs in
which we have a single proposer as we can run phase 1 for multiple instances at
a time, which is what Ben called previously Multi-Paxos, I believe. The trick
with skipping phase 1 is to deal with leader switching.

However, if we have a run with multiple proposers, operating simultaneously or
not, then we have to run phase 1 at least for the instances that haven't been
committed. The ZooKeeper protocol does not. The reason why we don't is
twofold. First, we assume FIFO channels. (FIFO meaning if a packet is received
from the channel all previously sent messages will have been delivered. If a
packet is lost in the channel, all subsequent packets will be lost. TCP is a
FIFO channel.) Paxos doesn't assume such a channel, and it is a rather
practical assumption that simplifies the protocol a lot. Second, there can be
at most one leader (proposer) at any time, and we guarantee this by making sure
that a quorum of replicas recognize the leader as a leader by committing to an
epoch change. This change in epoch also allows us to get unique zxids since the
epoch forms part of the zxid. Followers (they both acceptors and learners in
the Paxos terminology) have a FIFO
channel to a single leader, so that can only be a single active leader.

As a result, we can skip the phase 1 of Paxos completely, and also during
recovery we can skip all the uncommitted zxids of the epoch of the previous
leader. Since messages can be received out of order and even lost with Paxos,
it is possible to have gaps in the sequence of instances, and these instances
have to be decided when a new proposer arises. The conclusion is that by making
stronger assumptions for the system, we are able to use a simpler algorithm
that works truly in two phases.

One difference we find interesting is that Paxos embeds recovery into the
protocol. According to the algorithm, a new proposer just has to start one
phase 1 for each instance that it believes hasn't been committed yet. If such
an instance has been committed, then there is no problem as the value can't
change once it is committed. With the ZooKeeper protocol, we have to run an
auxiliary protocol to make sure that new leaders are up to date with respect to
operations that have been committed, but because of the FIFO assumption, we
know that the replica with the latest transaction id has the latest committed
state. Again, strengthening the assumptions for the system enable a simpler

Oh and don't get distracted by the leader election algorithm. Our protocol
assumes there is one, but it's not part of the broadcast protocol. The leader
election algorithm can easily change. There are actually two different ones in
the sources, and one of them doesn't even use notifications.

-Flavio and Ben

参考 Thoughts On Zookeeper 

No TrackBacks

TrackBack URL:

Leave a comment

About this Entry

This page contains a single entry by suchasplus published on October 23, 2014 11:53 PM.

在centos上启用epel was the previous entry in this blog.

关于decorator之类的语法糖的必要性 is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.


OpenID accepted here Learn more about OpenID
Powered by Movable Type 5.2.7