淘寶搜尋的部落格 http://www.searchtb.com/2011/01/zookeeper-research.html  提到Paxos是zookeeper的靈魂

有一篇文章標題更是以“Zookeeper全解析——Paxos作為靈魂” 作為標題,認為是zookeeper的基礎:


Paxos 這個演算法是Leslie Lamport在1990年提出的一種基於訊息傳遞的一致性演算法.Paxos 演算法解決的問題是一個分散式系統如何就某個值(決議)達成一致。

part-time parliament Paxos Made Simple裡這樣描述Paxos演算法執行過程:

  1. prepare 階段:
    1. proposer 選擇一個提案編號 n 並將 prepare 請求傳送給 acceptors 中的一個多數派;
    2. acceptor 收到 prepare 訊息後,如果提案的編號大於它已經回覆的所有 prepare 訊息,則 acceptor 將自己上次接受的提案回覆給 proposer,並承諾不再回復小於 n 的提案;
  2. 批准階段:
    1. 當一個 proposer 收到了多數 acceptors 對 prepare 的回覆後,就進入批准階段。它要向回覆 prepare 請求的 acceptors 傳送 accept 請求,包括編號 n 和根據 P2c 決定的 value(如果根據 P2c 沒有已經接受的 value,那麼它可以自由決定 value)。
    2. 在不違背自己向其他 proposer 的承諾的前提下,acceptor 收到 accept 請求後即接受這個請求。


3. Learn階段:

  • 當各個Acceptor達到一致之後,需要將達到一致的結果通知給所有的Learner.


這篇文章http://csrd.aliapp.com/?p=162&replytocom=782 直接用paxos實現作為標題,提到  zookeeper在選舉leader的時候採用了paxos演算法(主要是fast paxos)



FastLeaderElection根本不是Paxos,也不是Fast Paxos的實現。 FastLeaderElection原始碼與Paxos的論文相去甚遠。

Paxos與 FastPaxos演算法中也有一個leader選舉的問題。





There is a very common misunderstanding that the leader election algorithm in zookeeper is paxos or fast paxos. The leader election algorithm is not paxos or fast paxos, please consider the following facts:

  1. There is no the concept of proposal number in the leader election in zookeeper. the proposal number is a key concept to paxos. Some one think the epoch is the proposal number, but different followers may produce proposal with the same epoch which is not allowed in paxos.
  2. Fast paxos requires at least 3t + 1 acceptors, where t is the number of servers which are allowed to fail [3]. This is conflict with the fact that a zookeeper cluster with 3 servers works well even if one server failed.
  3. The leader election algorithm must make sure P1. However paxos does provide such guarantee.
  4. In paxos, a leader is also required to achieve progress. There are some similarities between the leader in paxos and the leader in zookeeper. Even if more than one servers believe they are the leader, the consistency is preserved both in zookeeper and in paxos. this is very clearly discussed in [1] and [2]. 



Our protocol instead, has only two phases, just like a two-phase 
commit protocol. Of course, for Paxos, we can ignore the first phase in runs in 
which we have a single proposer as we can run phase 1 for multiple instances at 
a time, which is what Ben called previously Multi-Paxos, I believe. The trick 
with skipping phase 1 is to deal with leader switching. 


We made a few interesting observations about Paxos when contrasting it to Zab, like problems you could run into if you just implemented Paxos alone. Not that Paxos is broken or anything, just that in our setting, there were some properties it was not giving us. Some people still like to map Zab to Paxos, and they are not completely off, but the way we see it, Zab matches a service like ZooKeeper well.



We use an algorithm that shares some of the character- istics of Paxos, but that combines transaction logging needed for consensus with write-ahead logging needed for data tree recovery to enable an efficient implementa- tion. 





there can be at most one leader (proposer) at any time, and we guarantee this by making sure 
that a quorum of replicas recognize the leader as a leader by committing to an 
epoch change. This change in epoch also allows us to get unique zxids since the 
epoch forms part of the zxid. 

每個server有一個id,收到提交的事務時則有一個zxid,隨更新資料的變動,事務編號遞增,server id各不同。首先選zxid最大的作為leader,如果zxid比較不出來,則選server id最大的為leader




瞭解完選主的做法後,來了解下同步資料的做法,同步資料則採用Zab協議:Zookeeper Atomic broadcast protocol,是個類似兩階段提交的協議:

  1. The leader sends a PROPOSAL message, p, to all followers.
  2. Upon receiving p, a follower responds to the leader with an ACK, informing the leader that it has accepted the proposal.
  3. Uponreceivingacknowledgmentsfromaquorum(thequorumincludestheleader itself), the leader sends a message informing the followers to COMMIT it. 


保證機制:按順序廣播的兩個事務, T 和 Tʹ ,T在前則Tʹ 生效前必須提交T。如果有一個server 提交了T 和 Tʹ ,則所有其他server必須也在Tʹ前提交T。


為解決leader crash的問題,避免出現多個leader導致事務混亂,Zab演算法保證:





zk裡的第三種角色,觀察者和follower的區別就是沒有選舉權。它主要是1、為系統的讀請求擴充套件性存在 2、滿足多機房部署需求,中心機房部署leader、follower,其他機房部署observer,讀取配置優先讀本地。




