zookeeper學習系列:四、Paxos演算法和zookeeper的關係
一、問題起源
淘寶搜尋的部落格 http://www.searchtb.com/2011/01/zookeeper-research.html 提到Paxos是zookeeper的靈魂
有一篇文章標題更是以“Zookeeper全解析——Paxos作為靈魂” 作為標題,認為是zookeeper的基礎:
“
Google的Chubby,Apache的Zookeeper都是基於它的理論來實現的,Paxos還被認為是到目前為止唯一的分散式一致性演算法,其它的演算法都是Paxos的改進或簡化。有個問題要提一下,Paxos有一個前提:沒有拜占庭將軍問題。就是說Paxos只有在一個可信的計算環境中才能成立,這個環境是不會被入侵所破壞的。
”
Paxos 這個演算法是Leslie Lamport在1990年提出的一種基於訊息傳遞的一致性演算法.Paxos 演算法解決的問題是一個分散式系統如何就某個值(決議)達成一致。
part-time parliament Paxos Made Simple裡這樣描述Paxos演算法執行過程:
- prepare 階段:
- proposer 選擇一個提案編號 n 並將 prepare 請求傳送給 acceptors 中的一個多數派;
- acceptor 收到 prepare 訊息後,如果提案的編號大於它已經回覆的所有 prepare 訊息,則 acceptor 將自己上次接受的提案回覆給 proposer,並承諾不再回復小於 n 的提案;
- 批准階段:
- 當一個 proposer 收到了多數 acceptors 對 prepare 的回覆後,就進入批准階段。它要向回覆 prepare 請求的 acceptors 傳送 accept 請求,包括編號 n 和根據 P2c 決定的 value(如果根據 P2c 沒有已經接受的 value,那麼它可以自由決定 value)。
- 在不違背自己向其他 proposer 的承諾的前提下,acceptor 收到 accept 請求後即接受這個請求。
wiki上是兩個階段,論文裡卻是說三階段,而且預設就有了個proposer相當於leader。查資料有大俠列出了第三個階段(http://www.wuzesheng.com/?p=2724):
3. Learn階段:
- 當各個Acceptor達到一致之後,需要將達到一致的結果通知給所有的Learner.
zookeeper採用org.apache.zookeeper.server.quorum.FastLeaderElection作為其預設選舉演算法
這篇文章http://csrd.aliapp.com/?p=162&replytocom=782 直接用paxos實現作為標題,提到 zookeeper在選舉leader的時候採用了paxos演算法(主要是fast paxos)
偶然看到下邊有人反駁:
魏講文:“
FastLeaderElection根本不是Paxos,也不是Fast Paxos的實現。 FastLeaderElection原始碼與Paxos的論文相去甚遠。
Paxos與 FastPaxos演算法中也有一個leader選舉的問題。
FastLeaderElection對於zookeeper來講,只是相當於Paxos中的leader選舉。
”
二、資料證實
好的,查查資料,分析原始碼開始調研
首先是魏講文的反駁 :
There is a very common misunderstanding that the leader election algorithm in zookeeper is paxos or fast paxos. The leader election algorithm is not paxos or fast paxos, please consider the following facts:
- There is no the concept of proposal number in the leader election in zookeeper. the proposal number is a key concept to paxos. Some one think the epoch is the proposal number, but different followers may produce proposal with the same epoch which is not allowed in paxos.
- Fast paxos requires at least 3t + 1 acceptors, where t is the number of servers which are allowed to fail [3]. This is conflict with the fact that a zookeeper cluster with 3 servers works well even if one server failed.
- The leader election algorithm must make sure P1. However paxos does provide such guarantee.
- In paxos, a leader is also required to achieve progress. There are some similarities between the leader in paxos and the leader in zookeeper. Even if more than one servers believe they are the leader, the consistency is preserved both in zookeeper and in paxos. this is very clearly discussed in [1] and [2].
然後是作者三次對比
1)郵件列表
Our protocol instead, has only two phases, just like a two-phase
commit protocol. Of course, for Paxos, we can ignore the first phase in runs in
which we have a single proposer as we can run phase 1 for multiple instances at
a time, which is what Ben called previously Multi-Paxos, I believe. The trick
with skipping phase 1 is to deal with leader switching.
2)出書的訪談
We made a few interesting observations about Paxos when contrasting it to Zab, like problems you could run into if you just implemented Paxos alone. Not that Paxos is broken or anything, just that in our setting, there were some properties it was not giving us. Some people still like to map Zab to Paxos, and they are not completely off, but the way we see it, Zab matches a service like ZooKeeper well.
zk的分散式一致性演算法有了個名稱叫Zab
3)論文
We use an algorithm that shares some of the character- istics of Paxos, but that combines transaction logging needed for consensus with write-ahead logging needed for data tree recovery to enable an efficient implementa- tion.
三、leader選舉分析
在我理解首先在選舉時,並不能用到paxos演算法,paxos裡選總統也好,zk選leader也好,跟搞個提案讓大部分人同意是有區別的。選主才能保證不會出現多proposer的併發提案衝突
誰去作為proposer發提案?是paxos演算法進行下去的前提。而提出提案讓大部分follower同意則可用到類似paxos的演算法實現一致性。zookeeper使用的是Zab演算法實現一致性。
zk的選主策略:
there can be at most one leader (proposer) at any time, and we guarantee this by making sure
that a quorum of replicas recognize the leader as a leader by committing to an
epoch change. This change in epoch also allows us to get unique zxids since the
epoch forms part of the zxid.
每個server有一個id,收到提交的事務時則有一個zxid,隨更新資料的變動,事務編號遞增,server id各不同。首先選zxid最大的作為leader,如果zxid比較不出來,則選server id最大的為leader
zxid包含一個epoch數字,epoch指示一個server作為leader的時期,隨新的leader誕生而遞增。
再看程式碼:
四、zookeeper資料更新原理分析
瞭解完選主的做法後,來了解下同步資料的做法,同步資料則採用Zab協議:Zookeeper Atomic broadcast protocol,是個類似兩階段提交的協議:
- The leader sends a PROPOSAL message, p, to all followers.
- Upon receiving p, a follower responds to the leader with an ACK, informing the leader that it has accepted the proposal.
- Uponreceivingacknowledgmentsfromaquorum(thequorumincludestheleader itself), the leader sends a message informing the followers to COMMIT it.
跟paxos的區別是leaer傳送給所有follower,而不是大多數,所有follower都要確認並通知leader,而不是大多數。
保證機制:按順序廣播的兩個事務, T 和 Tʹ ,T在前則Tʹ 生效前必須提交T。如果有一個server 提交了T 和 Tʹ ,則所有其他server必須也在Tʹ前提交T。
五、leader的探活
為解決leader crash的問題,避免出現多個leader導致事務混亂,Zab演算法保證:
1、新事務開啟時,leader必須提交上個epoch期間提交的所有事務
2、任何時候都不會有兩個leader同時獲得足夠多的支持者。
一個新leader的起始狀態需要大多數server同意
六、observer
zk裡的第三種角色,觀察者和follower的區別就是沒有選舉權。它主要是1、為系統的讀請求擴充套件性存在 2、滿足多機房部署需求,中心機房部署leader、follower,其他機房部署observer,讀取配置優先讀本地。
七、總結
我認為zookeeper只能說是受paxos演算法影響,角色劃分類似,提案通過方式類似,實現更為簡單直觀。選主基於voteid(server-id)和zxid做大小優先順序排序,資訊同步則使用兩階段提交,leader獲取follower的全部同意後才提交事務,更新狀態。observer角色則是為了增加系統吞吐和滿足跨機房部署。
參考文獻
[1] Reed, B., & Junqueira, F. P. (2008). A simple totally ordered broadcast protocol. Second Workshop on Large-Scale Distributed Systems and Middleware (LADIS 2008). Yorktown Heights, NY: ACM. ISBN: 978-1-60558-296-2. [2] Lamport, L. Paxos made simple. ACM SIGACT News 32, 4 (Dec. 2001), 1825. [3] F. Junqueira, Y. Mao, and K. Marzullo. Classic paxos vs. fast paxos: caveat emptor. In Proceedings of the 3rd USENIX/IEEE/IFIP Workshop on Hot Topics in System Dependability (HotDep.07). Citeseer, 2007.
[4]O'Reilly.ZooKeeper.Distributed process coordination.2013
[5] http://agapple.iteye.com/blog/1184023 zookeeper專案使用幾點小結