1. 程式人生 > 實用技巧 >Consul 學習筆記(三)之 高可靠性

Consul 學習筆記(三)之 高可靠性

一. 背景

在生產環境中,後臺服務眾多,當選擇使用 consul 做服務治理的時候,所有服務註冊到consul 上,若consul 掛掉,會影響整個平臺的業務執行,為了保證業務的穩定性,需要consul 不宕機持續對外提供服務,因此要求consul提供高可靠的能力,根據官方文件,需要將consul部署成叢集(部署多個consul例項)

二. 方案

在《Consul 學習筆記(二)之生產環境部署 》文章中其實已經按照官方的叢集部署方案進行部署的,部署規模為:2個consul server,2個client, web服務均註冊到client,由client註冊到server上。此種部署方案有一個缺點,當其中一個 consul server 例項宕機,整個叢集將會不可用,未達到高可靠的目的。

結合官方文件說明,consul server 叢集的容錯(允許宕機的例項個數)如下:

由前文得知,在構建 consul server 叢集的時候,必須指定一個例項為-bootstrap-expect(或-bootstrap),讓其以 bootstrap 模式啟動。必須的原因為:

  • 以 bootstrap 模式啟動的例項可以自動選舉為 leader,若不指定某個例項為bootstrap模式,則初始的leader無法選擇,造成無法構建叢集;
  • 在一個叢集構建的時候,bootstrap模式的 consul server 例項有且僅能有一個。
  • 以上2點成立的條件是沒有配置bootstrap_expect或者bootstrap_expect配置值為1。若事先知道部署的consul server數量,每個consul server 的都可配置 bootstrap_expect,其值為consul server的數量。

官方推薦的部署方式為:

部署規模:consul server 每個資料中心 3-5 個,consul client 可以有多個。其中 consul server 數量這樣限定的理由是:在容錯和效能上面達到一個平衡,隨著例項越來越多會造成一致性進展變慢。

基於以上,本文章中 叢集初始部署規模為: 3個consul server,2個 client,web服務註冊到 client 上。

consul server: 192.168.149.128、192.168.149.130、192.168.149.129

consul client:192.168.149.128、192.168.149.130

兩個client 都註冊 web 服務。

192.168.149.129 預設是leader。

三. 部署

部署方式主要參考Consul 學習筆記(二)之生產環境部署 》,這裡主要給出部署配置和效果圖,以及部署過程中遇到的問題。

1、配置

1) consul server

192.168.149.128

{
        "datacenter": "test-datacenter1",
        "data_dir": "/opt/consul/consul/data/",
        "log_file": "/opt/consul/consul/log/",
        "log_level": "INFO",
        "bind_addr": "192.168.149.128",
        "client_addr": "0.0.0.0",
        "node_name": "consul server1",
        "ui": true,
        "server": true,
"retry_join":["192.168.149.129:8301"] //預設加入到 129 consul server 上組成叢集 }

192.168.149.129

{
    "datacenter": "test-datacenter1",
    "data_dir": "/opt/consul/consul/data/",
    "log_file": "/opt/consul/consul/log/",
    "log_level": "INFO",
    "bind_addr": "192.168.149.129",
    "client_addr": "0.0.0.0",
    "node_name": "consul server2",
    "ui": true,
    "server": true,
"bootstrap_expect":1 //預設是 leader }

192.168.149.130

{
    "datacenter": "test-datacenter1",
    "data_dir": "/opt/consul/consul/data/",
    "log_file": "/opt/consul/consul/log/",
    "log_level": "INFO",
    "bind_addr": "192.168.149.130",
    "client_addr": "0.0.0.0",
    "node_name": "consul server3",
    "ui": true,
    "server": true,
    "retry_join":["192.168.149.129:8301"]  //預設加入到 129 consul server 上組成叢集
}

2) consul client

192.168.149.128

{
    "datacenter": "test-datacenter1",
    "data_dir": "/opt/consul/client/data/",
    "log_file": "/opt/consul/client/log/",
    "log_level": "INFO",
    "bind_addr": "192.168.149.128",
    "client_addr": "0.0.0.0",
    "node_name": "consul client1on128",
    "retry_join": ["192.168.149.128:8301"],
    "ports": {
        "dns": 8703,
        "http": 8700,
        "serf_wan": 8702,
        "serf_lan": 8701,
        "server":8704
    }
}

192.168.149.130

{
    "datacenter": "test-datacenter1",
    "data_dir": "/opt/consul/client/data/",
    "log_file": "/opt/consul/client/log/",
    "log_level": "INFO",
    "bind_addr": "192.168.149.130",
    "client_addr": "0.0.0.0",
    "node_name": "consul client1on130",
    "retry_join": ["192.168.149.129:8301"],
    "ports": {
        "dns": 8703,
        "http": 8700,
        "serf_wan": 8702,
        "serf_lan": 8701
    }
}

2、效果

通過./consul agent -config-dir=conf ,啟動所有的 consul server 和 consul client.

開啟 128、129、130 的 UI 檢視結點情況。

可以看出此時 129 被選舉成了 leader.

1) 129 宕機場景

在 129 的啟動視窗 按 ctrl+c 停止服務,此時在128的UI上看結點情況。

此時 128 被推舉為 leader,而之前 130 consul client 是直接通過 129 註冊到 consul server,但是現在雖然129 掛掉了,在 128 的consul server 上還是能夠看到 130 client結點還存活著。所以 consul 結點通過叢集中的一個consul server 例項加入到叢集中來,意味著所有 consul server 例項都知道該 client。

2) 恢復 129 consul server

129 被自動加入叢集,leader 不變。

3)在實際場景中,我們會碰到批量自動安裝consul server,又要求 consul server 的所有例項自動組成叢集,根據以上得知,構建叢集的 consul server中必須有一個且僅有一個consulserver例項是bootstrap模式啟動。基於此場景,給出的部署方案為:

a)批量安裝的 consul server 都以bootstrap模式開啟;

b)每個例項啟動之後,開啟一個指令碼 向另外一個服務查詢是否存在其它 consul 例項,若是存在,停止當前 例項,以非 bootstrap 模式啟動,並將當前例項 通過 retry-join 配置加入到所有例項中,即啟動裡面增加如下配置:

“retry-join”:["x.x.x.x:xx","x.x.x.x:xx"]    //x.x.x.x 為其他例項IP,xx:為例項的 serf_lan_port

經過驗證,在client上註冊的服務,最終都會同步每個 consul server上。所有 server \client 結點在 consul server UI 上都可以看到。

3、服務註冊

在以上的部署方案中,128 和 130 伺服器上部署一個微服務,微服務均註冊到伺服器上部署的 consul client上面,此時開啟任意 consul web ui 介面,檢視註冊的服務資料如下:

四、問題

1、在進行高可用宕機測試的時候,通過 CTRL+C的方式停止3個服務中2個服務會有不同效果

(1) 停止129服務的時候

128 的consul server 上可以看到如下日誌:

2020-07-20T08:51:05.860-0400 [DEBUG] agent.server.memberlist.lan: memberlist: Failed ping: consul server2 (timeout reached)
2020-07-20T08:51:06.360-0400 [INFO] agent.server.memberlist.lan: memberlist: Suspect consul server2 has failed, no acks received
2020-07-20T08:51:06.528-0400 [DEBUG] agent.server.memberlist.wan: memberlist: Initiating push/pull sync with: consul server3.test-datacenter1 192.168.149.130:8302
2020-07-20T08:51:08.491-0400 [DEBUG] agent.server.raft: accepted connection: local-address=192.168.149.128:8300 remote-address=192.168.149.130:34859
2020-07-20T08:51:08.492-0400 [WARN] agent.server.raft: rejecting vote request since we have a leader: from=192.168.149.130:8300 leader=192.168.149.129:8300
2020-07-20T08:51:08.575-0400 [INFO] agent.server.serf.lan: serf: EventMemberFailed: consul server2 192.168.149.129
2020-07-20T08:51:08.601-0400 [INFO] agent.server: Removing LAN server: server="consul server2 (Addr: tcp/192.168.149.129:8300) (DC: test-datacenter1)"
2020-07-20T08:51:08.613-0400 [INFO] agent.server.memberlist.lan: memberlist: Marking consul server2 as failed, suspect timeout reached (2 peer confirmations)
2020-07-20T08:51:08.860-0400 [DEBUG] agent.server.memberlist.lan: memberlist: Failed ping: consul server2 (timeout reached)
2020-07-20T08:51:09.491-0400 [DEBUG] agent.server.memberlist.lan: memberlist: Initiating push/pull sync with: consul client1on130 192.168.149.130:8701
2020-07-20T08:51:10.359-0400 [INFO] agent.server.memberlist.lan: memberlist: Suspect consul server2 has failed, no acks received
2020-07-20T08:51:13.441-0400 [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader=192.168.149.129:8300
2020-07-20T08:51:13.441-0400 [INFO] agent.server.raft: entering candidate state: node="Node at 192.168.149.128:8300 [Candidate]" term=3
2020-07-20T08:51:13.474-0400 [WARN] agent.server.raft: unable to get address for sever, using fallback address: id=c92b2e81-c03e-f6d7-9eaf-7d5000141052 fallback=192.168.149.129:8300 error="Could not find address for server id c92b2e81-c03e-f6d7-9eaf-7d5000141052"
2020-07-20T08:51:13.475-0400 [DEBUG] agent.server.raft: votes: needed=2
2020-07-20T08:51:13.475-0400 [DEBUG] agent.server.raft: vote granted: from=a5c8f427-d4f7-571f-1057-1ec5d1981713 term=3 tally=1
2020-07-20T08:51:13.475-0400 [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter c92b2e81-c03e-f6d7-9eaf-7d5000141052 192.168.149.129:8300}" error="dial tcp 192.168.149.128:0->192.168.149.129:8300: connect: connection refused"
2020-07-20T08:51:16.014-0400 [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=192.168.149.128:56926
2020-07-20T08:51:16.975-0400 [DEBUG] agent.server.raft: lost leadership because received a requestVote with a newer term
2020-07-20T08:51:16.981-0400 [INFO] agent.server.raft: entering follower state: follower="Node at 192.168.149.128:8300 [Follower]" leader=
2020-07-20T08:51:16.996-0400 [DEBUG] agent.server.raft: accepted connection: local-address=192.168.149.128:8300 remote-address=192.168.149.130:33381
2020-07-20T08:51:17.132-0400 [DEBUG] agent.server.serf.lan: serf: messageUserEventType: consul:new-leader
2020-07-20T08:51:17.132-0400 [INFO] agent.server: New leader elected: payload="consul server3"

日誌顯示 129 被移除掉同時進行重新選舉,130 被選舉成了新的 leader.

這是因為根據 consul 的一致性原則,3個consul server 允許一臺出問題,其它兩臺仍和繼續工作。

此時在 128 上執行./consul members 命令可以看到成員狀態如下:

此時明顯是:192.168.149.129 是 left 狀態(正常離開叢集狀態)。

(2) 繼續停止130

在128 consul server的日誌上可以看到如下內容:

2020-07-20T08:56:36.860-0400 [DEBUG] agent.server.memberlist.lan: memberlist: Failed ping: consul server3 (timeout reached)
2020-07-20T08:56:37.361-0400 [INFO]  agent.server.memberlist.lan: memberlist: Suspect consul server3 has failed, no acks received
2020-07-20T08:56:39.561-0400 [DEBUG] agent.server.memberlist.lan: memberlist: Initiating push/pull sync with: consul client1on130 192.168.149.130:8701
2020-07-20T08:56:39.864-0400 [DEBUG] agent.server.memberlist.lan: memberlist: Failed ping: consul server3 (timeout reached)
2020-07-20T08:56:40.363-0400 [INFO]  agent.server.memberlist.lan: memberlist: Suspect consul server3 has failed, no acks received
2020-07-20T08:56:41.362-0400 [INFO]  agent.server.memberlist.lan: memberlist: Marking consul server3 as failed, suspect timeout reached (2 peer confirmations)
2020-07-20T08:56:41.362-0400 [INFO]  agent.server.serf.lan: serf: EventMemberFailed: consul server3 192.168.149.130
2020-07-20T08:56:41.363-0400 [INFO]  agent.server: Removing LAN server: server="consul server3 (Addr: tcp/192.168.149.130:8300) (DC: test-datacenter1)"
2020-07-20T08:56:41.862-0400 [DEBUG] agent.server.memberlist.lan: memberlist: Failed ping: consul server3 (timeout reached)
2020-07-20T08:56:43.362-0400 [INFO]  agent.server.memberlist.lan: memberlist: Suspect consul server3 has failed, no acks received
2020-07-20T08:56:45.165-0400 [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader=192.168.149.130:8300
2020-07-20T08:56:45.165-0400 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.149.128:8300 [Candidate]" term=5
2020-07-20T08:56:45.172-0400 [DEBUG] agent.server.raft: votes: needed=2
2020-07-20T08:56:45.173-0400 [DEBUG] agent.server.raft: vote granted: from=a5c8f427-d4f7-571f-1057-1ec5d1981713 term=5 tally=1
2020-07-20T08:56:45.173-0400 [WARN]  agent.server.raft: unable to get address for sever, using fallback address: id=105bc18f-aba5-a62c-ae2d-e5b2df9c007d fallback=192.168.149.130:8300 error="Could not find address for server id 105bc18f-aba5-a62c-ae2d-e5b2df9c007d"
2020-07-20T08:56:45.173-0400 [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 105bc18f-aba5-a62c-ae2d-e5b2df9c007d 192.168.149.130:8300}" error=EOF
2020-07-20T08:56:45.362-0400 [DEBUG] agent.server.memberlist.wan: memberlist: Failed ping: consul server3.test-datacenter1 (timeout reached)
2020-07-20T08:56:46.090-0400 [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=192.168.149.128:57068
2020-07-20T08:56:47.361-0400 [INFO]  agent.server.memberlist.wan: memberlist: Suspect consul server3.test-datacenter1 has failed, no acks received
2020-07-20T08:56:47.485-0400 [INFO]  agent.server.serf.lan: serf: attempting reconnect to consul server3 192.168.149.130:8301
2020-07-20T08:56:47.490-0400 [DEBUG] agent.server.memberlist.lan: memberlist: Failed to join 192.168.149.130: dial tcp 192.168.149.130:8301: connect: connection refused
2020-07-20T08:56:50.258-0400 [WARN]  agent.server.raft: Election timeout reached, restarting election
2020-07-20T08:56:50.258-0400 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.149.128:8300 [Candidate]" term=6
2020-07-20T08:56:50.261-0400 [DEBUG] agent.server.raft: votes: needed=2
2020-07-20T08:56:50.261-0400 [DEBUG] agent.server.raft: vote granted: from=a5c8f427-d4f7-571f-1057-1ec5d1981713 term=6 tally=1
2020-07-20T08:56:50.261-0400 [WARN]  agent.server.raft: unable to get address for sever, using fallback address: id=105bc18f-aba5-a62c-ae2d-e5b2df9c007d fallback=192.168.149.130:8300 error="Could not find address for server id 105bc18f-aba5-a62c-ae2d-e5b2df9c007d"
2020-07-20T08:56:50.261-0400 [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 105bc18f-aba5-a62c-ae2d-e5b2df9c007d 192.168.149.130:8300}" error="dial tcp 192.168.149.128:0->192.168.149.130:8300: connect: connection refused"

雖然 130 被從 lan gossip pool中移除,但是128 進入候選者狀態之後,仍然需要2票,一票自己投的,另一票還需要130投,也就是認為130必須投出有效票。否則無法成為Leader。

此時通過 members 命令檢視成員狀態

可以看出此時僅是認為 130 是failed狀態,若此時將 130 consul server啟動,就可以成功進行選舉。

但是往往希望最後一個consul server有能夠正常選舉,那麼退出 130 的時候,就必須用 leave命令。

此時 128 cosnul server日誌顯示為:

2020-07-20T09:27:12.153-0400 [DEBUG] agent.server.serf.lan: serf: messageLeaveType: consul server3
2020-07-20T09:27:12.161-0400 [DEBUG] agent.server.serf.lan: serf: messageLeaveType: consul server3
2020-07-20T09:27:12.163-0400 [DEBUG] agent.server.serf.lan: serf: messageLeaveType: consul server3
2020-07-20T09:27:12.556-0400 [INFO]  agent.server.serf.lan: serf: EventMemberLeave: consul server3 192.168.149.130
2020-07-20T09:27:12.556-0400 [INFO]  agent.server: Removing LAN server: server="consul server3 (Addr: tcp/192.168.149.130:8300) (DC: test-datacenter1)"
2020-07-20T09:27:12.557-0400 [INFO]  agent.server: removing server by ID: id=105bc18f-aba5-a62c-ae2d-e5b2df9c007d
2020-07-20T09:27:12.557-0400 [INFO]  agent.server.raft: updating configuration: command=RemoveServer server-id=105bc18f-aba5-a62c-ae2d-e5b2df9c007d server-addr= servers="[{Suffrage:Voter ID:a5c8f427-d4f7-571f-1057-1ec5d1981713 Address:192.168.149.128:8300}]"
2020-07-20T09:27:12.564-0400 [WARN]  agent.server.raft: unable to get address for sever, using fallback address: id=105bc18f-aba5-a62c-ae2d-e5b2df9c007d fallback=192.168.149.130:8300 error="Could not find address for server id 105bc18f-aba5-a62c-ae2d-e5b2df9c007d"
2020-07-20T09:27:12.564-0400 [INFO]  agent.server.raft: removed peer, stopping replication: peer=105bc18f-aba5-a62c-ae2d-e5b2df9c007d last-index=359
2020-07-20T09:27:12.564-0400 [INFO]  agent.server: deregistering member: member="consul server3" reason=left
2020-07-20T09:27:12.564-0400 [INFO]  agent.server.raft: aborting pipeline replication: peer="{Voter 105bc18f-aba5-a62c-ae2d-e5b2df9c007d 192.168.149.130:8300}"

且介面顯示:

可以看出 在系統中正常執行的consul server例項數目若是達到了 Deployment Table 中的 Quorum size 值,此時要退出需要使用 leave,這樣就能正常退出 叢集,從 peer set中移除consul server,且Quorum size 發生變化,叢集能夠正常使用。

2、consul server啟動自動組成叢集

兩種方式:

1)啟動的 consul server至少一個是 bootstrap=true或bootstrap_expect=1進行啟動的;

2)所有的 consul server 中的 bootstrap_expect的值相同;