K8S叢集中etcd備份及恢復【轉】
阿新 • • 發佈:2021-12-01
etcd是kubernetes叢集極為重要的一塊服務,儲存了kubernetes叢集所有的資料資訊,如Namespace、Pod、Service、路由等狀態資訊。如果etcd叢集發生災難或者 etcd 叢集資料丟失,都會影響k8s叢集資料的恢復。因此,通過備份etcd資料來實現kubernetes叢集的災備環境十分重要。
一、etcd叢集備份
etcd不同版本的 etcdctl 命令不一樣,但大致差不多,這裡備份使用 napshot save進行快照備份。 需要注意幾點:- 備份操作在etcd叢集的其中一個節點執行就可以。
- 這裡使用的是etcd v3的api,因為從 k8s 1.13 開始,k8s不再支援 v2 版本的 etcd,即k8s的叢集資料都存在了v3版本的etcd中。故備份的資料也只備份了使用v3新增的etcd資料,v2新增的etcd資料是沒有做備份的。
- 本案例使用的是二進位制部署的k8s v1.18.6 + Calico 容器環境(下面命令中的"ETCDCTL_API=3 etcdctl" 等同於 "etcdctl")
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
etcd 資料目錄
[root@k8s-master01 ~] # cat /opt/k8s/bin/environment.sh |grep "ETCD_DATA_DIR="
export ETCD_DATA_DIR= "/data/k8s/etcd/data"
etcd WAL 目錄
[root@k8s-master01 ~] # cat /opt/k8s/bin/environment.sh |grep "ETCD_WAL_DIR="
export ETCD_WAL_DIR= "/data/k8s/etcd/wal"
[root@k8s-master01 ~] # ls /data/k8s/etcd/data/
member
[root@k8s-master01 ~] # ls /data/k8s/etcd/data/member/
snap
[root@k8s-master01 ~] # ls /data/k8s/etcd/wal/
0000000000000000-0000000000000000.wal 0.tmp
|
2)執行etcd叢集資料備份 在etcd叢集的其中一個節點執行備份操作,然後將備份檔案拷貝到其他節點上。 先在etcd叢集的每個節點上建立備份目錄
1 |
# mkdir -p /data/etcd_backup_dir
|
在etcd叢集其中個一個節點(這裡在k8s-master01)上執行備份:
1 |
[root@k8s-master01 ~] # ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/cert/ca.pem --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --endpoints=https://172.16.60.231:2379 snapshot save /data/etcd_backup_dir/etcd-snapshot-`date +%Y%m%d`.db
|
1 2 |
[root@k8s-master01 ~] # rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master02:/data/etcd_backup_dir/
[root@k8s-master01 ~] # rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master03:/data/etcd_backup_dir/
|
可以將上面k8s-master01節點的etcd備份命令放在腳本里,結合crontab進行定時備份:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
[root@k8s-master01 ~] # cat /data/etcd_backup_dir/etcd_backup.sh
#!/usr/bin/bash
date ;
CACERT= "/etc/kubernetes/cert/ca.pem"
CERT= "/etc/etcd/cert/etcd.pem"
EKY= "/etc/etcd/cert/etcd-key.pem"
ENDPOINTS= "172.16.60.231:2379"
ETCDCTL_API=3 /opt/k8s/bin/etcdctl \
--cacert= "${CACERT}" --cert= "${CERT}" --key= "${EKY}" \
--endpoints=${ENDPOINTS} \
snapshot save /data/etcd_backup_dir/etcd-snapshot- ` date +%Y%m%d`.db
# 備份保留30天
find /data/etcd_backup_dir/ -name "*.db" -mtime +30 - exec rm -f {} \;
# 同步到其他兩個etcd節點
/bin/rsync -e "ssh -p5522" -avpgolr --delete /data/etcd_backup_dir/ root@k8s-master02: /data/etcd_backup_dir/
/bin/rsync -e "ssh -p5522" -avpgolr --delete /data/etcd_backup_dir/ root@k8s-master03: /data/etcd_backup_dir/
|
設定crontab定時備份任務,每天凌晨5點執行備份:
1 2 3 4 |
[root@k8s-master01 ~] # chmod 755 /data/etcd_backup_dir/etcd_backup.sh
[root@k8s-master01 ~] # crontab -l
#etcd叢集資料備份
0 5 * * * /bin/bash -x /data/etcd_backup_dir/etcd_backup .sh > /dev/null 2>&1
|
二、etcd叢集恢復
etcd叢集備份操作只需要在其中的一個etcd節點上完成,然後將備份檔案拷貝到其他節點。
但etcd叢集恢復操作必須要所有的etcd節點上完成!
1)模擬etcd叢集資料丟失
刪除三個etcd叢集節點的data資料 (或者直接刪除data目錄)
1 |
# rm -rf /data/k8s/etcd/data/*
|
檢視k8s叢集狀態:
1 2 3 4 5 6 7 |
[root@k8s-master01 ~] # kubectl get cs
NAME STATUS MESSAGE ERROR
etcd-2 Unhealthy Get https: //172 .16.60.233:2379 /health : dial tcp 172.16.60.233:2379: connect: connection refused
etcd-1 Unhealthy Get https: //172 .16.60.232:2379 /health : dial tcp 172.16.60.232:2379: connect: connection refused
etcd-0 Unhealthy Get https: //172 .16.60.231:2379 /health : dial tcp 172.16.60.231:2379: connect: connection refused
scheduler Healthy ok
controller-manager Healthy ok
|
由於此時etcd叢集的三個節點服務還在,過一會兒檢視叢集狀態恢復正常:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
[root@k8s-master01 ~] # kubectl get cs
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-0 Healthy { "health" : "true" }
etcd-2 Healthy { "health" : "true" }
etcd-1 Healthy { "health" : "true" }
[root@k8s-master01 ~] # ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem endpoint health
https: //172 .16.60.231:2379 is healthy: successfully committed proposal: took = 9.918673ms
https: //172 .16.60.233:2379 is healthy: successfully committed proposal: took = 10.985279ms
https: //172 .16.60.232:2379 is healthy: successfully committed proposal: took = 13.422545ms
[root@k8s-master01 ~] # ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem member list --write-out=table
+------------------+---------+------------+----------------------------+----------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+------------+----------------------------+----------------------------+------------+
| 1d1d7edbba38c293 | started | k8s-etcd03 | https: //172 .16.60.233:2380 | https: //172 .16.60.233:2379 | false |
| 4c0cfad24e92e45f | started | k8s-etcd02 | https: //172 .16.60.232:2380 | https: //172 .16.60.232:2379 | false |
| 79cf4f0a8c3da54b | started | k8s-etcd01 | https: //172 .16.60.231:2380 | https: //172 .16.60.231:2379 | false |
+------------------+---------+------------+----------------------------+----------------------------+------------+
|
如上發現,etcd叢集三個節點的leader都是false,即沒有選主。此時需要重啟三個節點的etcd服務:
1 |
# systemctl restart etcd
|
重啟後,再次檢視發現etcd叢集已經選主成功,叢集狀態正常!
1 2 3 4 5 6 7 8 |
[root@k8s-master01 ~] # ETCDCTL_API=3 etcdctl -w table --cacert=/etc/kubernetes/cert/ca.pem --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" endpoint status
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https: //172 .16.60.231:2379 | 79cf4f0a8c3da54b | 3.4.9 | 1.6 MB | true | false | 5 | 24658 | 24658 | |
| https: //172 .16.60.232:2379 | 4c0cfad24e92e45f | 3.4.9 | 1.6 MB | false | false | 5 | 24658 | 24658 | |
| https: //172 .16.60.233:2379 | 1d1d7edbba38c293 | 3.4.9 | 1.7 MB | false | false | 5 | 24658 | 24658 | |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|
但是,k8s叢集資料其實已經丟失了。namespace名稱空間下的pod等資源都沒有了。此時就需要通過etcd叢集備份檔案來恢復,即通過上面的etcd叢集快照檔案恢復。
1 2 3 4 5 6 7 8 9 10 |
[root@k8s-master01 ~] # kubectl get ns
NAME STATUS AGE
default Active 9m47s
kube-node-lease Active 9m39s
kube-public Active 9m39s
kube-system Active 9m47s
[root@k8s-master01 ~] # kubectl get pods -n kube-system
No resources found in kube-system namespace.
[root@k8s-master01 ~] # kubectl get pods --all-namespaces
No resources found
|
2)etcd叢集資料恢復,即kubernetes叢集資料恢復 在etcd資料恢復之前,先依次關閉所有master節點的kube-aposerver服務,所有etcd節點的etcd服務:
1 2 |
# systemctl stop kube-apiserver
# systemctl stop etcd
|
特別注意:在進行etcd叢集資料恢復之前,一定要先將所有etcd節點的data和wal舊工作目錄刪掉,這裡指的是/data/k8s/etcd/data資料夾跟/data/k8s/etcd/wal資料夾,可能會導致恢復失敗(恢復命令執行時報錯資料目錄已存在)。
1 |
# rm -rf /data/k8s/etcd/data && rm -rf /data/k8s/etcd/wal
|
在每個etcd節點執行恢復操作:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
172.16.60.231節點
-------------------------------------------------------
ETCDCTL_API=3 etcdctl \
--name=k8s-etcd01 \
--endpoints= "https://172.16.60.231:2379" \
--cert= /etc/etcd/cert/etcd .pem \
--key= /etc/etcd/cert/etcd-key .pem \
--cacert= /etc/kubernetes/cert/ca .pem \
--initial-cluster-token=etcd-cluster-0 \
--initial-advertise-peer-urls=https: //172 .16.60.231:2380 \
--initial-cluster=k8s-etcd01=https: //172 .16.60.231:2380,k8s-etcd02=https: //172 .16.60.232:2380,k8s-etcd03=https: //192 .168.137.233:2380 \
--data- dir = /data/k8s/etcd/data \
--wal- dir = /data/k8s/etcd/wal \
snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820 .db
172.16.60.232節點
-------------------------------------------------------
ETCDCTL_API=3 etcdctl \
--name=k8s-etcd02 \
--endpoints= "https://172.16.60.232:2379" \
--cert= /etc/etcd/cert/etcd .pem \
--key= /etc/etcd/cert/etcd-key .pem \
--cacert= /etc/kubernetes/cert/ca .pem \
--initial-cluster-token=etcd-cluster-0 \
--initial-advertise-peer-urls=https: //172 .16.60.232:2380 \
--initial-cluster=k8s-etcd01=https: //172 .16.60.231:2380,k8s-etcd02=https: //172 .16.60.232:2380,k8s-etcd03=https: //192 .168.137.233:2380 \
--data- dir = /data/k8s/etcd/data \
--wal- dir = /data/k8s/etcd/wal \
snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820 .db
192.168.137.233節點
-------------------------------------------------------
ETCDCTL_API=3 etcdctl \
--name=k8s-etcd03 \
--endpoints= "https://192.168.137.233:2379" \
--cert= /etc/etcd/cert/etcd .pem \
--key= /etc/etcd/cert/etcd-key .pem \
--cacert= /etc/kubernetes/cert/ca .pem \
--initial-cluster-token=etcd-cluster-0 \
--initial-advertise-peer-urls=https: //192 .168.137.233:2380 \
--initial-cluster=k8s-etcd01=https: //172 .16.60.231:2380,k8s-etcd02=https: //172 .16.60.232:2380,k8s-etcd03=https: //192 .168.137.233:2380 \
--data- dir = /data/k8s/etcd/data \
--wal- dir = /data/k8s/etcd/wal \
snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820 .db
|
依次啟動所有etcd節點的etcd服務:
1 2 |
# systemctl start etcd
# systemctl status etcd
|
檢查 ETCD 叢集狀態(如下,發現etcd叢集裡已經成功選主了)
1 2 3 4 5 6 7 8 9 10 11 12 13 |
[root@k8s-master01 ~] # ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem endpoint health
https: //172 .16.60.232:2379 is healthy: successfully committed proposal: took = 12.837393ms
https: //172 .16.60.233:2379 is healthy: successfully committed proposal: took = 13.306671ms
https: //172 .16.60.231:2379 is healthy: successfully committed proposal: took = 13.602805ms
[root@k8s-master01 ~] # ETCDCTL_API=3 etcdctl -w table --cacert=/etc/kubernetes/cert/ca.pem --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" endpoint status
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https: //172 .16.60.231:2379 | 79cf4f0a8c3da54b | 3.4.9 | 9.0 MB | false | false | 2 | 13 | 13 | |
| https: //172 .16.60.232:2379 | 4c0cfad24e92e45f | 3.4.9 | 9.0 MB | true | false | 2 | 13 | 13 | |
| https: //172 .16.60.233:2379 | 5f70664d346a6ebd | 3.4.9 | 9.0 MB | false | false | 2 | 13 | 13 | |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|
再依次啟動所有master節點的kube-apiserver服務:
1 2 |
# systemctl start kube-apiserver
# systemctl status kube-apiserver
|
檢視kubernetes叢集狀態:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
[root@k8s-master01 ~] # kubectl get cs
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-2 Unhealthy HTTP probe failed with statuscode: 503
etcd-1 Unhealthy HTTP probe failed with statuscode: 503
etcd-0 Unhealthy HTTP probe failed with statuscode: 503
由於etcd服務剛重啟,需要多刷幾次狀態就會正常:
[root@k8s-master01 ~] # kubectl get cs
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-2 Healthy { "health" : "true" }
etcd-0 Healthy { "health" : "true" }
etcd-1 Healthy { "health" : "true" }
|
檢視kubernetes的資源情況:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
[root@k8s-master01 ~] # kubectl get ns
NAME STATUS AGE
default Active 7d4h
kevin Active 5d18h
kube-node-lease Active 7d4h
kube-public Active 7d4h
kube-system Active 7d4h
[root@k8s-master01 ~] # kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default dnsutils-ds-22q87 0 /1 ContainerCreating 171 7d3h
default dnsutils-ds-bp8tm 0 /1 ContainerCreating 138 5d18h
default dnsutils-ds-bzzqg 0 /1 ContainerCreating 138 5d18h
default dnsutils-ds-jcvng 1 /1 Running 171 7d3h
default dnsutils-ds-xrl2x 0 /1 ContainerCreating 138 5d18h
default dnsutils-ds-zjg5l 1 /1 Running 0 7d3h
default kevin-t-84cdd49d65-ck47f 0 /1 ContainerCreating 0 2d2h
default nginx-ds-98rm2 1 /1 Running 2 7d3h
default nginx-ds-bbx68 1 /1 Running 0 7d3h
default nginx-ds-kfctv 0 /1 ContainerCreating 1 5d18h
default nginx-ds-mdcd9 0 /1 ContainerCreating 1 5d18h
default nginx-ds-ngqcm 1 /1 Running 0 7d3h
default nginx-ds-tpcxs 0 /1 ContainerCreating 1 5d18h
kevin nginx-ingress-controller-797ffb479-vrq6w 0 /1 ContainerCreating 0 5d18h
kevin test -nginx-7d4f96b486-qd4fl 0 /1 ContainerCreating 0 2d1h
kevin test -nginx-7d4f96b486-qfddd 0 /1 Running 0 2d1h
kube-system calico-kube-controllers-578894d4cd-9rp4c 1 /1 Running 1 7d3h
kube-system calico-node-d7wq8 0 /1 PodInitializing 1 7d3h
|
三、最後總結
Kubernetes 叢集備份主要是備份 ETCD 叢集。而恢復時,主要考慮恢復整個順序: 停止kube-apiserver --> 停止ETCD --> 恢復資料 --> 啟動ETCD --> 啟動kube-apiserve 特別注意:- 備份ETCD叢集時,只需要備份一個ETCD資料,然後同步到其他節點上。
- 恢復ETCD資料時,拿其中一個節點的備份資料恢復即可
轉自
K8S叢集災備環境部署 - 散盡浮華 - 部落格園
https://www.cnblogs.com/kevingrace/p/14616824.html