Kubernetes 主節點宕機恢復記錄 MountVolume.SetUp failed for volume "kube-dns-config"
阿新 • • 發佈:2019-01-01
今天早上,發現原來執行的好好的Kubernetes叢集不能正常工作了,dashboard介面打不開,主節點上 docker ps 不顯示任何執行中容器,重啟 kubelet 後,短暫恢復,之後再次陷入不可用狀態,經過反覆重啟觀察,發現是etcd不斷重啟,最後失敗,導致其它元件相繼失敗。
etcd 日誌如下:
2018-02-06 02:25:24.564269 I | etcdmain: etcd Version: 3.0.17 2018-02-06 02:25:24.564531 I | etcdmain: Git SHA: cc198e2 2018-02-06 02:25:24.564544 I | etcdmain: Go Version: go1.6.4 2018-02-06 02:25:24.564554 I | etcdmain: Go OS/Arch: linux/amd64 2018-02-06 02:25:24.564563 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8 2018-02-06 02:25:24.564636 N | etcdmain: the server is already initialized as member before, starting as etcd member... 2018-02-06 02:25:24.565021 I | etcdmain: listening for peers on http://localhost:2380 2018-02-06 02:25:24.565114 I | etcdmain: listening for client requests on 127.0.0.1:2379 2018-02-06 02:25:24.568360 I | etcdserver: recovered store from snapshot at index 5900621 2018-02-06 02:25:24.568399 I | etcdserver: name = default 2018-02-06 02:25:24.568406 I | etcdserver: data dir = /var/lib/etcd 2018-02-06 02:25:24.568413 I | etcdserver: member dir = /var/lib/etcd/member 2018-02-06 02:25:24.568418 I | etcdserver: heartbeat = 100ms 2018-02-06 02:25:24.568423 I | etcdserver: election = 1000ms 2018-02-06 02:25:24.568428 I | etcdserver: snapshot count = 10000 2018-02-06 02:25:24.568464 I | etcdserver: advertise client URLs = http://127.0.0.1:2379 2018-02-06 02:25:24.760641 I | etcdserver: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 5904480 2018-02-06 02:25:24.760850 I | raft: 8e9e05c52164694d became follower at term 12 2018-02-06 02:25:24.760904 I | raft: newRaft 8e9e05c52164694d [peers: [8e9e05c52164694d], term: 12, commit: 5904480, applied: 5900621, lastindex: 5904480, lastterm: 12] 2018-02-06 02:25:24.761062 I | api: enabled capabilities for version 3.0 2018-02-06 02:25:24.761111 I | membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 from store 2018-02-06 02:25:24.761125 I | membership: set the cluster version to 3.0 from store 2018-02-06 02:25:24.829180 I | mvcc: restore compact to 2314830 2018-02-06 02:25:24.975979 I | etcdserver: starting server... [version: 3.0.17, cluster version: 3.0] 2018-02-06 02:25:25.661519 I | raft: 8e9e05c52164694d is starting a new election at term 12 2018-02-06 02:25:25.661581 I | raft: 8e9e05c52164694d became candidate at term 13 2018-02-06 02:25:25.661596 I | raft: 8e9e05c52164694d received vote from 8e9e05c52164694d at term 13 2018-02-06 02:25:25.661620 I | raft: 8e9e05c52164694d became leader at term 13 2018-02-06 02:25:25.661645 I | raft: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 13 2018-02-06 02:25:25.663336 I | etcdserver: published {Name:default ClientURLs:[http://127.0.0.1:2379]} to cluster cdf818194e3a8c32 2018-02-06 02:25:25.663379 I | etcdmain: ready to serve client requests 2018-02-06 02:25:25.663797 N | etcdmain: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged! 2018-02-06 02:25:37.403730 W | etcdserver: apply entries took too long [25.315155ms for 1 entries] 2018-02-06 02:25:37.403752 W | etcdserver: avoid queries with large range/delete range! 2018-02-06 02:25:49.930163 N | osutil: received terminated signal, shutting down...
查詢好多資料,大多說是 etcd 版本問題,可是原來是好好的呀!!!明顯不是這個問題。
費了半天勁,最終排查出的問題是:主節點機器磁碟空間不夠了,/ 目錄使用率達到了92%
刪除部分檔案後,磁碟使用率降到70%,重新啟動 kubelet,發現etcd不再重啟退出了,但是 kube-dns 一直啟動不起來
提示錯誤:
Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 1h 35m 36 kubelet, inm-bj-vip-ms04 Warning FailedMount Unable to mount volumes for pod "kube-dns-2425271678-bgzzp_kube-system(d7381d67-fb52-11e7-a2f6-0050568b6632)": timeout expired waiting for volumes to attach/mount for pod "kube-system"/"kube-dns-2425271678-bgzzp". list of unattached/unmounted volumes=[kube-dns-config] 1h 35m 36 kubelet, inm-bj-vip-ms04 Warning FailedSync Error syncing pod 1h 34m 48 kubelet, inm-bj-vip-ms04 Warning FailedMount MountVolume.SetUp failed for volume "kube-dns-config" : stat /var/lib/kubelet/pods/d7381d67-fb52-11e7-a2f6-0050568b6632/volumes/kubernetes.io~configmap/kube-dns-config: no such file or directory 30m 30m 1 kubelet, inm-bj-vip-ms04 Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "kube-dns-token-3hdnc" 28m 10m 9 kubelet, inm-bj-vip-ms04 Warning FailedMount Unable to mount volumes for pod "kube-dns-2425271678-bgzzp_kube-system(d7381d67-fb52-11e7-a2f6-0050568b6632)": timeout expired waiting for volumes to attach/mount for pod "kube-system"/"kube-dns-2425271678-bgzzp". list of unattached/unmounted volumes=[kube-dns-config] 28m 10m 9 kubelet, inm-bj-vip-ms04 Warning FailedSync Error syncing pod 30m 10m 18 kubelet, inm-bj-vip-ms04 Warning FailedMount MountVolume.SetUp failed for volume "kube-dns-config" : stat /var/lib/kubelet/pods/d7381d67-fb52-11e7-a2f6-0050568b6632/volumes/kubernetes.io~configmap/kube-dns-config: no such file or directory 10m 10m 1 kubelet, inm-bj-vip-ms04 Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "kube-dns-token-3hdnc" 10m 9m 6 kubelet, inm-bj-vip-ms04 Warning FailedMount MountVolume.SetUp failed for volume "kube-dns-config" : stat /var/lib/kubelet/pods/d7381d67-fb52-11e7-a2f6-0050568b6632/volumes/kubernetes.io~configmap/kube-dns-config: no such file or directory 9m 9m 1 kubelet, inm-bj-vip-ms04 Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "kube-dns-token-3hdnc" 7m 5m 2 kubelet, inm-bj-vip-ms04 Warning FailedMount Unable to mount volumes for pod "kube-dns-2425271678-bgzzp_kube-system(d7381d67-fb52-11e7-a2f6-0050568b6632)": timeout expired waiting for volumes to attach/mount for pod "kube-system"/"kube-dns-2425271678-bgzzp". list of unattached/unmounted volumes=[kube-dns-config] 7m 5m 2 kubelet, inm-bj-vip-ms04 Warning FailedSync Error syncing pod 9m 3m 11 kubelet, inm-bj-vip-ms04 Warning FailedMount MountVolume.SetUp failed for volume "kube-dns-config" : stat /var/lib/kubelet/pods/d7381d67-fb52-11e7-a2f6-0050568b6632/volumes/kubernetes.io~configmap/kube-dns-config: no such file or directory 1m 1m 1 kubelet, inm-bj-vip-ms04 Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "kube-dns-token-3hdnc" 1m 50s 8 kubelet, inm-bj-vip-ms04 Warning FailedMount MountVolume.SetUp failed for volume "kube-dns-config" : stat /var/lib/kubelet/pods/d7381d67-fb52-11e7-a2f6-0050568b6632/volumes/kubernetes.io~configmap/kube-dns-config: no such file or directory
感覺是掛載卷問題,嘗試重新安裝網路都沒有效果 ^~^|||
最後,發大招,關閉 kubelet 和 docker 後,刪除所有容器, 進入 /var/lib/kubelet/pods ,直接把 kube-dns 的pod 刪掉(pod的名字在錯誤日誌裡有,這裡就是d7381d67-fb52-11e7-a2f6-0050568b6632),再重啟docker和 kubelet,dns服務恢復!!!
末尾附上 dns 恢復依次執行的命令:
# systemctl stop kubelet # docker kill $(docker ps -a -q) # docker rm $(docker ps -a -q) # systemctl stop docker # cd /var/lib/kubelet/pods # mv d7381d67-fb52-11e7-a2f6-0050568b6632 /tmp # systemctl start docker # systemctl start kubelet