Kubernetes叢集中,Node異常時Pod狀態分析
摘要:Kubernetes叢集中Node NotReady是經常遇到的現象,我們需要了解各種Workload Type對應的Pod此時的行為。文中只給出現象總結,並沒有寫出對應的邏輯分析,因為這主要是Node Controller的行為,我對Node Controller寫過四篇系列部落格,大家可以參考。
Kubelet程序異常,Pod狀態變化
一個節點上執行著pod前提下,這個時候把kubelet程序停掉。裡面的pod會被幹掉嗎?會在其他節點recreate嗎?
結論:
(1)Node狀態變為NotReady (2)Pod 5分鐘之內狀態無變化,5分鐘之後的狀態變化:Daemonset的Pod狀態變為Nodelost,Deployment、Statefulset和Static Pod的狀態先變為NodeLost,然後馬上變為Unknown。Deployment的pod會recreate,但是Deployment如果是node selector停掉kubelet的node,則recreate的pod會一直處於Pending的狀態。Static Pod和Statefulset的Pod會一直處於Unknown狀態。
Kubelet恢復,Pod行為
如果kubelet 10分鐘後又起來了,node和pod會怎樣?
結論:
(1)Node狀態變為Ready。 (2)Daemonset的pod不會recreate,舊pod狀態直接變為Running。 (3)Deployment的則是將kubelet程序停止的Node刪除(原因可能是因為舊Pod狀態在叢集中有變化,但是Pod狀態在變化時發現叢集中Deployment的Pod例項數已經夠了,所以對舊Pod做了刪除處理) (4)Statefulset的Pod會重新recreate。 (5)Staic Pod沒有重啟,但是Pod的執行時間會在kubelet起來的時候置為0。
在kubelet停止後,statefulset的pod會變成nodelost,接著就變成unknown,但是不會重啟,然後等kubelet起來後,statefulset的pod才會recreate。
還有一個就是Static Pod在kubelet重啟以後應該沒有重啟,但是叢集中查詢Static Pod的狀態時,Static Pod的執行時間變了
StatefulSet Pod為何在Node異常時沒有Recreate
Node down後,StatefulSet Pods並沒有重建,為什麼?
我們在node controller中發現,除了daemonset pods外,都會呼叫delete pod api刪除pod。
但並不是呼叫了delete pod api就會從apiserver/etcd中刪除pod object,僅僅是設定pod 的deletionTimestamp,標記該pod要被刪除。真正刪除Pod的行為是kubelet,kubelet grace terminate該pod後去真正刪除pod object。這個時候statefulset controller 發現某個replica缺失就會去recreate這個pod。
但此時由於kubelet掛了,無法與master通訊,導致Pod Object一直無法從etcd中刪除。如果能成功刪除Pod Object,就可以在其他Node重建Pod。
另外,要注意,statefulset只會針對isFailed Pod,(但現在Pods是Unkown狀態)才會去delete Pod。
// delete and recreate failed pods
if isFailed(replicas[I]) {
ssc.recorder.Eventf(set, v1.EventTypeWarning, "RecreatingFailedPod",
"StatefulSetPlus %s/%s is recreating failed Pod %s",
set.Namespace,
set.Name,
replicas[I].Name)
if err := ssc.podControl.DeleteStatefulPlusPod(set, replicas[I]); err != nil {
return &status, err
}
if getPodRevision(replicas[I]) == currentRevision.Name {
status.CurrentReplicas—
}
if getPodRevision(replicas[I]) == updateRevision.Name {
status.UpdatedReplicas—
}
status.Replicas—
replicas[I] = newVersionedStatefulSetPlusPod(
currentSet,
updateSet,
currentRevision.Name,
updateRevision.Name,
i)
}
優化StatefulSet Pod的行為
所以針對node異常的情況,有狀態應用(Non-Quorum)的保障,應該補充以下行為:
-
監測node的網路、kubelet程序、作業系統等是否異常,區別對待。
-
比如,如果是網路異常,Pod無法正常提供服務,那麼需要
kubectl delete pod -f —grace-period=0
進行強制從etcd中刪除該pod。 -
強制刪除後,statefulset controller就會自動觸發在其他Node上recreate pod。
亦或者,更粗暴的方法,就是放棄GracePeriodSeconds,StatefulSet Pod GracePeriodSeconds為nil或者0,則就會直接從etcd中刪除該object。
// BeforeDelete tests whether the object can be gracefully deleted.
// If graceful is set, the object should be gracefully deleted. If gracefulPending
// is set, the object has already been gracefully deleted (and the provided grace
// period is longer than the time to deletion). An error is returned if the
// condition cannot be checked or the gracePeriodSeconds is invalid. The options
// argument may be updated with default values if graceful is true. Second place
// where we set deletionTimestamp is pkg/registry/generic/registry/store.go.
// This function is responsible for setting deletionTimestamp during gracefulDeletion,
// other one for cascading deletions.
func BeforeDelete(strategy RESTDeleteStrategy, ctx context.Context, obj runtime.Object, options *metav1.DeleteOptions) (graceful, gracefulPending bool, err error) {
objectMeta, gvk, kerr := objectMetaAndKind(strategy, obj)
if kerr != nil {
return false, false, kerr
}
if errs := validation.ValidateDeleteOptions(options); len(errs) > 0 {
return false, false, errors.NewInvalid(schema.GroupKind{Group: metav1.GroupName, Kind: "DeleteOptions"}, "", errs)
}
// Checking the Preconditions here to fail early. They'll be enforced later on when we actually do the deletion, too.
if options.Preconditions != nil && options.Preconditions.UID != nil && *options.Preconditions.UID != objectMeta.GetUID() {
return false, false, errors.NewConflict(schema.GroupResource{Group: gvk.Group, Resource: gvk.Kind}, objectMeta.GetName(), fmt.Errorf("the UID in the precondition (%s) does not match the UID in record (%s). The object might have been deleted and then recreated", *options.Preconditions.UID, objectMeta.GetUID()))
}
gracefulStrategy, ok := strategy.(RESTGracefulDeleteStrategy)
if !ok {
// If we're not deleting gracefully there's no point in updating Generation, as we won't update
// the obcject before deleting it.
return false, false, nil
}
// if the object is already being deleted, no need to update generation.
if objectMeta.GetDeletionTimestamp() != nil {
// if we are already being deleted, we may only shorten the deletion grace period
// this means the object was gracefully deleted previously but deletionGracePeriodSeconds was not set,
// so we force deletion immediately
// IMPORTANT:
// The deletion operation happens in two phases.
// 1. Update to set DeletionGracePeriodSeconds and DeletionTimestamp
// 2. Delete the object from storage.
// If the update succeeds, but the delete fails (network error, internal storage error, etc.),
// a resource was previously left in a state that was non-recoverable. We
// check if the existing stored resource has a grace period as 0 and if so
// attempt to delete immediately in order to recover from this scenario.
if objectMeta.GetDeletionGracePeriodSeconds() == nil || *objectMeta.GetDeletionGracePeriodSeconds() == 0 {
return false, false, nil
}
...