kubernetes的eviction機制

阿新 • • 發佈：2018-11-21

eviction，即驅趕的意思，意思是當節點出現異常時，kubernetes將有相應的機制驅趕該節點上的Pod。eviction在openstack的nova元件中也存在。

目前kubernetes中存在兩種eviction機制，分別由kube-controller-manager和kubelet實現

1. kube-controller-manager實現的eviction

kube-controller-manager主要由多個控制器構成，而eviction的功能主要由node controller這個控制器實現。

kube-controller-manager提供了以下啟動引數控制eviction

pod-eviction-timeout：即當節點宕機該事件間隔後，開始eviction機制，驅趕宕機節點上的Pod，預設為5min
node-eviction-rate: 驅趕速率，即驅趕Node的速率，由令牌桶流控演算法實現，預設為0.1，即每秒驅趕0.1個節點，注意這裡不是驅趕Pod的速率，而是驅趕節點的速率。相當於每隔10s，清空一個節點
secondary-node-eviction-rate: 二級驅趕速率，當叢集中宕機節點過多時，相應的驅趕速率也降低，預設為0.01
unhealthy-zone-threshold：不健康zone閾值，會影響什麼時候開啟二級驅趕速率，預設為0.55，即當該zone中節點宕機數目超過55%，而認為該zone不健康

large-cluster-size-threshold:大叢集法制，當該zone的節點多餘該閾值時，則認為該zone是一個大叢集。大叢集節點宕機數目超過55%時，則將驅趕速率降為0.0.1，假如是小叢集，則將速率直接降為0

node-controller程式碼主要位於pkg/controller/node目錄下

1.1 zone

為了控制eviction，kubernete將節點劃分為不同的zone，主要通過給節點加label實現

failure-domain.beta.kubernetes.io/zone
failure-domain.beta.kubernetes.io/region

zone名稱由上述的zone和region標籤組合而成，兩個節點zone和region分別相同，則位於同一個zone，否則不同zone。假如二者都為空，就位於default zone

zone有四種不同的狀態

stateInitial
stateNormal
stateFullDisruption
statePartialDisruption

初始化狀態比較好理解，假如節點剛剛加入叢集，它所在的zone剛剛被發現，則該zone的狀態是initial，這是一個非常短暫的時間，其餘的狀態，由以下函式決定

func (nc *NodeController) ComputeZoneState(nodeReadyConditions []*v1.NodeCondition) (int, zoneState) {
    readyNodes := 0
    notReadyNodes := 0
    for i := range nodeReadyConditions {
        if nodeReadyConditions[i] != nil && nodeReadyConditions[i].Status == v1.ConditionTrue {
            readyNodes++
        } else {
            notReadyNodes++
        }
    }       
    switch {    
    case readyNodes == 0 && notReadyNodes > 0:
        return notReadyNodes, stateFullDisruption
    case notReadyNodes > 2 && float32(notReadyNodes)/float32(notReadyNodes+readyNodes) >= nc.unhealthyZoneThreshold:
        return notReadyNodes, statePartialDisruption
    default:
        return notReadyNodes, stateNormal
    }
}

注意這裡統計的某個zone下面的節點狀態，而不是所有。當該zone下面ready的節點為0，而notReady節點大於0時，即認為所有節點都宕機了，所以狀態為stateFullDisruption；當notReady節點大於兩個，而且notReady節點佔比超過unhealthyZoneThreshold，即0.55時，認為是statePartialDisruption，其他情況則認為stateNormal

這四種狀態會如何影響eviction速度呢？看如下函式

func (nc *NodeController) setLimiterInZone(zone string, zoneSize int, state zoneState) {
    switch state {
    case stateNormal:
        if nc.useTaintBasedEvictions {
            nc.zoneNotReadyOrUnreachableTainer[zone].SwapLimiter(nc.evictionLimiterQPS)
        } else {
            nc.zonePodEvictor[zone].SwapLimiter(nc.evictionLimiterQPS)
        }    
    case statePartialDisruption:
        if nc.useTaintBasedEvictions {
            nc.zoneNotReadyOrUnreachableTainer[zone].SwapLimiter(
                nc.enterPartialDisruptionFunc(zoneSize))
        } else {
            nc.zonePodEvictor[zone].SwapLimiter(
                nc.enterPartialDisruptionFunc(zoneSize))
        }    
    case stateFullDisruption:
        if nc.useTaintBasedEvictions {
            nc.zoneNotReadyOrUnreachableTainer[zone].SwapLimiter(
                nc.enterFullDisruptionFunc(zoneSize))
        } else {
            nc.zonePodEvictor[zone].SwapLimiter(
                nc.enterFullDisruptionFunc(zoneSize))
        }    
    }    
}

其中enterPartialDisruptionFunc就是函式ReducedQPSFunc
func (nc *NodeController) ReducedQPSFunc(nodeNum int) float32 {
    if int32(nodeNum) > nc.largeClusterThreshold {
        return nc.secondaryEvictionLimiterQPS
    }
    return 0
} 

而enterFullDisruptionFunc是函式HealthyQPSFunc
func (nc *NodeController) HealthyQPSFunc(nodeNum int) float32 {
    return nc.evictionLimiterQPS 
}

即加入zone狀態是normal，那麼速率為0.1，假如zone狀態是FullDisruption，速率也是0.1；假如zone是PartialDisruption，假如是大叢集，速率為0.0.1，小叢集則直接降為0

1.2 兩種不同的eviction方法

目前node controller存在兩種不同的eviction方法，即通過taint或者傳統方法

if nc.useTaintBasedEvictions {

    go wait.Until(nc.doTaintingPass, nodeEvictionPeriod, wait.NeverStop)
} else {
    go wait.Until(nc.doEvictionPass, nodeEvictionPeriod, wait.NeverStop)
}

其中nodeEvictionPeriod為100ms，即每隔100ms就會執行doEvictionPass或doTaintingPass

1.2.1 傳統eviction方法

zonePodEvictor型別為map[string]*RateLimitedTimedQueue，即每個zone都有一個佇列，佇列帶了流控演算法，裡面儲存的是unready的節點，節點上的pod需要被eviction

func (nc *NodeController) doEvictionPass() {
    nc.evictorLock.Lock()
    defer nc.evictorLock.Unlock()
    for k := range nc.zonePodEvictor {
        nc.zonePodEvictor[k].Try(func(value TimedValue) (bool, time.Duration) {
            node, err := nc.nodeLister.Get(value.Value)
            ...
            nodeUid, _ := value.UID.(string)
            remaining, err := deletePods(nc.kubeClient, nc.recorder, value.Value, nodeUid, nc.daemonSetStore)
            ...
            if remaining {
                glog.Infof("Pods awaiting deletion due to NodeController eviction")
            }
            return true, 0
        })  
    }
}

deletePods是驅逐節點的主要函式

func deletePods(kubeClient clientset.Interface, recorder record.EventRecorder, nodeName, nodeUID string, daemonStore extensionslisters.DaemonSetLister) (bool, error) {
    remaining := false
    selector := fields.OneTermEqualSelector(api.PodHostField, nodeName).String()
    options := metav1.ListOptions{FieldSelector: selector}
    pods, err := kubeClient.Core().Pods(metav1.NamespaceAll).List(options)
    var updateErrList []error
    
    ...

    for _, pod := range pods.Items {
        // Defensive check, also needed for tests.
        if pod.Spec.NodeName != nodeName {
            continue
        }

        // 設定Pod終止理由
        if _, err = setPodTerminationReason(kubeClient, &pod, nodeName); err != nil {
            if errors.IsConflict(err) {
                updateErrList = append(updateErrList,
                    fmt.Errorf("update status failed for pod %q: %v", format.Pod(&pod), err))
                continue
            }
        }
        // 該Pod正在被刪除，忽略
        if pod.DeletionGracePeriodSeconds != nil {
            remaining = true
            continue
        }
        // 假如該節點是又daemonset管理，則忽略
        _, err := daemonStore.GetPodDaemonSets(&pod)
        if err == nil {
            continue
        }
        if err := kubeClient.Core().Pods(pod.Namespace).Delete(pod.Name, nil); err != nil {
            return false, err
        }
        remaining = true
    }
    ...
    return remaining, nil
}

底層其實就是delete節點上的Pod，假如Pod是由daemonset管理，則忽略，因為即使刪除了，daemonset還是會在該節點上重建

1.2.2 taint機制 taint機制還處於試驗狀態，預設不開啟，假如要開始，則要在所有元件上設定–feature-gates TaintNodesByCondition=true

當節點狀態為unready時，打上node.alpha.kubernetes.io/notReady的taint 當節點狀態為unknown時，打上node.alpha.kubernetes.io/unreachable的taint

打上taint後，必然有相應的控制器去處理

// Run starts NoExecuteTaintManager which will run in loop until `stopCh` is closed.
func (tc *NoExecuteTaintManager) Run(stopCh <-chan struct{}) {   
    go func(stopCh <-chan struct{}) {
        for {            
            item, shutdown := tc.nodeUpdateQueue.Get()
            if shutdown {
                break    
            }            
            nodeUpdate := item.(*nodeUpdateItem) 
            select {     
            case <-stopCh:            
                break    
            case tc.nodeUpdateChannel <- nodeUpdate:
            }
        }
    }(stopCh)            

    go func(stopCh <-chan struct{}) {
        for {
            item, shutdown := tc.podUpdateQueue.Get()
            if shutdown {
                break
            }
            podUpdate := item.(*podUpdateItem)
            select {     
            case <-stopCh:            
                break
            case tc.podUpdateChannel <- podUpdate:
            }
        }
    }(stopCh)
    
    for {                
        select {         
        case <-stopCh:   
            break        
        case nodeUpdate := <-tc.nodeUpdateChannel:
            tc.handleNodeUpdate(nodeUpdate)
        case podUpdate := <-tc.podUpdateChannel: 
            // If we found a Pod update we need to empty Node queue first.
        priority:
            for {        
                select {
                case nodeUpdate := <-tc.nodeUpdateChannel:
                    tc.handleNodeUpdate(nodeUpdate)
                default:
                    break priority            
                }
            }
            // After Node queue is emptied we process podUpdate.
            tc.handlePodUpdate(podUpdate)
        }
    }
}

func deletePodHandler(c clientset.Interface, emitEventFunc func(types.NamespacedName)) func(args *WorkArgs) error {
    return func(args *WorkArgs) error {
        ns := args.NamespacedName.Namespace
        name := args.NamespacedName.Name
        glog.V(0).Infof("NoExecuteTaintManager is deleting Pod: %v", args.NamespacedName.String())
        if emitEventFunc != nil { 
            emitEventFunc(args.NamespacedName)
        }
        var err error    
        for i := 0; i < retries; i++ {
            err = c.Core().Pods(ns).Delete(name, &metav1.DeleteOptions{})
            if err == nil {           
                break
            }
            time.Sleep(10 * time.Millisecond)
        }
        return err
    }                    
}

本質上還是講帶eviction節點上的pod加入到刪除佇列上

2. kubelet的eviction機制

kube-controller-manager的eviction機制是粗粒度的，即驅趕一個節點上的所有pod，而kubelet則是細粒度的，它驅趕的是節點上的某些Pod，驅趕哪些Pod與之前講過的Pod的Qos機制有關。

kubelet的eviction機制，只有當節點記憶體和磁碟資源緊張時，才會開啟，他的目的就是為了回收node節點的資源。之前提過，kubelet還有oom-killer可以回收資源，那為什麼還需要eviction呢？這是因為oom-killer將Pod殺掉後，假如Pod的RestartPolicy設定為Always，則kubelet隔段時間後，仍然會在該節點上啟動該Pod。而kublet eviction則會將該Pod從該節點上刪除。

kubelet提供了以下引數控制eviction

eviction-hard：一系列的閾值，比如memory.available<1Gi，即當節點可用記憶體低於1Gi時，會立即觸發一次pod eviction
eviction-max-pod-grace-period：eviction-soft時，終止Pod的grace時間
eviction-minimum-reclaim：表示每一次eviction必須至少回收多少資源
eviction-pressure-transition-period：預設為5分鐘，脫離pressure condition的時間，超過閾值時，節點會被設定為memory pressure或者disk pressure，然後開啟pod eviction
eviction-soft：與hard相對應，也是一系列法制，比如memory.available<1.5Gi。但它不會立即執行pod eviction，而會等待eviction-soft-grace-period時間，假如該時間過後，依然還是達到了eviction-soft，則觸發一次pod eviction
eviction-soft-grace-period：預設為90秒

2.1 核心程式碼

kubelet eviction的核心程式碼就是如下，裡面的synchronize就是核心函式

func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, nodeProvider NodeProvider, monitoringInterval time.Duration) {
    // start the eviction manager monitoring
    go func() {
        for {
            if evictedPods := m.synchronize(diskInfoProvider, podFunc, nodeProvider); evictedPods != nil {
                glog.Infof("eviction manager: pods %s evicted, waiting for pod to be cleaned up", format.Pods(evictedPods))
                m.waitForPodsCleanup(podCleanedUpFunc, evictedPods)
            } else {
                time.Sleep(monitoringInterval)
            }       
        }       
    }()     
}

2.2 何時檢測觸發eviction的條件

目前主要由兩種機制檢測觸發eviction的條件

第一種就是定時觸發，前面的synchronize位於一個for迴圈，其中monitoringInterval為10s，也就是每隔10s會去檢測出發條件
通過cgroup訂閱而觸發，也就是假如記憶體低於閾值，cgroup就會通知kubelet去執行synchronize，核心層通知應用層，通過eventfd實現

if m.config.KernelMemcgNotification && !m.notifiersInitialized { 
        glog.Infof("eviction manager attempting to integrate with kernel memcg notification api")
        m.notifiersInitialized = true
        // start soft memory notification
        err = startMemoryThresholdNotifier(m.config.Thresholds, observations, false, func(desc string) {
            glog.Infof("soft memory eviction threshold crossed at %s", desc)
            // TODO wait grace period for soft memory limit
            m.synchronize(diskInfoProvider, podFunc, nodeProvider)
        })
        if err != nil {  
            glog.Warningf("eviction manager: failed to create hard memory threshold notifier: %v", err)
        }
        // start hard memory notification
        err = startMemoryThresholdNotifier(m.config.Thresholds, observations, true, func(desc string) { 
            glog.Infof("hard memory eviction threshold crossed at %s", desc)
            m.synchronize(diskInfoProvider, podFunc, nodeProvider)
        })
        if err != nil {
            glog.Warningf("eviction manager: failed to create soft memory threshold notifier: %v", err)
        }
    }

2.3 資源的回收

kubelet的eviction主要會回收兩種資源，記憶體和磁碟

磁盤迴收：主要通過刪除已經終止的容器和未使用的映象
記憶體回收：主要通過終止正在執行的Pod

2.4 Qos對Eviction的影響

eviction manager會獲取該節點上所有的容器，然後根據一定的演算法對Pod進行排序，這裡看看針對記憶體如何排序

// rankMemoryPressure orders the input pods for eviction in response to memory pressure.
func rankMemoryPressure(pods []*v1.Pod, stats statsFunc) {
    orderedBy(qosComparator, memory(stats)).Sort(pods)
}

// qosComparator compares pods by QoS (BestEffort < Burstable < Guaranteed)
func qosComparator(p1, p2 *v1.Pod) int {
    qosP1 := v1qos.GetPodQOS(p1)
    qosP2 := v1qos.GetPodQOS(p2)
    // its a tie
    if qosP1 == qosP2 {
        return 0
    }   
    // if p1 is best effort, we know p2 is burstable or guaranteed
    if qosP1 == v1.PodQOSBestEffort {
        return -1
    }
    // we know p1 and p2 are not besteffort, so if p1 is burstable, p2 must be guaranteed
    if qosP1 == v1.PodQOSBurstable {
        if qosP2 == v1.PodQOSGuaranteed {  
            return -1
        }
        return 1
    }
    // ok, p1 must be guaranteed.
    return 1
}

從qosComparator函式可以看出，Pod排列順序為

PodQOSBestEffort < PodQOSBurstable < PodQOSGuaranteed

即首先會回收BestEffort的Pod，然後回收Burstable，最後才會回收Guranteed

2.5 Eviction的本質

// we kill at most a single pod during each eviction interval
    for i := range activePods {
        pod := activePods[i]      
        // If the pod is marked as critical and static, and support for critical pod annotations is enabled,
        // do not evict such pods. Static pods are not re-admitted after evictions.
        // https://github.com/kubernetes/kubernetes/issues/40573 has more details.
        if utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&
            kubelettypes.IsCriticalPod(pod) && kubepod.IsStaticPod(pod) {
            continue     
        }
        status := v1.PodStatus{   
            Phase:   v1.PodFailed,    
            Message: fmt.Sprintf(message, resourceToReclaim),
            Reason:  reason,          
        }
        // record that we are evicting the pod
        m.recorder.Eventf(pod, v1.EventTypeWarning, reason, fmt.Sprintf(message, resourceToReclaim))
        gracePeriodOverride := int64(0)
        if softEviction {
            gracePeriodOverride = m.config.MaxPodGracePeriodSeconds
        }
        // this is a blocking call and should only return when the pod and its containers are killed.
        err := m.killPodFunc(pod, status, &gracePeriodOverride)
        if err != nil {  
            glog.Warningf("eviction manager: error while evicting pod %s: %v", format.Pod(pod), err)
        }                
        return []*v1.Pod{pod}     
    }

每次最多回收1個Pod

假如是hardEviction，則PodDeleteGracePeriod設定為0，即立即刪除，否則設定為MaxPodGracePeriodSeconds。然後呼叫killPodFunc刪除Pod

3. 注意點

值得注意的是，當kubernetes驅趕Pod的時候，kubernetes並不會重新建立Pod，假如要重新建立Pod，需要藉助replicationcontroller、relicaset和deployment等機制。也就是說假如，你直接建立一個Pod，當它被kubernetes驅趕時，該Pod直接被刪除了，不會重建。而利用replicationcontroller等機制，由於少了一個Pod，這些控制器就會重新建立一個Pod

文章轉自：http://licyhust.com/%E5%AE%B9%E5%99%A8%E6%8A%80%E6%9C%AF/2017/10/24/eviction/

kubernetes的eviction機制

1. kube-controller-manager實現的eviction

1.1 zone

1.2 兩種不同的eviction方法

2. kubelet的eviction機制

2.1 核心程式碼

2.2 何時檢測觸發eviction的條件

2.3 資源的回收

2.4 Qos對Eviction的影響

2.5 Eviction的本質

3. 注意點

微信網頁授權獲取用戶信息等機制

字符設備之poll機制

繞過chrome的彈窗攔截機制

Android安全機制介紹

C++差分隱私的指數機制的一種實現方法

Java多線程機制

一個極其高效的虛擬機內存冗余消除機制：UKSM

反射機制的理解

2.2.2　加入factory機制

$apply方法（觸發臟檢查機制）

linux下select/poll/epoll機制的比較

Java的異常機制

Oracle SCN機制解析

AssetBundle管理機制（下）

AssetBundle管理機制（上）

Http的通信機制？

Socket的通信機制？

Python的反射機制、hasattr() getattr() setattr() 函數使用方法詳解

【轉載】5天不再懼怕多線程——第二天鎖機制

java反射機制

kubernetes的eviction機制

1. kube-controller-manager實現的eviction

1.1 zone

1.2 兩種不同的eviction方法

2. kubelet的eviction機制

2.1 核心程式碼

2.2 何時檢測觸發eviction的條件

2.3 資源的回收

2.4 Qos對Eviction的影響

2.5 Eviction的本質

3. 注意點

相關推薦