kubernetes的eviction機制
eviction,即驅趕的意思,意思是當節點出現異常時,kubernetes將有相應的機制驅趕該節點上的Pod。eviction在openstack的nova元件中也存在。
目前kubernetes中存在兩種eviction機制,分別由kube-controller-manager和kubelet實現
1. kube-controller-manager實現的eviction
kube-controller-manager主要由多個控制器構成,而eviction的功能主要由node controller這個控制器實現。
kube-controller-manager提供了以下啟動引數控制eviction
- pod-eviction-timeout:即當節點宕機該事件間隔後,開始eviction機制,驅趕宕機節點上的Pod,預設為5min
- node-eviction-rate: 驅趕速率,即驅趕Node的速率,由令牌桶流控演算法實現,預設為0.1,即每秒驅趕0.1個節點,注意這裡不是驅趕Pod的速率,而是驅趕節點的速率。相當於每隔10s,清空一個節點
- secondary-node-eviction-rate: 二級驅趕速率,當叢集中宕機節點過多時,相應的驅趕速率也降低,預設為0.01
- unhealthy-zone-threshold:不健康zone閾值,會影響什麼時候開啟二級驅趕速率,預設為0.55,即當該zone中節點宕機數目超過55%,而認為該zone不健康
- large-cluster-size-threshold:大叢集法制,當該zone的節點多餘該閾值時,則認為該zone是一個大叢集。大叢集節點宕機數目超過55%時,則將驅趕速率降為0.0.1,假如是小叢集,則將速率直接降為0
node-controller程式碼主要位於pkg/controller/node目錄下
1.1 zone
為了控制eviction,kubernete將節點劃分為不同的zone,主要通過給節點加label實現
- failure-domain.beta.kubernetes.io/zone
- failure-domain.beta.kubernetes.io/region
zone名稱由上述的zone和region標籤組合而成,兩個節點zone和region分別相同,則位於同一個zone,否則不同zone。假如二者都為空,就位於default zone
zone有四種不同的狀態
- stateInitial
- stateNormal
- stateFullDisruption
- statePartialDisruption
初始化狀態比較好理解,假如節點剛剛加入叢集,它所在的zone剛剛被發現,則該zone的狀態是initial,這是一個非常短暫的時間,其餘的狀態,由以下函式決定
func (nc *NodeController) ComputeZoneState(nodeReadyConditions []*v1.NodeCondition) (int, zoneState) {
readyNodes := 0
notReadyNodes := 0
for i := range nodeReadyConditions {
if nodeReadyConditions[i] != nil && nodeReadyConditions[i].Status == v1.ConditionTrue {
readyNodes++
} else {
notReadyNodes++
}
}
switch {
case readyNodes == 0 && notReadyNodes > 0:
return notReadyNodes, stateFullDisruption
case notReadyNodes > 2 && float32(notReadyNodes)/float32(notReadyNodes+readyNodes) >= nc.unhealthyZoneThreshold:
return notReadyNodes, statePartialDisruption
default:
return notReadyNodes, stateNormal
}
}
注意這裡統計的某個zone下面的節點狀態,而不是所有。當該zone下面ready的節點為0,而notReady節點大於0時,即認為所有節點都宕機了,所以狀態為stateFullDisruption;當notReady節點大於兩個,而且notReady節點佔比超過unhealthyZoneThreshold,即0.55時,認為是statePartialDisruption,其他情況則認為stateNormal
這四種狀態會如何影響eviction速度呢?看如下函式
func (nc *NodeController) setLimiterInZone(zone string, zoneSize int, state zoneState) {
switch state {
case stateNormal:
if nc.useTaintBasedEvictions {
nc.zoneNotReadyOrUnreachableTainer[zone].SwapLimiter(nc.evictionLimiterQPS)
} else {
nc.zonePodEvictor[zone].SwapLimiter(nc.evictionLimiterQPS)
}
case statePartialDisruption:
if nc.useTaintBasedEvictions {
nc.zoneNotReadyOrUnreachableTainer[zone].SwapLimiter(
nc.enterPartialDisruptionFunc(zoneSize))
} else {
nc.zonePodEvictor[zone].SwapLimiter(
nc.enterPartialDisruptionFunc(zoneSize))
}
case stateFullDisruption:
if nc.useTaintBasedEvictions {
nc.zoneNotReadyOrUnreachableTainer[zone].SwapLimiter(
nc.enterFullDisruptionFunc(zoneSize))
} else {
nc.zonePodEvictor[zone].SwapLimiter(
nc.enterFullDisruptionFunc(zoneSize))
}
}
}
其中enterPartialDisruptionFunc就是函式ReducedQPSFunc
func (nc *NodeController) ReducedQPSFunc(nodeNum int) float32 {
if int32(nodeNum) > nc.largeClusterThreshold {
return nc.secondaryEvictionLimiterQPS
}
return 0
}
而enterFullDisruptionFunc是函式HealthyQPSFunc
func (nc *NodeController) HealthyQPSFunc(nodeNum int) float32 {
return nc.evictionLimiterQPS
}
即加入zone狀態是normal,那麼速率為0.1,假如zone狀態是FullDisruption,速率也是0.1;假如zone是PartialDisruption,假如是大叢集,速率為0.0.1,小叢集則直接降為0
1.2 兩種不同的eviction方法
目前node controller存在兩種不同的eviction方法,即通過taint或者傳統方法
if nc.useTaintBasedEvictions {
go wait.Until(nc.doTaintingPass, nodeEvictionPeriod, wait.NeverStop)
} else {
go wait.Until(nc.doEvictionPass, nodeEvictionPeriod, wait.NeverStop)
}
其中nodeEvictionPeriod為100ms,即每隔100ms就會執行doEvictionPass或doTaintingPass
1.2.1 傳統eviction方法
zonePodEvictor型別為map[string]*RateLimitedTimedQueue,即每個zone都有一個佇列,佇列帶了流控演算法,裡面儲存的是unready的節點,節點上的pod需要被eviction
func (nc *NodeController) doEvictionPass() {
nc.evictorLock.Lock()
defer nc.evictorLock.Unlock()
for k := range nc.zonePodEvictor {
nc.zonePodEvictor[k].Try(func(value TimedValue) (bool, time.Duration) {
node, err := nc.nodeLister.Get(value.Value)
...
nodeUid, _ := value.UID.(string)
remaining, err := deletePods(nc.kubeClient, nc.recorder, value.Value, nodeUid, nc.daemonSetStore)
...
if remaining {
glog.Infof("Pods awaiting deletion due to NodeController eviction")
}
return true, 0
})
}
}
deletePods是驅逐節點的主要函式
func deletePods(kubeClient clientset.Interface, recorder record.EventRecorder, nodeName, nodeUID string, daemonStore extensionslisters.DaemonSetLister) (bool, error) {
remaining := false
selector := fields.OneTermEqualSelector(api.PodHostField, nodeName).String()
options := metav1.ListOptions{FieldSelector: selector}
pods, err := kubeClient.Core().Pods(metav1.NamespaceAll).List(options)
var updateErrList []error
...
for _, pod := range pods.Items {
// Defensive check, also needed for tests.
if pod.Spec.NodeName != nodeName {
continue
}
// 設定Pod終止理由
if _, err = setPodTerminationReason(kubeClient, &pod, nodeName); err != nil {
if errors.IsConflict(err) {
updateErrList = append(updateErrList,
fmt.Errorf("update status failed for pod %q: %v", format.Pod(&pod), err))
continue
}
}
// 該Pod正在被刪除,忽略
if pod.DeletionGracePeriodSeconds != nil {
remaining = true
continue
}
// 假如該節點是又daemonset管理,則忽略
_, err := daemonStore.GetPodDaemonSets(&pod)
if err == nil {
continue
}
if err := kubeClient.Core().Pods(pod.Namespace).Delete(pod.Name, nil); err != nil {
return false, err
}
remaining = true
}
...
return remaining, nil
}
底層其實就是delete節點上的Pod,假如Pod是由daemonset管理,則忽略,因為即使刪除了,daemonset還是會在該節點上重建
1.2.2 taint機制 taint機制還處於試驗狀態,預設不開啟,假如要開始,則要在所有元件上設定–feature-gates TaintNodesByCondition=true
當節點狀態為unready時,打上node.alpha.kubernetes.io/notReady的taint 當節點狀態為unknown時,打上node.alpha.kubernetes.io/unreachable的taint
打上taint後,必然有相應的控制器去處理
// Run starts NoExecuteTaintManager which will run in loop until `stopCh` is closed.
func (tc *NoExecuteTaintManager) Run(stopCh <-chan struct{}) {
go func(stopCh <-chan struct{}) {
for {
item, shutdown := tc.nodeUpdateQueue.Get()
if shutdown {
break
}
nodeUpdate := item.(*nodeUpdateItem)
select {
case <-stopCh:
break
case tc.nodeUpdateChannel <- nodeUpdate:
}
}
}(stopCh)
go func(stopCh <-chan struct{}) {
for {
item, shutdown := tc.podUpdateQueue.Get()
if shutdown {
break
}
podUpdate := item.(*podUpdateItem)
select {
case <-stopCh:
break
case tc.podUpdateChannel <- podUpdate:
}
}
}(stopCh)
for {
select {
case <-stopCh:
break
case nodeUpdate := <-tc.nodeUpdateChannel:
tc.handleNodeUpdate(nodeUpdate)
case podUpdate := <-tc.podUpdateChannel:
// If we found a Pod update we need to empty Node queue first.
priority:
for {
select {
case nodeUpdate := <-tc.nodeUpdateChannel:
tc.handleNodeUpdate(nodeUpdate)
default:
break priority
}
}
// After Node queue is emptied we process podUpdate.
tc.handlePodUpdate(podUpdate)
}
}
}
func deletePodHandler(c clientset.Interface, emitEventFunc func(types.NamespacedName)) func(args *WorkArgs) error {
return func(args *WorkArgs) error {
ns := args.NamespacedName.Namespace
name := args.NamespacedName.Name
glog.V(0).Infof("NoExecuteTaintManager is deleting Pod: %v", args.NamespacedName.String())
if emitEventFunc != nil {
emitEventFunc(args.NamespacedName)
}
var err error
for i := 0; i < retries; i++ {
err = c.Core().Pods(ns).Delete(name, &metav1.DeleteOptions{})
if err == nil {
break
}
time.Sleep(10 * time.Millisecond)
}
return err
}
}
本質上還是講帶eviction節點上的pod加入到刪除佇列上
2. kubelet的eviction機制
kube-controller-manager的eviction機制是粗粒度的,即驅趕一個節點上的所有pod,而kubelet則是細粒度的,它驅趕的是節點上的某些Pod,驅趕哪些Pod與之前講過的Pod的Qos機制有關。
kubelet的eviction機制,只有當節點記憶體和磁碟資源緊張時,才會開啟,他的目的就是為了回收node節點的資源。之前提過,kubelet還有oom-killer可以回收資源,那為什麼還需要eviction呢?這是因為oom-killer將Pod殺掉後,假如Pod的RestartPolicy設定為Always,則kubelet隔段時間後,仍然會在該節點上啟動該Pod。而kublet eviction則會將該Pod從該節點上刪除。
kubelet提供了以下引數控制eviction
- eviction-hard:一系列的閾值,比如memory.available<1Gi,即當節點可用記憶體低於1Gi時,會立即觸發一次pod eviction
- eviction-max-pod-grace-period:eviction-soft時,終止Pod的grace時間
- eviction-minimum-reclaim:表示每一次eviction必須至少回收多少資源
- eviction-pressure-transition-period:預設為5分鐘,脫離pressure condition的時間,超過閾值時,節點會被設定為memory pressure或者disk pressure,然後開啟pod eviction
- eviction-soft:與hard相對應,也是一系列法制,比如memory.available<1.5Gi。但它不會立即執行pod eviction,而會等待eviction-soft-grace-period時間,假如該時間過後,依然還是達到了eviction-soft,則觸發一次pod eviction
- eviction-soft-grace-period:預設為90秒
2.1 核心程式碼
kubelet eviction的核心程式碼就是如下,裡面的synchronize就是核心函式
func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, nodeProvider NodeProvider, monitoringInterval time.Duration) {
// start the eviction manager monitoring
go func() {
for {
if evictedPods := m.synchronize(diskInfoProvider, podFunc, nodeProvider); evictedPods != nil {
glog.Infof("eviction manager: pods %s evicted, waiting for pod to be cleaned up", format.Pods(evictedPods))
m.waitForPodsCleanup(podCleanedUpFunc, evictedPods)
} else {
time.Sleep(monitoringInterval)
}
}
}()
}
2.2 何時檢測觸發eviction的條件
目前主要由兩種機制檢測觸發eviction的條件
- 第一種就是定時觸發,前面的synchronize位於一個for迴圈,其中monitoringInterval為10s,也就是每隔10s會去檢測出發條件
- 通過cgroup訂閱而觸發,也就是假如記憶體低於閾值,cgroup就會通知kubelet去執行synchronize,核心層通知應用層,通過eventfd實現
if m.config.KernelMemcgNotification && !m.notifiersInitialized {
glog.Infof("eviction manager attempting to integrate with kernel memcg notification api")
m.notifiersInitialized = true
// start soft memory notification
err = startMemoryThresholdNotifier(m.config.Thresholds, observations, false, func(desc string) {
glog.Infof("soft memory eviction threshold crossed at %s", desc)
// TODO wait grace period for soft memory limit
m.synchronize(diskInfoProvider, podFunc, nodeProvider)
})
if err != nil {
glog.Warningf("eviction manager: failed to create hard memory threshold notifier: %v", err)
}
// start hard memory notification
err = startMemoryThresholdNotifier(m.config.Thresholds, observations, true, func(desc string) {
glog.Infof("hard memory eviction threshold crossed at %s", desc)
m.synchronize(diskInfoProvider, podFunc, nodeProvider)
})
if err != nil {
glog.Warningf("eviction manager: failed to create soft memory threshold notifier: %v", err)
}
}
2.3 資源的回收
kubelet的eviction主要會回收兩種資源,記憶體和磁碟
- 磁盤迴收:主要通過刪除已經終止的容器和未使用的映象
- 記憶體回收:主要通過終止正在執行的Pod
2.4 Qos對Eviction的影響
eviction manager會獲取該節點上所有的容器,然後根據一定的演算法對Pod進行排序,這裡看看針對記憶體如何排序
// rankMemoryPressure orders the input pods for eviction in response to memory pressure.
func rankMemoryPressure(pods []*v1.Pod, stats statsFunc) {
orderedBy(qosComparator, memory(stats)).Sort(pods)
}
// qosComparator compares pods by QoS (BestEffort < Burstable < Guaranteed)
func qosComparator(p1, p2 *v1.Pod) int {
qosP1 := v1qos.GetPodQOS(p1)
qosP2 := v1qos.GetPodQOS(p2)
// its a tie
if qosP1 == qosP2 {
return 0
}
// if p1 is best effort, we know p2 is burstable or guaranteed
if qosP1 == v1.PodQOSBestEffort {
return -1
}
// we know p1 and p2 are not besteffort, so if p1 is burstable, p2 must be guaranteed
if qosP1 == v1.PodQOSBurstable {
if qosP2 == v1.PodQOSGuaranteed {
return -1
}
return 1
}
// ok, p1 must be guaranteed.
return 1
}
從qosComparator函式可以看出,Pod排列順序為
PodQOSBestEffort < PodQOSBurstable < PodQOSGuaranteed
即首先會回收BestEffort的Pod,然後回收Burstable,最後才會回收Guranteed
2.5 Eviction的本質
// we kill at most a single pod during each eviction interval
for i := range activePods {
pod := activePods[i]
// If the pod is marked as critical and static, and support for critical pod annotations is enabled,
// do not evict such pods. Static pods are not re-admitted after evictions.
// https://github.com/kubernetes/kubernetes/issues/40573 has more details.
if utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&
kubelettypes.IsCriticalPod(pod) && kubepod.IsStaticPod(pod) {
continue
}
status := v1.PodStatus{
Phase: v1.PodFailed,
Message: fmt.Sprintf(message, resourceToReclaim),
Reason: reason,
}
// record that we are evicting the pod
m.recorder.Eventf(pod, v1.EventTypeWarning, reason, fmt.Sprintf(message, resourceToReclaim))
gracePeriodOverride := int64(0)
if softEviction {
gracePeriodOverride = m.config.MaxPodGracePeriodSeconds
}
// this is a blocking call and should only return when the pod and its containers are killed.
err := m.killPodFunc(pod, status, &gracePeriodOverride)
if err != nil {
glog.Warningf("eviction manager: error while evicting pod %s: %v", format.Pod(pod), err)
}
return []*v1.Pod{pod}
}
每次最多回收1個Pod
假如是hardEviction,則PodDeleteGracePeriod設定為0,即立即刪除,否則設定為MaxPodGracePeriodSeconds。然後呼叫killPodFunc刪除Pod
3. 注意點
值得注意的是,當kubernetes驅趕Pod的時候,kubernetes並不會重新建立Pod,假如要重新建立Pod,需要藉助replicationcontroller、relicaset和deployment等機制。也就是說假如,你直接建立一個Pod,當它被kubernetes驅趕時,該Pod直接被刪除了,不會重建。而利用replicationcontroller等機制,由於少了一個Pod,這些控制器就會重新建立一個Pod
文章轉自:http://licyhust.com/%E5%AE%B9%E5%99%A8%E6%8A%80%E6%9C%AF/2017/10/24/eviction/