1. 程式人生 > 其它 >kube-scheduler原始碼分析(3)-搶佔排程分析

kube-scheduler原始碼分析(3)-搶佔排程分析

kube-scheduler原始碼分析(3)-搶佔排程分析

kube-scheduler簡介

kube-scheduler元件是kubernetes中的核心元件之一,主要負責pod資源物件的排程工作,具體來說,kube-scheduler元件負責根據排程演算法(包括預選演算法和優選演算法)將未排程的pod排程到合適的最優的node節點上。

kube-scheduler架構圖

kube-scheduler的大致組成和處理流程如下圖,kube-scheduler對pod、node等物件進行了list/watch,根據informer將未排程的pod放入待排程pod佇列,並根據informer構建排程器cache(用於快速獲取需要的node等物件),然後sched.scheduleOne

方法為kube-scheduler元件排程pod的核心處理邏輯所在,從未排程pod佇列中取出一個pod,經過預選與優選演算法,最終選出一個最優node,上述步驟都成功則更新cache並非同步執行bind操作,也就是更新pod的nodeName欄位,失敗則進入搶佔邏輯,至此一個pod的排程工作完成。

kube-scheduler搶佔排程概述

優先順序和搶佔機制,解決的是 Pod 排程失敗時該怎麼辦的問題。

正常情況下,當一個 pod 排程失敗後,就會被暫時 “擱置” 處於 pending 狀態,直到 pod 被更新或者叢集狀態發生變化,排程器才會對這個 pod 進行重新排程。

但是有的時候,我們希望給pod分等級,即分優先順序。當一個高優先順序的 Pod 排程失敗後,該 Pod 並不會被“擱置”,而是會“擠走”某個 Node 上的一些低優先順序的 Pod,這樣一來就可以保證高優先順序 Pod 會優先排程成功。

關於pod優先順序,具體請參考:https://kubernetes.io/zh/docs/concepts/scheduling-eviction/pod-priority-preemption/

搶佔發生的原因,一定是一個高優先順序的 pod 排程失敗,我們稱這個 pod 為“搶佔者”,稱被搶佔的 pod 為“犧牲者”(victims)。

PDB概述

PDB全稱PodDisruptionBudget,可以理解為是k8s中用來保證Deployment、StatefulSet等控制器在叢集中存在的最小副本數量的一個物件。

具體請參考:
https://kubernetes.io/zh/docs/concepts/workloads/pods/disruptions/


https://kubernetes.io/zh/docs/tasks/run-application/configure-pdb/

搶佔排程功能開啟與關閉配置

kube-scheduler的搶佔排程功能預設開啟。

在 Kubernetes 1.15+版本,如果 NonPreemptingPriority被啟用了(kube-scheduler元件啟動引數--feature-gates=NonPreemptingPriority=true) ,PriorityClass 可以設定 preemptionPolicy: Never,則該 PriorityClass 的所有 Pod在排程失敗後將不會執行搶佔邏輯。

另外,在 Kubernetes 1.11+版本,kube-scheduler元件也可以配置檔案引數設定將搶佔排程功能關閉(注意:不能通過元件啟動命令列引數設定)。

apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
...
disablePreemption: true

配置檔案通過kube-scheduler啟動引數--config指定。

kube-scheduler啟動引數參考:https://kubernetes.io/zh/docs/reference/command-line-tools-reference/kube-scheduler/
kube-scheduler配置檔案參考:https://kubernetes.io/zh/docs/reference/scheduling/config/

kube-scheduler元件的分析將分為三大塊進行,分別是:
(1)kube-scheduler初始化與啟動分析;
(2)kube-scheduler核心處理邏輯分析;
(3)kube-scheduler搶佔排程邏輯分析;

本篇進行搶佔排程邏輯分析。

3.kube-scheduler搶佔排程邏輯分析

基於tag v1.17.4

https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4

分析入口-scheduleOne

把scheduleOne方法作為kube-scheduler元件搶佔排程的分析入口,這裡只關注到scheduleOne方法中搶佔排程相關的邏輯:
(1)呼叫sched.Algorithm.Schedule方法,排程pod;
(2)pod排程失敗後,呼叫sched.DisablePreemption判斷kube-scheduler元件是否關閉了搶佔排程功能;
(3)如未關閉搶佔排程功能,則呼叫sched.preempt進行搶佔排程邏輯;

// pkg/scheduler/scheduler.go
func (sched *Scheduler) scheduleOne(ctx context.Context) {
    ...
    // 排程pod
    scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, state, pod)
	if err != nil {
		...
		if fitError, ok := err.(*core.FitError); ok {
		    // 判斷是否關閉了搶佔排程功能
			if sched.DisablePreemption {
				klog.V(3).Infof("Pod priority feature is not enabled or preemption is disabled by scheduler configuration." +
					" No preemption is performed.")
			} else {
			// 搶佔排程邏輯
				preemptionStartTime := time.Now()
				sched.preempt(schedulingCycleCtx, state, fwk, pod, fitError)
				...
			}
			...
	}
	...

sched.preempt

sched.preempt為kube-scheduler搶佔排程處理邏輯所在,主要邏輯:
(1)呼叫sched.Algorithm.Preempt,模擬pod搶佔排程過程,返回pod可以搶佔的node節點、被搶佔的pod列表、需要去除NominatedNodeName屬性的pod列表;
(2)呼叫sched.podPreemptor.setNominatedNodeName,請求apiserver,將可以搶佔的node節點名稱設定到pod的NominatedNodeName屬性值中,然後該pod會重新進入待排程pod佇列,等待再一次排程;
(3)遍歷被搶佔的pod列表,請求apiserver,刪除pod;
(4)遍歷需要去除NominatedNodeName屬性的pod列表,請求apiserver,更新pod,去除pod的NominatedNodeName屬性值;

注意:搶佔排程處理邏輯並馬上把排程失敗的pod再次搶佔排程到node上,而是根據模擬搶佔的結果,刪除被搶佔pod,空出相應的資源,最後把該排程失敗的pod交給下一個排程週期再處理。

// pkg/scheduler/scheduler.go
func (sched *Scheduler) preempt(ctx context.Context, state *framework.CycleState, fwk framework.Framework, preemptor *v1.Pod, scheduleErr error) (string, error) {
	...
    // (1)模擬pod搶佔排程過程,返回pod可以搶佔的node節點、被搶佔的pod列表、需要去除nominateName屬性的pod列表
	node, victims, nominatedPodsToClear, err := sched.Algorithm.Preempt(ctx, state, preemptor, scheduleErr)
	if err != nil {
		klog.Errorf("Error preempting victims to make room for %v/%v: %v", preemptor.Namespace, preemptor.Name, err)
		return "", err
	}
	var nodeName = ""
	if node != nil {
		nodeName = node.Name
		
		sched.SchedulingQueue.UpdateNominatedPodForNode(preemptor, nodeName)

		// (2)請求apiserver,將可以搶佔的node節點名稱設定到pod的nominatedNode屬性值中,然後該pod會重新進入待排程pod佇列,等待再一次排程
		err = sched.podPreemptor.setNominatedNodeName(preemptor, nodeName)
		if err != nil {
			klog.Errorf("Error in preemption process. Cannot set 'NominatedPod' on pod %v/%v: %v", preemptor.Namespace, preemptor.Name, err)
			sched.SchedulingQueue.DeleteNominatedPodIfExists(preemptor)
			return "", err
		}
        
        // (3)遍歷被搶佔的pod列表,請求apiserver,刪除pod
		for _, victim := range victims {
			if err := sched.podPreemptor.deletePod(victim); err != nil {
				klog.Errorf("Error preempting pod %v/%v: %v", victim.Namespace, victim.Name, err)
				return "", err
			}
			// If the victim is a WaitingPod, send a reject message to the PermitPlugin
			if waitingPod := fwk.GetWaitingPod(victim.UID); waitingPod != nil {
				waitingPod.Reject("preempted")
			}
			sched.Recorder.Eventf(victim, preemptor, v1.EventTypeNormal, "Preempted", "Preempting", "Preempted by %v/%v on node %v", preemptor.Namespace, preemptor.Name, nodeName)

		}
		metrics.PreemptionVictims.Observe(float64(len(victims)))
	}
	// (4)遍歷需要去除nominateName屬性的pod列表,請求apiserver,更新pod,去除pod的nominateName屬性值
	for _, p := range nominatedPodsToClear {
		rErr := sched.podPreemptor.removeNominatedNodeName(p)
		if rErr != nil {
			klog.Errorf("Cannot remove 'NominatedPod' field of pod: %v", rErr)
			// We do not return as this error is not critical.
		}
	}
	return nodeName, err
}

sched.Algorithm.Preempt

sched.Algorithm.Preempt方法模擬pod搶佔排程過程,返回pod可以搶佔的node節點、被搶佔的pod列表、需要去除NominatedNodeName屬性的pod列表,主要邏輯為:
(1)呼叫nodesWherePreemptionMightHelp,獲取預選失敗且移除部分pod之後可能可以滿足排程條件的節點;
(2)獲取PodDisruptionBudget物件,用於後續篩選可以被搶佔的node節點列表(關於PodDisruptionBudget的用法,可自行搜尋資料檢視);
(3)呼叫g.selectNodesForPreemption,篩選可以被搶佔的node節點列表,並返回node節點上被搶佔的pod的最小集合;
(4)遍歷scheduler-extender(kube-scheduler的一種webhook擴充套件機制),執行extender的搶佔處理邏輯,根據處理邏輯過濾可以被搶佔的node節點列表;
(5)呼叫pickOneNodeForPreemption,從可被搶佔的node節點列表中挑選出一個node節點;
(6)呼叫g.getLowerPriorityNominatedPods,獲取被搶佔node節點上NominatedNodeName屬性不為空且優先順序比搶佔pod低的pod列表;

// pkg/scheduler/core/generic_scheduler.go
func (g *genericScheduler) Preempt(ctx context.Context, state *framework.CycleState, pod *v1.Pod, scheduleErr error) (*v1.Node, []*v1.Pod, []*v1.Pod, error) {
	// Scheduler may return various types of errors. Consider preemption only if
	// the error is of type FitError.
	fitError, ok := scheduleErr.(*FitError)
	if !ok || fitError == nil {
		return nil, nil, nil, nil
	}
	if !podEligibleToPreemptOthers(pod, g.nodeInfoSnapshot.NodeInfoMap, g.enableNonPreempting) {
		klog.V(5).Infof("Pod %v/%v is not eligible for more preemption.", pod.Namespace, pod.Name)
		return nil, nil, nil, nil
	}
	if len(g.nodeInfoSnapshot.NodeInfoMap) == 0 {
		return nil, nil, nil, ErrNoNodesAvailable
	}
	// (1)獲取預選失敗且移除部分pod之後可能可以滿足排程條件的節點;
	potentialNodes := nodesWherePreemptionMightHelp(g.nodeInfoSnapshot.NodeInfoMap, fitError)
	if len(potentialNodes) == 0 {
		klog.V(3).Infof("Preemption will not help schedule pod %v/%v on any node.", pod.Namespace, pod.Name)
		// In this case, we should clean-up any existing nominated node name of the pod.
		return nil, nil, []*v1.Pod{pod}, nil
	}
	var (
		pdbs []*policy.PodDisruptionBudget
		err  error
	)
	// (2)獲取PodDisruptionBudget物件,用於後續篩選可以被搶佔的node節點列表(關於PodDisruptionBudget的用法,可自行搜尋資料檢視);
	if g.pdbLister != nil {
		pdbs, err = g.pdbLister.List(labels.Everything())
		if err != nil {
			return nil, nil, nil, err
		}
	}
	// (3)獲取可以被搶佔的node節點列表;  
	nodeToVictims, err := g.selectNodesForPreemption(ctx, state, pod, potentialNodes, pdbs)
	if err != nil {
		return nil, nil, nil, err
	}
	
    // (4)遍歷scheduler-extender(kube-scheduler的一種webhook擴充套件機制),執行extender的搶佔處理邏輯,根據處理邏輯過濾可以被搶佔的node節點列表; 
	// We will only check nodeToVictims with extenders that support preemption.
	// Extenders which do not support preemption may later prevent preemptor from being scheduled on the nominated
	// node. In that case, scheduler will find a different host for the preemptor in subsequent scheduling cycles.
	nodeToVictims, err = g.processPreemptionWithExtenders(pod, nodeToVictims)
	if err != nil {
		return nil, nil, nil, err
	}
    
    // (5)從可被搶佔的node節點列表中挑選出一個node節點;  
	candidateNode := pickOneNodeForPreemption(nodeToVictims)
	if candidateNode == nil {
		return nil, nil, nil, nil
	}
    
    // (6)獲取被搶佔node節點上nominateName屬性不為空且優先順序比搶佔pod低的pod列表;  
	// Lower priority pods nominated to run on this node, may no longer fit on
	// this node. So, we should remove their nomination. Removing their
	// nomination updates these pods and moves them to the active queue. It
	// lets scheduler find another place for them.
	nominatedPods := g.getLowerPriorityNominatedPods(pod, candidateNode.Name)
	if nodeInfo, ok := g.nodeInfoSnapshot.NodeInfoMap[candidateNode.Name]; ok {
		return nodeInfo.Node(), nodeToVictims[candidateNode].Pods, nominatedPods, nil
	}

	return nil, nil, nil, fmt.Errorf(
		"preemption failed: the target node %s has been deleted from scheduler cache",
		candidateNode.Name)
}

3.1 nodesWherePreemptionMightHelp

nodesWherePreemptionMightHelp函式主要是返回預選失敗且移除部分pod之後可能可以滿足排程條件的節點。

怎麼判斷某個預選失敗的node節點移除部分pod之後可能可以滿足排程條件呢?主要邏輯看到predicates.UnresolvablePredicateExists方法。

// pkg/scheduler/core/generic_scheduler.go
func nodesWherePreemptionMightHelp(nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo, fitErr *FitError) []*v1.Node {
	potentialNodes := []*v1.Node{}
	for name, node := range nodeNameToInfo {
		if fitErr.FilteredNodesStatuses[name].Code() == framework.UnschedulableAndUnresolvable {
			continue
		}
		failedPredicates := fitErr.FailedPredicates[name]

		// If we assume that scheduler looks at all nodes and populates the failedPredicateMap
		// (which is the case today), the !found case should never happen, but we'd prefer
		// to rely less on such assumptions in the code when checking does not impose
		// significant overhead.
		// Also, we currently assume all failures returned by extender as resolvable.
		if predicates.UnresolvablePredicateExists(failedPredicates) == nil {
			klog.V(3).Infof("Node %v is a potential node for preemption.", name)
			potentialNodes = append(potentialNodes, node.Node())
		}
	}
	return potentialNodes
}

3.1.1 predicates.UnresolvablePredicateExists

只要預選演算法執行失敗的node節點,其失敗的原因不屬於unresolvablePredicateFailureErrors中任何一個原因時,則該預選失敗的node節點移除部分pod之後可能可以滿足排程條件。

unresolvablePredicateFailureErrors包括節點NodeSelector不匹配、pod反親和規則不符合、汙點不容忍、節點屬於NotReady狀態、節點記憶體不足等等。

// pkg/scheduler/algorithm/predicates/error.go
var unresolvablePredicateFailureErrors = map[PredicateFailureReason]struct{}{
	ErrNodeSelectorNotMatch:      {},
	ErrPodAffinityRulesNotMatch:  {},
	ErrPodNotMatchHostName:       {},
	ErrTaintsTolerationsNotMatch: {},
	ErrNodeLabelPresenceViolated: {},
	// Node conditions won't change when scheduler simulates removal of preemption victims.
	// So, it is pointless to try nodes that have not been able to host the pod due to node
	// conditions. These include ErrNodeNotReady, ErrNodeUnderPIDPressure, ErrNodeUnderMemoryPressure, ....
	ErrNodeNotReady:            {},
	ErrNodeNetworkUnavailable:  {},
	ErrNodeUnderDiskPressure:   {},
	ErrNodeUnderPIDPressure:    {},
	ErrNodeUnderMemoryPressure: {},
	ErrNodeUnschedulable:       {},
	ErrNodeUnknownCondition:    {},
	ErrVolumeZoneConflict:      {},
	ErrVolumeNodeConflict:      {},
	ErrVolumeBindConflict:      {},
}

// UnresolvablePredicateExists checks if there is at least one unresolvable predicate failure reason, if true
// returns the first one in the list.
func UnresolvablePredicateExists(reasons []PredicateFailureReason) PredicateFailureReason {
	for _, r := range reasons {
		if _, ok := unresolvablePredicateFailureErrors[r]; ok {
			return r
		}
	}
	return nil
}

3.2 g.selectNodesForPreemption

g.selectNodesForPreemption方法,用於獲取可以被搶佔的node節點列表,並返回node節點上被搶佔的pod的最小集合,主要邏輯如下:
(1)定義checkNode函式,主要是呼叫g.selectVictimsOnNode方法,方法返回某node是否適合被搶佔,並返回該node節點上被搶佔的pod的最小集合、被搶佔pod中定義了PDB的pod數量;
(2)拉起16個goroutine,併發呼叫checkNode函式,對預選失敗的node節點列表進行是否適合被搶佔的檢查;

// pkg/scheduler/core/generic_scheduler.go
// selectNodesForPreemption finds all the nodes with possible victims for
// preemption in parallel.
func (g *genericScheduler) selectNodesForPreemption(
	ctx context.Context,
	state *framework.CycleState,
	pod *v1.Pod,
	potentialNodes []*v1.Node,
	pdbs []*policy.PodDisruptionBudget,
) (map[*v1.Node]*extenderv1.Victims, error) {
	nodeToVictims := map[*v1.Node]*extenderv1.Victims{}
	var resultLock sync.Mutex
    
    // (1)定義checkNode函式
	// We can use the same metadata producer for all nodes.
	meta := g.predicateMetaProducer(pod, g.nodeInfoSnapshot)
	checkNode := func(i int) {
		nodeName := potentialNodes[i].Name
		if g.nodeInfoSnapshot.NodeInfoMap[nodeName] == nil {
			return
		}
		nodeInfoCopy := g.nodeInfoSnapshot.NodeInfoMap[nodeName].Clone()
		var metaCopy predicates.Metadata
		if meta != nil {
			metaCopy = meta.ShallowCopy()
		}
		stateCopy := state.Clone()
		stateCopy.Write(migration.PredicatesStateKey, &migration.PredicatesStateData{Reference: metaCopy})
		// 呼叫g.selectVictimsOnNode方法,方法返回某node是否適合被搶佔,並返回該node節點上被搶佔的pod的最小集合、與PDB衝突的pod數量; 
		pods, numPDBViolations, fits := g.selectVictimsOnNode(ctx, stateCopy, pod, metaCopy, nodeInfoCopy, pdbs)
		if fits {
			resultLock.Lock()
			victims := extenderv1.Victims{
				Pods:             pods,
				NumPDBViolations: int64(numPDBViolations),
			}
			nodeToVictims[potentialNodes[i]] = &victims
			resultLock.Unlock()
		}
	}
	// (2)拉起16個goroutine,併發呼叫checkNode函式,對預選失敗的node節點列表進行是否適合被搶佔的檢查;
	workqueue.ParallelizeUntil(context.TODO(), 16, len(potentialNodes), checkNode)
	return nodeToVictims, nil
}

3.2.1 g.selectVictimsOnNode

g.selectVictimsOnNode方法用於判斷某node是否適合被搶佔,並返回該node節點上被搶佔的pod的最小集合、被搶佔pod中定義了PDB的pod數量。

主要邏輯:
(1)首先,假設把該node節點上比搶佔pod優先順序低的所有pod都刪除掉,然後呼叫預選演算法,看pod在該node上是否滿足排程條件,假如還是不符合排程條件,則該node節點不適合被搶佔,直接return;
(2)將所有比搶佔pod優先順序低的pod按優先順序高低進行排序,優先順序最低的排在最前面;
(3)將排好序的pod列表按是否定義了PDB分成兩個pod列表;
(4)先遍歷定義了PDB的pod列表,逐一刪除pod(被刪除的pod稱為被搶佔pod),每刪除一個pod,排程預選演算法,看pod在該node上是否滿足排程條件,如滿足則直接返回該node適合被搶佔、被搶佔的pod列表、被搶佔pod中定義了PDB的pod數量;
(5)假如遍歷完定義了PDB的pod列表後,搶佔pod在該node上任然不滿足排程條件,則繼續遍歷沒有定義PDB的pod列表,逐一刪除pod,每刪除一個pod,排程預選演算法,看pod在該node上是否滿足排程條件,如滿足則直接返回該node適合被搶佔、被搶佔的pod列表、被搶佔pod中定義了PDB的pod數量;
(6)如果上述兩個pod列表裡的pod都被刪除後,搶佔pod在該node上任然不滿足排程條件,則該node不適合被搶佔,return。

注意:以上說的刪除pod並不是真正的刪除,而是模擬刪除後,搶佔pod是否滿足排程條件而已。真正的刪除被搶佔pod的操作在後續確定了要搶佔的node節點後,再刪除該node節點上被搶佔的pod。

// pkg/scheduler/core/generic_scheduler.go
func (g *genericScheduler) selectVictimsOnNode(
	ctx context.Context,
	state *framework.CycleState,
	pod *v1.Pod,
	meta predicates.Metadata,
	nodeInfo *schedulernodeinfo.NodeInfo,
	pdbs []*policy.PodDisruptionBudget,
) ([]*v1.Pod, int, bool) {
	var potentialVictims []*v1.Pod

	removePod := func(rp *v1.Pod) error {
		if err := nodeInfo.RemovePod(rp); err != nil {
			return err
		}
		if meta != nil {
			if err := meta.RemovePod(rp, nodeInfo.Node()); err != nil {
				return err
			}
		}
		status := g.framework.RunPreFilterExtensionRemovePod(ctx, state, pod, rp, nodeInfo)
		if !status.IsSuccess() {
			return status.AsError()
		}
		return nil
	}
	addPod := func(ap *v1.Pod) error {
		nodeInfo.AddPod(ap)
		if meta != nil {
			if err := meta.AddPod(ap, nodeInfo.Node()); err != nil {
				return err
			}
		}
		status := g.framework.RunPreFilterExtensionAddPod(ctx, state, pod, ap, nodeInfo)
		if !status.IsSuccess() {
			return status.AsError()
		}
		return nil
	}
	// (1)首先,假設把該node節點上比搶佔pod優先順序低的所有pod都刪除掉,然後呼叫預選演算法,看pod在該node上是否滿足排程條件,假如還是不符合排程條件,則該node節點不適合被搶佔,直接return
	// As the first step, remove all the lower priority pods from the node and
	// check if the given pod can be scheduled.
	podPriority := podutil.GetPodPriority(pod)
	for _, p := range nodeInfo.Pods() {
		if podutil.GetPodPriority(p) < podPriority {
			potentialVictims = append(potentialVictims, p)
			if err := removePod(p); err != nil {
				return nil, 0, false
			}
		}
	}
	// If the new pod does not fit after removing all the lower priority pods,
	// we are almost done and this node is not suitable for preemption. The only
	// condition that we could check is if the "pod" is failing to schedule due to
	// inter-pod affinity to one or more victims, but we have decided not to
	// support this case for performance reasons. Having affinity to lower
	// priority pods is not a recommended configuration anyway.
	if fits, _, _, err := g.podFitsOnNode(ctx, state, pod, meta, nodeInfo, false); !fits {
		if err != nil {
			klog.Warningf("Encountered error while selecting victims on node %v: %v", nodeInfo.Node().Name, err)
		}

		return nil, 0, false
	}
	var victims []*v1.Pod
	numViolatingVictim := 0
	// (2)將所有比搶佔pod優先順序低的pod按優先順序高低進行排序,優先順序最低的排在最前面;  
	sort.Slice(potentialVictims, func(i, j int) bool { return util.MoreImportantPod(potentialVictims[i], potentialVictims[j]) })
	// Try to reprieve as many pods as possible. We first try to reprieve the PDB
	// violating victims and then other non-violating ones. In both cases, we start
	// from the highest priority victims.
	// (3)將排好序的pod列表按是否定義了PDB分成兩個pod列表; 
	violatingVictims, nonViolatingVictims := filterPodsWithPDBViolation(potentialVictims, pdbs)
	reprievePod := func(p *v1.Pod) (bool, error) {
		if err := addPod(p); err != nil {
			return false, err
		}
		fits, _, _, _ := g.podFitsOnNode(ctx, state, pod, meta, nodeInfo, false)
		if !fits {
			if err := removePod(p); err != nil {
				return false, err
			}
			victims = append(victims, p)
			klog.V(5).Infof("Pod %v/%v is a potential preemption victim on node %v.", p.Namespace, p.Name, nodeInfo.Node().Name)
		}
		return fits, nil
	}
	// (4)先遍歷定義了PDB的pod列表,逐一刪除pod(被刪除的pod稱為被搶佔pod),每刪除一個pod,排程預選演算法,看pod在該node上是否滿足排程條件,如滿足則直接返回該node適合被搶佔、被搶佔的pod列表、被搶佔pod中定義了PDB的pod數量;  
	for _, p := range violatingVictims {
		if fits, err := reprievePod(p); err != nil {
			klog.Warningf("Failed to reprieve pod %q: %v", p.Name, err)
			return nil, 0, false
		} else if !fits {
			numViolatingVictim++
		}
	}
	// (5)假如遍歷完定義了PDB的pod列表後,搶佔pod在該node上任然不滿足排程條件,則繼續遍歷沒有定義PDB的pod列表,逐一刪除pod,每刪除一個pod,排程預選演算法,看pod在該node上是否滿足排程條件,如滿足則直接返回該node適合被搶佔、被搶佔的pod列表、被搶佔pod中定義了PDB的pod數量;  
	// Now we try to reprieve non-violating victims.
	for _, p := range nonViolatingVictims {
		if _, err := reprievePod(p); err != nil {
			klog.Warningf("Failed to reprieve pod %q: %v", p.Name, err)
			return nil, 0, false
		}
	}
	// (6)如果上述兩個pod列表裡的pod都被刪除後,搶佔pod在該node上任然不滿足排程條件,則該node不適合被搶佔,return。 
	return victims, numViolatingVictim, true
}

3.3 pickOneNodeForPreemption

pickOneNodeForPreemption函式,從可被搶佔的node節點列表中挑選出一個node節點,該函式將按順序參照下列規則來挑選最優的被搶佔node,直到某個條件能夠選出唯一的一個node節點:
(1)node節點沒有被搶佔pod的,優先選擇;
(2)被搶佔pod中定義了PDB的pod數量最少的節點;
(3)高優先順序pod數量最少的節點;
(4)對node節點上所有被搶佔pod的優先順序進行相加,選取其值最小的節點;
(5)選擇被搶佔pod數量最少的node節點;
(6)選擇被搶佔pod中執行時間最短的pod所在node節點;
(7)返回符合上述條件的最後一個node節點;

// pkg/scheduler/core/generic_scheduler.go
func pickOneNodeForPreemption(nodesToVictims map[*v1.Node]*extenderv1.Victims) *v1.Node {
	if len(nodesToVictims) == 0 {
		return nil
	}
	minNumPDBViolatingPods := int64(math.MaxInt32)
	var minNodes1 []*v1.Node
	lenNodes1 := 0
	for node, victims := range nodesToVictims {
	    // (1)node節點沒有被搶佔pod的,優先選擇
		if len(victims.Pods) == 0 {
			// We found a node that doesn't need any preemption. Return it!
			// This should happen rarely when one or more pods are terminated between
			// the time that scheduler tries to schedule the pod and the time that
			// preemption logic tries to find nodes for preemption.
			return node
		}
		//(2)與PDB衝突的pod數量最少的節點
		numPDBViolatingPods := victims.NumPDBViolations
		if numPDBViolatingPods < minNumPDBViolatingPods {
			minNumPDBViolatingPods = numPDBViolatingPods
			minNodes1 = nil
			lenNodes1 = 0
		}
		if numPDBViolatingPods == minNumPDBViolatingPods {
			minNodes1 = append(minNodes1, node)
			lenNodes1++
		}
	}
	if lenNodes1 == 1 {
		return minNodes1[0]
	}
    
    // (3)高優先順序pod數量最少的節點
	// There are more than one node with minimum number PDB violating pods. Find
	// the one with minimum highest priority victim.
	minHighestPriority := int32(math.MaxInt32)
	var minNodes2 = make([]*v1.Node, lenNodes1)
	lenNodes2 := 0
	for i := 0; i < lenNodes1; i++ {
		node := minNodes1[i]
		victims := nodesToVictims[node]
		// highestPodPriority is the highest priority among the victims on this node.
		highestPodPriority := podutil.GetPodPriority(victims.Pods[0])
		if highestPodPriority < minHighestPriority {
			minHighestPriority = highestPodPriority
			lenNodes2 = 0
		}
		if highestPodPriority == minHighestPriority {
			minNodes2[lenNodes2] = node
			lenNodes2++
		}
	}
	if lenNodes2 == 1 {
		return minNodes2[0]
	}
    
    // (4)對node節點上所有被搶佔pod的優先順序進行相加,選取其值最小的節點
	// There are a few nodes with minimum highest priority victim. Find the
	// smallest sum of priorities.
	minSumPriorities := int64(math.MaxInt64)
	lenNodes1 = 0
	for i := 0; i < lenNodes2; i++ {
		var sumPriorities int64
		node := minNodes2[i]
		for _, pod := range nodesToVictims[node].Pods {
			// We add MaxInt32+1 to all priorities to make all of them >= 0. This is
			// needed so that a node with a few pods with negative priority is not
			// picked over a node with a smaller number of pods with the same negative
			// priority (and similar scenarios).
			sumPriorities += int64(podutil.GetPodPriority(pod)) + int64(math.MaxInt32+1)
		}
		if sumPriorities < minSumPriorities {
			minSumPriorities = sumPriorities
			lenNodes1 = 0
		}
		if sumPriorities == minSumPriorities {
			minNodes1[lenNodes1] = node
			lenNodes1++
		}
	}
	if lenNodes1 == 1 {
		return minNodes1[0]
	}
    
    // (5)選擇被搶佔pod數量最少的node節點; 
	// There are a few nodes with minimum highest priority victim and sum of priorities.
	// Find one with the minimum number of pods.
	minNumPods := math.MaxInt32
	lenNodes2 = 0
	for i := 0; i < lenNodes1; i++ {
		node := minNodes1[i]
		numPods := len(nodesToVictims[node].Pods)
		if numPods < minNumPods {
			minNumPods = numPods
			lenNodes2 = 0
		}
		if numPods == minNumPods {
			minNodes2[lenNodes2] = node
			lenNodes2++
		}
	}
	if lenNodes2 == 1 {
		return minNodes2[0]
	}
    
    // (6)選擇被搶佔pod中執行時間最短的pod所在node節點; 
	// There are a few nodes with same number of pods.
	// Find the node that satisfies latest(earliestStartTime(all highest-priority pods on node))
	latestStartTime := util.GetEarliestPodStartTime(nodesToVictims[minNodes2[0]])
	if latestStartTime == nil {
		// If the earliest start time of all pods on the 1st node is nil, just return it,
		// which is not expected to happen.
		klog.Errorf("earliestStartTime is nil for node %s. Should not reach here.", minNodes2[0])
		return minNodes2[0]
	}
	nodeToReturn := minNodes2[0]
	for i := 1; i < lenNodes2; i++ {
		node := minNodes2[i]
		// Get earliest start time of all pods on the current node.
		earliestStartTimeOnNode := util.GetEarliestPodStartTime(nodesToVictims[node])
		if earliestStartTimeOnNode == nil {
			klog.Errorf("earliestStartTime is nil for node %s. Should not reach here.", node)
			continue
		}
		if earliestStartTimeOnNode.After(latestStartTime.Time) {
			latestStartTime = earliestStartTimeOnNode
			nodeToReturn = node
		}
	}
    
    // (7)返回符合上述條件的最後一個node節點
	return nodeToReturn
}

總結

kube-scheduler簡介

kube-scheduler元件是kubernetes中的核心元件之一,主要負責pod資源物件的排程工作,具體來說,kube-scheduler元件負責根據排程演算法(包括預選演算法和優選演算法)將未排程的pod排程到合適的最優的node節點上。

kube-scheduler架構圖

kube-scheduler的大致組成和處理流程如下圖,kube-scheduler對pod、node等物件進行了list/watch,根據informer將未排程的pod放入待排程pod佇列,並根據informer構建排程器cache(用於快速獲取需要的node等物件),然後sched.scheduleOne方法為kube-scheduler元件排程pod的核心處理邏輯所在,從未排程pod佇列中取出一個pod,經過預選與優選演算法,最終選出一個最優node,上述步驟都成功則更新cache並非同步執行bind操作,也就是更新pod的nodeName欄位,失敗則進入搶佔邏輯,至此一個pod的排程工作完成。

kube-scheduler搶佔排程概述

優先順序和搶佔機制,解決的是 Pod 排程失敗時該怎麼辦的問題。

正常情況下,當一個 pod 排程失敗後,就會被暫時 “擱置” 處於 pending 狀態,直到 pod 被更新或者叢集狀態發生變化,排程器才會對這個 pod 進行重新排程。

但是有的時候,我們希望給pod分等級,即分優先順序。當一個高優先順序的 Pod 排程失敗後,該 Pod 並不會被“擱置”,而是會“擠走”某個 Node 上的一些低優先順序的 Pod,這樣一來就可以保證高優先順序 Pod 會優先排程成功。

搶佔發生的原因,一定是一個高優先順序的 pod 排程失敗,我們稱這個 pod 為“搶佔者”,稱被搶佔的 pod 為“犧牲者”(victims)。

kube-scheduler搶佔邏輯流程圖

下方處理流程圖展示了kube-scheduler搶佔邏輯的核心處理步驟,在開始搶佔邏輯處理之前,會先進行搶佔排程功能是否開啟的判斷。