Kubernetes Scheduler原理解析

阿新 • • 發佈：2019-01-17

本文是對Kubernetes Scheduler的演算法解讀和原理解析,重點介紹了預選(Predicates)和優選(Priorities)步驟的原理，並介紹了預設配置的Default Policies。接下來，我會分析Kubernetes Scheduler的原始碼，窺探其具體的實現細節以及如何開發一個Policy，見我下片博文吧。

Scheduler及其演算法介紹

Kubernetes Scheduler是Kubernetes Master的一個元件，通常與API Server和Controller Manager元件部署在一個節點，共同組成Master的三劍客。

一句話概括Scheduler的功能：將PodSpec.NodeName為空的Pods逐個地，經過預選(Predicates)和優選(Priorities)兩個步驟，挑選最合適的Node作為該Pod的Destination。

展開這兩個步驟，就是Scheduler的演算法描述：

預選：根據配置的Predicates Policies（預設為DefaultProvider中定義的default predicates policies集合）過濾掉那些不滿足這些Policies的的Nodes，剩下的Nodes就作為優選的輸入。
優選：根據配置的Priorities Policies（預設為DefaultProvider中定義的default priorities policies集合）給預選後的Nodes進行打分排名，得分最高的Node即作為最適合的Node，該Pod就Bind到這個Node。

如果經過優選將Nodes打分排名後，有多個Nodes並列得分最高，那麼scheduler將隨機從中選擇一個Node作為目標Node。

因此整個schedule過程，演算法本身的邏輯是非常簡單的，關鍵在這些Policies的邏輯，下面我們就來看看Kubernetes的Predicates and Priorities Policies。

Predicates and Priorities Policies

Predicates Policies

Predicates Policies就是提供給Scheduler用來過濾出滿足所定義條件的Nodes，併發的(最多16個goroutine)對每個Node啟動所有Predicates Policies的遍歷Filter，看其是否都滿足配置的Predicates Policies，若有一個Policy不滿足，則直接被淘汰。

注意：這裡的併發goroutine number為All Nodes number，但最多不能超過16個，由一個queue控制。

Kubernetes提供了以下Predicates Policies的定義，你可以在kube-scheduler啟動引數中新增--policy-config-file來指定要運用的Policies集合,比如：

{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
    {"name" : "PodFitsPorts"},
    {"name" : "PodFitsResources"},
    {"name" : "NoDiskConflict"},
    {"name" : "NoVolumeZoneConflict"},
    {"name" : "MatchNodeSelector"},
    {"name" : "HostName"}
    ],
"priorities" : [
    ...
    ]
}

NoDiskConflict: Evaluate if a pod can fit due to the volumes it requests, and those that are already mounted. Currently supported volumes are: AWS EBS, GCE PD, ISCSI and Ceph RBD. Only Persistent Volume Claims for those supported types are checked. Persistent Volumes added directly to pods are not evaluated and are not constrained by this policy.
NoVolumeZoneConflict: Evaluate if the volumes a pod requests are available on the node, given the Zone restrictions.
PodFitsResources: Check if the free resource (CPU and Memory) meets the requirement of the Pod. The free resource is measured by the capacity minus the sum of requests of all Pods on the node. To learn more about the resource QoS in Kubernetes, please check QoS proposal.
PodFitsHostPorts: Check if any HostPort required by the Pod is already occupied on the node.
HostName: Filter out all nodes except the one specified in the PodSpec’s NodeName field.
MatchNodeSelector: Check if the labels of the node match the labels specified in the Pod’s nodeSelector field and, as of Kubernetes v1.2, also match the scheduler.alpha.kubernetes.io/affinity pod annotation if present. See here for more details on both.
MaxEBSVolumeCount: Ensure that the number of attached ElasticBlockStore volumes does not exceed a maximum value (by default, 39, since Amazon recommends a maximum of 40 with one of those 40 reserved for the root volume – see Amazon’s documentation). The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.
MaxGCEPDVolumeCount: Ensure that the number of attached GCE PersistentDisk volumes does not exceed a maximum value (by default, 16, which is the maximum GCE allows – see GCE’s documentation). The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.
CheckNodeMemoryPressure: Check if a pod can be scheduled on a node reporting memory pressure condition. Currently, no BestEffort should be placed on a node under memory pressure as it gets automatically evicted by kubelet.
CheckNodeDiskPressure: Check if a pod can be scheduled on a node reporting disk pressure condition. Currently, no pods should be placed on a node under disk pressure as it gets automatically evicted by kubelet.

預設的DefaultProvider中選了以下Predicates Policies：

NoVolumeZoneConflict
MaxEBSVolumeCount
MaxGCEPDVolumeCount
MatchInterPodAffinity

說明：Fit is determined by inter-pod affinity.AffinityAnnotationKey represents the key of affinity data (json serialized) in the Annotations of a Pod.

AffinityAnnotationKey string = "scheduler.alpha.kubernetes.io/affinity"
NoDiskConflict
GeneralPredicates
- PodFitsResources
  - pod, in number
  - cpu, in cores
  - memory, in bytes
  - alpha.kubernetes.io/nvidia-gpu, in devices。截止V1.4，每個node最多隻支援1個gpu
- PodFitsHost
- PodFitsHostPorts
- PodSelectorMatches
PodToleratesNodeTaints
CheckNodeMemoryPressure
CheckNodeDiskPressure

Priorities Policies

經過預選策略甩選後得到的Nodes，會來到優選步驟。在這個過程中，會併發的根據每個Node分別啟動一個goroutine，在每個goroutine中會根據對應的policy實現，遍歷所有的預選Nodes，分別進行打分，每個Node每一個Policy的打分為0-10分，0分最低，10分最高。待所有policy對應的goroutine都完成後，根據設定的各個priorities policies的權重weight，對每個node的各個policy的得分進行加權求和作為最終的node的得分。

finalScoreNodeA = (weight1 * priorityFunc1) + (weight2 * priorityFunc2)

注意：這裡的併發goroutine number為All Nodes number，但最多不能超過16個，由一個queue控制。

思考：如果經過預選後，沒有一個Node滿足條件，則直接返回FailedPredicates報錯，不會再觸發Prioritizing階段，這是合理的。但是，如果經過預選後，只有一個Node滿足條件，同樣會觸發Prioritizing，並且所走的流程和多個Nodes一樣。實際上，如果只有一個Node滿足條件，在優選階段，可以直接返回該Node作為最終scheduled結果，無需跑完整個打分流程。

如果經過優選將Nodes打分排名後，有多個Nodes並列得分最高，那麼scheduler將隨機從中選擇一個Node作為目標Node。

Kubernetes提供了以下Priorities Policies的定義，你可以在kube-scheduler啟動引數中新增--policy-config-file來指定要運用的Policies集合，比如：

{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
    ...
    ],
"priorities" : [
    {"name" : "LeastRequestedPriority", "weight" : 1},
    {"name" : "BalancedResourceAllocation", "weight" : 1},
    {"name" : "ServiceSpreadingPriority", "weight" : 1},
    {"name" : "EqualPriority", "weight" : 1}
    ]
}

LeastRequestedPriority: The node is prioritized based on the fraction of the node that would be free if the new Pod were scheduled onto the node. (In other words, (capacity - sum of requests of all Pods already on the node - request of Pod that is being scheduled) / capacity). CPU and memory are equally weighted. The node with the highest free fraction is the most preferred. Note that this priority function has the effect of spreading Pods across the nodes with respect to resource consumption.
BalancedResourceAllocation: This priority function tries to put the Pod on a node such that the CPU and Memory utilization rate is balanced after the Pod is deployed.
SelectorSpreadPriority: Spread Pods by minimizing the number of Pods belonging to the same service, replication controller, or replica set on the same node. If zone information is present on the nodes, the priority will be adjusted so that pods are spread across zones and nodes.
CalculateAntiAffinityPriority: Spread Pods by minimizing the number of Pods belonging to the same service on nodes with the same value for a particular label.
ImageLocalityPriority: Nodes are prioritized based on locality of images requested by a pod. Nodes with larger size of already-installed packages required by the pod will be preferred over nodes with no already-installed packages required by the pod or a small total size of already-installed packages required by the pod.
NodeAffinityPriority: (Kubernetes v1.2) Implements preferredDuringSchedulingIgnoredDuringExecution node affinity; see here for more details.

預設的DefaultProvider中選了以下Priorities Policies

SelectorSpreadPriority, 預設權重為1
InterPodAffinityPriority, 預設權重為1
- pods should be placed in the same topological domain (e.g. same node, same rack, same zone, same power domain, etc.)
- as some other pods, or, conversely, should not be placed in the same topological domain as some other pods.
- AffinityAnnotationKey represents the key of affinity data (json serialized) in the Annotations of a Pod.
scheduler.alpha.kubernetes.io/affinity="..."
LeastRequestedPriority, 預設權重為1
BalancedResourceAllocation, 預設權重為1
NodePreferAvoidPodsPriority, 預設權重為10000

說明：這裡權重設定足夠大（10000），如果得分不為0，那麼加權後最終得分將很高，如果得分為0，那麼意味著相對其他得搞很高的，註定被淘汰,分析如下：

如果Node的Anotation沒有設定key-value:

scheduler.alpha.kubernetes.io/preferAvoidPods="..."

則該node對該policy的得分就是10分，加上權重10000，那麼該node對該policy的得分至少10W分。

如果Node的Anotation設定了

scheduler.alpha.kubernetes.io/preferAvoidPods="..."

如果該pod對應的Controller是ReplicationController或ReplicaSet，則該node對該policy的得分就是0分，那麼該node對該policy的得分相對沒有設定該Anotation的Node得分低的離譜了。也就是說這個Node一定會被淘汰！
NodeAffinityPriority, 預設權重為1
TaintTolerationPriority, 預設權重為1

scheduler演算法流程圖

這裡寫圖片描述

總結

kubernetes scheduler的任務就是將pod排程到最合適的Node。
整個排程過程分兩步：預選(Predicates)和優選(Policies)
預設配置的排程策略為DefaultProvider，具體包含的策略見上。
可以通過kube-scheduler的啟動引數–policy-config-file指定一個自定義的Json內容的檔案，按照格式組裝自己Predicates and Priorities policies。

Kubernetes Scheduler原理解析

Scheduler及其演算法介紹

Predicates and Priorities Policies

Predicates Policies

Priorities Policies

scheduler演算法流程圖

總結

Kubernetes Scheduler原理解析

Golang-Scheduler原理解析

Kubernetes網路原理解析

[Architect] Abp 框架原理解析(5) UnitOfWork

angularjs工作原理解析

USB Type-C工作原理解析

LocationManager（一）-定位方式原理解析

移動端使用rem同時適應安卓ios手機原理解析，移動端響應式開發

短信轟炸工具原理解析

【數據壓縮】JPEG標準與原理解析

數據庫水平切分(拆庫拆表)的實現原理解析(轉)

遊戲外掛原理解析與制作 - [內存數值修改類篇一]

圍棋人機大戰中阿爾法狗原理解析，左右互搏，青出於藍而勝於藍？

遊戲外掛原理解析與制作 - [內存數值修改類篇二]

Spring源碼：IOC原理解析（二）

Giraph源代碼分析（九）—— Aggregators 原理解析

第九章 Servllet工作原理解析

數據庫水平切分的實現原理解析——分庫，分表，主從，集群，負載均衡器（轉）

MyBatis框架中Mapper映射配置的使用及原理解析(二) 配置篇 SqlSessionFactoryBuilder，XMLConfigBuilder

laravel的源碼解析：PHP自動加載功能原理解析

Kubernetes Scheduler原理解析

Scheduler及其演算法介紹

Predicates and Priorities Policies

Predicates Policies

Priorities Policies

scheduler演算法流程圖

總結

相關推薦