1. 程式人生 > >Kubernetes Scheduler原理解析

本文是對Kubernetes Scheduler的演算法解讀和原理解析,重點介紹了預選(Predicates)和優選(Priorities)步驟的原理,並介紹了預設配置的Default Policies。接下來,我會分析Kubernetes Scheduler的原始碼,窺探其具體的實現細節以及如何開發一個Policy,見我下片博文吧。


Kubernetes Scheduler是Kubernetes Master的一個元件,通常與API Server和Controller Manager元件部署在一個節點,共同組成Master的三劍客。



  • 預選:根據配置的Predicates Policies(預設為DefaultProvider中定義的default predicates policies集合)過濾掉那些不滿足這些Policies的的Nodes,剩下的Nodes就作為優選的輸入。
  • 優選:根據配置的Priorities Policies(預設為DefaultProvider中定義的default priorities policies集合)給預選後的Nodes進行打分排名,得分最高的Node即作為最適合的Node,該Pod就Bind到這個Node。


因此整個schedule過程,演算法本身的邏輯是非常簡單的,關鍵在這些Policies的邏輯,下面我們就來看看Kubernetes的Predicates and Priorities Policies。

Predicates and Priorities Policies

Predicates Policies

Predicates Policies就是提供給Scheduler用來過濾出滿足所定義條件的Nodes,併發的(最多16個goroutine)對每個Node啟動所有Predicates Policies的遍歷Filter,看其是否都滿足配置的Predicates Policies,若有一個Policy不滿足,則直接被淘汰。

注意:這裡的併發goroutine number為All Nodes number,但最多不能超過16個,由一個queue控制。

Kubernetes提供了以下Predicates Policies的定義,你可以在kube-scheduler啟動引數中新增--policy-config-file來指定要運用的Policies集合,比如:

"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
    {"name" : "PodFitsPorts"},
    {"name" : "PodFitsResources"},
    {"name" : "NoDiskConflict"},
    {"name" : "NoVolumeZoneConflict"},
    {"name" : "MatchNodeSelector"},
    {"name" : "HostName"}
"priorities" : [
  1. NoDiskConflict: Evaluate if a pod can fit due to the volumes it requests, and those that are already mounted. Currently supported volumes are: AWS EBS, GCE PD, ISCSI and Ceph RBD. Only Persistent Volume Claims for those supported types are checked. Persistent Volumes added directly to pods are not evaluated and are not constrained by this policy.

  2. NoVolumeZoneConflict: Evaluate if the volumes a pod requests are available on the node, given the Zone restrictions.

  3. PodFitsResources: Check if the free resource (CPU and Memory) meets the requirement of the Pod. The free resource is measured by the capacity minus the sum of requests of all Pods on the node. To learn more about the resource QoS in Kubernetes, please check QoS proposal.

  4. PodFitsHostPorts: Check if any HostPort required by the Pod is already occupied on the node.

  5. HostName: Filter out all nodes except the one specified in the PodSpec’s NodeName field.

  6. MatchNodeSelector: Check if the labels of the node match the labels specified in the Pod’s nodeSelector field and, as of Kubernetes v1.2, also match the scheduler.alpha.kubernetes.io/affinity pod annotation if present. See here for more details on both.

  7. MaxEBSVolumeCount: Ensure that the number of attached ElasticBlockStore volumes does not exceed a maximum value (by default, 39, since Amazon recommends a maximum of 40 with one of those 40 reserved for the root volume – see Amazon’s documentation). The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.

  8. MaxGCEPDVolumeCount: Ensure that the number of attached GCE PersistentDisk volumes does not exceed a maximum value (by default, 16, which is the maximum GCE allows – see GCE’s documentation). The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.

  9. CheckNodeMemoryPressure: Check if a pod can be scheduled on a node reporting memory pressure condition. Currently, no BestEffort should be placed on a node under memory pressure as it gets automatically evicted by kubelet.

  10. CheckNodeDiskPressure: Check if a pod can be scheduled on a node reporting disk pressure condition. Currently, no pods should be placed on a node under disk pressure as it gets automatically evicted by kubelet.

預設的DefaultProvider中選了以下Predicates Policies:

  1. NoVolumeZoneConflict
  2. MaxEBSVolumeCount
  3. MaxGCEPDVolumeCount
  4. MatchInterPodAffinity

    說明:Fit is determined by inter-pod affinity.AffinityAnnotationKey represents the key of affinity data (json serialized) in the Annotations of a Pod.

    AffinityAnnotationKey string = "scheduler.alpha.kubernetes.io/affinity"

  5. NoDiskConflict
  6. GeneralPredicates
    • PodFitsResources
      • pod, in number
      • cpu, in cores
      • memory, in bytes
      • alpha.kubernetes.io/nvidia-gpu, in devices。截止V1.4,每個node最多隻支援1個gpu
    • PodFitsHost
    • PodFitsHostPorts
    • PodSelectorMatches
  7. PodToleratesNodeTaints
  8. CheckNodeMemoryPressure
  9. CheckNodeDiskPressure

Priorities Policies

經過預選策略甩選後得到的Nodes,會來到優選步驟。在這個過程中,會併發的根據每個Node分別啟動一個goroutine,在每個goroutine中會根據對應的policy實現,遍歷所有的預選Nodes,分別進行打分,每個Node每一個Policy的打分為0-10分,0分最低,10分最高。待所有policy對應的goroutine都完成後,根據設定的各個priorities policies的權重weight,對每個node的各個policy的得分進行加權求和作為最終的node的得分。

finalScoreNodeA = (weight1 * priorityFunc1) + (weight2 * priorityFunc2)

注意:這裡的併發goroutine number為All Nodes number,但最多不能超過16個,由一個queue控制。



Kubernetes提供了以下Priorities Policies的定義,你可以在kube-scheduler啟動引數中新增--policy-config-file來指定要運用的Policies集合,比如:

"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
"priorities" : [
    {"name" : "LeastRequestedPriority", "weight" : 1},
    {"name" : "BalancedResourceAllocation", "weight" : 1},
    {"name" : "ServiceSpreadingPriority", "weight" : 1},
    {"name" : "EqualPriority", "weight" : 1}
  • LeastRequestedPriority: The node is prioritized based on the fraction of the node that would be free if the new Pod were scheduled onto the node. (In other words, (capacity - sum of requests of all Pods already on the node - request of Pod that is being scheduled) / capacity). CPU and memory are equally weighted. The node with the highest free fraction is the most preferred. Note that this priority function has the effect of spreading Pods across the nodes with respect to resource consumption.
  • BalancedResourceAllocation: This priority function tries to put the Pod on a node such that the CPU and Memory utilization rate is balanced after the Pod is deployed.
  • SelectorSpreadPriority: Spread Pods by minimizing the number of Pods belonging to the same service, replication controller, or replica set on the same node. If zone information is present on the nodes, the priority will be adjusted so that pods are spread across zones and nodes.
  • CalculateAntiAffinityPriority: Spread Pods by minimizing the number of Pods belonging to the same service on nodes with the same value for a particular label.
  • ImageLocalityPriority: Nodes are prioritized based on locality of images requested by a pod. Nodes with larger size of already-installed packages required by the pod will be preferred over nodes with no already-installed packages required by the pod or a small total size of already-installed packages required by the pod.
  • NodeAffinityPriority: (Kubernetes v1.2) Implements preferredDuringSchedulingIgnoredDuringExecution node affinity; see here for more details.

預設的DefaultProvider中選了以下Priorities Policies

  1. SelectorSpreadPriority, 預設權重為1
  2. InterPodAffinityPriority, 預設權重為1

    • pods should be placed in the same topological domain (e.g. same node, same rack, same zone, same power domain, etc.)
    • as some other pods, or, conversely, should not be placed in the same topological domain as some other pods.
    • AffinityAnnotationKey represents the key of affinity data (json serialized) in the Annotations of a Pod.


  3. LeastRequestedPriority, 預設權重為1

  4. BalancedResourceAllocation, 預設權重為1
  5. NodePreferAvoidPodsPriority, 預設權重為10000








  6. NodeAffinityPriority, 預設權重為1

  7. TaintTolerationPriority, 預設權重為1




  • kubernetes scheduler的任務就是將pod排程到最合適的Node。
  • 整個排程過程分兩步:預選(Predicates)和優選(Policies)
  • 預設配置的排程策略為DefaultProvider,具體包含的策略見上。
  • 可以通過kube-scheduler的啟動引數–policy-config-file指定一個自定義的Json內容的檔案,按照格式組裝自己Predicates and Priorities policies。


