深入理解k8s排程器與排程框架核心原始碼

阿新 • • 發佈：2021-01-10

k8s排程器kube-scheduler的核心實現在pkg/scheduler下 algorithmprovider：排程演算法的註冊與獲取功能，核心資料結構是一個字典類的結構 apis：k8s叢集中的資源版本相關的介面，和apiversion、type相關的一些內容 core：排程器例項的核心資料結構與介面以及外部擴充套件機制的實現 framework：定義了一套排程器內部擴充套件機制 internal：排程器核心例項依賴的內部資料結構 metrics：指標度量 profile：基於framework的一套排程器的配置，用於管控整個排程器的執行框架 testing：一些測試程式碼 util：一些通用的工具在pkg/scheduler/scheduler.go，定義了Scheduler：

type Scheduler struct {
    SchedulerCache internalcache.Cache
    Algorithm core.ScheduleAlgorithm
    NextPod func() *framework.QueuedPodInfo
    Error func(*framework.QueuedPodInfo, error)   //預設的排程失敗處理方法
    StopEverything <-chan struct{}
    SchedulingQueue internalqueue.SchedulingQueue  //Pod的排程佇列  
    Profiles profile.Map   //排程器配置
    client clientset.Interface
}

pkg/scheduler/internal/queue/scheduling_queue.go中定義了排程佇列的介面SchedulingQueue：

type SchedulingQueue interface {
    framework.PodNominator
    Add(pod *v1.Pod) error
    AddUnschedulableIfNotPresent(pod *framework.QueuedPodInfo, podSchedulingCycle int64) error
    SchedulingCycle() int64
    Pop() (*framework.QueuedPodInfo, error)
    Update(oldPod, newPod *v1.Pod) error
    Delete(pod *v1.Pod) error
    MoveAllToActiveOrBackoffQueue(event string)
    AssignedPodAdded(pod *v1.Pod)
    AssignedPodUpdated(pod *v1.Pod)
    PendingPods() []*v1.Pod
    Close()
    NumUnschedulablePods() int  //不可排程的Pod數量
    Run()
}

AssignedPodAdded、AssignedPodUpdated、MoveAllToActiveOrBackoffQueue底層都會呼叫 movePodsToActiveOrBackoffQueue方法，主要用來設定資源（Pod、Node等）更新時的回撥方法。即資源更新時，之前無法被排程的Pod，會有重試的機會。 PriorityQueue是介面的具體實現：

type PriorityQueue struct {
    framework.PodNominator  //排程的結果（Pod和Node的對應關係）
    stop chan struct{}    //外部控制佇列的channel
    clock util.Clock
    podInitialBackoffDuration time.Duration    //backoff pod 初始的等待重新排程時間
    podMaxBackoffDuration time.Duration       //backoff pod 最大的等待重新排程時間
    lock sync.RWMutex
    cond sync.Cond     //併發場景下實現控制pop的阻塞
    activeQ *heap.Heap
    podBackoffQ *heap.Heap
    unschedulableQ *UnschedulablePodsMap
    schedulingCycle int64     //計數器，每pop一共pod，增加一次
    moveRequestCycle int64
    closed bool
}

其核心資料結構主要包含三個佇列，高優先度的Pod排在前面。（1）activeQ：儲存所有等待排程的Pod的佇列預設是基於堆來實現，其中元素的優先順序則通過對比Pod的建立時間和Pod的優先順序來進行排序。 kube-scheduler發現某個Pod的nodeName是空後，就認為這個Pod處於未排程狀態，將其放到排程佇列裡：（2）podBackoffQ：儲存執行失敗的Pod的佇列

func (p *PriorityQueue) podsCompareBackoffCompleted(podInfo1, podInfo2 interface{}) bool {
    pInfo1 := podInfo1.(*framework.QueuedPodInfo)
    pInfo2 := podInfo2.(*framework.QueuedPodInfo)
    bo1 := p.getBackoffTime(pInfo1)
    bo2 := p.getBackoffTime(pInfo2)
    return bo1.Before(bo2)
}
// getBackoffTime returns the time that podInfo completes backoff
func (p *PriorityQueue) getBackoffTime(podInfo *framework.QueuedPodInfo) time.Time {
    duration := p.calculateBackoffDuration(podInfo)
    backoffTime := podInfo.Timestamp.Add(duration)
    return backoffTime
}
 
// 計算backoff時間
func (p *PriorityQueue) calculateBackoffDuration(podInfo *framework.QueuedPodInfo) time.Duration {
    duration := p.podInitialBackoffDuration
    for i := 1; i < podInfo.Attempts; i++ {
        duration = duration * 2
        if duration > p.podMaxBackoffDuration {
            return p.podMaxBackoffDuration
        }
    }
    return duration
}

（3）unschedulableQ：其實是一個Map結構，儲存暫時無法排程的Pod

type UnschedulablePodsMap struct {
    podInfoMap map[string]*framework.QueuedPodInfo
    keyFunc func(*v1.Pod) string
    metricRecorder metrics.MetricRecorder  //有Pod從Map中新增、刪除時就會增加1
}
// 建構函式
func newUnschedulablePodsMap(metricRecorder metrics.MetricRecorder) *UnschedulablePodsMap {
    return &UnschedulablePodsMap{
        podInfoMap: make(map[string]*framework.QueuedPodInfo),
        keyFunc: util.GetPodFullName,
        metricRecorder: metricRecorder,    
    }
}

新建Scheduler的方法：

func New(client clientset.Interface,
    informerFactory informers.SharedInformerFactory,
    recorderFactory profile.RecorderFactory,
    stopCh <-chan struct{},
    opts ...Option) (*Scheduler, error) {
    stopEverything := stopCh
    if stopEverything == nil {
        stopEverything = wait.NeverStop
    }
    options := defaultSchedulerOptions    //獲取預設的排程器選項，裡面會給定預設的algorithmSourceProvider
    for _, opt := range opts {
        opt(&options)
    }
    schedulerCache := internalcache.New(30*time.Second, stopEverything)    //初始化排程快取
    registry := frameworkplugins.NewInTreeRegistry()   //registry是一個字典，裡面存放了外掛名與外掛的工廠方法
    if err := registry.Merge(options.frameworkOutOfTreeRegistry); err != nil {
        return nil, err
    }
    snapshot := internalcache.NewEmptySnapshot()
    configurator := &Configurator{    //基於配置建立configurator例項
        client:                   client,
        recorderFactory:          recorderFactory,
        informerFactory:          informerFactory,
        schedulerCache:           schedulerCache,
        StopEverything:           stopEverything,
        percentageOfNodesToScore: options.percentageOfNodesToScore,
        podInitialBackoffSeconds: options.podInitialBackoffSeconds,
        podMaxBackoffSeconds:     options.podMaxBackoffSeconds,
        profiles:                 append([]schedulerapi.KubeSchedulerProfile(nil), options.profiles...),
        registry:                 registry,
        nodeInfoSnapshot:         snapshot,
        extenders:                options.extenders,
        frameworkCapturer:        options.frameworkCapturer,
    }
    metrics.Register()
    var sched *Scheduler
    source := options.schedulerAlgorithmSource
    switch {
    case source.Provider != nil:
        // Create the config from a named algorithm provider.
        sc, err := configurator.createFromProvider(*source.Provider)
        if err != nil {
            return nil, fmt.Errorf("couldn't create scheduler using provider %q: %v", *source.Provider, err)
        }
        sched = sc
    case source.Policy != nil:
        // Create the config from a user specified policy source.
        policy := &schedulerapi.Policy{}
        switch {
        case source.Policy.File != nil:
            if err := initPolicyFromFile(source.Policy.File.Path, policy); err != nil {
                return nil, err
            }
        case source.Policy.ConfigMap != nil:
            if err := initPolicyFromConfigMap(client, source.Policy.ConfigMap, policy); err != nil {
                return nil, err
            }
        }
        // Set extenders on the configurator now that we've decoded the policy
        // In this case, c.extenders should be nil since we're using a policy (and therefore not componentconfig,
        // which would have set extenders in the above instantiation of Configurator from CC options)
        configurator.extenders = policy.Extenders
        sc, err := configurator.createFromConfig(*policy)
        if err != nil {
            return nil, fmt.Errorf("couldn't create scheduler from policy: %v", err)
        }
        sched = sc
    default:
        return nil, fmt.Errorf("unsupported algorithm source: %v", source)
    }
    // Additional tweaks to the config produced by the configurator.
    sched.StopEverything = stopEverything
    sched.client = client
    addAllEventHandlers(sched, informerFactory)
    return sched, nil
}

addAllEventHandlers方法會啟動所有資源物件的事件監聽，例如，新生成的Pod，spec.nodeName為空且狀態是pending。kube-Scheduler會watch到這個Pod的生成事件。 kube-scheduler的排程流程為：（1）Cobra命令列引數解析通過options.NewOptions函式初始化各個模組的預設配置，例如HTTP或HTTPS服務等。通過options.Validate函式驗證配置引數的合法性和可用性 kube-scheduler啟動時通過--config <filename>指定配置檔案對預設配置啟動的排程器，可以用 --write-config-to把預設配置寫到一個指定檔案裡面。

apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
algorithmSource:
  provider: DefaultProvider
percentageOfNodesToScore: 0
schedulerName: default-scheduler
bindTimeoutSeconds: 600
clientConnection:
  acceptContentTypes: ""
  burst: 100
  contentType: application/vnd.kubernetes.protobuf
  kubeconfig: ""
  qps: 50
disablePreemption: false
enableContentionProfiling: false
enableProfiling: false
hardPodAffinitySymmetricWeight: 1
healthzBindAddress: 0.0.0.0:10251
leaderElection:
  leaderElect: true
  leaseDuration: 15s
  lockObjectName: kube-scheduler
  lockObjectNamespace: kube-system
  renewDeadline: 10s
  resourceLock: endpoints
  retryPeriod: 2s
metricsBindAddress: 0.0.0.0:10251
profiles:
  - schedulerName: default-scheduler
  - schedulerName: no-scoring-scheduler
    plugins:
      preScore:
        disabled:
        - name: '*'
      score:
        disabled:
        - name: '*'

algorithmSource：演算法提供者，即排程器配置（過濾器、打分器等一些配置檔案的格式），目前提供三種方式： Provider（DefaultProvider優先打散、ClusterAutoscalerProvider優先堆疊）、file、configMap percentageOfNodesToscore：控制Node的取樣規模； SchedulerName：排程器名稱，預設名稱是default-scheduler； bindTimeoutSeconds：Bind階段的超時時間 ClientConnection：配置跟kube-apiserver互動的一些引數配置。比如contentType是用來跟kube-apiserver互動的序列化協議，這裡指定為protobuf； disablePreemption：關閉搶佔協議； hardPodAffinitySymnetricweight：配置PodAffinity和NodeAffinity的權重是多少。 profiles：可以定義多個。Pod通過spec.schedulerName指定使用的排程器（預設排程器是default-scheduler）將cc物件（kube-scheduler元件的執行配置）傳入cmd/kube-scheduler/app/server.go中的Run函式，Run函式定義了kube-scheduler元件啟動的邏輯，它是一個執行不退出的常駐程序（1）Configz registration

if cz, err := configz.New("componentconfig"); err == nil {       
    cz.Set(cc.ComponentConfig)
} else {
    return fmt.Errorf("unable to register configz: %s", err)
}

（2）執行EventBroadcaster事件管理器。

cc.EventBroadcaster.StartRecordingToSink(ctx.Done())

（3）執行HTTP服務 /healthz：用於健康檢查

var checks []healthz.HealthChecker       // 設定健康檢查
if cc.ComponentConfig.LeaderElection.LeaderElect {
    checks = append(checks, cc.LeaderElection.WatchDog)
}
if cc.InsecureServing != nil {    
    separateMetrics := cc.InsecureMetricsServing != nil
    handler := buildHandlerChain(newHealthzHandler(&cc.ComponentConfig, separateMetrics, checks...), nil, nil)
    if err := cc.InsecureServing.Serve(handler, 0, ctx.Done()); err != nil {
        return fmt.Errorf("failed to start healthz server: %v", err)
    }
}
var checks []healthz.HealthChecker
if cc.ComponentConfig.LeaderElection.LeaderElect {
    checks = append(checks, cc.LeaderElection.WatchDog)
}

/metrics：用於監控指標，一般用於Prometheus指標採集

if cc.InsecureMetricsServing != nil {
    handler := buildHandlerChain(newMetricsHandler(&cc.ComponentConfig), nil, nil)
    if err := cc.InsecureMetricsServing.Serve(handler, 0, ctx.Done()); err != nil {
        return fmt.Errorf("failed to start metrics server: %v", err)
    }
}

（4）執行HTTPS服務

if cc.SecureServing != nil {
    handler := buildHandlerChain(newHealthzHandler(&cc.ComponentConfig, false, checks...), cc.Authentication.Authenticator, cc.Authorization.Authorizer)
    // TODO: handle stoppedCh returned by c.SecureServing.Serve
    if _, err := cc.SecureServing.Serve(handler, 0, ctx.Done()); err != nil {
    // fail early for secure handlers, removing the old error loop from above
    return fmt.Errorf("failed to start secure server: %v", err)
    }
}

（5）例項化所有的Informer，執行所有已經例項化的Informer物件包括Pod、Node、PV、PVC、SC、CSINode、PDB、RC、RS、Service、STS、Deployment

cc.InformerFactory.Start(ctx.Done())
cc.InformerFactory.WaitForCacheSync(ctx.Done())   // 等待所有執行中的Informer的資料同步到本地

（6）參與選主：

if cc.LeaderElection != nil {     //需要參與選主
    cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
        OnStartedLeading: func(ctx context.Context) {
            close(waitingForLeader)
            sched.Run(ctx)
        },
        OnStoppedLeading: func() {
            klog.Fatalf("leaderelection lost")
        },
    }
    leaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection)  //例項化LeaderElector物件
        if err != nil {
            return fmt.Errorf("couldn't create leader elector: %v", err)
        }
    leaderElector.Run(ctx)    //呼叫client-go中tools/leaderelection/leaderelection.go中的Run()參與領導選舉
    return fmt.Errorf("lost lease")
}

LeaderCallbacks中定義了兩個回撥函式： OnStartedLeading函式是當前節點領導者選舉成功後回撥的函式，定義了kube-scheduler元件的主邏輯； OnStoppedLeading函式是當前節點領導者被搶佔後回撥的函式，會退出當前的kube-scheduler協程。（7）執行sched.Run排程器。

sched.Run(ctx)

其執行邏輯為：

func (sched *Scheduler) Run(ctx context.Context) {
    sched.SchedulingQueue.Run()
    wait.UntilWithContext(ctx, sched.scheduleOne, 0)
    sched.SchedulingQueue.Close()
}

首先呼叫了pkg/scheduler/internal/queue/scheduling_queue.go中PriorityQueue的Run方法：

func (p *PriorityQueue) Run() { 
    go wait.Until(p.flushBackoffQCompleted, 1.0*time.Second, p.stop) 
    go wait.Until(p.flushUnschedulableQLeftover, 30*time.Second, p.stop)

其邏輯為：每隔1秒，檢測backoffQ裡是否有pod可以被放進activeQ裡每隔30秒，檢測unschedulepodQ裡是否有pod可以被放進activeQ裡（預設條件是等待時間超過60 秒）然後呼叫了sched.scheduleOne，它是kube-scheduler元件的排程主邏輯，通過wait.Until定時器執行，內部會定時呼叫sched.scheduleOne函式，當sched.config.StopEverythingChan關閉時，該定時器才會停止並退出。 kube-scheduler首先從activeQ裡pop一個等待排程的Pod出來，並從NodeCache裡拿到相關的Node資料 NodeCache橫軸為zoneIndex（即Node按照zone進行分堆，從而保證拿到的Node按zone打散），縱軸為nodeIndex。在filter階段，每pop一個node進行過濾，zoneIndex往後自增一個位置，然後從該zone的node列表中取一個Node出來（如果當前zone的無Node，就會從下一個zone拿），取出後nodeIndex也要往後自增一個位置。根據取樣比例判斷Filter到的Node是否足夠。如果取樣的規模已經達到了設定的取樣比例，Filter就會結束。取樣比例通過percentageOfNodesToScore（0~100）設定當叢集中的可排程節點少於50個時，排程器仍然會去檢查所有的Node 若不設定取樣比例，預設的比例會隨著節點數量的增多不斷降低（最低到5%） Scheduling Framework是一種可插入的架構，在原有的排程流程中定義了豐富的擴充套件點（extention point）介面開發者可以通過實現擴充套件點所定義的介面來實現外掛，從而將自身的排程邏輯整合到Scheduling Framework中。自帶的外掛在pkg/scheduler/algorithmprovider/registry.go中進行註冊：

func getDefaultConfig() *schedulerapi.Plugins {
    return &schedulerapi.Plugins{
        QueueSort: &schedulerapi.PluginSet{
            Enabled: []schedulerapi.Plugin{
                {Name: queuesort.Name},
            },
        },
        PreFilter: &schedulerapi.PluginSet{
            Enabled: []schedulerapi.Plugin{
                {Name: noderesources.FitName},
                {Name: nodeports.Name},
                {Name: podtopologyspread.Name},
                {Name: interpodaffinity.Name},
                {Name: volumebinding.Name},
            },
        },
        Filter: &schedulerapi.PluginSet{
            Enabled: []schedulerapi.Plugin{
                {Name: nodeunschedulable.Name},
                {Name: nodename.Name},
                {Name: tainttoleration.Name},
                {Name: nodeaffinity.Name},
                {Name: nodeports.Name},
                {Name: noderesources.FitName},
                {Name: volumerestrictions.Name},
                {Name: nodevolumelimits.EBSName},
                {Name: nodevolumelimits.GCEPDName},
                {Name: nodevolumelimits.CSIName},
                {Name: nodevolumelimits.AzureDiskName},
                {Name: volumebinding.Name},
                {Name: volumezone.Name},
                {Name: podtopologyspread.Name},
                {Name: interpodaffinity.Name},
            },
        },
        PostFilter: &schedulerapi.PluginSet{
            Enabled: []schedulerapi.Plugin{
                {Name: defaultpreemption.Name},
            },
        },
        PreScore: &schedulerapi.PluginSet{
            Enabled: []schedulerapi.Plugin{
                {Name: interpodaffinity.Name},
                {Name: podtopologyspread.Name},
                {Name: tainttoleration.Name},
                {Name: nodeaffinity.Name},
            },
        },
        Score: &schedulerapi.PluginSet{
            Enabled: []schedulerapi.Plugin{
                {Name: noderesources.BalancedAllocationName, Weight: 1},
                {Name: imagelocality.Name, Weight: 1},
                {Name: interpodaffinity.Name, Weight: 1},
                {Name: noderesources.LeastAllocatedName, Weight: 1},
                {Name: nodeaffinity.Name, Weight: 1},
                {Name: nodepreferavoidpods.Name, Weight: 10000},
                {Name: podtopologyspread.Name, Weight: 2},
                {Name: tainttoleration.Name, Weight: 1},
            },
        },
        Reserve: &schedulerapi.PluginSet{
            Enabled: []schedulerapi.Plugin{
                {Name: volumebinding.Name},
            },
        },
        PreBind: &schedulerapi.PluginSet{
            Enabled: []schedulerapi.Plugin{
                {Name: volumebinding.Name},
            },
        },
        Bind: &schedulerapi.PluginSet{
            Enabled: []schedulerapi.Plugin{
                {Name: defaultbinder.Name},
            },
        },
    }
}

Scheduling Framework在執行排程流程執行到相應的擴充套件點時，會呼叫使用者註冊的外掛，影響排程決策的結果。核心排程流程在pkg/scheduler/core/generic_scheduler.go：

func (g *genericScheduler) Schedule(ctx context.Context, fwk framework.Framework, state *framework.CycleState, pod *v1.Pod) (result ScheduleResult, err error) {
    trace := utiltrace.New("Scheduling", utiltrace.Field{Key: "namespace", Value: pod.Namespace}, utiltrace.Field{Key: "name", Value: pod.Name})
    defer trace.LogIfLong(100 * time.Millisecond)
    if err := g.snapshot(); err != nil {
        return result, err
    }
    trace.Step("Snapshotting scheduler cache and node infos done")
    if g.nodeInfoSnapshot.NumNodes() == 0 {
        return result, ErrNoNodesAvailable
    }
    feasibleNodes, filteredNodesStatuses, err := g.findNodesThatFitPod(ctx, fwk, state, pod)
    if err != nil {
        return result, err
    }
    trace.Step("Computing predicates done")
    if len(feasibleNodes) == 0 {
        return result, &FitError{
            Pod:                   pod,
            NumAllNodes:           g.nodeInfoSnapshot.NumNodes(),
            FilteredNodesStatuses: filteredNodesStatuses,
        }
    }
    // When only one node after predicate, just use it.
    if len(feasibleNodes) == 1 {
        return ScheduleResult{
            SuggestedHost:  feasibleNodes[0].Name,
            EvaluatedNodes: 1 + len(filteredNodesStatuses),
            FeasibleNodes:  1,
        }, nil
    }
    priorityList, err := g.prioritizeNodes(ctx, fwk, state, pod, feasibleNodes)
    if err != nil {
        return result, err
    }
    host, err := g.selectHost(priorityList)
    trace.Step("Prioritizing done")
    return ScheduleResult{
        SuggestedHost:  host,
        EvaluatedNodes: len(feasibleNodes) + len(filteredNodesStatuses),
        FeasibleNodes:  len(feasibleNodes),
    }, err
}

下面為Scheduling Framework全流程，灰色外掛預設不啟用： 1、scheduling cycle scheduling cycle是排程的核心流程，主要進行排程決策，挑選出唯一的節點。 scheduling cycle是同步執行的，同一個時間只有一個scheduling cycle，是執行緒安全的（1）QueueSort QueueSortPlugin用於排序排程佇列中的Pod，介面只定義了一個函式Less，用於堆排序待排程Pod時進行比較

type QueueSortPlugin interface {
    Plugin
    Less(*PodInfo, *PodInfo) bool   
}

比較函式在同一時刻只有一個，所以QueueSort 外掛只能Enable一個，如果使用者Enable了2個則排程器啟動時會報錯退出預設的比較函式首先比較優先順序，然後再比較timestamp：

type PrioritySort struct{}
func (pl *PrioritySort) Less(pInfo1, pInfo2 *framework.QueuedPodInfo) bool {
    p1 := pod.GetPodPriority(pInfo1.Pod)
    p2 := pod.GetPodPriority(pInfo2.Pod)
    return (p1 > p2) || (p1 == p2 && pInfo1.Timestamp.Before(pInfo2.Timestamp))
}
func GetPodPriority(pod *v1.Pod) int32 {
    if pod.Spec.Priority != nil { 
        return *pod.Spec.Priority 
    } 
    return 0 
}

預選階段先併發執行PreFilter，只有當所有的PreFilter外掛都返回success 時，才能進入Filter階段，否則Pod將會被拒絕掉，標識此次排程流程失敗；再併發執行Filter的所有外掛，每個Node只要被任一Filter外掛認為不滿足排程要求就會被濾除。為了提升效率，執行順序可以被配置，這樣使用者就可以將過濾掉大量節點的策略（例如NodeSelector的Filter）放到前邊執行，從而減少後邊Filter策略執行的次數（2）PreFilter PreFilter是排程流程啟動之前的預處理，可以進行Pod資訊的加工、叢集或Pod必須滿足的預置條件的檢查等。

NodeResourcesFit
NodePorts
podtopologyspread
InterPodAffinity
volumebinding
ServiceAffinity

（3）Filter

nodeunschedulable

Node是否不允許排程

NodeResourcesFit

檢查節點是否有Pod執行所需的資源

nodename：

Node是否符合Pod在spec.nodeSelector中的要求

nodeports

若Pod定義了Ports.hostPort屬性，則檢查其值指定的埠是否已經被節點上其他容器或服務佔用

nodeaffinity

Pod和Node的親和和反親和排程

volumerestrictions

檢查掛載該Node上的卷是否滿足儲存提供者的要求

tainttoleration

檢查Pod Tolerates和Node Taints是否匹配

NodeVolumeLimits、 EBSLimits、 GCEPDLimits、 AzureDiskLimits、 CinderVolume：

校驗PVC指定的Provision在CSI plugin或非CSI Plugin（後三個）上報的單機最大掛盤數（儲存外掛提供方一般對每個節點的單機最大掛載磁碟數是有限制的）

volumebinding

檢查節點是否已經存在所需的PV，如果已經掛載了卷，其它同樣使用這個卷的Pod不能排程到這個主機上；如果沒有，檢查是否能夠繫結所需的PV

volumezone

檢查檢查PV的Label，如果定義了zone的資訊，則必須和Node的zone匹配

podtopologyspread

檢查Pod的拓撲邏輯

interpodaffinity

Pod間的親和和反親和排程

NodeLabel
ServiceAffinity

（4）PostFilter 主要用於處理Pod在Filter階段失敗後的操作，如搶佔、Autoscale觸發等。

DefaultPreemption：當高優先順序的Pod沒有找到合適的Node時，會執行Preempt搶佔演算法，搶佔的流程：

①一個Pod進入搶佔的時候，首先會判斷Pod是否擁有搶佔的資格，有可能上次已經搶佔過一次。 ②如果符合搶佔資格，會先對所有的節點進行一次過濾，過濾出符合這次搶佔要求的節點。然後 ③模擬一次排程，把優先順序低的Pod先移除出去，再嘗試能否把待搶佔的Pod放置到此節點上。然後通過這個過程從過濾剩下的節點中選出一批節點進行搶佔。 ④ProcessPreemptionWithExtenders是一個擴充套件的鉤子，使用者可以在這裡加一些自己搶佔節點的策略。如果沒有擴充套件的鉤子，這裡面不做任何動作。 ⑤PickOneNodeForPreemption，從上面選出的節點裡挑選出最合適的一個節點，策略包括：優先選擇打破PDB最少的節點；其次選擇待搶佔Pods中最大優先順序最小的節點；再次選擇待搶佔Pods優先順序加和最小的節點；接下來選擇待搶佔Pods數目最小的節點；最後選擇擁有最晚啟動Pod的節點；通過過濾之後，會選出一個最合適的節點。對這個節點上待搶佔的Pod進行delete，完成搶佔過程。優選階段會執行PreScore+Score所有外掛進行打分（5）PreScore 獲取到通過Filter階段的節點列表後，進行一些資訊預處理、生成日誌或者監控資訊。

SelectorSpread
interpodaffinity
podtopologyspread
tainttoleration
NodeResourceLimits

（6）Score 對Filter過濾後的剩餘節點進行打分。

SelectorSpread
NodeResourcesBalancedAllocation：碎片率（BalancedResourceAllocation）：

公式是{ 1 - Abs[CPU(Request / Allocatable) - Mem(Request / Allocatable)] } * Score 注：該公式是用來考慮碎片率（CPU 的使用比例和記憶體使用比例的差值）。如果這個差值越大，就表示碎片越大，優先不分配到這個節點上。

NodeResourcesLeastAllocated：優先打散

公式是 (Allocatable - Request) / Allocatable * Score imagelocality：如果節點裡面存在映象的話，優先把Pod排程到這個節點上。這裡還會去考慮映象的大小，會按照節點上已經存在的映象大小優先順序親和

interpodaffinity
nodeaffinity
nodepreferavoidpods
podtopologyspread

權重為2（因為是使用者指定的）

tainttoleration
NodeResourcesMostAllocated：優先堆疊

公式是Request / Allocatable * Score

RequestedToCapacityRatio：指定比率

使用者指定配置引數可以指定不同資源使用比率的分數，從而達到控制叢集上每個節點上pod的分佈。

NodeResourceLimits
NodeLabel
ServiceAffinity：

替換了曾經的SelectorSpreadPriority（因為Service代表一組服務，只要能做到服務的打散分配就足夠了）。（7）NormalizeScore 標準化完成後，Scheduler會綜合所有打分器的打分。（8）Reserve 分配Pod到Node的時候，需要進行賬本預佔（Reserve），對排程結果進行快取。預佔的過程會把Pod的狀態標記為Assumed（處於記憶體態）、在Node的狀態中新增該Pod的資料賬本。

volumebinding

未來會將UnReserve與Reserve統一到一起，即要求開發者在實現Reserve的同時定義UnReserve，保證資料能夠有效的清理，避免留下髒資料（9）Permit Pod在Reserve階段完成資源預留後、Bind操作前，開發者可以定義自己的策略在Permit階段進行攔截，根據條件對Pod進行 allow（允許Pod通過Permit階段）、reject（Pod排程失敗）和wait（可設定超時時間）這3種操作。 Schedule Theread周而復始的從activeQ拿出Pod，進入scheduling cycle的排程流水線。 scheduling cycle結束後，這個Pod會非同步交給Wait Thread，Wait Thread如果等待成功了，就會交給binding cycle 2、Binding cycle （10）prebind

volumebinding

（11）bind

defaultbinder

kube-scheduler只有在watch到Pod資料已經確定分配到這個Node的時候，才會更新排程快取（Schedule Cache），把Pod的狀態變成Added，也會更新Node資料。選中的節點在和待排程Pod進行Bind的時候，有可能會Bind失敗，此時需要做回退，把Pod的Assumed狀態退回Initial，從Node裡面把Pod資料賬本擦除掉，會把Pod重新丟回到unschedulableQ佇列裡面。在unschedulableQ裡，如果一個Pod一分鐘沒排程過，就會重新回到activeQ。它的輪詢週期是30s。排程失敗的Pod會放到backoffQ，在backoffQ裡等待的時間會比在unschedulableQ裡更短，backoffQ裡的降級策略是2的指數次冪降級。假設重試第一次為1s，那第二次就是2s，第三次就是4s，但最大到10s。最終，某個Node上的kubelet會watch到這個Pod屬於自己所在的節點。kubelet會在節點上建立Pod，包括建立容器storage、network。等所有的資源都準備完成，kubelet會把Pod狀態更新為Runni

深入理解k8s排程器與排程框架核心原始碼

深入理解k8s排程器與排程框架核心原始碼

深入理解CSS選擇器優先級

深入理解CSS選擇器優先順序

《深入理解NGINX 模組開發與架構解析》之摘抄學習

RT-Thread 讀後感6 ——實現排程器（排程器初始化，啟動排程器）

深入理解JavaScript原型鏈與繼承

深入理解Plasma（一）Plasma 框架

執行緒排程器和排程策略

深入理解紅黑樹與B+樹應用場景

深入理解計算機原理——程式與執行（二）

深入理解代理模式原理與技術

推薦我的新書《深入理解Nginx:模組開發與架構解析》

深入理解Docker容器引擎runC執行框架_Kubernetes中文社群

深入理解javascript選擇器API系列第二篇——getElementsByClassName

深入理解javascript選擇器API系列第一篇——4種元素選擇器

深入理解Java虛擬機器 JVM基本框架

Java基礎2——深入理解基本資料型別與常量池

深入理解多執行緒與併發程式設計

《深入理解Nginx 模組開發與架構解析》筆記之epoll事件模組

深入理解C# 靜態類與非靜態類、靜態成員的區別

深入理解k8s排程器與排程框架核心原始碼

相關推薦