kube-scheduler排程器排程框架原始碼學習篇1

阿新 • • 發佈：2021-07-02

queueSort擴充套件點
preFilter擴充套件點

queueSort擴充套件點

概述

該擴充套件點需要完成的工作為：對兩個pod的排程優先順序進行比較

該擴充套件點有且只有一個外掛實現，預設的實現外掛為下面的PrioritySort外掛。

PrioritySort

pkg/scheduler/framework/plugins/queuesort/priority_sort.go

// Less is the function used by the activeQ heap algorithm to sort pods.
// It sorts pods based on their priority. When priorities are equal, it uses
// PodQueueInfo.timestamp.
func (pl *PrioritySort) Less(pInfo1, pInfo2 *framework.QueuedPodInfo) bool {
	p1 := corev1helpers.PodPriority(pInfo1.Pod)
	p2 := corev1helpers.PodPriority(pInfo2.Pod)
	return (p1 > p2) || (p1 == p2 && pInfo1.Timestamp.Before(pInfo2.Timestamp))
}

核心邏輯：通過Less方法比較兩個pod的排程優先順序，如果pod1的優先順序比pod2優先順序高，則返回true
比較的依據是先比較podSpec裡的priority，priority越高則優先順序越高；如果priority相等，則時間早的排序靠前
podSpec裡的priority欄位在pod到達排程器之前由pod關聯的PriorityClass物件解析而來
一個排程器框架中，只有一個queueSort plugin可以被啟用，程式碼中預設啟用了第一個

裝載過程

pkg/scheduler/factory.go

Configurator.create()

資料的流動：config->profile->framework.Framework->QueueSortFunc()->framework.LessFunc

preFilter擴充套件點

概述

對於每一個排程框架下的plugin，處理該擴充套件點的大致套路如下：

關鍵輸入：cycleState，pod
- cycleState為各個plugin的共用儲存，其包裝了一個map[string]StateData，可以通過Read、Write方法安全地讀寫
- pod即為v1 api組的Pod結構體，可以從中取得pod的資訊
輸出：正常情況返回nil，有錯誤的情況，通過framework.NewStatus()返回類似於Error、Unschedulable等狀態
核心操作：每個外掛定義一個獨有的key，以及一個獨有的StateData，按照需求將需要後續使用的資料封裝到StateData中，寫入到cycleState這個map中

以下為不同外掛對該擴充套件點的具體實現

NodeResourcesFit

pkg/scheduler/framework/plugins/noderesources/fit.go

func computePodResourceRequest(pod *v1.Pod) *preFilterState {
	result := &preFilterState{}
	for _, container := range pod.Spec.Containers {
		result.Add(container.Resources.Requests)
	}

	// take max_resource(sum_pod, any_init_container)
	for _, container := range pod.Spec.InitContainers {
		result.SetMaxResource(container.Resources.Requests)
	}

	// If Overhead is being utilized, add to the total requests for the pod
	if pod.Spec.Overhead != nil && utilfeature.DefaultFeatureGate.Enabled(features.PodOverhead) {
		result.Add(pod.Spec.Overhead)
	}

	return result
}

核心邏輯：對於一個進入排程過程的pod，計算其整體的資源需求量，為後續流程做準備
由於init container順序啟動，所以對於它們的同種資源需求取最大值；普通container的同種資源需求累加
除了統計一個pod內所有容器的資源需求外，排程器還支援將pod本身除容器以外的額外資源開銷納入排程流程中，詳情見Pod開銷。因此函式的最後會檢查該特性門控是否開啟，pod是否有overhead欄位，如果滿足條件，會把這類資源消耗也納入統計中

NodePorts

pkg/scheduler/framework/plugins/nodeports/node_ports.go

// getContainerPorts returns the used host ports of Pods: if 'port' was used, a 'port:true' pair
// will be in the result; but it does not resolve port conflict.
func getContainerPorts(pods ...*v1.Pod) []*v1.ContainerPort {
	ports := []*v1.ContainerPort{}
	for _, pod := range pods {
		for j := range pod.Spec.Containers {
			container := &pod.Spec.Containers[j]
			for k := range container.Ports {
				ports = append(ports, &container.Ports[k])
			}
		}
	}
	return ports
}

核心邏輯：遍歷pod的所有容器的所有埠，將其展平後放入一個slice中

PodTopologySpread

pkg/scheduler/framework/plugins/podtopologyspread/filtering.go

該擴充套件點主要處理pod的拓撲域打散資訊，略

InterPodAffinity

pkg/scheduler/framework/plugins/interpodaffinity/filtering.go

該擴充套件點主要處理pod的affinity和antifinity資訊，略

VolumeBinding

pkg/scheduler/framework/plugins/volumebinding/volume_binding.go

核心邏輯：遍歷podSpec的volume，判斷是否用到了pvc，如果沒有用到，則該擴充套件點沒有意義，跳過；如果有pvc，則解析pvc的資訊
解析出的PVC狀態分為三類：bound, tobind, unboundImmediate，其中的unboundImmediate狀態屬於非正常狀態，如果存在這類pvc，會直接返回錯誤資訊pod has unbound immediate PersistentVolumeClaims，並將pod重新放回排程佇列
最後會初始化一個map：podVolumesByNode，在filter擴充套件點階段會用到

NodeAffinity

pkg/scheduler/framework/plugins/nodeaffinity/node_affinity.go

func GetRequiredNodeAffinity(pod *v1.Pod) RequiredNodeAffinity {
	var selector labels.Selector
	if len(pod.Spec.NodeSelector) > 0 {
		selector = labels.SelectorFromSet(pod.Spec.NodeSelector)
	}
	// Use LazyErrorNodeSelector for backwards compatibility of parsing errors.
	var affinity *LazyErrorNodeSelector
	if pod.Spec.Affinity != nil &&
		pod.Spec.Affinity.NodeAffinity != nil &&
		pod.Spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil {
		affinity = NewLazyErrorNodeSelector(pod.Spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution)
	}
	return RequiredNodeAffinity{labelSelector: selector, nodeSelector: affinity}
}

核心邏輯：解析podSpec裡的nodeSelector欄位和nodeAffinity欄位，儲存
此處只處理了RequiredDuringSchedulingIgnoredDuringExecution型別的nodeAffinity

kube-scheduler排程器排程框架原始碼學習篇1

目錄queueSort擴充套件點概述PrioritySort裝載過程preFilter擴充套件點概述NodeResourcesFitNodePortsPodTopologySpreadInterPodAffinityVolumeBindingNodeAffinity

kube-scheduler排程器及排程框架原始碼學習篇0

目錄排程器流程排程框架流程scheduler的本地啟動匯出預設配置參考排程器流程

kube-scheduler排程器排程框架原始碼學習篇2

目錄Filter擴充套件點概述NodeUnschedulableNodeNameTaintTolerationNodeAffinityNodePortsNodeResourcesFitVolumeRestrictionsNodeVolumeLimitsVolumeBindingVolumeZonePodTopologySpreadInterPodAffinity

UiAutomator原始碼學習（1）-- UiDevice

UiDevice提供對裝置狀態資訊的訪問。也可以使用此類來模擬裝置上的使用者操作，例如按鍵盤或按Home和Menu按鈕。UiDevice類的完整原始碼UiDevice.java

Spring原始碼學習（1）

Spring原始碼學習（一）一、前言　　該系列部落格用於記錄本人學習Spring原始碼的過程，以Spring5.1為例。第一篇筆記不會記錄太多程式碼相關的內容，更多的是梳理一下Spring整體的結構，本人學識有限，如果書寫過

RocketMQ原始碼分析篇(1)-架構說明

1、技術架構 RocketMQ架構上主要分為四部分，如上圖所示。 .Producer 訊息釋出的角色，支援分散式叢集方式部署。Producer通過MQ的負載均衡模組選擇相應的Broker叢集佇列進行訊息投遞，投遞的過程支援快速失敗並且低

CMAKE 基礎學習篇1

目錄01 基礎A 認識CMAKE入門概念二進位制檔案目錄構建可執行檔案B 標頭檔案 hello-headers目錄相關的路徑變數建立變數例子標頭檔案路徑設定使用VerboseC 連結靜態庫 static-library建立靜態庫靜態庫的標頭檔案關聯連

在GO中呼叫C原始碼#基礎篇1

內嵌形式先讓我們來看一個最簡單的cgo例項 package main //#include <stdio.h> import \"C\"

kube-scheduler 排程原始碼分析

排程器核心的資料結構是 Scheduler，Scheduler物件初始化完成後就開始執行排程，Scheduler 物件的大概結構如下

kubernetes 【排程和驅逐】【2】kube-scheduler排程器

技術標籤：kuberneteskubernetes 文章目錄 1. 簡介2. kube-scheduler 排程流程3. 排程器效能調優3.1 設定閾值3.2 節點打分閾值預設閾值示例

[原始碼分析-kubernetes]3. 排程器框架

排程器框架寫在前面今天我們從pkg/scheduler/scheduler.go出發，分析Scheduler的整體框架。前面講Scheduler設計的時候有提到過原始碼的3層結構，pkg/scheduler/scheduler.go也就是中間這一層，負責Scheduler除了

kube-scheduler原始碼分析（3）-搶佔排程分析

kube-scheduler原始碼分析（3）-搶佔排程分析 kube-scheduler簡介 kube-scheduler元件是kubernetes中的核心元件之一，主要負責pod資源物件的排程工作，具體來說，kube-scheduler元件負責根據排程演算法（包括預選演算

Yarn SLS（Scheduler Load Simulator）模擬排程器

一、排程壓力模擬器介紹最近在調研Yarn排程效能問題，考慮到線上叢集規模已達到5k+臺，在線上環境實驗是不太可行的，因此必須在線上有一套環境來驗證排程器的效能，才能把有效的優化策略推廣到線上環境。線上下環境

Golang排程器GMP學習筆記（一）

排程器的由來單程序時代的問題單一執行流程，計算機只能一個任務一個任務處理

餘老師帶你學習大資料-Spark快速大資料處理第三章第十一節YARN排程器和實戰編寫

YARN編寫實戰 Yarn排程器配置理想情況下，我們應用對Yarn資源的請求應該立刻得到滿足，但現實情況資源往往是有限的，特別是在一個很繁忙的叢集，一個應用資源的請求經常需要等待一段時間才能的到相應的

[LeetCode] 621. Task Scheduler（任務排程器）

Difficulty: Medium Related Topics: Array, Greedy, Queue Link: https://leetcode.com/problems/task-scheduler/

ucore作業系統學習(六) ucore lab6執行緒排程器

1. ucore lab6介紹　　ucore在lab5中實現了較為完整的程序/執行緒機制，能夠建立和管理位於核心態或使用者態的多個執行緒，讓不同的執行緒通過上下文切換併發的執行，最大化利用CPU硬體資源。ucore在lab5中使用FIFO

ORACLE_OCP之Oracle Scheduler（ ORACLE排程器）自動執行任務

ORACLE_OCP之Oracle Scheduler（ ORACLE排程器）自動執行任務文章目標：使用Oracle Scheduler簡化管理任務建立作業，計劃和排程日程監視作業執行使用基於時間或基於事件的計劃來執行Oracle Scheduler作業描

python 鏈式計算框架_Python的分散式計算框架——Dask排程器簡介

技術標籤：python 鏈式計算框架 Dask是Python的分散式計算框架，它支援分散式的DataFrame，也就是pandas的DataFrame，二者介面完美相容，但Dask是分散式計算的框架，可以支援記憶體無法裝載的資料，進行計算，它

排程器調頻學習筆記

Linux5.4 Qcom平臺 1. per-cpu的 update_util_data 例項是排程器與schedutil調頻驅動溝通的橋樑，cpufreq_update_util()函式中訪問裡面的回撥函式進行調頻。

kube-scheduler排程器排程框架原始碼學習篇1

queueSort擴充套件點

概述

PrioritySort

裝載過程

preFilter擴充套件點

概述

NodeResourcesFit

NodePorts

PodTopologySpread

InterPodAffinity

VolumeBinding

NodeAffinity

相關推薦