1. 程式人生 > 其它 >[原始碼分析-kubernetes]9. 親和性專題

[原始碼分析-kubernetes]9. 親和性專題

專題-親和性排程(Author - XiaoYang)

簡介

在未分析和深入理解scheduler原始碼邏輯之前,本人在操作配置親和性上,由於官方和第三方文件者說明不清楚等原因,在親和性理解上有遇到過一些困惑,如:

  1. 親和性的operator的 “In”底層是什麼匹配操作?正則匹配嗎?“Gt/Lt”底層又是什麼操作實現的?

  2. 所有能查到的文件描述pod親和性的topoloykey有三個:
    kubernetes.io/hostname
    failure-domain.beta.kubernetes.io/zone
    failure-domain.beta.kubernetes.io/region
    為什麼?真的只支援這三個key?不能自定義?

  3. Pod與Node親和性兩種型別的差異是什麼?而Pod親和性正真要去匹配的是什麼,其內在邏輯是?
    不知道你們是否有同樣類似的問題或困惑呢?當你清晰的理解了程式碼邏輯實現後,那麼你會覺得一切是那麼的
    清楚明確了,不再有“隱性知識”問題存在。所以我希望本文所述內容能給大家在kubernetes親和性的解惑上有所幫助。

約束排程

在展開原始碼分析之前為更好的理解親和性程式碼邏輯,補充一些kubernetes排程相關的基礎知識:

  1. 親和性目的是為了實現使用者可以按需將pod排程到指定Node上,我稱之為“約束排程”
  2. 約束排程操作上常用以下三類:
  • NodeSelector / NodeName
    node標籤選擇器 和 "nodeName"匹配
  • Affinity (Node/Pod/Service) 親和性
  • Taint / Toleration 汙點和容忍
  1. 本文所述主題是親和性,親和性分為三種類型Node、Pod、Service親和,以下是親和性預選和優選階段程式碼實現的策略對應表(後面有詳細分析):
預選階段策略 Pod.Spec配置 類別 次序
MatchNodeSelecotorPred NodeAffinity.RequiredDuringScheduling
IgnoredDuringExecution
Node 6
MatchInterPodAffinityPred PodAffinity.RequiredDuringScheduling
IgnoredDuringExecution
**PodAntiAffinity.RequiredDuringScheduling
IgnoredDuringExecution
Pod 22
CheckServiceAffinityPred Service 12
優選階段策略 Pod.Spec配置 預設權重
InterPodAffinityPriority PodAffinity.PreferredDuringScheduling
IgnoredDuringExecution
1
NodeAffinityPriority NodeAffinity.PreferredDuringScheduling
IgnoredDuringExecution
1

Labels.selector標籤選擇器

labels selector是親和性程式碼底層使用最基礎的程式碼工具,不論是nodeAffinity還是podAffinity都是需要用到它。在使用yml型別deployment定義一個pod,配置其親和性時須指定匹配表示式,其根本的匹配都是要對Node或pod的labels標籤進行條件匹配。而這些labels標籤匹配計算就必須要用到labels.selector工具(公共使用部分)。 所以在將此塊最底層的匹配計算分析部分放在最前面,以便於後面原始碼分析部分更容易理解。

labels.selector介面定義,關鍵的方法是Matchs()

!FILENAME vendor/k8s.io/apimachinery/pkg/labels/selector.go:36

type Selector interface {
	Matches(Labels) bool  
	Empty() bool
	String() string
	Add(r ...Requirement) Selector
	Requirements() (requirements Requirements, selectable bool)
	DeepCopySelector() Selector
}

看一下呼叫端,如下面的幾個例項的func,呼叫labels.NewSelector()例項化一個labels.selector物件返回.

func LabelSelectorAsSelector(ps *LabelSelector) (labels.Selector, error) {
  ...
	selector := labels.NewSelector()   
  ...
}

func NodeSelectorRequirementsAsSelector(nsm []v1.NodeSelectorRequirement) (labels.Selector, error) {
	...
	selector := labels.NewSelector() 
	...
	}

func TopologySelectorRequirementsAsSelector(tsm []v1.TopologySelectorLabelRequirement) (labels.Selector, error) {
  ...
	selector := labels.NewSelector()  
  ...
}

NewSelector返回的是一個InternelSelector型別,而InternelSelector型別是一個Requirement(必要條件)

型別的列表。

!FILENAME vendor/k8s.io/apimachinery/pkg/labels/selector.go:79

func NewSelector() Selector {
	return internalSelector(nil)
}

type internalSelector []Requirement

InternelSelector類的Matches()底層實現是遍歷呼叫requirement.Matches()

!FILENAME vendor/k8s.io/apimachinery/pkg/labels/selector.go:340

func (lsel internalSelector) Matches(l Labels) bool {
	for ix := range lsel {
	  // internalSelector[ix]為Requirement
		if matches := lsel[ix].Matches(l); !matches {
			return false
		}
	}
	return true
}

再來看下requirment結構定義(key、操作符、值 ) "這就是配置的親和匹配條件表示式"

!FILENAME vendor/k8s.io/apimachinery/pkg/labels/selector.go:114

type Requirement struct {
	key      string
	operator selection.Operator
	// In huge majority of cases we have at most one value here.
	// It is generally faster to operate on a single-element slice
	// than on a single-element map, so we have a slice here.
	strValues []string
}

requirment.matchs() 真正的條件表示式操作實現,基於表示式operator,計算key/value,返回匹配與否

!FILENAME vendor/k8s.io/apimachinery/pkg/labels/selector.go:192

func (r *Requirement) Matches(ls Labels) bool {
	switch r.operator {
	case selection.In, selection.Equals, selection.DoubleEquals:
		if !ls.Has(r.key) {                       //IN
			return false
		}
		return r.hasValue(ls.Get(r.key))
	case selection.NotIn, selection.NotEquals:   //NotIn
		if !ls.Has(r.key) {
			return true
		}
		return !r.hasValue(ls.Get(r.key))        
	case selection.Exists:                       //Exists
		return ls.Has(r.key)
	case selection.DoesNotExist:                 //NotExists
		return !ls.Has(r.key)
	case selection.GreaterThan, selection.LessThan: // GT、LT
		if !ls.Has(r.key) {
			return false
		}
		lsValue, err := strconv.ParseInt(ls.Get(r.key), 10, 64)   //能轉化為數值的”字元數值“
		if err != nil {
			klog.V(10).Infof("ParseInt failed for value %+v in label %+v, %+v", ls.Get(r.key), ls, err)
			return false
		}

		// There should be only one strValue in r.strValues, and can be converted to a integer.
		if len(r.strValues) != 1 {
			klog.V(10).Infof("Invalid values count %+v of requirement %#v, for 'Gt', 'Lt' operators, exactly one value is required", len(r.strValues), r)
			return false
		}

		var rValue int64
		for i := range r.strValues {
			rValue, err = strconv.ParseInt(r.strValues[i], 10, 64)
			if err != nil {
				klog.V(10).Infof("ParseInt failed for value %+v in requirement %#v, for 'Gt', 'Lt' operators, the value must be an integer", r.strValues[i], r)
				return false
			}
		}
		return (r.operator == selection.GreaterThan && lsValue > rValue) || (r.operator == selection.LessThan && lsValue < rValue)
	default:
		return false
	}
}

注:
除了LabelsSelector外還有NodeSelector 、FieldsSelector、PropertySelector等,但基本都是類似的Selector介面實現,邏輯上都基本一致,後在原始碼分析過程有相應的說明。

Node親和性

Node親和性基礎描述:

yml配置例項sample:

---
apiVersion:v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:    #pod例項部署在prd-zone-A 或 prd-zone-B
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/prd-zone-name
            operator: In
            values:
            - prd-zone-A
            - prd-zone-B
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: securityZone
            operator: In
            values:
            - BussinssZone
  containers:
  - name: with-node-affinity
    image: gcr.io/google_containers/pause:2.0

Node親和性預選策略MatchNodeSelectorPred

策略說明:

基於NodeSelector和NodeAffinity定義為被排程的pod選擇相匹配的Node(Nodes Labels)

適用NodeAffinity配置項

NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution

預選策略原始碼分析:

  1. 策略註冊: defaults.init()註冊了一條名為“MatchNodeSelectorPred”預選策略項,策略Func是PodMatchNodeSelector()

!FILENAME pkg/scheduler/algorithmprovider/defaults/defaults.go:78

func init() {
  ...
factory.RegisterFitPredicate(predicates.MatchNodeSelectorPred, predicates.PodMatchNodeSelector)
  ...
}
  1. 策略Func: PodMatchNodeSelector()

獲取目標Node資訊,呼叫podMatchesNodeSelectorAndAffinityTerms()對被排程pod和目標node進行親和性匹配。 如果符合則返回true,反之false並記錄錯誤資訊。

!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:853

func PodMatchNodeSelector(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
  // 獲取node資訊
	node := nodeInfo.Node()
	if node == nil {
		return false, nil, fmt.Errorf("node not found")
	}
  // 關鍵子邏輯func
  // 輸入引數:被排程的pod和前面獲取的node(被檢測的node)
	if podMatchesNodeSelectorAndAffinityTerms(pod, node) {
		return true, nil, nil
	}
	return false, []algorithm.PredicateFailureReason{ErrNodeSelectorNotMatch}, nil
}

podMatchesNodeSelectorAndAffinityTerms()

​ NodeSelector和NodeAffinity定義的"必要條件"配置匹配檢測

!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:807

func podMatchesNodeSelectorAndAffinityTerms(pod *v1.Pod, node *v1.Node) bool {
  // 如果設定了NodeSelector,則檢測Node labels是否滿足NodeSelector所定義的所有terms項.   
	if len(pod.Spec.NodeSelector) > 0 {
		selector := labels.SelectorFromSet(pod.Spec.NodeSelector)
		if !selector.Matches(labels.Set(node.Labels)) {
			return false
		}
	}
  //如果設定了NodeAffinity,則進行Node親和性匹配  nodeMatchesNodeSelectorTerms() *[後面有詳細分析]* 
	nodeAffinityMatches := true
	affinity := pod.Spec.Affinity
	if affinity != nil && affinity.NodeAffinity != nil {
		nodeAffinity := affinity.NodeAffinity
		if nodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution == nil {
			return true
		}

		if nodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil {
			nodeSelectorTerms := nodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms
			klog.V(10).Infof("Match for RequiredDuringSchedulingIgnoredDuringExecution node selector terms %+v", nodeSelectorTerms)
      
      // 關鍵處理func: nodeMatchesNodeSelectorTerms()                             
			nodeAffinityMatches = nodeAffinityMatches && nodeMatchesNodeSelectorTerms(node, nodeSelectorTerms)
		}

	}
	return nodeAffinityMatches
}

  • NodeSelector和NodeAffinity.Require... 都存在配置則需True;

  • 如果NodeSelector失敗則直接false,不處理NodeAffinity;

  • 如果指定了多個 NodeSelectorTerms,那 node只要滿足其中一個條件;

  • 如果指定了多個 MatchExpressions,那必須要滿足所有條件.

nodeMatchesNodeSelectorTerms()
呼叫v1helper.MatchNodeSelectorTerms()進行NodeSelectorTerm定義的必要條件進行檢測是否符合。
關鍵的配置定義分為兩類(matchExpressions/matchFileds):
-“requiredDuringSchedulingIgnoredDuringExecution.matchExpressions”定義檢測(匹配key與value)
-“requiredDuringSchedulingIgnoredDuringExecution.matchFileds”定義檢測(不匹配key,只value)

!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:797

func nodeMatchesNodeSelectorTerms(node *v1.Node, nodeSelectorTerms []v1.NodeSelectorTerm) bool {
	nodeFields := map[string]string{}
  // 獲取檢測目標node的Filelds
	for k, f := range algorithm.NodeFieldSelectorKeys {
		nodeFields[k] = f(node)
	}
  // 呼叫v1helper.MatchNodeSelectorTerms()
  // 引數:nodeSelectorTerms  親和性配置的必要條件Terms
  //      labels             被檢測的目標node的label列表 
  //      fields             被檢測的目標node filed列表
	return v1helper.MatchNodeSelectorTerms(nodeSelectorTerms, labels.Set(node.Labels), fields.Set(nodeFields))
}

// pkg/apis/core/v1/helper/helpers.go:302
func MatchNodeSelectorTerms( nodeSelectorTerms []v1.NodeSelectorTerm,
	nodeLabels labels.Set, nodeFields fields.Set,) bool {
	for _, req := range nodeSelectorTerms {
		// nil or empty term selects no objects
		if len(req.MatchExpressions) == 0 && len(req.MatchFields) == 0 {
			continue
		}
    // MatchExpressions條件表示式匹配                                             ① 
		if len(req.MatchExpressions) != 0 {
			labelSelector, err := NodeSelectorRequirementsAsSelector(req.MatchExpressions)
			if err != nil || !labelSelector.Matches(nodeLabels) {
				continue
			}
		}
    // MatchFields條件表示式匹配                                                   ②
		if len(req.MatchFields) != 0 {
			fieldSelector, err := NodeSelectorRequirementsAsFieldSelector(req.MatchFields)
			if err != nil || !fieldSelector.Matches(nodeFields) {
				continue
			}
		}
		return true
	}
	return false
}

NodeSelectorRequirementAsSelector()
是對“requiredDuringSchedulingIgnoredDuringExecution.matchExpressions"所配置的表示式進行Selector表示式進行格式化加工,返回一個labels.Selector例項化物件. [本文開頭1.2章節有分析]

!FILENAME pkg/apis/core/v1/helper/helpers.go:222

func NodeSelectorRequirementsAsSelector(nsm []v1.NodeSelectorRequirement) (labels.Selector, error) {
	if len(nsm) == 0 {
		return labels.Nothing(), nil
	}
	selector := labels.NewSelector()
	for _, expr := range nsm {
		var op selection.Operator
		switch expr.Operator {
		case v1.NodeSelectorOpIn:
			op = selection.In
		case v1.NodeSelectorOpNotIn:
			op = selection.NotIn
		case v1.NodeSelectorOpExists:
			op = selection.Exists
		case v1.NodeSelectorOpDoesNotExist:
			op = selection.DoesNotExist
		case v1.NodeSelectorOpGt:
			op = selection.GreaterThan
		case v1.NodeSelectorOpLt:
			op = selection.LessThan
		default:
			return nil, fmt.Errorf("%q is not a valid node selector operator", expr.Operator)
		}
		// 表示式的三個關鍵要素: expr.Key, op, expr.Values 
		r, err := labels.NewRequirement(expr.Key, op, expr.Values)
		if err != nil {
			return nil, err
		}
		selector = selector.Add(*r)
	}
	return selector, nil
}

NodeSelectorRequirementAsFieldSelector()
是對“requiredDuringSchedulingIgnoredDuringExecution.matchFields"所配置的表示式進行Selector表示式進行格式化加工,返回一個Fields.Selector例項化物件.

!FILENAME pkg/apis/core/v1/helper/helpers.go:256

func NodeSelectorRequirementsAsFieldSelector(nsm []v1.NodeSelectorRequirement) (fields.Selector, error) {
	if len(nsm) == 0 {
		return fields.Nothing(), nil
	}

	selectors := []fields.Selector{}
	for _, expr := range nsm {
		switch expr.Operator {
		case v1.NodeSelectorOpIn:
			if len(expr.Values) != 1 {
				return nil, fmt.Errorf("unexpected number of value (%d) for node field selector operator %q",
					len(expr.Values), expr.Operator)
			}
			selectors = append(selectors, fields.OneTermEqualSelector(expr.Key, expr.Values[0]))

		case v1.NodeSelectorOpNotIn:
			if len(expr.Values) != 1 {
				return nil, fmt.Errorf("unexpected number of value (%d) for node field selector operator %q",
					len(expr.Values), expr.Operator)
			}
			selectors = append(selectors, fields.OneTermNotEqualSelector(expr.Key, expr.Values[0]))

		default:
			return nil, fmt.Errorf("%q is not a valid node field selector operator", expr.Operator)
		}
	}

	return fields.AndSelectors(selectors...), nil
}
  1. 關鍵資料結構
    NodeSelector相關結構的定義

!FILENAME vendor/k8s.io/api/core/v1/types.go:2436

type NodeSelector struct {
	NodeSelectorTerms []NodeSelectorTerm `json:"nodeSelectorTerms" protobuf:"bytes,1,rep,name=nodeSelectorTerms"`
}

type NodeSelectorTerm struct {
	MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty" protobuf:"bytes,1,rep,name=matchExpressions"`
	MatchFields []NodeSelectorRequirement `json:"matchFields,omitempty" protobuf:"bytes,2,rep,name=matchFields"`
}

type NodeSelectorRequirement struct {
	Key string `json:"key" protobuf:"bytes,1,opt,name=key"`
	Operator NodeSelectorOperator `json:"operator" protobuf:"bytes,2,opt,name=operator,casttype=NodeSelectorOperator"`
	Values []string `json:"values,omitempty" protobuf:"bytes,3,rep,name=values"`
}

type NodeSelectorOperator string
const (
	NodeSelectorOpIn           NodeSelectorOperator = "In"
	NodeSelectorOpNotIn        NodeSelectorOperator = "NotIn"
	NodeSelectorOpExists       NodeSelectorOperator = "Exists"
	NodeSelectorOpDoesNotExist NodeSelectorOperator = "DoesNotExist"
	NodeSelectorOpGt           NodeSelectorOperator = "Gt"
	NodeSelectorOpLt           NodeSelectorOperator = "Lt"
)

FieldsSelector實現類的結構定義(Match value)

!FILENAME vendor/k8s.io/apimachinery/pkg/fields/selector.go:78

type hasTerm struct {
	field, value string
}

func (t *hasTerm) Matches(ls Fields) bool {
	return ls.Get(t.field) == t.value
}

type notHasTerm struct {
	field, value string
}

func (t *notHasTerm) Matches(ls Fields) bool {
	return ls.Get(t.field) != t.value
}

Node親和性優選策略NodeAffinityPriority

策略說明:

通過被排程的pod親和性配置定義條件,對潛在可被排程執行的Nodes進行親和性匹配並評分.

適用NodeAffinity配置項

NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution

預選策略原始碼分析:

  1. 策略註冊:defaultPriorities()註冊了一條名為“NodeAffinityPriority”優選策略項.並註冊了策略的兩個方法Map/Reduce:

    • CalculateNodeAffinityPriorityMap() map計算, 對潛在被排程Node進行親和匹配,併為其計權重得分.
    • CalculateNodeAffinityPriorityReduce() reduce計算,重新統計得分,取值區間0~10.

!FILENAME pkg/scheduler/algorithmprovider/defaults/defaults.go:266

//k8s.io/kubernetes/pkg/scheduler/algorithmprovider/defaults/defaults.go/algorithmprovider/defaults.go 

func defaultPriorities() sets.String {
  ...
  
factory.RegisterPriorityFunction2("NodeAffinityPriority", priorities.CalculateNodeAffinityPriorityMap, priorities.CalculateNodeAffinityPriorityReduce, 1),
  
  ...
}
  1. 策略Func:

    map計算 CalculateNodeAffinityPriorityMap()
    遍歷affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution所 定義的Terms解NodeSelector物件(labels.selector)後,對潛在被排程Node的labels進行Match匹配檢測,如果匹配則將條件所給定的Weight權重值累計。 最後將返回各潛在的被排程Node最後分值。

!FILENAME pkg/scheduler/algorithm/priorities/node_affinity.go:34

func CalculateNodeAffinityPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
	// 獲取被檢測的Node資訊
  node := nodeInfo.Node()
	if node == nil {
		return schedulerapi.HostPriority{}, fmt.Errorf("node not found")
	}

	// 預設為Spec配置的Affinity
	affinity := pod.Spec.Affinity
	if priorityMeta, ok := meta.(*priorityMetadata); ok {
		// We were able to parse metadata, use affinity from there.
		affinity = priorityMeta.affinity
	}

	var count int32
	if affinity != nil && affinity.NodeAffinity != nil && affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution != nil {
    // 遍歷PreferredDuringSchedulingIgnoredDuringExecution定義的`必要條件項`(Terms)
		for i := range affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution {
			preferredSchedulingTerm := &affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution[i]
			if preferredSchedulingTerm.Weight == 0 {  //注意前端的配置,如果weight為0則不做任何處理
				continue
			}

			// TODO: Avoid computing it for all nodes if this becomes a performance problem.
      // 獲取node親和MatchExpression表示式條件,例項化label.Selector物件.  
			nodeSelector, err := v1helper.NodeSelectorRequirementsAsSelector(preferredSchedulingTerm.Preference.MatchExpressions)
			if err != nil {
				return schedulerapi.HostPriority{}, err
			}
			if nodeSelector.Matches(labels.Set(node.Labels)) {
				count += preferredSchedulingTerm.Weight
			}
		}
	}
     // 返回Node得分
	return schedulerapi.HostPriority{
		Host:  node.Name,
		Score: int(count),
	}, nil
}

再次看到前面(預選策略分析時)分析過的NodeSelectorRequirementAsSelector()
返回一個labels.Selector例項物件 使用selector.Matches對node.Labels進行匹配是否符合條件.

reduce計算 CalculateNodeAffinityPriorityReduce()

將各個node的最後得分重新計算分佈區間在0〜10.

程式碼內給定一個NormalizeReduce()方法,MaxPriority值為10,reverse取反false關閉

!FILENAME pkg/scheduler/algorithm/priorities/node_affinity.go:77

const	MaxPriority = 10
var CalculateNodeAffinityPriorityReduce = NormalizeReduce(schedulerapi.MaxPriority, false)

NormalizeReduce()

  • 結果評分取值0〜MaxPriority
  • reverse取反為true時,最終評分=(MaxPriority-其原評分值)

!FILENAME pkg/scheduler/algorithm/priorities/reduce.go:29

func NormalizeReduce(maxPriority int, reverse bool) algorithm.PriorityReduceFunction {
	return func(
		_ *v1.Pod,
		_ interface{},
		_ map[string]*schedulercache.NodeInfo,
		result schedulerapi.HostPriorityList) error {

		var maxCount int
		// 取出最大的值
		for i := range result {
			if result[i].Score > maxCount {
				maxCount = result[i].Score
			}
		}
    // 如果最大的值為0,且取反設為真,則將所有的評分設定為MaxPriority
		if maxCount == 0 {
			if reverse {
				for i := range result {
					result[i].Score = maxPriority
				}
			}
			return nil
		}
		// 計算後得分 = maxPrority * 原分值 / 最大值
		// 如果取反為真則 maxPrority - 計算後得分
		for i := range result {
			score := result[i].Score

			score = maxPriority * score / maxCount
			if reverse {
				score = maxPriority - score
			}

			result[i].Score = score
		}
		return nil
	}
}

Pod親和性

Pod親和性基礎描述:

yml配置例項sample:

---
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: affinity
  labels:
    app: affinity
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: affinity
        role: lab-web
    spec:
      containers:
      - name: nginx
        image: nginx:1.9.0
        ports:
        - containerPort: 80
          name: nginx_web_Lab
      affinity:                     #為實現高可用,三個pod應該分佈在不同Node上
        podAntiAffinity: 
          requiredDuringSchedulingIgnoredDuringExecution:  
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - prod-pod
            topologyKey: kubernetes.io/hostname

Pod親和性預選策略MatchInterPodAffinityPred

策略說明:

對需被排程的Pod進行親和/反親和配置匹配檢測目標Pods,然後獲取滿足親和條件的Pods所執行的Nodes
​的 TopologyKey的值(親和性pod定義topologyKey)與目標 Nodes進行一一匹配是否符合條件.

適用NodeAffinity配置項
PodAffinity.RequiredDuringSchedulingIgnoredDuringExecution
PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution

預選策略原始碼分析:

  1. 策略註冊:defaultPredicates()註冊了一條名為“MatchInterPodAffinity”預選策略項.

!FILENAME pkg/scheduler/algorithmprovider/defaults/defaults.go:143

func defaultPredicates() sets.String {
  ...
  
factory.RegisterFitPredicateFactory(
			predicates.MatchInterPodAffinityPred,
			func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
				return predicates.NewPodAffinityPredicate(args.NodeInfo, args.PodLister)
			},
  
  ...
}
  1. 策略Func: checker.InterPodAffinityMatches()
    Func是通過NewPodAffinityProdicate()例項化PodAffinityChecker類物件後返回。

!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:1138

type PodAffinityChecker struct {
	info      NodeInfo
	podLister algorithm.PodLister
}

func NewPodAffinityPredicate(info NodeInfo, podLister algorithm.PodLister) algorithm.FitPredicate {
	checker := &PodAffinityChecker{
		info:      info,
		podLister: podLister,
	}
	return checker.InterPodAffinityMatches  //返回策略func
}

InterPodAffinityMatches()
檢測一個pod是否滿足排程到特定的(符合pod親和或反親和配置)Node上。

  1. satisfiesExistingPodsAntiAffinity() 滿足存在的Pods反親和配置.
  2. satisfiesPodsAffinityAntiAffinity() 滿足Pods親和與反親和配置.

!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:1155

func (c *PodAffinityChecker) InterPodAffinityMatches(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
	node := nodeInfo.Node()
	if node == nil {
		return false, nil, fmt.Errorf("node not found")
	}     
                                          //①
	if failedPredicates, error := c.satisfiesExistingPodsAntiAffinity(pod, meta, nodeInfo); failedPredicates != nil {
		failedPredicates := append([]algorithm.PredicateFailureReason{ErrPodAffinityNotMatch}, failedPredicates)
		return false, failedPredicates, error
	}

	// Now check if <pod> requirements will be satisfied on this node.
	affinity := pod.Spec.Affinity
	if affinity == nil || (affinity.PodAffinity == nil && affinity.PodAntiAffinity == nil) {
		return true, nil, nil
	}   
                                         //② 
	if failedPredicates, error := c.satisfiesPodsAffinityAntiAffinity(pod, meta, nodeInfo, affinity); failedPredicates != nil {
		failedPredicates := append([]algorithm.PredicateFailureReason{ErrPodAffinityNotMatch}, failedPredicates)
		return false, failedPredicates, error
	}

	return true, nil, nil
}

① satisfiesExistingPodsAntiAffinity()
檢測當pod被排程到目標node上是否觸犯了其它pods所定義的反親和配置.
即:當排程一個pod到目標Node上,而某個或某些Pod定義了反親和配置與被
排程的Pod相匹配(觸犯),那麼就不應該將此Node加入到可選的潛在排程Nodes列表內.

!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:1293

func (c *PodAffinityChecker) satisfiesExistingPodsAntiAffinity(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (algorithm.PredicateFailureReason, error) {
	node := nodeInfo.Node()
	if node == nil {
		return ErrExistingPodsAntiAffinityRulesNotMatch, fmt.Errorf("Node is nil")
	}
	var topologyMaps *topologyPairsMaps
  //如果存在預處理的MetaData則直接獲取topologyPairsAntiAffinityPodsMap
	if predicateMeta, ok := meta.(*predicateMetadata); ok {
		topologyMaps = predicateMeta.topologyPairsAntiAffinityPodsMap
	} else {
    //  不存在預處理的MetaData處理邏輯.
    //  過濾掉pod的nodeName等於NodeInfo.Node.Name,且不存在於nodeinfo中.
    //  即執行在其它Nodes上的Pods
		filteredPods, err := c.podLister.FilteredList(nodeInfo.Filter, labels.Everything())
		if err != nil {
			errMessage := fmt.Sprintf("Failed to get all pods, %+v", err)
			klog.Error(errMessage)
			return ErrExistingPodsAntiAffinityRulesNotMatch, errors.New(errMessage)
		}
    // 獲取被排程Pod與其它存在反親和配置的Pods匹配的topologyMaps
		if topologyMaps, err = c.getMatchingAntiAffinityTopologyPairsOfPods(pod, filteredPods); err != nil {
			errMessage := fmt.Sprintf("Failed to get all terms that pod %+v matches, err: %+v", podName(pod), err)
			klog.Error(errMessage)
			return ErrExistingPodsAntiAffinityRulesNotMatch, errors.New(errMessage)
		}
	}

  // 遍歷所有topology pairs(所有反親和topologyKey/Value),檢測Node是否有影響.
	for topologyKey, topologyValue := range node.Labels {
		if topologyMaps.topologyPairToPods[topologyPair{key: topologyKey, value: topologyValue}] != nil {
			klog.V(10).Infof("Cannot schedule pod %+v onto node %v", podName(pod), node.Name)
			return ErrExistingPodsAntiAffinityRulesNotMatch, nil
		}
	}
	return nil, nil
}

getMatchingAntiAffinityTopologyPairsOfPods()
獲取被排程Pod與其它存在反親和配置的Pods匹配的topologyMaps

!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:1270

func (c *PodAffinityChecker) getMatchingAntiAffinityTopologyPairsOfPods(pod *v1.Pod, existingPods []*v1.Pod) (*topologyPairsMaps, error) {
	topologyMaps := newTopologyPairsMaps()
   // 遍歷所有存在Pods,獲取pod所執行的Node資訊
	for _, existingPod := range existingPods {
		existingPodNode, err := c.info.GetNodeInfo(existingPod.Spec.NodeName)
		if err != nil {
			if apierrors.IsNotFound(err) {
				klog.Errorf("Node not found, %v", existingPod.Spec.NodeName)
				continue
			}
			return nil, err
		}
    // 依據被排程的pod、目標pod、目標Node資訊(上面獲取得到)獲取TopologyPairs。
    // getMatchingAntiAffinityTopologyPairsOfPod()下面詳述
		existingPodTopologyMaps, err := getMatchingAntiAffinityTopologyPairsOfPod(pod, existingPod, existingPodNode)
		if err != nil {
			return nil, err
		}
		topologyMaps.appendMaps(existingPodTopologyMaps)
	}
	return topologyMaps, nil
}

//1)是否ExistingPod定義了反親和配置,如果沒有直接返回
//2)如果有定義,是否有任務一個反親和Term匹配需被排程的pod.
//  如果配置則將返回term定義的TopologyKey和Node的topologyValue.
func getMatchingAntiAffinityTopologyPairsOfPod(newPod *v1.Pod, existingPod *v1.Pod, node *v1.Node) (*topologyPairsMaps, error) {
	affinity := existingPod.Spec.Affinity
	if affinity == nil || affinity.PodAntiAffinity == nil {
		return nil, nil
	}

	topologyMaps := newTopologyPairsMaps()
	for _, term := range GetPodAntiAffinityTerms(affinity.PodAntiAffinity) {
		namespaces := priorityutil.GetNamespacesFromPodAffinityTerm(existingPod, &term)
		selector, err := metav1.LabelSelectorAsSelector(term.LabelSelector)
		if err != nil {
			return nil, err
		}
		if priorityutil.PodMatchesTermsNamespaceAndSelector(newPod, namespaces, selector) {
			if topologyValue, ok := node.Labels[term.TopologyKey]; ok {
				pair := topologyPair{key: term.TopologyKey, value: topologyValue}
				topologyMaps.addTopologyPair(pair, existingPod)
			}
		}
	}
	return topologyMaps, nil
}

② satisfiesPodsAffinityAntiAffinity()
滿足Pods親和與反親和配置.
我們先看一下程式碼結構,我將共分為兩個部分if{}部分,else{}部分,依賴於是否指定了預處理的預選metadata.

!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:1367

func (c *PodAffinityChecker) satisfiesPodsAffinityAntiAffinity(pod *v1.Pod,
	meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo,
	affinity *v1.Affinity) (algorithm.PredicateFailureReason, error) {
	node := nodeInfo.Node()
	if node == nil {
		return ErrPodAffinityRulesNotMatch, fmt.Errorf("Node is nil")
	}
	if predicateMeta, ok := meta.(*predicateMetadata); ok {
	  ...    //partI
	} else { 
    ...    //partII  
	}
	return nil, nil
}

partI if{...}

  • 如果指定了預處理metadata,則使用此邏輯,否則跳至else{...}
  • 獲取所有pod親和性定義AffinityTerms,如果存在親和性定義,基於指定的metadata判斷AffinityTerms所定義的nodeTopoloykey與值是否所有都存在於metadata.topologyPairsPotentialAffinityPods之內(潛在匹配親和定義的pod list)。
  • 獲取所有pod親和性定義AntiAffinityTerms,如果存在反親和定義,基於指定的metadata判斷AntiAffinityTerms所定義的nodeTopoloykey與值 是否有一個存在於 metadata.topologyPairsPotentialAntiAffinityPods之內的情況(潛在匹配anti反親和定義的pod list)。
	if predicateMeta, ok := meta.(*predicateMetadata); ok {
		// 檢測所有affinity terms.
		topologyPairsPotentialAffinityPods := predicateMeta.topologyPairsPotentialAffinityPods
		if affinityTerms := GetPodAffinityTerms(affinity.PodAffinity); len(affinityTerms) > 0 {
			matchExists := c.nodeMatchesAllTopologyTerms(pod, topologyPairsPotentialAffinityPods, nodeInfo, affinityTerms)
      
			if !matchExists {
				if !(len(topologyPairsPotentialAffinityPods.topologyPairToPods) == 0 && targetPodMatchesAffinityOfPod(pod, pod)) {
					klog.V(10).Infof("Cannot schedule pod %+v onto node %v, because of PodAffinity",
						podName(pod), node.Name)
					return ErrPodAffinityRulesNotMatch, nil
				}
			}
		}

		// 檢測所有anti-affinity terms.
		topologyPairsPotentialAntiAffinityPods := predicateMeta.topologyPairsPotentialAntiAffinityPods
		if antiAffinityTerms := GetPodAntiAffinityTerms(affinity.PodAntiAffinity); len(antiAffinityTerms) > 0 {
			matchExists := c.nodeMatchesAnyTopologyTerm(pod, topologyPairsPotentialAntiAffinityPods, nodeInfo, antiAffinityTerms)
			if matchExists {
				klog.V(10).Infof("Cannot schedule pod %+v onto node %v, because of PodAntiAffinity",
					podName(pod), node.Name)
				return ErrPodAntiAffinityRulesNotMatch, nil
			}
		}
 }

以下說明繼續if{…}內所用的各個子邏輯函式分析(按程式碼位置的先後順序):

GetPodAffinityTerms()
如果存在podAffinity硬體配置,獲取所有"匹配必要條件”Terms

!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:1217

func GetPodAffinityTerms(podAffinity *v1.PodAffinity) (terms []v1.PodAffinityTerm) {
	if podAffinity != nil {
		if len(podAffinity.RequiredDuringSchedulingIgnoredDuringExecution) != 0 {
			terms = podAffinity.RequiredDuringSchedulingIgnoredDuringExecution
		}
	}
	return terms
}

nodeMatchesAllTopologyTerms()
判斷目標Node是否匹配所有親和性配置的定義Terms的topology值.

!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:1336

// 目標Node須匹配所有Affinity terms所定義的TopologyKey,且值須與nodes(執行被親和匹配表示式匹配的Pods)
// 的TopologyKey和值相匹配。
// 注:此邏輯內metadata預計算了topologyPairs
func (c *PodAffinityChecker) nodeMatchesAllTopologyTerms(pod *v1.Pod, topologyPairs *topologyPairsMaps, nodeInfo *schedulercache.NodeInfo, terms []v1.PodAffinityTerm) bool {
	node := nodeInfo.Node()
	for _, term := range terms {
    // 判斷目標node上是否存在親和配置定義的TopologyKey的key,取出其topologykey值
    // 根據key與值建立topologyPair
    // 基於metadata.topologyPairsPotentialAffinityPods(潛在親和pods的topologyPairs)判斷\
       //目標node上的ToplogyKey與value是否相互匹配.
		if topologyValue, ok := node.Labels[term.TopologyKey]; ok {
			pair := topologyPair{key: term.TopologyKey, value: topologyValue}
			if _, ok := topologyPairs.topologyPairToPods[pair]; !ok {
				return false // 一項不滿足則為false
			}
		} else {
			return false
		}
	}
	return true
}

// topologyPairsMaps結構定義
type topologyPairsMaps struct {
    topologyPairToPods    map[topologyPair]podSet
    podToTopologyPairs    map[string]topologyPairSet
}

targetPodMatchesAffinityOfPod()
根據pod的親和定義檢測目標pod的NameSpace是否符合條件以及 Labels.selector條件表示式是否匹配.

!FILENAME pkg/scheduler/algorithm/predicates/metadata.go:498

func targetPodMatchesAffinityOfPod(pod, targetPod *v1.Pod) bool {
	affinity := pod.Spec.Affinity
	if affinity == nil || affinity.PodAffinity == nil {
		return false
	}
	affinityProperties, err := getAffinityTermProperties(pod, GetPodAffinityTerms(affinity.PodAffinity))   // ① 
	if err != nil {
		klog.Errorf("error in getting affinity properties of Pod %v", pod.Name)
		return false
	}                                          // ② 
	return podMatchesAllAffinityTermProperties(targetPod, affinityProperties)
}

// ① 獲取affinityTerms所定義所有的namespaces 和 selector 列表,
//    返回affinityTermProperites陣列. 陣列的每項定義{namesapces,selector}.
func getAffinityTermProperties(pod *v1.Pod, terms []v1.PodAffinityTerm) (properties []*affinityTermProperties, err error) {
	if terms == nil {
		return properties, nil
	}

	for _, term := range terms {
		namespaces := priorityutil.GetNamespacesFromPodAffinityTerm(pod, &term)
    // 基於定義的親和性term,建立labels.selector
		selector, err := metav1.LabelSelectorAsSelector(term.LabelSelector) 
		if err != nil {
			return nil, err
		}
		// 返回 namespaces 和 selector
		properties = append(properties, &affinityTermProperties{namespaces: namespaces, selector: selector})
	}
	return properties, nil
}
// 返回Namespace列表(如果term未指定Namespace則使用被排程pod的Namespace).
func GetNamespacesFromPodAffinityTerm(pod *v1.Pod, podAffinityTerm *v1.PodAffinityTerm) sets.String {
	names := sets.String{}
	if len(podAffinityTerm.Namespaces) == 0 {
		names.Insert(pod.Namespace)
	} else {
		names.Insert(podAffinityTerm.Namespaces...)
	}
	return names
}

// ② 遍歷properties所有定義的namespaces 和 selector 列表,呼叫PodMatchesTermsNamespaceAndSelector()進行一一匹配.
func podMatchesAllAffinityTermProperties(pod *v1.Pod, properties []*affinityTermProperties) bool {
	if len(properties) == 0 {
		return false
	}
	for _, property := range properties {
		if !priorityutil.PodMatchesTermsNamespaceAndSelector(pod, property.namespaces, property.selector) {
			return false
		}
	}
	return true
}
//  檢測NameSpaces一致性和Labels.selector是否匹配.
//  - 如果pod.Namespaces不相等於指定的NameSpace值則返回false,如果true則繼續labels match.
//  - 如果pod.labels不能Match Labels.selector選擇器,則返回false,反之true
func PodMatchesTermsNamespaceAndSelector(pod *v1.Pod, namespaces sets.String, selector labels.Selector) bool {
	if !namespaces.Has(pod.Namespace) {
		return false
	}
	if !selector.Matches(labels.Set(pod.Labels)) {
		return false
	}
	return true
}

GetPodAntiAffinityTerms()
獲取pod反親和配置所有的必要條件Terms

!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:1231

func GetPodAntiAffinityTerms(podAntiAffinity *v1.PodAntiAffinity) (terms []v1.PodAffinityTerm) {
	if podAntiAffinity != nil {
		if len(podAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution) != 0 {
			terms = podAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
		}
	}
	return terms
}

nodeMatchesAnyTopologyTerm()
判斷目標Node是否有匹配了反親和的定義Terms的topology值*.

!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:1353

//  Node只須匹配任何一條AnitAffinity terms所定義的TopologyKey則為True
//  邏輯等同於nodeMatchesAllTopologyTerms(),只是匹配一條則返回為true.
func (c *PodAffinityChecker) nodeMatchesAnyTopologyTerm(pod *v1.Pod, topologyPairs *topologyPairsMaps, nodeInfo *schedulercache.NodeInfo, terms []v1.PodAffinityTerm) bool {
	node := nodeInfo.Node()
	for _, term := range terms {
		if topologyValue, ok := node.Labels[term.TopologyKey]; ok {
			pair := topologyPair{key: term.TopologyKey, value: topologyValue}
			if _, ok := topologyPairs.topologyPairToPods[pair]; ok {
				return true // 一項滿足則為true
			}
		}
	}
	return false
}

partII else{...}

  • 如果沒有預處理的Metadata,則通過指定podFilter過濾器獲取滿足條件的pod列表
  • 獲取所有親和配置定義,如果存在則,通過獲取PodAffinity所定義的所有namespaces和標籤條件表示式進行匹配”目標pod",完全符合則獲取此目標pod的執行node的topologykey(此為affinity指定的topologykey)的 和"潛在Node"的topologykey的值比對是否一致。
  • 與上類似,獲取所有anti反親和配置定義,如果存在則,通過獲取PodAntiAffinity所定義的所有namespaces和標籤條件表示式進行匹配”目標pod",完全符合則獲取此目標pod的執行node的topologykey(此為AntiAffinity指定的topologykey)的值和"潛在Node"的topologykey的值比對是否一致。
else { 
  // We don't have precomputed metadata. We have to follow a slow path to check affinity terms.
		filteredPods, err := c.podLister.FilteredList(nodeInfo.Filter, labels.Everything())
		if err != nil {
			return ErrPodAffinityRulesNotMatch, err
		}

    //獲取親和、反親和配置定義的"匹配條件"Terms
		affinityTerms := GetPodAffinityTerms(affinity.PodAffinity)
		antiAffinityTerms := GetPodAntiAffinityTerms(affinity.PodAntiAffinity)
   
		matchFound, termsSelectorMatchFound := false, false
		for _, targetPod := range filteredPods {
			// 遍歷所有目標Pod,檢測所有親和性配置"匹配條件"Terms
			if !matchFound && len(affinityTerms) > 0 {
        // podMatchesPodAffinityTerms()對namespaces和標籤條件表示式進行匹配目標pod【詳解後述】
				affTermsMatch, termsSelectorMatch, err := c.podMatchesPodAffinityTerms(pod, targetPod, nodeInfo, affinityTerms)
				if err != nil {
					errMessage := fmt.Sprintf("Cannot schedule pod %+v onto node %v, because of PodAffinity, err: %v", podName(pod), node.Name, err)
					klog.Error(errMessage)
					return ErrPodAffinityRulesNotMatch, errors.New(errMessage)
				}
				if termsSelectorMatch {
					termsSelectorMatchFound = true
				}
				if affTermsMatch {
					matchFound = true
				}
			}

			// 同上,遍歷所有目標Pod,檢測所有Anti反親和配置"匹配條件"Terms.
			if len(antiAffinityTerms) > 0 {
				antiAffTermsMatch, _, err := c.podMatchesPodAffinityTerms(pod, targetPod, nodeInfo, antiAffinityTerms)
				if err != nil || antiAffTermsMatch {
					klog.V(10).Infof("Cannot schedule pod %+v onto node %v, because of PodAntiAffinityTerm, err: %v",
						podName(pod), node.Name, err)
					return ErrPodAntiAffinityRulesNotMatch, nil
				}
			}
		}

		if !matchFound && len(affinityTerms) > 0 {
			if termsSelectorMatchFound {
				klog.V(10).Infof("Cannot schedule pod %+v onto node %v, because of PodAffinity",
					podName(pod), node.Name)
				return ErrPodAffinityRulesNotMatch, nil
			}
			// Check if pod matches its own affinity properties (namespace and label selector).
			if !targetPodMatchesAffinityOfPod(pod, pod) {
				klog.V(10).Infof("Cannot schedule pod %+v onto node %v, because of PodAffinity",
					podName(pod), node.Name)
				return ErrPodAffinityRulesNotMatch, nil
			}
		}
	}

以下說明繼續else{…}內所用的子邏輯函式分析

podMatchesPodAffinityTerms()
通過獲取親和配置定義的所有namespaces和標籤條件表示式進行匹配目標pod,完全符合則獲取此目標pod的執行node的topologykey(此為affinity指定的topologykey)的和潛在Node的topologykey的比對是否一致.

!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:1189

func (c *PodAffinityChecker) podMatchesPodAffinityTerms(pod, targetPod *v1.Pod, nodeInfo *schedulercache.NodeInfo, terms []v1.PodAffinityTerm) (bool, bool, error) {
	if len(terms) == 0 {
		return false, false, fmt.Errorf("terms array is empty")
	}
	// 獲取{namespaces,selector}列表
	props, err := getAffinityTermProperties(pod, terms)
	if err != nil {
		return false, false, err
	}
	// 匹配目標pod是否在affinityTerm定義的{namespaces,selector}列表內所有項,如果不匹配則返回false,
	// 如果匹配則獲取此pod的執行node資訊(稱為目標Node),
	// 通過“目標Node”所定義的topologykey(此為affinity指定的topologykey)的值來匹配“潛在被排程的Node”的topologykey是否一致。
	if !podMatchesAllAffinityTermProperties(targetPod, props) {
		return false, false, nil
	}
	// Namespace and selector of the terms have matched. Now we check topology of the terms.
	targetPodNode, err := c.info.GetNodeInfo(targetPod.Spec.NodeName)
	if err != nil {
		return false, false, err
	}
	for _, term := range terms {
		if len(term.TopologyKey) == 0 {
			return false, false, fmt.Errorf("empty topologyKey is not allowed except for PreferredDuringScheduling pod anti-affinity")
		}
		if !priorityutil.NodesHaveSameTopologyKey(nodeInfo.Node(), targetPodNode, term.TopologyKey) {
			return false, true, nil
		}
	}
	return true, true, nil
}

priorityutil.NodesHaveSameTopologyKey()* 正真的toplogykey比較實現的邏輯程式碼塊。
從此程式碼可以看出deployment的yml對topologykey設定的可以支援自定義的

!FILENAME pkg/scheduler/algorithm/priorities/util/topologies.go:53

// 判斷兩者的topologyKey定義的值是否一致。
func NodesHaveSameTopologyKey(nodeA, nodeB *v1.Node, topologyKey string) bool {
	if len(topologyKey) == 0 {
		return false
	}

	if nodeA.Labels == nil || nodeB.Labels == nil {
		return false
	}

	nodeALabel, okA := nodeA.Labels[topologyKey]   //取Node一個被意義化的“Label”的值value
	nodeBLabel, okB := nodeB.Labels[topologyKey]

	// If found label in both nodes, check the label
	if okB && okA {                                  
		return nodeALabel == nodeBLabel             //比對  
	}

	return false
}

Pod親和性優選策略InterPodAffinityPriority

策略說明:
併發遍歷所有潛在的目標Nodes,對Pods與需被排程Pod的親和和反親性檢測,對親性匹配則增,對反親性
匹配則減, 最終對每個Node進行統計分數。

適用NodeAffinity配置項
PodAffinity.PreferredDuringSchedulingIgnoredDuringExecution
PodAntiAffinity.PreferredDuringSchedulingIgnoredDuringExecution

預選策略原始碼分析:

  1. 策略註冊:defaultPriorities()註冊了一條名為“InterPodAffinityPriority”優選策略項.

!FILENAME pkg/scheduler/algorithmprovider/defaults/defaults.go:145

// k8s.io/kubernetes/pkg/scheduler/algorithmprovider/defaults/defaults.go
func defaultPriorities() sets.String {
  ...
  
	factory.RegisterPriorityConfigFactory(
			"InterPodAffinityPriority",
			factory.PriorityConfigFactory{
				Function: func(args factory.PluginFactoryArgs) algorithm.PriorityFunction {
					return priorities.NewInterPodAffinityPriority(args.NodeInfo, args.NodeLister, args.PodLister, args.HardPodAffinitySymmetricWeight)
				},
				Weight: 1,
			},
		),

  ...
}
  1. 策略Func: interPodAffinity.CalculateInterPodAffinityPriority()
    通過NewPodAffinityPriority()例項化interPodAffinityod類物件及CalculateInterPodAffinityPriority()策略Func返回。

!FILENAME pkg/scheduler/algorithm/priorities/interpod_affinity.go:45

func NewInterPodAffinityPriority(
	info predicates.NodeInfo,
	nodeLister algorithm.NodeLister,
	podLister algorithm.PodLister,
	hardPodAffinityWeight int32) algorithm.PriorityFunction {
	interPodAffinity := &InterPodAffinity{
		info:                  info,
		nodeLister:            nodeLister,
		podLister:             podLister,
		hardPodAffinityWeight: hardPodAffinityWeight,
	}
	return interPodAffinity.CalculateInterPodAffinityPriority
}

CalculateInterPodAffinityPriority()
基於pod親和性配置匹配"必要條件項”Terms,併發處理所有目標nodes,為其目標node統計親和weight得分.
我們先來看一下它的程式碼結構:

  • processPod := func(existingPod *v1.Pod) error {… pm.processTerms()}
  • processNode := func(i int) {…}
  • workqueue.ParallelizeUntil(context.TODO(), 16, len(allNodeNames), processNode)
  • fScore = float64(schedulerapi.MaxPriority) * ((pm.counts[node.Name] - minCount) / (maxCount - minCount))

此程式碼邏輯需理解幾個定義:
pod 一個"需被排程的Pod"
hasAffinityConstraints "被排程的pod"是否有定義親和配置
hasAntiAffinityConstraints "被排程的pod"是否有定義親和配置
existingPod 一個待處理的"親和目標pod"
existingPodNode 執行此“親和目標pod”的節點--“目標Node
existingHasAffinityConstraints "親和目標pod"是否存在親和約束
existingHasAntiAffinityConstraints "親和目標pod"是否存在反親和約束

!FILENAME pkg/scheduler/algorithm/priorities/interpod_affinity.go:119

func (ipa *InterPodAffinity) CalculateInterPodAffinityPriority(pod *v1.Pod, nodeNameToInfo map[string]*schedulercache.NodeInfo, nodes []*v1.Node) (schedulerapi.HostPriorityList, error) {
	affinity := pod.Spec.Affinity
  //"需被排程Pod"是否存在親和、反親和約束配置
	hasAffinityConstraints := affinity != nil && affinity.PodAffinity != nil
	hasAntiAffinityConstraints := affinity != nil && affinity.PodAntiAffinity != nil

	allNodeNames := make([]string, 0, len(nodeNameToInfo))
	for name := range nodeNameToInfo {
		allNodeNames = append(allNodeNames, name)
	}
	var maxCount float64
	var minCount float64

	pm := newPodAffinityPriorityMap(nodes)
  
  // processPod()主要處理pod親和和反親和weight累計的邏輯程式碼。                     ②
  // 呼叫了Terms處理方法:processTerms()
	processPod := func(existingPod *v1.Pod) error {     
		...
       // 親和性檢測邏輯程式碼                                                    ① 
       pm.processTerms(terms, pod, existingPod, existingPodNode, 1)
    ...
	}
  //ProcessNode()通過一個判斷是否存在親和性配置選擇呼叫processPod()                ③
	processNode := func(i int) {  
		    ...
					if err := processPod(existingPod); err != nil {
						pm.setError(err)
					}
        ...
	}
  // 併發多執行緒處理呼叫ProcessNode()
	workqueue.ParallelizeUntil(context.TODO(), 16, len(allNodeNames), processNode)
   
  ...
	for _, node := range nodes {
		if pm.counts[node.Name] > maxCount {
			maxCount = pm.counts[node.Name]
		}
		if pm.counts[node.Name] < minCount {
			minCount = pm.counts[node.Name]
		}
	}
	result := make(schedulerapi.HostPriorityList, 0, len(nodes))
	for _, node := range nodes {
		fScore := float64(0)
		if (maxCount - minCount) > 0 {           //reduce計算fScore分             ④ 
			fScore = float64(schedulerapi.MaxPriority) * ((pm.counts[node.Name] - minCount) / (maxCount - minCount))
		}
		result = append(result, schedulerapi.HostPriority{
		                         Host: node.Name, 
		                         Score: int(fScore)
		                         })  
		}
	}
	return result, nil
}

ProcessTerms()
給定Pod和此Pod的定義的親和性配置(podAffinityTerm)、被測目標pod、執行被測目標pod的Node資訊,對所有潛在可被排程的Nodes列表進行一一檢測,並對根據檢測結果為node進行weight累計。
流程如下:

  1. “被測Pod”的namespaces是否與“給定的pod”的namespaces是否一致;

  2. “被測Pod”的labels是否與“給定的pod”的podAffinityTerm定義匹配;

  3. 如果前兩條件都為True,則對執行“被測的pod”的node的TopologyKey的值與所有潛在可被排程的Node進行遍歷檢測 TopologyKey的值是否一致,true則累計weight值.

    邏輯理解:

    12實現了找出在同一個namespace下滿足被調pod所配置podAffinityTerm的pods;

    3則實現獲取topologyKey的值與潛在被排程的Node進行匹配檢測” .

    此處則可清楚的理解pod親和性配置匹配的內在含義與邏輯。

!FILENAME pkg/scheduler/algorithm/priorities/interpod_affinity.go:107

func (p *podAffinityPriorityMap) processTerms(terms []v1.WeightedPodAffinityTerm, podDefiningAffinityTerm, podToCheck *v1.Pod, fixedNode *v1.Node, multiplier int) {
	for i := range terms {
		term := &terms[i]
		p.processTerm(&term.PodAffinityTerm, podDefiningAffinityTerm, podToCheck, fixedNode, float64(term.Weight*int32(multiplier)))
	}
}

func (p *podAffinityPriorityMap) processTerm(term *v1.PodAffinityTerm, podDefiningAffinityTerm, podToCheck *v1.Pod, fixedNode *v1.Node, weight float64) {
	// 獲取namesapce資訊(affinityTerm.Namespaces或pod.Namesapce)
	// 根據podAffinityTerm定義生成selector物件(參看本文開頭的述labelSelector)
	namespaces := priorityutil.GetNamespacesFromPodAffinityTerm(podDefiningAffinityTerm, term)
	selector, err := metav1.LabelSelectorAsSelector(term.LabelSelector) //labeSelector
	if err != nil {
		p.setError(err)
		return
	}
	//判斷“被檢測的Pod”的Namespace和Selector Labels是否匹配
	match := priorityutil.PodMatchesTermsNamespaceAndSelector(podToCheck, namespaces, selector)
	if match {
		func() {
			p.Lock()
			defer p.Unlock()
			for _, node := range p.nodes {
				//對"執行被檢測親和Pod的Node節點" 與被考慮的所有Nodes進行一一匹配TopologyKey檢查,如相等則進行累加權值
				if priorityutil.NodesHaveSameTopologyKey(node, fixedNode, term.TopologyKey) {
					p.counts[node.Name] += weight
				}
			}
		}()
	}
}

GetNamespaceFromPodAffinitTerm()
返回Namespaces列表(如果term未指定Namespace則使用被排程pod的Namespace)

!FILENAME pkg/scheduler/algorithm/priorities/util/topologies.go:28

func GetNamespacesFromPodAffinityTerm(pod *v1.Pod, podAffinityTerm *v1.PodAffinityTerm) sets.String {
	names := sets.String{}
	if len(podAffinityTerm.Namespaces) == 0 {
		names.Insert(pod.Namespace)
	} else {
		names.Insert(podAffinityTerm.Namespaces...)
	}
	return names
}

PodMatchesTermsNamespaceAndSelector()
檢測NameSpace一致性和Labels.selector是否匹配.

!FILENAME pkg/scheduler/algorithm/priorities/util/topologies.go:40

func PodMatchesTermsNamespaceAndSelector(pod *v1.Pod, namespaces sets.String, selector labels.Selector) bool {
	if !namespaces.Has(pod.Namespace) {
		return false
	}

	if !selector.Matches(labels.Set(pod.Labels)) {
		return false
	}
	return true
}

② **processPod() ** 處理親和和反親和邏輯層,呼叫processTerms()進行檢測與統計權重值。

!FILENAME pkg/scheduler/algorithm/priorities/interpod_affinity.go:136

	processPod := func(existingPod *v1.Pod) error {
		existingPodNode, err := ipa.info.GetNodeInfo(existingPod.Spec.NodeName)
		if err != nil {
			if apierrors.IsNotFound(err) {
				klog.Errorf("Node not found, %v", existingPod.Spec.NodeName)
				return nil
			}
			return err
		}
		existingPodAffinity := existingPod.Spec.Affinity
		existingHasAffinityConstraints := existingPodAffinity != nil && existingPodAffinity.PodAffinity != nil
		existingHasAntiAffinityConstraints := existingPodAffinity != nil && existingPodAffinity.PodAntiAffinity != nil
    //如果"需被排程的Pod"存在親和約束,則與"親和目標Pod"和"親和目標Node"進行一次ProcessTerms()檢測,如果成立則wieght權重值加1倍.
		if hasAffinityConstraints {
			terms := affinity.PodAffinity.PreferredDuringSchedulingIgnoredDuringExecution
			pm.processTerms(terms, pod, existingPod, existingPodNode, 1)
		}
    // 如果"需被排程的Pod"存在反親和約束,則與"親和目標Pod"和"親和目標Node"進行一次ProcessTerms()檢測,如果成立則wieght權重值減1倍.
		if hasAntiAffinityConstraints {
			terms := affinity.PodAntiAffinity.PreferredDuringSchedulingIgnoredDuringExecution
			pm.processTerms(terms, pod, existingPod, existingPodNode, -1)
		}
   //如果"親和目標Pod"存在親和約束,則反過來與"需被排程的Pod"和"親和目標Node"進行一次ProcessTerms()檢測,如果成立則wieght權重值加1倍. 
		if existingHasAffinityConstraints {
			if ipa.hardPodAffinityWeight > 0 {
				terms := existingPodAffinity.PodAffinity.RequiredDuringSchedulingIgnoredDuringExecution
				for _, term := range terms {
					pm.processTerm(&term, existingPod, pod, existingPodNode, float64(ipa.hardPodAffinityWeight))
				}
			}
			terms := existingPodAffinity.PodAffinity.PreferredDuringSchedulingIgnoredDuringExecution
			pm.processTerms(terms, existingPod, pod, existingPodNode, 1)
		}
    // 如果"親和目標Pod"存在反親和約束,則反過來與"需被排程的Pod"和"親和目標Node"進行一次ProcessTerms()檢測,如果成立則wieght權重值減1倍.
		if existingHasAntiAffinityConstraints {
			terms := existingPodAffinity.PodAntiAffinity.PreferredDuringSchedulingIgnoredDuringExecution
			pm.processTerms(terms, existingPod, pod, existingPodNode, -1)
		}
		return nil
	}

③ **processNode ** 如果"被排程pod"未定義親和配置,則檢測潛在Nodes的親和性定義.

!FILENAME pkg/scheduler/algorithm/priorities/interpod_affinity.go:193

	processNode := func(i int) {
		nodeInfo := nodeNameToInfo[allNodeNames[i]]
		if nodeInfo.Node() != nil {
			if hasAffinityConstraints || hasAntiAffinityConstraints {
				// We need to process all the nodes.
				for _, existingPod := range nodeInfo.Pods() {
					if err := processPod(existingPod); err != nil {
						pm.setError(err)
					}
				}
			} else {
				for _, existingPod := range nodeInfo.PodsWithAffinity() {
					if err := processPod(existingPod); err != nil {
						pm.setError(err)
					}
				}
			}
		}
	}

④ 最後的得分fscore計算公式:

// 10 * (node權重累計值 - 最小權重得分值) / (最大權重得分值 - 最小權重得分值)
fScore = float64(schedulerapi.MaxPriority) * ((pm.counts[node.Name] - minCount) / (maxCount - minCount))

const (
	// MaxPriority defines the max priority value.
	MaxPriority = 10
)

Service親和性

在default排程器程式碼內並未註冊此預選策略,僅有程式碼實現。連google/baidu上都無法查詢到相關使用案例,配置用法不予分析,僅看下面原始碼詳細分析。

程式碼場景應用註釋譯文:
一個服務的第一個Pod被排程到帶有Label “region=foo”的Nodes(資源叢集)上, 那麼其服務後面的其它Pod都將排程至Label “region=foo”的Nodes。

Serice親和性預選策略checkServiceAffinity

通過NewServiceAffinityPredicate()建立一個ServiceAffinity類物件,並返回兩個預選策略所必須的處理Func:

  • affinity.checkServiceAffinity 基於預選元資料Meta,對被排程的pod檢測Node是否滿足服務親和性.

  • affinity.serverAffinityMetadataProducer 基於預選Meta的pod資訊,獲取服務資訊和在相同NameSpace下的的Pod列表,供親和檢測時使用。

後面將詳述處理func

!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:955

func NewServiceAffinityPredicate(podLister algorithm.PodLister, serviceLister algorithm.ServiceLister, nodeInfo NodeInfo, labels []string) (algorithm.FitPredicate, PredicateMetadataProducer) {
	affinity := &ServiceAffinity{
		podLister:     podLister,
		serviceLister: serviceLister,
		nodeInfo:      nodeInfo,
		labels:        labels,
	}
	return affinity.checkServiceAffinity, affinity.serviceAffinityMetadataProducer
}

affinity.serverAffinityMetadataProducer()
輸入:predicateMateData
返回:services 和 pods

  1. 基於預選MetaData的pod資訊查詢出services
  2. 基於預選MetaData的pod Lables獲取所有匹配的pods,且過濾掉僅剩在同一個Namespace的pods。

!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:934

func (s *ServiceAffinity) serviceAffinityMetadataProducer(pm *predicateMetadata) {
	if pm.pod == nil {
		klog.Errorf("Cannot precompute service affinity, a pod is required to calculate service affinity.")
		return
	}
	pm.serviceAffinityInUse = true
	var errSvc, errList error
	// 1.基於預選MetaData的pod資訊查詢services
	pm.serviceAffinityMatchingPodServices, errSvc = s.serviceLister.GetPodServices(pm.pod)
	// 2.基於預選MetaData的pod Lables獲取所有匹配的pods
	selector := CreateSelectorFromLabels(pm.pod.Labels)
	allMatches, errList := s.podLister.List(selector)

	// In the future maybe we will return them as part of the function.
	if errSvc != nil || errList != nil {
		klog.Errorf("Some Error were found while precomputing svc affinity: \nservices:%v , \npods:%v", errSvc, errList)
	}
	// 3.過濾掉僅剩在同一個Namespace的pods
	pm.serviceAffinityMatchingPodList = FilterPodsByNamespace(allMatches, pm.pod.Namespace)
}


affinity.checkServiceAffinity()
基於預處理的MetaData,對被排程的pod檢測Node是否滿足服務親和性。

最終的親和檢測Labels:

​ Final affinityLabels =(A ∩ B)+ (B ∩ C) 與 node.Labels 進行Match計算 //∩交集符號

A: 需被排程podNodeSelector配置
B: 需被排程pod定義的服務親和affinityLabels配置
C: 被選定的親和目標NodeLables

!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:992

func (s *ServiceAffinity) checkServiceAffinity(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
	var services []*v1.Service
	var pods []*v1.Pod
	if pm, ok := meta.(*predicateMetadata); ok && (pm.serviceAffinityMatchingPodList != nil || pm.serviceAffinityMatchingPodServices != nil) {
		services = pm.serviceAffinityMatchingPodServices
		pods = pm.serviceAffinityMatchingPodList
	} else {
		// Make the predicate resilient in case metadata is missing.
		pm = &predicateMetadata{pod: pod}
		s.serviceAffinityMetadataProducer(pm)
		pods, services = pm.serviceAffinityMatchingPodList, pm.serviceAffinityMatchingPodServices
	}
	// 篩選掉存在於Node(nodeinfo)上pods,且與之進行podKey比對不相等的pods。          ①
	filteredPods := nodeInfo.FilterOutPods(pods)
	node := nodeInfo.Node()
	if node == nil {
		return false, nil, fmt.Errorf("node not found")
	}
	// affinityLabes交集 ==(A ∩ B) 
  // A:被排程pod的NodeSelector定義  B:定義的親和性Labels                           ②
	affinityLabels := FindLabelsInSet(s.labels, labels.Set(pod.Spec.NodeSelector))
	// Step 1: If we don't have all constraints, introspect nodes to find the missing constraints.
	if len(s.labels) > len(affinityLabels) {
		if len(services) > 0 {
			if len(filteredPods) > 0 {
				//"被選定的親和Node"
        //基於第一個filteredPods獲取Node資訊
				nodeWithAffinityLabels, err := s.nodeInfo.GetNodeInfo(filteredPods[0].Spec.NodeName)
				if err != nil {
					return false, nil, err
				}
				// 輸入:交集Labels、服務親和Labels、被選出的親和Node Lables
				// affinityLabels = affinityLabels + 交集(B ∩ C)
				// B: 服務親和Labels  C:被選出的親和Node的Lables                           ③
				AddUnsetLabelsToMap(affinityLabels, s.labels, labels.Set(nodeWithAffinityLabels.Labels))
			}
		}
	}

	// 進行一次最終的匹配(affinityLabels 與 被檢測親和的node.Labels )               ④
	if CreateSelectorFromLabels(affinityLabels).Matches(labels.Set(node.Labels)) {
		return true, nil, nil
	}
	return false, []algorithm.PredicateFailureReason{ErrServiceAffinityViolated}, nil
}

FilterOutPods()
篩選掉存在於Node(nodeinfo)上pods,且與之進行podKey比對不相等的pods
filteredPods = 未在Node上的pods + 在node上但podKey相同的pods

!FILENAME pkg/scheduler/cache/node_info.go:656

func (n *NodeInfo) FilterOutPods(pods []*v1.Pod) []*v1.Pod {
	//獲取Node的詳細資訊
	node := n.Node()
	if node == nil {
		return pods
	}
	filtered := make([]*v1.Pod, 0, len(pods))
	for _, p := range pods {
 		//如果pod(親和matched)的NodeName 不等於Spec配置的nodeNmae (即pod不在此Node上),將pod放入filtered.
		if p.Spec.NodeName != node.Name {
			filtered = append(filtered, p)
			continue
		}
		//如果在此Node上,則獲取podKey(pod.UID)
		//遍歷此Node上所有的目標Pods,獲取每個podKey進行與匹配pod的podkey是否相同,
    //相同則將pod放入filtered並返回
		podKey, _ := GetPodKey(p)
		for _, np := range n.Pods() {
			npodkey, _ := GetPodKey(np)
			if npodkey == podKey {
				filtered = append(filtered, p)
				break
			}
		}
	}
	return filtered
}

② *FindLabelsInSet() *
引數一: (B)定義的親和性Labels配置
引數二: (A)被排程pod的定義NodeSelector配置Selector
檢測存在的於NodeSelector的親和性Labels配置,則取兩者的交集部分. (A ∩ B)

!FILENAME pkg/scheduler/algorithm/predicates/utils.go:26

func FindLabelsInSet(labelsToKeep []string, selector labels.Set) map[string]string {
	aL := make(map[string]string)
	for _, l := range labelsToKeep {
		if selector.Has(l) {
			aL[l] = selector.Get(l)
		}
	}
	return aL
}

AddUnsetLabelsToMap()
引數一: (N)在FindLabelsInSet()計算出來的交集Labels
引數二: (B)定義的親和性Labels配置
引數三: (C)"被選出的親和Node"上的Lables
檢測存在的於"被選出的親和Node"上的親和性配置Labels,則取兩者的交集部分存放至N. (B ∩ C)=>N

!FILENAME pkg/scheduler/algorithm/predicates/utils.go:37

// 輸入:交集Labels、服務親和Labels、被選出的親和Node Lables
// 填充:Labels交集 ==(B ∩ C) B: 服務親和Labels  C:被選出的親和Node Lables
func AddUnsetLabelsToMap(aL map[string]string, labelsToAdd []string, labelSet labels.Set) {
	for _, l := range labelsToAdd {
		// 如果存在則不作任何操作
		if _, exists := aL[l]; exists {
			continue
		}
		// 反之,計算包含的交集部分 C ∩ B
		if labelSet.Has(l) {
			aL[l] = labelSet.Get(l)
		}
	}
}

CreateSelectorFromLabels().Match() 返回labels.Selector物件

!FILENAME pkg/scheduler/algorithm/predicates/utils.go:62

func CreateSelectorFromLabels(aL map[string]string) labels.Selector {
	if aL == nil || len(aL) == 0 {
		return labels.Everything()
	}
	return labels.Set(aL).AsSelector()
}

End

引用連結:

gitbook:https://farmer-hutao.github.io/k8s-source-code-analysis/
github:https://hub.fastgit.org/daniel-hutao/k8s-source-code-analysis