容器編排系統K8s之Pod Affinity
前文我們瞭解了k8s上的NetworkPolicy資源的使用和工作邏輯,回顧請參考:https://www.cnblogs.com/qiuhom-1874/p/14227660.html;今天我們來聊一聊Pod排程策略相關話題;
在k8s上有一個非常重要的元件kube-scheduler,它主要作用是監聽apiserver上的pod資源中的nodename欄位是否為空,如果該欄位為空就表示對應pod還沒有被排程,此時kube-scheduler就會從k8s眾多節點中,根據pod資源的定義相關屬性,從眾多節點中挑選一個最佳執行pod的節點,並把對應主機名稱填充到對應pod的nodename欄位,然後把pod定義資源存回apiserver;此時apiserver就會根據pod資源上的nodename欄位中的主機名,通知對應節點上的kubelet元件來讀取對應pod資源定義,kubelet從apiserver讀取對應pod資源定義清單,根據資源清單中定義的屬性,呼叫本地docker把對應pod執行起來;然後把pod狀態反饋給apiserver,由apiserver把對應pod的狀態資訊存回etcd中;整個過程,kube-scheduler主要作用是排程pod,並把排程資訊反饋給apiserver,那麼問題來了,kube-scheduler它是怎麼評判眾多節點哪個節點最適合執行對應pod的呢?
在k8s上排程器的工作邏輯是根據排程演算法來實現對應pod的排程的;不同的排程演算法,排程結果也有所不同,其評判的標準也有所不同,當排程器發現apiserver上有未被排程的pod時,它會把k8s上所有節點資訊,挨個套進對應的預選策略函式中進行篩選,把不符合執行pod的節點淘汰掉,我們把這個過程叫做排程器的預選階段(Predicate);剩下符合執行pod的節點會進入下一個階段優選(Priority),所謂優選是在這些符合執行pod的節點中根據各個優選函式的評分,最後把每個節點通過各個優選函式評分加起來,選擇一個最高分,這個最高分對應的節點就是排程器最後排程結果,如果最高分有多個節點,此時排程器會從最高分相同的幾個節點隨機挑選一個節點當作最後執行pod的節點;我們把這個這個過程叫做pod選定過程(select);簡單講排程器的排程過程會通過三個階段,第一階段是預選階段,此階段主要是篩選不符合執行pod節點,並將這些節點淘汰掉;第二階段是優選,此階段是通過各個優選函式對節點評分,篩選出得分最高的節點;第三階段是節點選定,此階段是從多個高分節點中隨機挑選一個作為最終執行pod的節點;大概過程如下圖所示
提示:預選過程是一票否決機制,只要其中一個預選函式不通過,對應節點則直接被淘汰;剩下通過預選的節點會進入優選階段,此階段每個節點會通過對應的優選函式來對各個節點評分,並計算每個節點的總分;最後排程器會根據每個節點的最後總分來挑選一個最高分的節點,作為最終排程結果;如果最高分有多個節點,此時排程器會從對應節點集合中隨機挑選一個作為最後排程結果,並把最後排程結果反饋給apiserver;
影響排程的因素
NodeName:nodename是最直接影響pod排程的方式,我們知道排程器評判pod是否被排程,就是根據nodename欄位是否為空來進行判斷,如果對應pod資源清單中,使用者明確定義了nodename欄位,則表示不使用排程器排程,此時排程器也不會排程此類pod資源,原因是對應nodename非空,排程器認為該pod是已經排程過了;這種方式是使用者手動將pod繫結至某個節點的方式;
NodeSelector:nodeselector相比nodename,這種方式要寬鬆一些,它也是影響排程器排程的一個重要因素,我們在定義pod資源時,如果指定了nodeselector,就表示只有符合對應node標籤選擇器定義的標籤的node才能執行對應pod;如果沒有節點滿足節點選擇器,對應pod就只能處於pending狀態;
Node Affinity:node affinity是用來定義pod對節點的親和性,所謂pod對節點的親和性是指,pod更願意或更不願意執行在那些節點;這種方式相比前面的nodename和nodeselector在排程邏輯上要精細一些;
Pod Affinity:pod affinity是用來定義pod與pod間的親和性,所謂pod與pod的親和性是指,pod更願意和那個或那些pod在一起;與之相反的也有pod更不願意和那個或那些pod在一起,這種我們叫做pod anti affinity,即pod與pod間的反親和性;所謂在一起是指和對應pod在同一個位置,這個位置可以是按主機名劃分,也可以按照區域劃分,這樣一來我們要定義pod和pod在一起或不在一起,定義位置就顯得尤為重要,也是評判對應pod能夠執行在哪裡標準;
taint和tolerations:taint是節點上的汙點,tolerations是對應pod對節點上的汙點的容忍度,即pod能夠容忍節點的汙點,那麼對應pod就能夠執行在對應節點,反之Pod就不能執行在對應節點;這種方式是結合節點的汙點,以及pod對節點汙點的容忍度來排程的;
示例:使用nodename排程策略
[root@master01 ~]# cat pod-demo.yaml apiVersion: v1 kind: Pod metadata: name: nginx-pod spec: nodeName: node01.k8s.org containers: - name: nginx image: nginx:1.14-alpine imagePullPolicy: IfNotPresent ports: - name: http containerPort: 80 [root@master01 ~]#
提示:nodename可以直接指定對應pod執行在那個節點上,無需預設排程器排程;以上資源表示把nginx-pod執行在node01.k8s.org這個節點上;
應用清單
[root@master01 ~]# kubectl apply -f pod-demo.yaml pod/nginx-pod created [root@master01 ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-pod 1/1 Running 0 10s 10.244.1.28 node01.k8s.org <none> <none> [root@master01 ~]#
提示:可以看到對應pod一定執行在我們手動指定的節點上;
示例:使用nodeselector排程策略
[root@master01 ~]# cat pod-demo-nodeselector.yaml apiVersion: v1 kind: Pod metadata: name: nginx-pod-nodeselector spec: nodeSelector: disktype: ssd containers: - name: nginx image: nginx:1.14-alpine imagePullPolicy: IfNotPresent ports: - name: http containerPort: 80 [root@master01 ~]#
提示:nodeselector使用來定義對對應node的標籤進行匹配,如果對應節點有此對應標籤,則對應pod就能被排程到對應節點執行,反之則不能被排程到對應節點執行;如果所有節點都不滿足,此時pod會處於pending狀態,直到有對應節點擁有對應標籤時,pod才會被排程到對應節點執行;
應用清單
[root@master01 ~]# kubectl apply -f pod-demo-nodeselector.yaml pod/nginx-pod-nodeselector created [root@master01 ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-pod 1/1 Running 0 9m38s 10.244.1.28 node01.k8s.org <none> <none> nginx-pod-nodeselector 0/1 Pending 0 16s <none> <none> <none> <none> [root@master01 ~]#
提示:可以看到對應pod的狀態一直處於pending狀態,其原因是對應k8s節點沒有一個節點滿足對應節點選擇器標籤;
驗證:給node02打上對應標籤,看看對應pod是否會被排程到node02上呢?
[root@master01 ~]# kubectl get nodes --show-labels NAME STATUS ROLES AGE VERSION LABELS master01.k8s.org Ready control-plane,master 29d v1.20.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master01.k8s.org,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master= node01.k8s.org Ready <none> 29d v1.20.0 app=nginx-1.14-alpine,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node01.k8s.org,kubernetes.io/os=linux node02.k8s.org Ready <none> 29d v1.20.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node02.k8s.org,kubernetes.io/os=linux node03.k8s.org Ready <none> 29d v1.20.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node03.k8s.org,kubernetes.io/os=linux node04.k8s.org Ready <none> 19d v1.20.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node04.k8s.org,kubernetes.io/os=linux [root@master01 ~]# kubectl label node node02.k8s.org disktype=ssd node/node02.k8s.org labeled [root@master01 ~]# kubectl get nodes --show-labels NAME STATUS ROLES AGE VERSION LABELS master01.k8s.org Ready control-plane,master 29d v1.20.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master01.k8s.org,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master= node01.k8s.org Ready <none> 29d v1.20.0 app=nginx-1.14-alpine,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node01.k8s.org,kubernetes.io/os=linux node02.k8s.org Ready <none> 29d v1.20.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=node02.k8s.org,kubernetes.io/os=linux node03.k8s.org Ready <none> 29d v1.20.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node03.k8s.org,kubernetes.io/os=linux node04.k8s.org Ready <none> 19d v1.20.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node04.k8s.org,kubernetes.io/os=linux [root@master01 ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-pod 1/1 Running 0 12m 10.244.1.28 node01.k8s.org <none> <none> nginx-pod-nodeselector 1/1 Running 0 3m26s 10.244.2.18 node02.k8s.org <none> <none> [root@master01 ~]#
提示:可以看到給node02節點打上disktype=ssd標籤以後,對應pod就被排程在node02上執行;
示例:使用affinity中的nodeaffinity排程策略
[root@master01 ~]# cat pod-demo-affinity-nodeaffinity.yaml apiVersion: v1 kind: Pod metadata: name: nginx-pod-nodeaffinity spec: containers: - name: nginx image: nginx:1.14-alpine imagePullPolicy: IfNotPresent ports: - name: http containerPort: 80 affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: foo operator: Exists values: [] - matchExpressions: - key: disktype operator: Exists values: [] preferredDuringSchedulingIgnoredDuringExecution: - weight: 10 preference: matchExpressions: - key: foo operator: Exists values: [] - weight: 2 preference: matchExpressions: - key: disktype operator: Exists values: [] [root@master01 ~]#
提示:對於nodeaffinity來說,它有兩種限制,一種是硬限制,用requiredDuringSchedulingIgnoredDuringExecution欄位來定義,該欄位為一個物件,其裡面只有nodeSelectorTerms一個欄位可以定義,該欄位為一個列表物件,可以使用matchExpressions欄位來定義匹配對應節點標籤的表示式(其中對應表示式中可以使用的操作符有In、NotIn、Exists、DoesNotExists、Lt、Gt;Lt和Gt用於字串比較,Exists和DoesNotExists用來判斷對應標籤key是否存在,In和NotIn用來判斷對應標籤的值是否在某個集合中),也可以使用matchFields欄位來定義對應匹配節點欄位;所謂硬限制是指必須滿足對應定義的節點標籤選擇表示式或節點欄位選擇器,對應pod才能夠被排程在對應節點上執行,否則對應pod不能被排程到節點上執行,如果沒有滿足對應的節點標籤表示式或節點欄位選擇器,則對應pod會一直被掛起;第二種是軟限制,用preferredDuringSchedulingIgnoredDuringExecution欄位定義,該欄位為一個列表物件,裡面可以用weight來定義對應軟限制的權重,該權重會被排程器在最後計算node得分時加入到對應節點總分中;preference欄位是用來定義對應軟限制匹配條件;即滿足對應軟限制的節點在排程時會被排程器把對應權重加入對應節點總分;對於軟限制來說,只有當硬限制匹配有多個node時,對應軟限制才會生效;即軟限制是在硬限制的基礎上做的第二次限制,它表示在硬限制匹配多個node,優先使用軟限制中匹配的node,如果軟限制中給定的權重和匹配條件不能讓多個node決勝出最高分,即使用預設排程排程機制,從多個最高分node中隨機挑選一個node作為最後排程結果;如果在軟限制中給定權重和對應匹配條件能夠決勝出對應node最高分,則對應node就為最後排程結果;簡單講軟限制和硬限制一起使用,軟限制是輔助硬限制對node進行挑選;如果只是單純的使用軟限制,則優先把pod排程到權重較高對應條件匹配的節點上;如果權重一樣,則排程器會根據預設規則從最後得分中挑選一個最高分,作為最後排程結果;以上示例表示執行pod的硬限制必須是對應節點上滿足有key為foo的節點標籤或者key為disktype的節點標籤;如果對應硬限制沒有匹配到任何節點,則對應pod不做任何排程,即處於pending狀態,如果對應硬限制都匹配,則在軟限制中匹配key為foo的節點將在總分中加上10,對key為disktype的節點總分加2分;即軟限制中,pod更傾向key為foo的節點標籤的node上;這裡需要注意的是nodeAffinity沒有node anti Affinity,要想實現反親和性可以使用NotIn或者DoesNotExists操作符來匹配對應條件;
應用資源清單
[root@master01 ~]# kubectl get nodes -L foo,disktype NAME STATUS ROLES AGE VERSION FOO DISKTYPE master01.k8s.org Ready control-plane,master 29d v1.20.0 node01.k8s.org Ready <none> 29d v1.20.0 node02.k8s.org Ready <none> 29d v1.20.0 ssd node03.k8s.org Ready <none> 29d v1.20.0 node04.k8s.org Ready <none> 19d v1.20.0 [root@master01 ~]# kubectl apply -f pod-demo-affinity-nodeaffinity.yaml pod/nginx-pod-nodeaffinity created [root@master01 ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-pod 1/1 Running 0 122m 10.244.1.28 node01.k8s.org <none> <none> nginx-pod-nodeaffinity 1/1 Running 0 7s 10.244.2.22 node02.k8s.org <none> <none> nginx-pod-nodeselector 1/1 Running 0 113m 10.244.2.18 node02.k8s.org <none> <none> [root@master01 ~]#
提示:可以看到應用清單以後對應pod被排程到node02上運行了,之所以排程到node02是因為對應節點上有key為disktype的節點標籤,該條件滿足對應執行pod的硬限制;
驗證:刪除pod和對應node02上的key為disktype的節點標籤,再次應用資源清單,看看對應pod怎麼排程?
[root@master01 ~]# kubectl delete -f pod-demo-affinity-nodeaffinity.yaml pod "nginx-pod-nodeaffinity" deleted [root@master01 ~]# kubectl label node node02.k8s.org disktype- node/node02.k8s.org labeled [root@master01 ~]# kubectl get pods NAME READY STATUS RESTARTS AGE nginx-pod 1/1 Running 0 127m nginx-pod-nodeselector 1/1 Running 0 118m [root@master01 ~]# kubectl get node -L foo,disktype NAME STATUS ROLES AGE VERSION FOO DISKTYPE master01.k8s.org Ready control-plane,master 29d v1.20.0 node01.k8s.org Ready <none> 29d v1.20.0 node02.k8s.org Ready <none> 29d v1.20.0 node03.k8s.org Ready <none> 29d v1.20.0 node04.k8s.org Ready <none> 19d v1.20.0 [root@master01 ~]# kubectl apply -f pod-demo-affinity-nodeaffinity.yaml pod/nginx-pod-nodeaffinity created [root@master01 ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-pod 1/1 Running 0 128m 10.244.1.28 node01.k8s.org <none> <none> nginx-pod-nodeaffinity 0/1 Pending 0 9s <none> <none> <none> <none> nginx-pod-nodeselector 1/1 Running 0 118m 10.244.2.18 node02.k8s.org <none> <none> [root@master01 ~]#
提示:可以看到刪除原有pod和node2上面的標籤後,再次應用資源清單,pod就一直處於pending狀態;其原因是對應k8s節點沒有滿足對應pod執行時的硬限制;所以對應pod無法進行排程;
驗證:刪除pod,分別給node01和node03打上key為foo和key為disktype的節點標籤,看看然後再次應用清單,看看對應pod會這麼排程?
[root@master01 ~]# kubectl delete -f pod-demo-affinity-nodeaffinity.yaml pod "nginx-pod-nodeaffinity" deleted [root@master01 ~]# kubectl label node node01.k8s.org foo=bar node/node01.k8s.org labeled [root@master01 ~]# kubectl label node node03.k8s.org disktype=ssd node/node03.k8s.org labeled [root@master01 ~]# kubectl get nodes -L foo,disktype NAME STATUS ROLES AGE VERSION FOO DISKTYPE master01.k8s.org Ready control-plane,master 29d v1.20.0 node01.k8s.org Ready <none> 29d v1.20.0 bar node02.k8s.org Ready <none> 29d v1.20.0 node03.k8s.org Ready <none> 29d v1.20.0 ssd node04.k8s.org Ready <none> 19d v1.20.0 [root@master01 ~]# kubectl apply -f pod-demo-affinity-nodeaffinity.yaml pod/nginx-pod-nodeaffinity created [root@master01 ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-pod 1/1 Running 0 132m 10.244.1.28 node01.k8s.org <none> <none> nginx-pod-nodeaffinity 1/1 Running 0 5s 10.244.1.29 node01.k8s.org <none> <none> nginx-pod-nodeselector 1/1 Running 0 123m 10.244.2.18 node02.k8s.org <none> <none> [root@master01 ~]#
提示:可以看到當硬限制中的條件被多個node匹配時,優先排程對應軟限制條件匹配權重較大的節點上,即硬限制不能正常抉擇出調度節點,則軟限制中對應權重大的匹配條件有限被排程;
驗證:刪除node01上的節點標籤,看看對應pod是否會被移除,或被排程其他節點?
[root@master01 ~]# kubectl get nodes -L foo,disktype NAME STATUS ROLES AGE VERSION FOO DISKTYPE master01.k8s.org Ready control-plane,master 29d v1.20.0 node01.k8s.org Ready <none> 29d v1.20.0 bar node02.k8s.org Ready <none> 29d v1.20.0 node03.k8s.org Ready <none> 29d v1.20.0 ssd node04.k8s.org Ready <none> 19d v1.20.0 [root@master01 ~]# kubectl label node node01.k8s.org foo- node/node01.k8s.org labeled [root@master01 ~]# kubectl get nodes -L foo,disktype NAME STATUS ROLES AGE VERSION FOO DISKTYPE master01.k8s.org Ready control-plane,master 29d v1.20.0 node01.k8s.org Ready <none> 29d v1.20.0 node02.k8s.org Ready <none> 29d v1.20.0 node03.k8s.org Ready <none> 29d v1.20.0 ssd node04.k8s.org Ready <none> 19d v1.20.0 [root@master01 ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-pod 1/1 Running 0 145m 10.244.1.28 node01.k8s.org <none> <none> nginx-pod-nodeaffinity 1/1 Running 0 12m 10.244.1.29 node01.k8s.org <none> <none> nginx-pod-nodeselector 1/1 Running 0 135m 10.244.2.18 node02.k8s.org <none> <none> [root@master01 ~]#
提示:可以看到當pod正常執行以後,即便後來對應節點不滿足對應pod執行的硬限制,對應pod也不會被移除或排程到其他節點,說明節點親和性是在排程時發生作用,一旦排程完成,即便後來節點不滿足pod執行節點親和性,對應pod也不會被移除或再次排程;簡單講nodeaffinity對pod排程既成事實無法做二次排程;
node Affinity規則生效方式
1、nodeAffinity和nodeSelector一起使用時,兩者間關係取“與”關係,即兩者條件必須同時滿足,對應節點才滿足排程執行或不執行對應pod;
示例:使用nodeaffinity和nodeselector定義pod排程策略
[root@master01 ~]# cat pod-demo-affinity-nodesector.yaml apiVersion: v1 kind: Pod metadata: name: nginx-pod-nodeaffinity-nodeselector spec: containers: - name: nginx image: nginx:1.14-alpine imagePullPolicy: IfNotPresent ports: - name: http containerPort: 80 affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: foo operator: Exists values: [] nodeSelector: disktype: ssd [root@master01 ~]#
提示:以上清單表示對應pod傾向執行在節點上有節點標籤key為foo的節點並且對應節點上還有disktype=ssd節點標籤
應用清單
[root@master01 ~]# kubectl get nodes -L foo,disktype NAME STATUS ROLES AGE VERSION FOO DISKTYPE master01.k8s.org Ready control-plane,master 29d v1.20.0 node01.k8s.org Ready <none> 29d v1.20.0 node02.k8s.org Ready <none> 29d v1.20.0 node03.k8s.org Ready <none> 29d v1.20.0 ssd node04.k8s.org Ready <none> 19d v1.20.0 [root@master01 ~]# kubectl apply -f pod-demo-affinity-nodesector.yaml pod/nginx-pod-nodeaffinity-nodeselector created [root@master01 ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-pod 1/1 Running 0 168m 10.244.1.28 node01.k8s.org <none> <none> nginx-pod-nodeaffinity 1/1 Running 0 35m 10.244.1.29 node01.k8s.org <none> <none> nginx-pod-nodeaffinity-nodeselector 0/1 Pending 0 7s <none> <none> <none> <none> nginx-pod-nodeselector 1/1 Running 0 159m 10.244.2.18 node02.k8s.org <none> <none> [root@master01 ~]#
提示:可以看到對應pod被建立以後,一直處於pengding狀態,原因是沒有節點滿足同時有節點標籤key為foo並且disktype=ssd的節點,所以對應pod就無法正常被排程,只好掛起;
2、多個nodeaffinity同時指定多個nodeSelectorTerms時,相互之間取“或”關係;即使用多個matchExpressions列表分別指定對應的匹配條件;
[root@master01 ~]# cat pod-demo-affinity2.yaml apiVersion: v1 kind: Pod metadata: name: nginx-pod-nodeaffinity2 spec: containers: - name: nginx image: nginx:1.14-alpine imagePullPolicy: IfNotPresent ports: - name: http containerPort: 80 affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: foo operator: Exists values: [] - matchExpressions: - key: disktype operator: Exists values: [] [root@master01 ~]#
提示:以上示例表示執行pod節點傾向對應節點上有節點標籤key為foo或key為disktype的節點;
應用清單
[root@master01 ~]# kubectl get nodes -L foo,disktype NAME STATUS ROLES AGE VERSION FOO DISKTYPE master01.k8s.org Ready control-plane,master 29d v1.20.0 node01.k8s.org Ready <none> 29d v1.20.0 node02.k8s.org Ready <none> 29d v1.20.0 node03.k8s.org Ready <none> 29d v1.20.0 ssd node04.k8s.org Ready <none> 19d v1.20.0 [root@master01 ~]# kubectl apply -f pod-demo-affinity2.yaml pod/nginx-pod-nodeaffinity2 created [root@master01 ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-pod 1/1 Running 0 179m 10.244.1.28 node01.k8s.org <none> <none> nginx-pod-nodeaffinity 1/1 Running 0 46m 10.244.1.29 node01.k8s.org <none> <none> nginx-pod-nodeaffinity-nodeselector 0/1 Pending 0 10m <none> <none> <none> <none> nginx-pod-nodeaffinity2 1/1 Running 0 6s 10.244.3.21 node03.k8s.org <none> <none> nginx-pod-nodeselector 1/1 Running 0 169m 10.244.2.18 node02.k8s.org <none> <none> [root@master01 ~]#
提示:可以看到對應pod被排程node03上運行了,之所以能在node03執行是因為對應node03滿足節點標籤key為foo或key為disktype條件;
3、同一個matchExpressions,多個條件取“與”關係;即使用多個key列表分別指定對應的匹配條件;
示例:在一個matchExpressions下指定多個條件
[root@master01 ~]# cat pod-demo-affinity3.yaml apiVersion: v1 kind: Pod metadata: name: nginx-pod-nodeaffinity3 spec: containers: - name: nginx image: nginx:1.14-alpine imagePullPolicy: IfNotPresent ports: - name: http containerPort: 80 affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: foo operator: Exists values: [] - key: disktype operator: Exists values: [] [root@master01 ~]#
提示:上述清單表示pod傾向執行在節點標籤key為foo和節點標籤key為disktype的節點上;
應用清單
[root@master01 ~]# kubectl get nodes -L foo,disktype NAME STATUS ROLES AGE VERSION FOO DISKTYPE master01.k8s.org Ready control-plane,master 29d v1.20.0 node01.k8s.org Ready <none> 29d v1.20.0 node02.k8s.org Ready <none> 29d v1.20.0 node03.k8s.org Ready <none> 29d v1.20.0 ssd node04.k8s.org Ready <none> 19d v1.20.0 [root@master01 ~]# kubectl apply -f pod-demo-affinity3.yaml pod/nginx-pod-nodeaffinity3 created [root@master01 ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-pod 1/1 Running 0 3h8m 10.244.1.28 node01.k8s.org <none> <none> nginx-pod-nodeaffinity 1/1 Running 0 56m 10.244.1.29 node01.k8s.org <none> <none> nginx-pod-nodeaffinity-nodeselector 0/1 Pending 0 20m <none> <none> <none> <none> nginx-pod-nodeaffinity2 1/1 Running 0 9m38s 10.244.3.21 node03.k8s.org <none> <none> nginx-pod-nodeaffinity3 0/1 Pending 0 7s <none> <none> <none> <none> nginx-pod-nodeselector 1/1 Running 0 179m 10.244.2.18 node02.k8s.org <none> <none> [root@master01 ~]#
提示:可以看到對應pod建立以後,一直處於pengding狀態;原因是沒有符合節點標籤同時滿足key為foo和key為disktyp的節點;
pod affinity 的工作邏輯和使用方式同node affinity類似,pod affinity也有硬限制和軟限制,其邏輯和nodeaffinity一樣,即定義了硬親和,軟親和規則就是輔助硬親和規則挑選對應pod執行節點;如果硬親和不滿足條件,對應pod只能掛起;如果只是使用軟親和規則,則對應pod會優先執行在匹配軟親和規則中權重較大的節點上,如果軟親和規則也沒有節點滿足,則使用預設排程規則從中挑選一個得分最高的節點執行pod;
示例:使用Affinity中的PodAffinity中的硬限制排程策略
[root@master01 ~]# cat require-podaffinity.yaml apiVersion: v1 kind: Pod metadata: name: with-pod-affinity-1 spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - {key: app, operator: In, values: ["nginx"]} topologyKey: kubernetes.io/hostname containers: - name: myapp image: ikubernetes/myapp:v1 [root@master01 ~]#
提示:上述清單是podaffinity中的硬限制使用方式,其中定義podaffinity需要在spec.affinity欄位中使用podAffinity欄位來定義;requiredDuringSchedulingIgnoredDuringExecution欄位是定義對應podAffinity的硬限制所使用的欄位,該欄位為一個列表物件,其中labelSelector用來定義和對應pod在一起pod的標籤選擇器;topologyKey欄位是用來定義對應在一起的位置以那個什麼來劃分,該位置可以是對應節點上的一個節點標籤key;上述清單表示執行myapp這個pod的硬限制條件是必須滿足對應對應節點上必須執行的有一個pod,這個pod上有一個app=nginx的標籤;即標籤為app=nginx的pod執行在那個節點,對應myapp就執行在那個節點;如果沒有對應pod存在,則該pod也會處於pending狀態;
應用清單
[root@master01 ~]# kubectl get pods -L app -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES APP nginx-pod 1/1 Running 0 8m25s 10.244.4.25 node04.k8s.org <none> <none> nginx [root@master01 ~]# kubectl apply -f require-podaffinity.yaml pod/with-pod-affinity-1 created [root@master01 ~]# kubectl get pods -L app -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES APP nginx-pod 1/1 Running 0 8m43s 10.244.4.25 node04.k8s.org <none> <none> nginx with-pod-affinity-1 1/1 Running 0 6s 10.244.4.26 node04.k8s.org <none> <none> [root@master01 ~]#
提示:可以看到對應pod執行在node04上了,其原因對應節點上有一個app=nginx標籤的pod存在,滿足對應podAffinity中的硬限制;
驗證:刪除上述兩個pod,然後再次應用清單,看看對應pod是否能夠正常執行?
[root@master01 ~]# kubectl delete all --all pod "nginx-pod" deleted pod "with-pod-affinity-1" deleted service "kubernetes" deleted [root@master01 ~]# kubectl apply -f require-podaffinity.yaml pod/with-pod-affinity-1 created [root@master01 ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES with-pod-affinity-1 0/1 Pending 0 8s <none> <none> <none> <none> [root@master01 ~]#
提示:可以看到對應pod處於pending狀態,其原因是沒有一個節點上執行的有app=nginx pod標籤,不滿足podAffinity中的硬限制;
示例:使用Affinity中的PodAffinity中的軟限制排程策略
[root@master01 ~]# cat prefernece-podaffinity.yaml apiVersion: v1 kind: Pod metadata: name: with-pod-affinity-2 spec: affinity: podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 80 podAffinityTerm: labelSelector: matchExpressions: - {key: app, operator: In, values: ["db"]} topologyKey: rack - weight: 20 podAffinityTerm: labelSelector: matchExpressions: - {key: app, operator: In, values: ["db"]} topologyKey: zone containers: - name: myapp image: ikubernetes/myapp:v1 [root@master01 ~]#
提示:podAffinity中的軟限制需要用preferredDuringSchedulingIgnoredDuringExecution欄位定義;其中weight用來定義對應軟限制條件的權重,即滿足對應軟限制的node,最後得分會加上這個權重;上述清單表示以節點標籤key=rack來劃分位置,如果對應節點上執行的有對應pod標籤為app=db的pod,則對應節點總分加80;如果以節點標籤key=zone來劃分位置,如果對應節點上執行的有pod標籤為app=db的pod,對應節點總分加20;如果沒有滿足的節點,則使用預設排程規則進行排程;
應用清單
[root@master01 ~]# kubectl get node -L rack,zone NAME STATUS ROLES AGE VERSION RACK ZONE master01.k8s.org Ready control-plane,master 30d v1.20.0 node01.k8s.org Ready <none> 30d v1.20.0 node02.k8s.org Ready <none> 30d v1.20.0 node03.k8s.org Ready <none> 30d v1.20.0 node04.k8s.org Ready <none> 20d v1.20.0 [root@master01 ~]# kubectl get pods -o wide -L app NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES APP with-pod-affinity-1 0/1 Pending 0 22m <none> <none> <none> <none> [root@master01 ~]# kubectl apply -f prefernece-podaffinity.yaml pod/with-pod-affinity-2 created [root@master01 ~]# kubectl get pods -o wide -L app NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES APP with-pod-affinity-1 0/1 Pending 0 22m <none> <none> <none> <none> with-pod-affinity-2 1/1 Running 0 6s 10.244.4.28 node04.k8s.org <none> <none> [root@master01 ~]#
提示:可以看到對應pod正常執行起來,並排程到node04上;從上面的示例來看,對應pod的執行並沒有走軟限制條件進行排程,而是走預設排程法則;其原因是對應節點沒有滿足對應軟限制中的條件;
驗證:刪除pod,在node01上打上rack節點標籤,在node03上打上zone節點標籤,再次執行pod,看看對應pod會怎麼排程?
[root@master01 ~]# kubectl delete -f prefernece-podaffinity.yaml pod "with-pod-affinity-2" deleted [root@master01 ~]# kubectl label node node01.k8s.org rack=group1 node/node01.k8s.org labeled [root@master01 ~]# kubectl label node node03.k8s.org zone=group2 node/node03.k8s.org labeled [root@master01 ~]# kubectl get node -L rack,zone NAME STATUS ROLES AGE VERSION RACK ZONE master01.k8s.org Ready control-plane,master 30d v1.20.0 node01.k8s.org Ready <none> 30d v1.20.0 group1 node02.k8s.org Ready <none> 30d v1.20.0 node03.k8s.org Ready <none> 30d v1.20.0 group2 node04.k8s.org Ready <none> 20d v1.20.0 [root@master01 ~]# kubectl apply -f prefernece-podaffinity.yaml pod/with-pod-affinity-2 created [root@master01 ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES with-pod-affinity-1 0/1 Pending 0 27m <none> <none> <none> <none> with-pod-affinity-2 1/1 Running 0 9s 10.244.4.29 node04.k8s.org <none> <none> [root@master01 ~]#
提示:可以看到對應pod還是被排程到node04上執行,說明節點上的位置標籤不影響其排程結果;
驗證:刪除pod,在node01和node03上分別建立一個標籤為app=db的pod,然後再次應用清單,看看對應pod會這麼排程?
[root@master01 ~]# kubectl apply -f prefernece-podaffinity.yaml pod/with-pod-affinity-2 created [root@master01 ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES with-pod-affinity-1 0/1 Pending 0 27m <none> <none> <none> <none> with-pod-affinity-2 1/1 Running 0 9s 10.244.4.29 node04.k8s.org <none> <none> [root@master01 ~]# [root@master01 ~]# kubectl delete -f prefernece-podaffinity.yaml pod "with-pod-affinity-2" deleted [root@master01 ~]# cat pod-demo.yaml apiVersion: v1 kind: Pod metadata: name: redis-pod1 labels: app: db spec: nodeSelector: rack: group1 containers: - name: redis image: redis:4-alpine imagePullPolicy: IfNotPresent ports: - name: redis containerPort: 6379 --- apiVersion: v1 kind: Pod metadata: name: redis-pod2 labels: app: db spec: nodeSelector: zone: group2 containers: - name: redis image: redis:4-alpine imagePullPolicy: IfNotPresent ports: - name: redis containerPort: 6379 [root@master01 ~]# kubectl apply -f pod-demo.yaml pod/redis-pod1 created pod/redis-pod2 created [root@master01 ~]# kubectl get pods -L app -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES APP redis-pod1 1/1 Running 0 34s 10.244.1.35 node01.k8s.org <none> <none> db redis-pod2 1/1 Running 0 34s 10.244.3.24 node03.k8s.org <none> <none> db with-pod-affinity-1 0/1 Pending 0 34m <none> <none> <none> <none> [root@master01 ~]# kubectl apply -f prefernece-podaffinity.yaml pod/with-pod-affinity-2 created [root@master01 ~]# kubectl get pods -L app -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES APP redis-pod1 1/1 Running 0 52s 10.244.1.35 node01.k8s.org <none> <none> db redis-pod2 1/1 Running 0 52s 10.244.3.24 node03.k8s.org <none> <none> db with-pod-affinity-1 0/1 Pending 0 35m <none> <none> <none> <none> with-pod-affinity-2 1/1 Running 0 9s 10.244.1.36 node01.k8s.org <none> <none> [root@master01 ~]#
提示:可以看到對應pod執行在node01上,其原因是對應node01上有一個pod標籤為app=db的pod執行,滿足對應軟限制條件,並且對應節點上有key為rack的節點標籤;即滿足對應權重為80的條件,所以對應pod更傾向執行在node01上;
示例:使用Affinity中的PodAffinity中的硬限制和軟限制排程策略
[root@master01 ~]# cat require-preference-podaffinity.yaml apiVersion: v1 kind: Pod metadata: name: with-pod-affinity-3 spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - {key: app, operator: In, values: ["db"]} topologyKey: kubernetes.io/hostname preferredDuringSchedulingIgnoredDuringExecution: - weight: 80 podAffinityTerm: labelSelector: matchExpressions: - {key: app, operator: In, values: ["db"]} topologyKey: rack - weight: 20 podAffinityTerm: labelSelector: matchExpressions: - {key: app, operator: In, values: ["db"]} topologyKey: zone containers: - name: myapp image: ikubernetes/myapp:v1 [root@master01 ~]#
提示:上述清單表示對應pod必須執行在對應節點上執行的有標籤為app=db的pod,如果沒有節點滿足,則對應pod只能掛起;如果滿足的節點有多個,則對應滿足軟限制中的要求;如果滿足硬限制的同時也滿足對應節點上有key為rack的節點標籤,則對應節點總分加80,如果對應節點有key為zone的節點標籤,則對應節點總分加20;
應用清單
[root@master01 ~]# kubectl get pods -o wide -L app NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES APP redis-pod1 1/1 Running 0 13m 10.244.1.35 node01.k8s.org <none> <none> db redis-pod2 1/1 Running 0 13m 10.244.3.24 node03.k8s.org <none> <none> db with-pod-affinity-1 0/1 Pending 0 48m <none> <none> <none> <none> with-pod-affinity-2 1/1 Running 0 13m 10.244.1.36 node01.k8s.org <none> <none> [root@master01 ~]# kubectl apply -f require-preference-podaffinity.yaml pod/with-pod-affinity-3 created [root@master01 ~]# kubectl get pods -o wide -L app NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES APP redis-pod1 1/1 Running 0 14m 10.244.1.35 node01.k8s.org <none> <none> db redis-pod2 1/1 Running 0 14m 10.244.3.24 node03.k8s.org <none> <none> db with-pod-affinity-1 0/1 Pending 0 48m <none> <none> <none> <none> with-pod-affinity-2 1/1 Running 0 13m 10.244.1.36 node01.k8s.org <none> <none> with-pod-affinity-3 1/1 Running 0 6s 10.244.1.37 node01.k8s.org <none> <none> [root@master01 ~]#
提示:可以看到對應pod被排程到node01上執行,其原因是對應節點滿足硬限制條件的同時也滿足對應權重最大的軟限制條件;
驗證:刪除上述pod,重新應用清單看看對應pod是否還會正常執行?
[root@master01 ~]# kubectl delete all --all pod "redis-pod1" deleted pod "redis-pod2" deleted pod "with-pod-affinity-1" deleted pod "with-pod-affinity-2" deleted pod "with-pod-affinity-3" deleted service "kubernetes" deleted [root@master01 ~]# kubectl apply -f require-preference-podaffinity.yaml pod/with-pod-affinity-3 created [root@master01 ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES with-pod-affinity-3 0/1 Pending 0 5s <none> <none> <none> <none> [root@master01 ~]#
提示:可以看到對應pod創建出來處於pending狀態,其原因是沒有任何節點滿足對應pod排程的硬限制;所以對應pod沒法排程,只能被掛起;
示例:使用Affinity中的podAntiAffinity排程策略
[root@master01 ~]# cat require-preference-podantiaffinity.yaml apiVersion: v1 kind: Pod metadata: name: with-pod-affinity-4 spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - {key: app, operator: In, values: ["db"]} topologyKey: kubernetes.io/hostname preferredDuringSchedulingIgnoredDuringExecution: - weight: 80 podAffinityTerm: labelSelector: matchExpressions: - {key: app, operator: In, values: ["db"]} topologyKey: rack - weight: 20 podAffinityTerm: labelSelector: matchExpressions: - {key: app, operator: In, values: ["db"]} topologyKey: zone containers: - name: myapp image: ikubernetes/myapp:v1 [root@master01 ~]#
提示:podantiaffinity的使用和podaffinity的使用方式一樣,只是其對應的邏輯相反,podantiaffinity是定義滿足條件的節點不執行對應pod,podaffinity是滿足條件執行pod;上述清單表示對應pod一定不能執行在有標籤為app=db的pod執行的節點,並且對應節點上如果有key為rack和key為zone的節點標籤,這類節點也不執行;即只能執行在上述三個條件都滿足的節點上;如果所有節點都滿足上述三個條件,則對應pod只能掛;如果單單使用軟限制,則pod會勉強執行在對應節點得分較低的節點上執行;
應用清單
[root@master01 ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES with-pod-affinity-3 0/1 Pending 0 22m <none> <none> <none> <none> [root@master01 ~]# kubectl apply -f require-preference-podantiaffinity.yaml pod/with-pod-affinity-4 created [root@master01 ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES with-pod-affinity-3 0/1 Pending 0 22m <none> <none> <none> <none> with-pod-affinity-4 1/1 Running 0 6s 10.244.4.30 node04.k8s.org <none> <none> [root@master01 ~]# kubectl get node -L rack,zone NAME STATUS ROLES AGE VERSION RACK ZONE master01.k8s.org Ready control-plane,master 30d v1.20.0 node01.k8s.org Ready <none> 30d v1.20.0 group1 node02.k8s.org Ready <none> 30d v1.20.0 node03.k8s.org Ready <none> 30d v1.20.0 group2 node04.k8s.org Ready <none> 20d v1.20.0 [root@master01 ~]#
提示:可以看到對應pod被排程到node04上執行;其原因是node04上沒有上述三個條件;當然node02也是符合執行對應pod的節點;
驗證:刪除上述pod,在四個節點上各自執行一個app=db標籤的pod,再次應用清單,看看對用pod怎麼排程?
[root@master01 ~]# kubectl delete all --all pod "with-pod-affinity-3" deleted pod "with-pod-affinity-4" deleted service "kubernetes" deleted [root@master01 ~]# cat pod-demo.yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: redis-ds labels: app: db spec: selector: matchLabels: app: db template: metadata: labels: app: db spec: containers: - name: redis image: redis:4-alpine ports: - name: redis containerPort: 6379 [root@master01 ~]# kubectl apply -f pod-demo.yaml daemonset.apps/redis-ds created [root@master01 ~]# kubectl get pods -L app -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES APP redis-ds-4bnmv 1/1 Running 0 44s 10.244.2.26 node02.k8s.org <none> <none> db redis-ds-c2h77 1/1 Running 0 44s 10.244.1.38 node01.k8s.org <none> <none> db redis-ds-mbxcd 1/1 Running 0 44s 10.244.4.32 node04.k8s.org <none> <none> db redis-ds-r2kxv 1/1 Running 0 44s 10.244.3.25 node03.k8s.org <none> <none> db [root@master01 ~]# kubectl apply -f require-preference-podantiaffinity.yaml pod/with-pod-affinity-5 created [root@master01 ~]# kubectl get pods -o wide -L app NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES APP redis-ds-4bnmv 1/1 Running 0 2m29s 10.244.2.26 node02.k8s.org <none> <none> db redis-ds-c2h77 1/1 Running 0 2m29s 10.244.1.38 node01.k8s.org <none> <none> db redis-ds-mbxcd 1/1 Running 0 2m29s 10.244.4.32 node04.k8s.org <none> <none> db redis-ds-r2kxv 1/1 Running 0 2m29s 10.244.3.25 node03.k8s.org <none> <none> db with-pod-affinity-5 0/1 Pending 0 9s <none> <none> <none> <none> [root@master01 ~]#
提示:可以看到對應pod沒有節點可以執行,處於pending狀態,其原因對應節點都滿足排斥執行對應pod的硬限制;
通過上述驗證過程可以總結,不管是pod與節點的親和性還是pod與pod的親和性,只要在排程策略中定義了硬親和,對應pod一定會執行在滿足硬親和條件的節點上,如果沒有節點滿足硬親和條件,則對應pod掛起;如果只是定義了軟親和,則對應pod會優先執行在匹配權重較大軟限制條件的節點上,如果沒有節點滿足軟限制,對應排程就走預設排程策略,找得分最高的節點執行;對於反親和性也是同樣的邏輯;不同的是反親和滿足對應硬限制或軟限制,對應pod不會執行在對應節點上;這裡還需要注意一點,使用pod與pod的親和排程策略,如果節點較多,其規則不應該設定的過於精細,顆粒度應該適當即可,過度精細會導致pod在排程時,篩選節點消耗更多的資源,導致整個叢集效能下降;建議在大規模叢集中使用node affinit