8.深入k8s:資源控制Qos和eviction及其原始碼分析
阿新 • • 發佈:2020-08-30
> 轉載請宣告出處哦~,本篇文章釋出於luozhiyun的部落格:https://www.luozhiyun.com,原始碼版本是[1.19](https://github.com/kubernetes/kubernetes/tree/release-1.19)
![83980769_p0_master1200](https://img.luozhiyun.com/20200829221848.jpg)
又是一個週末,可以愉快的坐下來靜靜的品味一段原始碼,這一篇涉及到資源的回收,工作量是很大的,篇幅會比較長,我們可以看到k8s在資源不夠時會怎麼做的,k8s在回收資源的時候有哪些考慮,我們的pod為什麼會無端端的被幹掉等等。
## limit&request
在k8s中,CPU和記憶體的資源主要是通過這limit&request來進行限制的,在yaml檔案中的定義如下:
```
spec.containers[].resources.limits.cpu
spec.containers[].resources.limits.memory
spec.containers[].resources.requests.cpu
spec.containers[].resources.requests.memory
```
在排程的時候,kube-scheduler 只會按照 requests 的值進行計算,而真正限制資源使用的是limit。
下面我引用一個官方的例子:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: cpu-demo
namespace: cpu-example
spec:
containers:
- name: cpu-demo-ctr
image: vish/stress
resources:
limits:
cpu: "1"
requests:
cpu: "0.5"
args:
- -cpus
- "2"
```
在這個例子中,args引數給的是cpus等於2,表示這個container可以使用2個cpu進行壓測。但是我們的limits是1,以及requests是0.5。
當我們建立好這個pod之後,然後使用kubectl top去檢視資源使用情況的時候會發現cpu使用並不會超過1:
```
NAME CPU(cores) MEMORY(bytes)
cpu-demo 974m
```
這說明這個pod的cpu資源被限制在了1個cpu,即使container想使用,也是沒有辦法的。
在容器沒有指定 request 的時候,request 的值和 limit 預設相等。
## QoS 模型與Eviction
下面說一下由不同的 requests 和 limits 的設定方式引出的不同的 QoS 級別。
kubernetes 中有三種 Qos,分別為:
1. `Guaranteed`:Pod中所有Container的所有Resource的`limit`和`request`都相等且不為0;
2. `Burstable`:pod不滿足Guaranteed條件,但是其中至少有一個container設定了requests或limits ;
3. `BestEffort`:pod的 requests 與 limits 均沒有設定;
當宿主機資源緊張的時候,kubelet 對 Pod 進行 Eviction(即資源回收)時會按照Qos的順序進行回收,回收順序是:BestEffort>Burstable>Guaranteed
Eviction有兩種模式,分為 Soft 和 Hard。Soft Eviction 允許你為 Eviction 過程設定grace period,然後等待一個使用者配置的grace period之後,再執行Eviction,而Hard則立即執行。
那麼什麼時候會發生Eviction呢?我們可以為Eviction 設定threshold,比如設定設定記憶體的 eviction hard threshold 為 100M,那麼當這臺機器的記憶體可用資源不足 100M 時,kubelet 就會根據這臺機器上面所有 pod 的 QoS 級別以及他們的記憶體使用情況,進行一個綜合排名,把排名最靠前的 pod 進行遷移,從而釋放出足夠的記憶體資源。
thresholds定義方式為`[eviction-signal][operator][quantity]`
**eviction-signal**
eviction-signal按照官方文件的說法分為如下幾種:
| Eviction Signal | Description |
| ------------------ | ------------------------------------------------------------ |
| memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet |
| nodefs.available | nodefs.available := node.stats.fs.available |
| nodefs.inodesFree | nodefs.inodesFree := node.stats.fs.inodesFree |
| imagefs.available | imagefs.available := node.stats.runtime.imagefs.available |
| imagefs.inodesFree | imagefs.inodesFree := node.stats.runtime.imagefs.inodesFree |
nodefs和imagefs表示兩種檔案系統分割槽:
nodefs:檔案系統,kubelet 將其用於卷和守護程式日誌等。
imagefs:檔案系統,容器執行時用於儲存映象和容器可寫層。
**operator**
就是所需的關係運算符,如"<"。
**quantity**
是閾值的大小,可以容量大小,如:1Gi;也可以用百分比來表示:10%。
如果kubelet在節點經歷系統 OOM 之前無法回收記憶體,那麼oom_killer將基於它在節點上 使用的記憶體百分比算出一個oom_score,然後結束得分最高的容器。
## Qos原始碼分析
qos的程式碼位於pkg\apis\core\v1\helper\qos\包下面:
**qos#GetPodQOS**
```go
//pkg\apis\core\v1\helper\qos\qos.go
func GetPodQOS(pod *v1.Pod) v1.PodQOSClass {
requests := v1.ResourceList{}
limits := v1.ResourceList{}
zeroQuantity := resource.MustParse("0")
isGuaranteed := true
allContainers := []v1.Container{}
//追加所有的初始化容器
allContainers = append(allContainers, pod.Spec.Containers...)
allContainers = append(allContainers, pod.Spec.InitContainers...)
//遍歷container
for _, container := range allContainers {
// process requests
//遍歷request 裡面的cpu、memory 獲取其中的值
for name, quantity := range container.Resources.Requests {
if !isSupportedQoSComputeResource(name) {
continue
}
if quantity.Cmp(zeroQuantity) == 1 {
delta := quantity.DeepCopy()
if _, exists := requests[name]; !exists {
requests[name] = delta
} else {
delta.Add(requests[name])
requests[name] = delta
}
}
}
// process limits
qosLimitsFound := sets.NewString()
//遍歷 limit 裡面的cpu、memory 獲取其中的值
for name, quantity := range container.Resources.Limits {
if !isSupportedQoSComputeResource(name) {
continue
}
if quantity.Cmp(zeroQuantity) == 1 {
qosLimitsFound.Insert(string(name))
delta := quantity.DeepCopy()
if _, exists := limits[name]; !exists {
limits[name] = delta
} else {
delta.Add(limits[name])
limits[name] = delta
}
}
}
//如果limits 沒有同時設定cpu 、Memory,那麼就不是Guaranteed
if !qosLimitsFound.HasAll(string(v1.ResourceMemory), string(v1.ResourceCPU)) {
isGuaranteed = false
}
}
//如果requests 和 limits都沒有設定,那麼為BestEffort
if len(requests) == 0 && len(limits) == 0 {
return v1.PodQOSBestEffort
}
// Check is requests match limits for all resources.
if isGuaranteed {
for name, req := range requests {
if lim, exists := limits[name]; !exists || lim.Cmp(req) != 0 {
isGuaranteed = false
break
}
}
}
// 都設定了limits 和 requests,則是Guaranteed
if isGuaranteed &&
len(requests) == len(limits) {
return v1.PodQOSGuaranteed
}
return v1.PodQOSBurstable
}
```
上面有註釋我就不過多介紹,非常的簡單。
下面這裡是QOS OOM打分機制,通過給不同的pod打分來判斷,哪些pod可以被優先kill掉,分數越高的越容易被kill。
**policy**
```go
//\pkg\kubelet\qos\policy.go
// 分值越高越容易被kill
const (
// KubeletOOMScoreAdj is the OOM score adjustment for Kubelet
KubeletOOMScoreAdj int = -999
// KubeProxyOOMScoreAdj is the OOM score adjustment for kube-proxy
KubeProxyOOMScoreAdj int = -999
guaranteedOOMScoreAdj int = -998
besteffortOOMScoreAdj int = 1000
)
```
**policy#GetContainerOOMScoreAdjust**
```go
//\pkg\kubelet\qos\policy.go
func GetContainerOOMScoreAdjust(pod *v1.Pod, container *v1.Container, memoryCapacity int64) int {
//靜態Pod、映象Pod和高優先順序Pod,直接可以是guaranteedOOMScoreAdj
if types.IsCriticalPod(pod) {
// Critical pods should be the last to get killed.
return guaranteedOOMScoreAdj
}
//獲取pod的qos等級,這裡只處理Guaranteed與BestEffort
switch v1qos.GetPodQOS(pod) {
case v1.PodQOSGuaranteed:
// Guaranteed containers should be the last to get killed.
return guaranteedOOMScoreAdj
case v1.PodQOSBestEffort:
return besteffortOOMScoreAdj
}
memoryRequest := container.Resources.Requests.Memory().Value()
//如果我們佔用的記憶體越少,則打分就越高
oomScoreAdjust := 1000 - (1000*memoryRequest)/memoryCapacity
//這裡是為了保證burstable能有個更高的 OOM score
if int(oomScoreAdjust) < (1000 + guaranteedOOMScoreAdj) {
return (1000 + guaranteedOOMScoreAdj)
}
if int(oomScoreAdjust) == besteffortOOMScoreAdj {
return int(oomScoreAdjust - 1)
}
return int(oomScoreAdjust)
}
```
這個方法裡面給不同的pod進行打分,靜態Pod、映象Pod和高優先順序Pod,QOS直接被設定成為guaranteed;
然後呼叫qos的GetPodQOS方法獲取一個pod的評分,但是如果一個pod是burstable,那麼需要根據其直接使用的記憶體來進行評分,佔用的記憶體越少,則打分就越高,如果分數小於1000 + guaranteedOOMScoreAdj,也就是2分,那麼被直接設定成2分,避免分數過低。
## Eviction Manager原始碼分析
kubelet在例項化一個kubelet物件的時候,呼叫`eviction.NewManager`新建了一個evictionManager物件。然後kubelet再Run方法開始工作的時候,建立一個goroutine,每5s執行一次updateRuntimeUp。
在updateRuntimeUp中,待確認runtime啟動成功後,會呼叫initializeRuntimeDependentModules完成runtime依賴模組的初始化工作。
然後在initializeRuntimeDependentModules中會呼叫evictionManager的start方法進行啟動。
程式碼如下,具體的kubelet流程我們留到以後慢慢分析:
```go
func NewMainKubelet(...){
...
evictionManager, evictionAdmitHandler := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.podManager.GetMirrorPodByPod, klet.imageManager, klet.containerGC, kubeDeps.Recorder, nodeRef, klet.clock, etcHostsPathFunc)
klet.evictionManager = evictionManager
...
}
func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
...
go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
...
}
func (kl *Kubelet) updateRuntimeUp() {
...
kl.oneTimeInitializer.Do(kl.initializeRuntimeDependentModules)
...
}
func (kl *Kubelet) initializeRuntimeDependentModules() {
...
kl.evictionManager.Start(kl.StatsProvider, kl.GetActivePods, kl.podResourcesAreReclaimed, evictionMonitoringPeriod)
...
}
```
下面我們來到\pkg\kubelet\eviction\eviction_manager.go去看一下Start方法怎麼實現eviction的。
**managerImp#Start**
```go
// 開啟一個控制迴圈去監視和響應資源過低的情況
func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, monitoringInterval time.Duration) {
thresholdHandler := func(message string) {
klog.Infof(message)
m.synchronize(diskInfoProvider, podFunc)
}
//是否要利用kernel memcg notification
if m.config.KernelMemcgNotification {
for _, threshold := range m.config.Thresholds {
if threshold.Signal == evictionapi.SignalMemoryAvailable || threshold.Signal == evictionapi.SignalAllocatableMemoryAvailable {
notifier, err := NewMemoryThresholdNotifier(threshold, m.config.PodCgroupRoot, &CgroupNotifierFactory{}, thresholdHandler)
if err != nil {
klog.Warningf("eviction manager: failed to create memory threshold notifier: %v", err)
} else {
go notifier.Start()
m.thresholdNotifiers = append(m.thresholdNotifiers, notifier)
}
}
}
}
// start the eviction manager monitoring
// 啟動一個goroutine,for迴圈裡每隔monitoringInterval(10s)執行一次synchronize
go func() {
for {
//synchronize是主要的eviction控制迴圈,返回被kill的pod,或返回nill
if evictedPods := m.synchronize(diskInfoProvider, podFunc); evictedPods != nil {
klog.Infof("eviction manager: pods %s evicted, waiting for pod to be cleaned up", format.Pods(evictedPods))
m.waitForPodsCleanup(podCleanedUpFunc, evictedPods)
} else {
time.Sleep(monitoringInterval)
}
}
}()
}
```
下面的synchronize方法會很長,需要點耐心:
**managerImpl#synchronize**
1. 根據上面介紹的不同的eviction signal會有不同的排序方法,以及設定節點資源回收方法
```go
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
...
if m.dedicatedImageFs == nil {
hasImageFs, ok := diskInfoProvider.HasDedicatedImageFs()
if ok != nil {
return nil
}
m.dedicatedImageFs = &hasImageFs
//註冊各個eviction signal所對應的資源排序方法
m.signalToRankFunc = buildSignalToRankFunc(hasImageFs)
// 註冊節點資源回收方法,例如imagefs.avaliable對應的是刪除無用容器和無用映象
m.signalToNodeReclaimFuncs = buildSignalToNodeReclaimFuncs(m.imageGC, m.containerGC, hasImageFs)
}
...
}
```
看一下buildSignalToRankFunc方法的實現:
```go
func buildSignalToRankFunc(withImageFs bool) map[evictionapi.Signal]rankFunc {
signalToRankFunc := map[evictionapi.Signal]rankFunc{
evictionapi.SignalMemoryAvailable: rankMemoryPressure,
evictionapi.SignalAllocatableMemoryAvailable: rankMemoryPressure,
evictionapi.SignalPIDAvailable: rankPIDPressure,
}
if withImageFs {
signalToRankFunc[evictionapi.SignalNodeFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage)
signalToRankFunc[evictionapi.SignalNodeFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes)
signalToRankFunc[evictionapi.SignalImageFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot}, v1.ResourceEphemeralStorage)
signalToRankFunc[evictionapi.SignalImageFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot}, resourceInodes)
} else {
signalToRankFunc[evictionapi.SignalNodeFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage)
signalToRankFunc[evictionapi.SignalNodeFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes)
signalToRankFunc[evictionapi.SignalImageFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage)
signalToRankFunc[evictionapi.SignalImageFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes)
}
return signalToRankFunc
}
```
這個方法裡面會將各個eviction signal的排序方法放入到一個map中返回,如MemoryAvailable、NodeFsAvailable、ImageFsAvailable等。
2. 獲取所有的活躍的pod,以及整體的stat資訊
```go
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
...
//獲取當前active的pods
activePods := podFunc()
updateStats := true
//獲取節點的整體概況,即nodeStsts和podStats
summary, err := m.summaryProvider.Get(updateStats)
if err != nil {
klog.Errorf("eviction manager: failed to get summary stats: %v", err)
return nil
}
//如果Notifiers有超過10s沒有重新整理,那麼更新Notifiers
if m.clock.Since(m.thresholdsLastUpdated) > notifierRefreshInterval {
m.thresholdsLastUpdated = m.clock.Now()
for _, notifier := range m.thresholdNotifiers {
if err := notifier.UpdateThreshold(summary); err != nil {
klog.Warningf("eviction manager: failed to update %s: %v", notifier.Description(), err)
}
}
}
...
}
```
3. 根據summary資訊建立相應的統計資訊到observations物件中
```go
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
...
//根據summary資訊建立相應的統計資訊到observations物件中,如SignalMemoryAvailable、SignalNodeFsAvailable等。
observations, statsFunc := makeSignalObservations(summary)
...
}
```
下面抽取部分程式碼**makeSignalObservations**
```go
func makeSignalObservations(summary *statsapi.Summary) (signalObservations, statsFunc) {
...
if memory := summary.Node.Memory; memory != nil && memory.AvailableBytes != nil && memory.WorkingSetBytes != nil {
result[evictionapi.SignalMemoryAvailable] = signalObservation{
available: resource.NewQuantity(int64(*memory.AvailableBytes), resource.BinarySI),
capacity: resource.NewQuantity(int64(*memory.AvailableBytes+*memory.WorkingSetBytes), resource.BinarySI),
time: memory.Time,
}
}
...
}
```
這個方法主要是將summary裡面的資源利用情況根據不同的eviction signal封裝到result裡面返回。
4. 根據獲取的observations判斷是否已到達閾值的thresholds
```go
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
...
//根據獲取的observations判斷是否已到達閾值的thresholds,然後返回
thresholds = thresholdsMet(thresholds, observations, false)
if len(m.thresholdsMet) > 0 {
//Minimum eviction reclaim 策略
thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true)
thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved)
}
...
}
```
**thresholdsMet**
```go
func thresholdsMet(thresholds []evictionapi.Threshold, observations signalObservations, enforceMinReclaim bool) []evictionapi.Threshold {
results := []evictionapi.Threshold{}
for i := range thresholds {
threshold := thresholds[i]
observed, found := observations[threshold.Signal]
if !found {
klog.Warningf("eviction manager: no observation found for eviction signal %v", threshold.Signal)
continue
}
thresholdMet := false
// 根據資源容量獲取閾值的資源大小
quantity := evictionapi.GetThresholdQuantity(threshold.Value, observed.capacity)
//Minimum eviction reclaim 策略,具體看:https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#minimum-eviction-reclaim
if enforceMinReclaim && threshold.MinReclaim != nil {
quantity.Add(*evictionapi.GetThresholdQuantity(*threshold.MinReclaim, observed.capacity))
}
//如果observed.available比quantity大,那麼返回1
thresholdResult := quantity.Cmp(*observed.available)
//檢查Operator識別符號
switch threshold.Operator {
//如果是小於號"<",當thresholdResult大於0,返回true
case evictionapi.OpLessThan:
thresholdMet = thresholdResult > 0
}
//如果append到results,表示已經到達閾值
if thresholdMet {
results = append(results, threshold)
}
}
return results
}
```
thresholdsMet會遍歷整個thresholds,然後從observations裡面獲取eviction signal對應的資源情況。因為我們上面講了設定的threshold可以是1Gi,也可以是百分比,所以需要呼叫GetThresholdQuantity方法換算一下,得到quantity;
然後根據Minimum eviction reclaim 策略判斷一下是否還需要提高這個需要eviction的資源,具體的資訊檢視文件:https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#minimum-eviction-reclaim;
然後用quantity和available比較一下,如果已達閾值,那麼加入到results集合中返回。
5. 記錄eviction signal 第一次的時間,並將Eviction Signals對映到對應的Node Conditions
```go
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
...
now := m.clock.Now()
//主要用來記錄 eviction signal 第一次的時間,沒有則設定 now 時間
thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now)
// the set of node conditions that are triggered by currently observed thresholds
// Kubelet會將對應的Eviction Signals對映到對應的Node Conditions
nodeConditions := nodeConditions(thresholds)
if len(nodeConditions) > 0 {
klog.V(3).Infof("eviction manager: node conditions - observed: %v", nodeConditions)
}
...
}
```
**nodeConditions**
```go
func nodeConditions(thresholds []evictionapi.Threshold) []v1.NodeConditionType {
results := []v1.NodeConditionType{}
for _, threshold := range thresholds {
if nodeCondition, found := signalToNodeCondition[threshold.Signal]; found {
//檢查results裡是否已有nodeCondition
if !hasNodeCondition(results, nodeCondition) {
results = append(results, nodeCondition)
}
}
}
return results
}
```
nodeConditions方法主要就是根據signalToNodeCondition來對映對應的nodeCondition,其中nodeCondition如下:
```go
signalToNodeCondition = map[evictionapi.Signal]v1.NodeConditionType{}
signalToNodeCondition[evictionapi.SignalMemoryAvailable] = v1.NodeMemoryPressure
signalToNodeCondition[evictionapi.SignalAllocatableMemoryAvailable] = v1.NodeMemoryPressure
signalToNodeCondition[evictionapi.SignalImageFsAvailable] = v1.NodeDiskPressure
signalToNodeCondition[evictionapi.SignalNodeFsAvailable] = v1.NodeDiskPressure
signalToNodeCondition[evictionapi.SignalImageFsInodesFree] = v1.NodeDiskPressure
signalToNodeCondition[evictionapi.SignalNodeFsInodesFree] = v1.NodeDiskPressure
signalToNodeCondition[evictionapi.SignalPIDAvailable] = v1.NodePIDPressure
```
也就是將Eviction Signals分別對映成了MemoryPressure或DiskPressure,整理出來的表格如下:
| Node Condition | Eviction Signal | Description |
| -------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
| MemoryPressure | memory.available | Available memory on the node has satisfied an eviction threshold |
| DiskPressure | nodefs.available, nodefs.inodesFree, imagefs.available, or imagefs.inodesFree | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |
6. 本輪 node condition 與上次的observed合併,以最新的為準
```go
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
...
//本輪 node condition 與上次的observed合併,以最新的為準
nodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now)
...
}
```
7. 防止Node的資源不斷在閾值附近波動,從而不斷變動Node Condition值
```go
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
...
//PressureTransitionPeriod引數預設為5分鐘
//防止Node的資源不斷在閾值附近波動,從而不斷變動Node Condition值
//具體檢視文件:https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#oscillation-of-node-conditions
nodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now)
if len(nodeConditions) > 0 {
klog.V(3).Infof("eviction manager: node conditions - transition period not met: %v", nodeConditions)
}
...
}
```
**nodeConditionsObservedSince**
```go
func nodeConditionsObservedSince(observedAt nodeConditionsObservedAt, period time.Duration, now time.Time) []v1.NodeConditionType {
results := []v1.NodeConditionType{}
for nodeCondition, at := range observedAt {
duration := now.Sub(at)
if duration < period {
results = append(results, nodeCondition)
}
}
return results
}
```
如果已經超過了5分鐘,那麼需要排除。
8. 對eviction-soft做判斷
```go
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
...
//設定 eviction-soft-grace-period,預設為90秒,超過該值加入閾值集合
thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now)
...
}
```
**thresholdsMetGracePeriod**
```go
func thresholdsMetGracePeriod(observedAt thresholdsObservedAt, now time.Time) []evictionapi.Threshold {
results := []evictionapi.Threshold{}
for threshold, at := range observedAt {
duration := now.Sub(at)
//Soft Eviction Thresholds,必須要等一段時間之後才能進行trigger
if duration < threshold.GracePeriod {
klog.V(2).Infof("eviction manager: eviction criteria not yet met for %v, duration: %v", formatThreshold(threshold), duration)
continue
}
results = append(results, threshold)
}
return results
}
```
9. 設值,然後比較更新
```go
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
...
// update internal state
m.Lock()
m.nodeConditions = nodeConditions
m.thresholdsFirstObservedAt = thresholdsFirstObservedAt
m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt
m.thresholdsMet = thresholds
// 閾值集合跟上次比較是否需要更新
thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations)
debugLogThresholdsWithObservation("thresholds - updated stats", thresholds, observations)
//將本次的資訊設定為上次資訊
m.lastObservations = observations
m.Unlock()
...
}
```
10. 排序之後找到第一個需要釋放的threshold,以及對應的resource
```go
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
...
//如果沒有 eviction signal 集合則本輪結束流程
if len(thresholds) == 0 {
klog.V(3).Infof("eviction manager: no resources are starved")
return nil
}
//排序之後獲取thresholds集合中的第一個元素
sort.Sort(byEvictionPriority(thresholds))
thresholdToReclaim, resourceToReclaim, foundAny := getReclaimableThreshold(thresholds)
if !foundAny {
return nil
}
...
}
```
**getReclaimableThreshold**
```go
func getReclaimableThreshold(thresholds []evictionapi.Threshold) (evictionapi.Threshold, v1.ResourceName, bool) {
//遍歷thresholds,然後根據對應的Eviction Signals找到對應的resource
for _, thresholdToReclaim := range thresholds {
if resourceToReclaim, ok := signalToResource[thresholdToReclaim.Signal]; ok {
return thresholdToReclaim, resourceToReclaim, true
}
klog.V(3).Infof("eviction manager: threshold %s was crossed, but reclaim is not implemented for this threshold.", thresholdToReclaim.Signal)
}
return evictionapi.Threshold{}, "", false
}
```
下面我們看一下signalToResource的定義:
```go
signalToResource = map[evictionapi.Signal]v1.ResourceName{}
signalToResource[evictionapi.SignalMemoryAvailable] = v1.ResourceMemory
signalToResource[evictionapi.SignalAllocatableMemoryAvailable] = v1.ResourceMemory
signalToResource[evictionapi.SignalImageFsAvailable] = v1.ResourceEphemeralStorage
signalToResource[evictionapi.SignalImageFsInodesFree] = resourceInodes
signalToResource[evictionapi.SignalNodeFsAvailable] = v1.ResourceEphemeralStorage
signalToResource[evictionapi.SignalNodeFsInodesFree] = resourceInodes
signalToResource[evictionapi.SignalPIDAvailable] = resourcePids
```
signalToResource將Eviction Signals分成了memory、ephemeral-storage、inodes、pids幾類。
11. 回收節點級別的資源
```go
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
...
//回收節點級別的資源
if m.reclaimNodeLevelResources(thresholdToReclaim.Signal, resourceToReclaim) {
klog.Infof("eviction manager: able to reduce %v pressure without evicting pods.", resourceToReclaim)
return nil
}
...
}
```
**reclaimNodeLevelResources**
```go
func (m *managerImpl) reclaimNodeLevelResources(signalToReclaim evictionapi.Signal, resourceToReclaim v1.ResourceName) bool {
//呼叫buildSignalToNodeReclaimFuncs中設定的方法
nodeReclaimFuncs := m.signalToNodeReclaimFuncs[signalToReclaim]
for _, nodeReclaimFunc := range nodeReclaimFuncs {
// 刪除沒用使用到的images或 刪除已經是dead狀態的Pod 和 container
if err := nodeReclaimFunc(); err != nil {
klog.Warningf("eviction manager: unexpected error when attempting to reduce %v pressure: %v", resourceToReclaim, err)
}
}
//回收之後再檢查一下資源佔用情況,如果沒有達到閾值,那麼直接結束
if len(nodeReclaimFuncs) > 0 {
summary, err := m.summaryProvider.Get(true)
if err != nil {
klog.Errorf("eviction manager: failed to get summary stats after resource reclaim: %v", err)
return false
}
observations, _ := makeSignalObservations(summary)
debugLogObservations("observations after resource reclaim", observations)
thresholds := thresholdsMet(m.config.Thresholds, observations, false)
debugLogThresholdsWithObservation("thresholds after resource reclaim - ignoring grace period", thresholds, observations)
if len(thresholds) == 0 {
return true
}
}
return false
}
```
首先根據需要釋放的signal從signalToNodeReclaimFuncs中找到對應的釋放資源的方法,這個方法在上面buildSignalToNodeReclaimFuncs中設定的,如:
```
nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
```
這個方法會呼叫相應的GC方法,刪除無用的container以及無用的images來釋放資源。
然後會檢查釋放完資源之後是否依然超過閾值,如果沒有的話就直接結束了。
12. 獲取相應的排序函式並進行排序
```go
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
...
//得到上面的eviction signal 排序函式,在buildSignalToRankFunc方法中設定
rank, ok := m.signalToRankFunc[thresholdToReclaim.Signal]
if !ok {
klog.Errorf("eviction manager: no ranking function for signal %s", thresholdToReclaim.Signal)
return nil
}
//如果沒有 active pod 直接返回
if len(activePods) == 0 {
klog.Errorf("eviction manager: eviction thresholds have been met, but no pods are active to evict")
return nil
}
//將pod按照特定資源排序
rank(activePods, statsFunc)
...
}
```
13. 將排好序的pod刪除,並返回
```go
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
...
for i := range activePods {
pod := activePods[i]
gracePeriodOverride := int64(0)
if !isHardEvictionThreshold(thresholdToReclaim) {
gracePeriodOverride = m.config.MaxPodGracePeriodSeconds
}
message, annotations := evictionMessage(resourceToReclaim, pod, statsFunc)
//kill pod
if m.evictPod(pod, gracePeriodOverride, message, annotations) {
metrics.Evictions.WithLabelValues(string(thresholdToReclaim.Signal)).Inc()
return []*v1.Pod{pod}
}
}
...
}
```
只要有一個pod被刪除了,那麼就返回~
到這裡eviction manager就分析完了~
## 總結
這一篇講解了其中資源控制是怎麼做的,理解了通過limit和request的設定會影響到pod被刪除的優先順序,所以我們在設定pod的時候儘量設定合理的limit和request可以不那麼容易被kill掉;然後通過分析了原始碼知道了limit和request會影響到QOS的評分,從而影響到pod被kill掉的優先順序。
接下來通過原始碼分析了k8s中對閾值的設定是怎樣的,當資源不夠的時候pod是根據什麼條件被kill掉的,這一部分花了很大的篇幅來介紹。通過原始碼也可以知道在eviction發生的時候k8s也是做了很多的考慮,比如說對於節點狀態振盪應該怎麼處理、首先應該回收什麼型別的資源、minimum-reclaim最小回收資源在原始碼裡是怎麼做到的等等。
## Reference
https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/
https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/
https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/
https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/
https://zhuanlan.zhihu.com/p/38359775
https://cloud.tencent.com/developer/article/1097431
https://developer.aliyun.com/article/679216
https://my.oschina.net/jxcdwangtao/blog