Kubernetes PodGC Controller原始碼分析

阿新 • • 發佈：2018-11-21

Author: [email protected]

PodGC Controller配置

關於PodGC Controller的相關配置（kube-controller-manager配置），一共只有兩個：

flag	default value	comments
--controllers stringSlice	*	這裡配置需要enable的controlllers列表，podgc當然也可以在這裡設定是都要enable or disable，預設podgc是在enable列表中的。
--terminated-pod-gc-threshold int32	12500	Number of terminated pods that can exist before the terminated pod garbage collector starts deleting terminated pods. If <= 0, the terminated pod garbage collector is disabled. (default 12500)

PodGC Controller入口

PodGC Controller是在kube-controller-manager Run的時候啟動的。CMServer Run時會invoke StartControllers將預先註冊的enabled Controllers遍歷並逐個啟動。

cmd/kube-controller-manager/app/controllermanager.go:180

func Run(s *options.CMServer) error {
   ...
	err := StartControllers(newControllerInitializers(), s, rootClientBuilder, clientBuilder, stop)
	...
}

在newControllerInitializers註冊了所有一些常規Controllers及其對應的start方法，為什麼說這些是常規的Controllers呢，因為還有一部分Controllers沒在這裡進行註冊，比如非常重要的service Controller，node Controller等，我把這些稱為非常規Controllers

。

func newControllerInitializers() map[string]InitFunc {
	controllers := map[string]InitFunc{}
	controllers["endpoint"] = startEndpointController
	...
	controllers["podgc"] = startPodGCController
	...

	return controllers
}

因此CMServer最終是invoke startPodGCController來啟動PodGC Controller的。

cmd/kube-controller-manager/app/core.go:66

func startPodGCController(ctx ControllerContext) (bool, error) {
	go podgc.NewPodGC(
		ctx.ClientBuilder.ClientOrDie("pod-garbage-collector"),
		ctx.InformerFactory.Core().V1().Pods(),
		int(ctx.Options.TerminatedPodGCThreshold),
	).Run(ctx.Stop)
	return true, nil
}

startPodGCController內容很簡單，啟動一個goruntine協程，建立PodGC並啟動執行。

PodGC Controller的建立

我們先來看看PodGCController的定義。

pkg/controller/podgc/gc_controller.go:44

type PodGCController struct {
	kubeClient clientset.Interface

	podLister       corelisters.PodLister
	podListerSynced cache.InformerSynced

	deletePod              func(namespace, name string) error
	terminatedPodThreshold int
}

kubeClient: 用來跟APIServer通訊的client。
PodLister: PodLister helps list Pods.
podListerSynced: 用來判斷PodLister是否Has Synced。
deletePod: 呼叫apiserver刪除對應pod的介面。
terminatedPodThreshold: 對應--terminated-pod-gc-threshold的配置，預設為12500。

pkg/controller/podgc/gc_controller.go:54

func NewPodGC(kubeClient clientset.Interface, podInformer coreinformers.PodInformer, terminatedPodThreshold int) *PodGCController {
	if kubeClient != nil && kubeClient.Core().RESTClient().GetRateLimiter() != nil {
		metrics.RegisterMetricAndTrackRateLimiterUsage("gc_controller", kubeClient.Core().RESTClient().GetRateLimiter())
	}
	gcc := &PodGCController{
		kubeClient:             kubeClient,
		terminatedPodThreshold: terminatedPodThreshold,
		deletePod: func(namespace, name string) error {
			glog.Infof("PodGC is force deleting Pod: %v:%v", namespace, name)
			return kubeClient.Core().Pods(namespace).Delete(name, metav1.NewDeleteOptions(0))
		},
	}

	gcc.podLister = podInformer.Lister()
	gcc.podListerSynced = podInformer.Informer().HasSynced

	return gcc
}

建立PodGC Controller時其實只是把相關的PodGCController元素進行賦值。注意deletePod方法定義時的引數metav1.NewDeleteOptions(0)，表示立即刪除pod，沒有grace period。

PodGC Controller的執行

建立完PodGC Controller後，接下來就是執行Run方法啟動執行了。

pkg/controller/podgc/gc_controller.go:73

func (gcc *PodGCController) Run(stop <-chan struct{}) {
	if !cache.WaitForCacheSync(stop, gcc.podListerSynced) {
		utilruntime.HandleError(fmt.Errorf("timed out waiting for caches to sync"))
		return
	}

	go wait.Until(gcc.gc, gcCheckPeriod, stop)
	<-stop
}

每100ms都會去檢查對應的PodLister是否Has Synced，直到Has Synced。
啟動goruntine協程，每執行完一次gcc.gc進行Pod回收後，等待20s，再次執行gcc.gc，直到收到stop訊號。

pkg/controller/podgc/gc_controller.go:83

func (gcc *PodGCController) gc() {
	pods, err := gcc.podLister.List(labels.Everything())
	if err != nil {
		glog.Errorf("Error while listing all Pods: %v", err)
		return
	}
	if gcc.terminatedPodThreshold > 0 {
		gcc.gcTerminated(pods)
	}
	gcc.gcOrphaned(pods)
	gcc.gcUnscheduledTerminating(pods)
}

gcc.gc是最終的pod回收邏輯：

調從PodLister中去除所有的pods（不設定過濾）
如果terminatedPodThreshold大於0，則呼叫gcc.gcTerminated(pods)回收那些超出Threshold的Pods。
呼叫gcc.gcOrphaned(pods)回收Orphaned pods。
呼叫gcc.gcUnscheduledTerminating(pods)回收UnscheduledTerminating pods。

注意：

gcTerminated和gcOrphaned，gcUnscheduledTerminating這三個gc都是序列執行的。
gcTerminated刪除超出閾值的pods的刪除動作是並行的，通過sync.WaitGroup等待所有對應的pods刪除完成後，gcTerminated才會結束返回，才能開始後面的gcOrphaned.
gcOrphaned，gcUnscheduledTerminatin，gcUnscheduledTerminatin內部都是序列gc pods的。

回收那些Terminated的pods

func (gcc *PodGCController) gcTerminated(pods []*v1.Pod) {
	terminatedPods := []*v1.Pod{}
	for _, pod := range pods {
		if isPodTerminated(pod) {
			terminatedPods = append(terminatedPods, pod)
		}
	}

	terminatedPodCount := len(terminatedPods)
	sort.Sort(byCreationTimestamp(terminatedPods))

	deleteCount := terminatedPodCount - gcc.terminatedPodThreshold

	if deleteCount > terminatedPodCount {
		deleteCount = terminatedPodCount
	}
	if deleteCount > 0 {
		glog.Infof("garbage collecting %v pods", deleteCount)
	}

	var wait sync.WaitGroup
	for i := 0; i < deleteCount; i++ {
		wait.Add(1)
		go func(namespace string, name string) {
			defer wait.Done()
			if err := gcc.deletePod(namespace, name); err != nil {
				// ignore not founds
				defer utilruntime.HandleError(err)
			}
		}(terminatedPods[i].Namespace, terminatedPods[i].Name)
	}
	wait.Wait()
}

遍歷所有pods，過濾出所有Terminated Pods（Pod.Status.Phase不為Pending, Running, Unknow的Pods）.
計算terminated pods數與terminatedPodThreshold的(超出)差值deleteCount。
啟動deleteCount數量的goruntine協程，並行呼叫gcc.deletePod（invoke apiserver's api）方法立刻刪除對應的pod。

回收那些Binded的Nodes已經不存在的pods

// gcOrphaned deletes pods that are bound to nodes that don't exist.
func (gcc *PodGCController) gcOrphaned(pods []*v1.Pod) {
	glog.V(4).Infof("GC'ing orphaned")
	// We want to get list of Nodes from the etcd, to make sure that it's as fresh as possible.
	nodes, err := gcc.kubeClient.Core().Nodes().List(metav1.ListOptions{})
	if err != nil {
		return
	}
	nodeNames := sets.NewString()
	for i := range nodes.Items {
		nodeNames.Insert(nodes.Items[i].Name)
	}

	for _, pod := range pods {
		if pod.Spec.NodeName == "" {
			continue
		}
		if nodeNames.Has(pod.Spec.NodeName) {
			continue
		}
		glog.V(2).Infof("Found orphaned Pod %v assigned to the Node %v. Deleting.", pod.Name, pod.Spec.NodeName)
		if err := gcc.deletePod(pod.Namespace, pod.Name); err != nil {
			utilruntime.HandleError(err)
		} else {
			glog.V(0).Infof("Forced deletion of orphaned Pod %s succeeded", pod.Name)
		}
	}
}

gcOrphaned用來刪除那些bind的node已經不存在的pods。

呼叫apiserver介面，獲取所有的Nodes。
遍歷所有pods，如果pod bind的NodeName不為空且不包含在剛剛獲取的所有Nodes中，則序列逐個呼叫gcc.deletePod刪除對應的pod。

回收Unscheduled並且Terminating的pods

pkg/controller/podgc/gc_controller.go:167

// gcUnscheduledTerminating deletes pods that are terminating and haven't been scheduled to a particular node.
func (gcc *PodGCController) gcUnscheduledTerminating(pods []*v1.Pod) {
	glog.V(4).Infof("GC'ing unscheduled pods which are terminating.")

	for _, pod := range pods {
		if pod.DeletionTimestamp == nil || len(pod.Spec.NodeName) > 0 {
			continue
		}

		glog.V(2).Infof("Found unscheduled terminating Pod %v not assigned to any Node. Deleting.", pod.Name)
		if err := gcc.deletePod(pod.Namespace, pod.Name); err != nil {
			utilruntime.HandleError(err)
		} else {
			glog.V(0).Infof("Forced deletion of unscheduled terminating Pod %s succeeded", pod.Name)
		}
	}
}

gcUnscheduledTerminating刪除那些terminating並且還沒排程到某個node的pods。

遍歷所有pods，過濾那些terminating(pod.DeletionTimestamp != nil)並且未排程成功的(pod.Spec.NodeName為空)的pods。
序列逐個呼叫gcc.deletePod刪除對應的pod。

總結

PodGC Controller作為Kubernetes預設啟動的Controllers之一，在Master後臺每隔20s進行一次Pod GC。

通過--controllers可以控制PodGC Controller的開關。
通過--terminated-pod-gc-threshold設定gcTerminated的閾值。
PodGC Controller序列的執行以下三個gc子過程：
- 回收超過閾值的Terminated Pods（Pod.Status.Phase不為Pending, Running, Unknow的Pods）。
- 回收那些binded的node已經不存在（不在etcd中）的pods。
- 回收那些terminating並且還沒排程到某個node的pods。

文章轉載於：https://cloud.tencent.com/developer/article/1097279

Kubernetes PodGC Controller原始碼分析

PodGC Controller配置

PodGC Controller入口

PodGC Controller的建立

PodGC Controller的執行

回收那些Terminated的pods

回收那些Binded的Nodes已經不存在的pods

回收Unscheduled並且Terminating的pods

總結

Kubernetes PodGC Controller原始碼分析

Kubernetes Node Controller原始碼分析之執行篇

kubernetes垃圾回收器GarbageCollector Controller原始碼分析（二）

【kubernetes/k8s原始碼分析】 controller-manager之replicaset原始碼分析

【kubernetes/k8s原始碼分析】kubectl-controller-manager之job原始碼分析

【kubernetes/k8s原始碼分析】kubectl-controller-manager之cronjob原始碼分析

【kubernetes/k8s原始碼分析】kubectl-controller-manager之HPA原始碼分析

【kubernetes/k8s原始碼分析】kubectl-controller-manager之pod gc原始碼分析

【kubernetes/k8s概念】CNI host-local原始碼分析

【kubernetes/k8s概念】CNI macvlan原始碼分析

【kubernetes/k8s原始碼分析】kubelet原始碼分析之cdvisor原始碼分析

【kubernetes/k8s原始碼分析】kubelet原始碼分析之容器網路初始化原始碼分析

【kubernetes/k8s原始碼分析】kubelet原始碼分析之資源上報

【kubernetes/k8s原始碼分析】kubelet原始碼分析之啟動容器

【kubernetes/k8s原始碼分析】 client-go包之Informer原始碼分析

【kubernetes/k8s原始碼分析】kube-apiserver的go-restful框架使用

kubeadm原始碼分析（內含kubernetes離線包，三步安裝）_Kubernetes中文社群

【kubernetes/k8s原始碼分析】 deployment原始碼分析

【kubernetes/k8s概念】CNI plugin calico原始碼分析

【kubernetes/k8s原始碼分析】kubernetes event原始碼分析

Kubernetes PodGC Controller原始碼分析

PodGC Controller配置

PodGC Controller入口

PodGC Controller的建立

PodGC Controller的執行

回收那些Terminated的pods

回收那些Binded的Nodes已經不存在的pods

回收Unscheduled並且Terminating的pods

總結

相關推薦