k8s endpoints controller分析

阿新 • • 發佈：2021-11-14

endpoints controller是kube-controller-manager元件中眾多控制器中的一個，是 endpoints 資源的控制器，其通過對service、pod 2種資源的監聽，當這2種資源發生變化時會觸發 endpoints controller對相應的endpoints資源進行調諧操作，完成endpoints的建立更新

k8s endpoints controller分析

endpoints controller簡介

endpoints controller是kube-controller-manager元件中眾多控制器中的一個，是 endpoints 資源物件的控制器，其通過對service、pod 2種資源的監聽，當這2種資源發生變化時會觸發 endpoints controller 對相應的endpoints資源進行調諧操作，從而完成endpoints物件的新建、更新、刪除等操作。

endpoints controller架構圖

endpoints controller的大致組成和處理流程如下圖，endpoints controller對pod、service物件註冊了event handler，當有事件時，會watch到然後將對應的service物件放入到queue中，然後syncService

方法為endpoints controller調諧endpoints物件的核心處理邏輯所在，從queue中取出service物件，再查詢相應的pod物件列表，然後對endpoints物件做調諧處理。

service物件簡介

Service 是對一組提供相同功能的 Pods 的抽象，併為它們提供一個統一的入口。藉助 Service，應用可以方便的實現服務發現與負載均衡，並實現應用的零宕機升級。Service 通過標籤來選取服務後端，這些匹配標籤的 Pod IP 和埠列表組成 endpoints，由 kube-proxy 負責將服務 IP 負載均衡到這些 endpoints 上。

service的四種類型如下。

（1）ClusterIP

型別為ClusterIP的service，這個service有一個Cluster IP，其實就一個VIP，具體實現原理依靠kubeproxy元件，通過iptables或是ipvs實現。該型別的service 只能在叢集內訪問。

client訪問Cluster IP，通過iptables或ipvs規則轉到Real Server（endpoints），從而達到負載均衡的效果。

Headless Service

特殊的ClusterIP，通過指定 Cluster IP（spec.clusterIP）的值為 "None" 來建立 Headless Service。

使用場景一：自主選擇權，client自行決定使用哪個Real Server，可以通過查詢DNS來獲取Real Server的資訊。

使用場景二：Headless Service的對應的每一個Endpoints，即每一個Pod，都會有對應的DNS域名，這樣Pod之間就可以通過域名互相訪問（該用法常用於statefulset）。

（2）NodePort

在 ClusterIP 基礎上為 Service在每臺機器上繫結一個埠，這樣就可以通過<NodeIP>:NodePort來訪問該服務。在叢集內，NodePort 服務仍然像之前的 ClusterIP 服務一樣訪問。

（3）LoadBalancer

在 NodePort 的基礎上，藉助 cloud provider 建立一個外部的負載均衡器，並將請求轉發到 <NodeIP>:NodePort。

（4）ExternalName

將服務通過 DNS CNAME 記錄方式轉發到指定的域名。

apiVersion: v1
kind: Service
metadata:
  name: baidu-service
  namespace: test
spec:
  type: ExternalName
  externalName: www.baidu.com

endpoints物件簡介

endpoints中指定了需要連線的服務IP和埠，可以認為endpoints定義了service的backend後端。當訪問service時，實際上是會將請求負載均衡到endpoints定義的服務IP與埠上面去。

另外，endpoints物件與同名稱的service物件相關聯。

endpoints controller分析將分為兩大塊進行，分別是：
（1）endpoints controller初始化與啟動分析；
（2）endpoints controller處理邏輯分析。

1.endpoints controller初始化與啟動分析

基於tag v1.17.4

https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4

直接看到startEndpointController函式，作為endpoints controller啟動分析的入口。

startEndpointController

在startEndpointController函式中啟動了一個goroutine，先是呼叫了endpointcontroller的NewEndpointController方法初始化endpoints controller，接著呼叫Run方法啟動endpoints controller。

// cmd/kube-controller-manager/app/core.go
func startEndpointController(ctx ControllerContext) (http.Handler, bool, error) {
	go endpointcontroller.NewEndpointController(
		ctx.InformerFactory.Core().V1().Pods(),
		ctx.InformerFactory.Core().V1().Services(),
		ctx.InformerFactory.Core().V1().Endpoints(),
		ctx.ClientBuilder.ClientOrDie("endpoint-controller"),
		ctx.ComponentConfig.EndpointController.EndpointUpdatesBatchPeriod.Duration,
	).Run(int(ctx.ComponentConfig.EndpointController.ConcurrentEndpointSyncs), ctx.Stop)
	return nil, true, nil
}

1.1 NewEndpointController

先來看到endpoints controller的初始化方法NewEndpointController。

從NewEndpointController函式程式碼中可以看到，endpoints controller註冊了三個informer，分別是podInformer、serviceInformer與endpointsInformer，以及註冊了service與pod物件的EventHandler，也即對這2個物件的event進行監聽，把event放入事件佇列，由endpoints controller的核心處理方法做做處理。

// pkg/controller/endpoint/endpoints_controller.go
func NewEndpointController(podInformer coreinformers.PodInformer, serviceInformer coreinformers.ServiceInformer,
	endpointsInformer coreinformers.EndpointsInformer, client clientset.Interface, endpointUpdatesBatchPeriod time.Duration) *EndpointController {
	broadcaster := record.NewBroadcaster()
	broadcaster.StartLogging(klog.Infof)
	broadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: client.CoreV1().Events("")})
	recorder := broadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "endpoint-controller"})

	if client != nil && client.CoreV1().RESTClient().GetRateLimiter() != nil {
		ratelimiter.RegisterMetricAndTrackRateLimiterUsage("endpoint_controller", client.CoreV1().RESTClient().GetRateLimiter())
	}
	e := &EndpointController{
		client:           client,
		queue:            workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "endpoint"),
		workerLoopPeriod: time.Second,
	}

	serviceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc: e.onServiceUpdate,
		UpdateFunc: func(old, cur interface{}) {
			e.onServiceUpdate(cur)
		},
		DeleteFunc: e.onServiceDelete,
	})
	e.serviceLister = serviceInformer.Lister()
	e.servicesSynced = serviceInformer.Informer().HasSynced

	podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc:    e.addPod,
		UpdateFunc: e.updatePod,
		DeleteFunc: e.deletePod,
	})
	e.podLister = podInformer.Lister()
	e.podsSynced = podInformer.Informer().HasSynced

	e.endpointsLister = endpointsInformer.Lister()
	e.endpointsSynced = endpointsInformer.Informer().HasSynced

	e.triggerTimeTracker = endpointutil.NewTriggerTimeTracker()
	e.eventBroadcaster = broadcaster
	e.eventRecorder = recorder

	e.endpointUpdatesBatchPeriod = endpointUpdatesBatchPeriod

	e.serviceSelectorCache = endpointutil.NewServiceSelectorCache()

	return e
}

1.2 Run

主要看到for迴圈處，根據workers的值（來源於kcm啟動引數concurrent-endpoint-syncs配置），啟動相應數量的goroutine，跑e.worker方法。

// pkg/controller/endpoint/endpoints_controller.go
func (e *EndpointController) Run(workers int, stopCh <-chan struct{}) {
	defer utilruntime.HandleCrash()
	defer e.queue.ShutDown()

	klog.Infof("Starting endpoint controller")
	defer klog.Infof("Shutting down endpoint controller")

	if !cache.WaitForNamedCacheSync("endpoint", stopCh, e.podsSynced, e.servicesSynced, e.endpointsSynced) {
		return
	}

	for i := 0; i < workers; i++ {
		go wait.Until(e.worker, e.workerLoopPeriod, stopCh)
	}

	go func() {
		defer utilruntime.HandleCrash()
		e.checkLeftoverEndpoints()
	}()

	<-stopCh
}

1.2.1 worker

直接看到processNextWorkItem方法，從佇列queue中取出一個key，然後呼叫e.syncService方法對該key做處理，e.syncService方法也即endpoints controller的核心處理方法，後面會做詳細分析。

// pkg/controller/endpoint/endpoints_controller.go
func (e *EndpointController) worker() {
	for e.processNextWorkItem() {
	}
}

func (e *EndpointController) processNextWorkItem() bool {
	eKey, quit := e.queue.Get()
	if quit {
		return false
	}
	defer e.queue.Done(eKey)

	err := e.syncService(eKey.(string))
	e.handleErr(err, eKey)

	return true
}

2.endpoints controller核心處理分析

基於tag v1.17.4

https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4

直接看到syncService方法，作為endpoints controller核心處理分析的入口。

2.1 核心處理邏輯-syncService

主要邏輯：
（1）獲取service物件，當查詢不到該service物件時，刪除同名endpoints物件；
（2）當service物件的.Spec.Selector為空時，不存在對應的endpoints物件，直接返回；
（3）根據service物件的.Spec.Selector，查詢與service物件匹配的pod列表；
（4）查詢service的annotations中是否配置了TolerateUnreadyEndpoints，代表允許為unready的pod也建立endpoints，該配置將會影響下面對endpoints物件的subsets資訊的計算；
（5）遍歷service物件匹配的pod列表，找出合適的pod，計算endpoints的subsets資訊；
遍歷pod列表時如何計算出subsets？
（5.1）過濾掉pod ip為空的pod；
（5.2）當TolerateUnreadyEndpoints配置為false且pod的deletetimestamp不為空時，過濾掉該pod；
（5.3）當service沒有ports配置，且ClusterIP為None時，為headless service，呼叫addEndpointSubset函式計算subsets，計算出來的subsets中的ports資訊為空；
（5.4）當service有ports配置，遍歷ports配置，迴圈呼叫addEndpointSubset函式計算subsets（addEndpointSubset函式在後面會展開分析）。
（6）獲取endpoints物件；
（7）判斷現存endpoints物件與調諧中重新計算出來的的endpoints物件的subsets與labels是否一致，一致則無需更新，直接返回；
（8）當endpoints物件不存在時新建endpoints物件，當endpoints物件存在時更新endpoints物件。

func (e *EndpointController) syncService(key string) error {
	startTime := time.Now()
	defer func() {
		klog.V(4).Infof("Finished syncing service %q endpoints. (%v)", key, time.Since(startTime))
	}()

	namespace, name, err := cache.SplitMetaNamespaceKey(key)
	if err != nil {
		return err
	}
	service, err := e.serviceLister.Services(namespace).Get(name)
	if err != nil {
		if !errors.IsNotFound(err) {
			return err
		}

		// Delete the corresponding endpoint, as the service has been deleted.
		// TODO: Please note that this will delete an endpoint when a
		// service is deleted. However, if we're down at the time when
		// the service is deleted, we will miss that deletion, so this
		// doesn't completely solve the problem. See #6877.
		err = e.client.CoreV1().Endpoints(namespace).Delete(name, nil)
		if err != nil && !errors.IsNotFound(err) {
			return err
		}
		e.triggerTimeTracker.DeleteService(namespace, name)
		return nil
	}

	if service.Spec.Selector == nil {
		// services without a selector receive no endpoints from this controller;
		// these services will receive the endpoints that are created out-of-band via the REST API.
		return nil
	}

	klog.V(5).Infof("About to update endpoints for service %q", key)
	pods, err := e.podLister.Pods(service.Namespace).List(labels.Set(service.Spec.Selector).AsSelectorPreValidated())
	if err != nil {
		// Since we're getting stuff from a local cache, it is
		// basically impossible to get this error.
		return err
	}

	// If the user specified the older (deprecated) annotation, we have to respect it.
	tolerateUnreadyEndpoints := service.Spec.PublishNotReadyAddresses
	if v, ok := service.Annotations[TolerateUnreadyEndpointsAnnotation]; ok {
		b, err := strconv.ParseBool(v)
		if err == nil {
			tolerateUnreadyEndpoints = b
		} else {
			utilruntime.HandleError(fmt.Errorf("Failed to parse annotation %v: %v", TolerateUnreadyEndpointsAnnotation, err))
		}
	}

	// We call ComputeEndpointLastChangeTriggerTime here to make sure that the
	// state of the trigger time tracker gets updated even if the sync turns out
	// to be no-op and we don't update the endpoints object.
	endpointsLastChangeTriggerTime := e.triggerTimeTracker.
		ComputeEndpointLastChangeTriggerTime(namespace, service, pods)

	subsets := []v1.EndpointSubset{}
	var totalReadyEps int
	var totalNotReadyEps int

	for _, pod := range pods {
		if len(pod.Status.PodIP) == 0 {
			klog.V(5).Infof("Failed to find an IP for pod %s/%s", pod.Namespace, pod.Name)
			continue
		}
		if !tolerateUnreadyEndpoints && pod.DeletionTimestamp != nil {
			klog.V(5).Infof("Pod is being deleted %s/%s", pod.Namespace, pod.Name)
			continue
		}

		ep, err := podToEndpointAddressForService(service, pod)
		if err != nil {
			// this will happen, if the cluster runs with some nodes configured as dual stack and some as not
			// such as the case of an upgrade..
			klog.V(2).Infof("failed to find endpoint for service:%v with ClusterIP:%v on pod:%v with error:%v", service.Name, service.Spec.ClusterIP, pod.Name, err)
			continue
		}

		epa := *ep
		if endpointutil.ShouldSetHostname(pod, service) {
			epa.Hostname = pod.Spec.Hostname
		}

		// Allow headless service not to have ports.
		if len(service.Spec.Ports) == 0 {
			if service.Spec.ClusterIP == api.ClusterIPNone {
				subsets, totalReadyEps, totalNotReadyEps = addEndpointSubset(subsets, pod, epa, nil, tolerateUnreadyEndpoints)
				// No need to repack subsets for headless service without ports.
			}
		} else {
			for i := range service.Spec.Ports {
				servicePort := &service.Spec.Ports[i]

				portName := servicePort.Name
				portProto := servicePort.Protocol
				portNum, err := podutil.FindPort(pod, servicePort)
				if err != nil {
					klog.V(4).Infof("Failed to find port for service %s/%s: %v", service.Namespace, service.Name, err)
					continue
				}

				var readyEps, notReadyEps int
				epp := &v1.EndpointPort{Name: portName, Port: int32(portNum), Protocol: portProto}
				subsets, readyEps, notReadyEps = addEndpointSubset(subsets, pod, epa, epp, tolerateUnreadyEndpoints)
				totalReadyEps = totalReadyEps + readyEps
				totalNotReadyEps = totalNotReadyEps + notReadyEps
			}
		}
	}
	subsets = endpoints.RepackSubsets(subsets)

	// See if there's actually an update here.
	currentEndpoints, err := e.endpointsLister.Endpoints(service.Namespace).Get(service.Name)
	if err != nil {
		if errors.IsNotFound(err) {
			currentEndpoints = &v1.Endpoints{
				ObjectMeta: metav1.ObjectMeta{
					Name:   service.Name,
					Labels: service.Labels,
				},
			}
		} else {
			return err
		}
	}

	createEndpoints := len(currentEndpoints.ResourceVersion) == 0

	if !createEndpoints &&
		apiequality.Semantic.DeepEqual(currentEndpoints.Subsets, subsets) &&
		apiequality.Semantic.DeepEqual(currentEndpoints.Labels, service.Labels) {
		klog.V(5).Infof("endpoints are equal for %s/%s, skipping update", service.Namespace, service.Name)
		return nil
	}
	newEndpoints := currentEndpoints.DeepCopy()
	newEndpoints.Subsets = subsets
	newEndpoints.Labels = service.Labels
	if newEndpoints.Annotations == nil {
		newEndpoints.Annotations = make(map[string]string)
	}

	if !endpointsLastChangeTriggerTime.IsZero() {
		newEndpoints.Annotations[v1.EndpointsLastChangeTriggerTime] =
			endpointsLastChangeTriggerTime.Format(time.RFC3339Nano)
	} else { // No new trigger time, clear the annotation.
		delete(newEndpoints.Annotations, v1.EndpointsLastChangeTriggerTime)
	}

	if newEndpoints.Labels == nil {
		newEndpoints.Labels = make(map[string]string)
	}

	if !helper.IsServiceIPSet(service) {
		newEndpoints.Labels = utillabels.CloneAndAddLabel(newEndpoints.Labels, v1.IsHeadlessService, "")
	} else {
		newEndpoints.Labels = utillabels.CloneAndRemoveLabel(newEndpoints.Labels, v1.IsHeadlessService)
	}

	klog.V(4).Infof("Update endpoints for %v/%v, ready: %d not ready: %d", service.Namespace, service.Name, totalReadyEps, totalNotReadyEps)
	if createEndpoints {
		// No previous endpoints, create them
		_, err = e.client.CoreV1().Endpoints(service.Namespace).Create(newEndpoints)
	} else {
		// Pre-existing
		_, err = e.client.CoreV1().Endpoints(service.Namespace).Update(newEndpoints)
	}
	if err != nil {
		if createEndpoints && errors.IsForbidden(err) {
			// A request is forbidden primarily for two reasons:
			// 1. namespace is terminating, endpoint creation is not allowed by default.
			// 2. policy is misconfigured, in which case no service would function anywhere.
			// Given the frequency of 1, we log at a lower level.
			klog.V(5).Infof("Forbidden from creating endpoints: %v", err)

			// If the namespace is terminating, creates will continue to fail. Simply drop the item.
			if errors.HasStatusCause(err, v1.NamespaceTerminatingCause) {
				return nil
			}
		}

		if createEndpoints {
			e.eventRecorder.Eventf(newEndpoints, v1.EventTypeWarning, "FailedToCreateEndpoint", "Failed to create endpoint for service %v/%v: %v", service.Namespace, service.Name, err)
		} else {
			e.eventRecorder.Eventf(newEndpoints, v1.EventTypeWarning, "FailedToUpdateEndpoint", "Failed to update endpoint %v/%v: %v", service.Namespace, service.Name, err)
		}

		return err
	}
	return nil
}

2.1.1 addEndpointSubset

下面來展開分析下計算service物件subsets資訊的函式addEndpointSubset，計算出的subsets包括了Address（ReadyAddresses）與NotReadyAddresses。

主要邏輯：
（1）當配置了tolerateUnreadyEndpoints且為true時，或pod處於ready狀態時，將計算進subsets中的Addresses；
（2）當配置了tolerateUnreadyEndpoints且為false或沒有配置時，或pod不處於ready狀態時，呼叫shouldPodBeInEndpoints函式，返回true時將計算進subsets中的NotReadyAddresses。
（2.1）當pod.Spec.RestartPolicy為Never，Pod Status.Phase不為Failed/Successed時，將計算進subsets中的NotReadyAddresses；
（2.2）當pod.Spec.RestartPolicy為OnFailure, Pod Status.Phase不為Successed時，Pod對應的EndpointAddress也會被加入到NotReadyAddresses中；
（2.3）其他情況下，將計算進subsets中的NotReadyAddresses。

// pkg/controller/endpoint/endpoints_controller.go
func addEndpointSubset(subsets []v1.EndpointSubset, pod *v1.Pod, epa v1.EndpointAddress,
	epp *v1.EndpointPort, tolerateUnreadyEndpoints bool) ([]v1.EndpointSubset, int, int) {
	var readyEps int
	var notReadyEps int
	ports := []v1.EndpointPort{}
	if epp != nil {
		ports = append(ports, *epp)
	}
	if tolerateUnreadyEndpoints || podutil.IsPodReady(pod) {
		subsets = append(subsets, v1.EndpointSubset{
			Addresses: []v1.EndpointAddress{epa},
			Ports:     ports,
		})
		readyEps++
	} else if shouldPodBeInEndpoints(pod) {
		klog.V(5).Infof("Pod is out of service: %s/%s", pod.Namespace, pod.Name)
		subsets = append(subsets, v1.EndpointSubset{
			NotReadyAddresses: []v1.EndpointAddress{epa},
			Ports:             ports,
		})
		notReadyEps++
	}
	return subsets, readyEps, notReadyEps
}

func shouldPodBeInEndpoints(pod *v1.Pod) bool {
	switch pod.Spec.RestartPolicy {
	case v1.RestartPolicyNever:
		return pod.Status.Phase != v1.PodFailed && pod.Status.Phase != v1.PodSucceeded
	case v1.RestartPolicyOnFailure:
		return pod.Status.Phase != v1.PodSucceeded
	default:
		return true
	}
}

IsPodReady

當在pod的.status.conditions中，type為Ready的status屬性值為True時，IsPodReady返回true。

// pkg/api/v1/pod/util.go
// IsPodReady returns true if a pod is ready; false otherwise.
func IsPodReady(pod *v1.Pod) bool {
	return IsPodReadyConditionTrue(pod.Status)
}

// GetPodReadyCondition extracts the pod ready condition from the given status and returns that.
// Returns nil if the condition is not present.
func GetPodReadyCondition(status v1.PodStatus) *v1.PodCondition {
	_, condition := GetPodCondition(&status, v1.PodReady)
	return condition
}

總結

endpoints controller架構圖

endpoints controller的大致組成和處理流程如下圖，endpoints controller對pod、service物件註冊了event handler，當有事件時，會watch到然後將對應的service物件放入到queue中，然後syncService方法為endpoints controller調諧endpoints物件的核心處理邏輯所在，從queue中取出service物件，再查詢相應的pod物件列表，然後對endpoints物件做調諧處理。

endpoints controller核心處理邏輯

endpoints controller的核心處理邏輯是獲取service物件，當service不存在時刪除同名endpoints物件，當存在時，根據service物件所關聯的pod列表，計算出endpoints物件的最新subsets資訊，然後新建或更新endpoints物件。

k8s endpoints controller分析