k8s endpoints controller分析
k8s endpoints controller分析
endpoints controller簡介
endpoints controller是kube-controller-manager元件中眾多控制器中的一個,是 endpoints 資源物件的控制器,其通過對service、pod 2種資源的監聽,當這2種資源發生變化時會觸發 endpoints controller 對相應的endpoints資源進行調諧操作,從而完成endpoints物件的新建、更新、刪除等操作。
endpoints controller架構圖
endpoints controller的大致組成和處理流程如下圖,endpoints controller對pod、service物件註冊了event handler,當有事件時,會watch到然後將對應的service物件放入到queue中,然後syncService
service物件簡介
Service 是對一組提供相同功能的 Pods 的抽象,併為它們提供一個統一的入口。藉助 Service,應用可以方便的實現服務發現與負載均衡,並實現應用的零宕機升級。Service 通過標籤來選取服務後端,這些匹配標籤的 Pod IP 和埠列表組成 endpoints,由 kube-proxy 負責將服務 IP 負載均衡到這些 endpoints 上。
service的四種類型如下。
(1)ClusterIP
型別為ClusterIP的service,這個service有一個Cluster IP,其實就一個VIP,具體實現原理依靠kubeproxy元件,通過iptables或是ipvs實現。該型別的service 只能在叢集內訪問。
client訪問Cluster IP,通過iptables或ipvs規則轉到Real Server(endpoints),從而達到負載均衡的效果。
Headless Service
特殊的ClusterIP,通過指定 Cluster IP(spec.clusterIP)的值為 "None" 來建立 Headless Service。
使用場景一:自主選擇權,client自行決定使用哪個Real Server,可以通過查詢DNS來獲取Real Server的資訊。
使用場景二:Headless Service的對應的每一個Endpoints,即每一個Pod,都會有對應的DNS域名,這樣Pod之間就可以通過域名互相訪問(該用法常用於statefulset)。
(2)NodePort
在 ClusterIP 基礎上為 Service在每臺機器上繫結一個埠,這樣就可以通過<NodeIP>:NodePort
來訪問該服務。在叢集內,NodePort 服務仍然像之前的 ClusterIP 服務一樣訪問。
(3)LoadBalancer
在 NodePort 的基礎上,藉助 cloud provider 建立一個外部的負載均衡器,並將請求轉發到 <NodeIP>:NodePort
。
(4)ExternalName
將服務通過 DNS CNAME 記錄方式轉發到指定的域名。
apiVersion: v1
kind: Service
metadata:
name: baidu-service
namespace: test
spec:
type: ExternalName
externalName: www.baidu.com
endpoints物件簡介
endpoints中指定了需要連線的服務IP和埠,可以認為endpoints定義了service的backend後端。當訪問service時,實際上是會將請求負載均衡到endpoints定義的服務IP與埠上面去。
另外,endpoints物件與同名稱的service物件相關聯。
endpoints controller分析將分為兩大塊進行,分別是:
(1)endpoints controller初始化與啟動分析;
(2)endpoints controller處理邏輯分析。
1.endpoints controller初始化與啟動分析
基於tag v1.17.4
https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4
直接看到startEndpointController函式,作為endpoints controller啟動分析的入口。
startEndpointController
在startEndpointController
函式中啟動了一個goroutine,先是呼叫了endpointcontroller的NewEndpointController
方法初始化endpoints controller,接著呼叫Run
方法啟動endpoints controller。
// cmd/kube-controller-manager/app/core.go
func startEndpointController(ctx ControllerContext) (http.Handler, bool, error) {
go endpointcontroller.NewEndpointController(
ctx.InformerFactory.Core().V1().Pods(),
ctx.InformerFactory.Core().V1().Services(),
ctx.InformerFactory.Core().V1().Endpoints(),
ctx.ClientBuilder.ClientOrDie("endpoint-controller"),
ctx.ComponentConfig.EndpointController.EndpointUpdatesBatchPeriod.Duration,
).Run(int(ctx.ComponentConfig.EndpointController.ConcurrentEndpointSyncs), ctx.Stop)
return nil, true, nil
}
1.1 NewEndpointController
先來看到endpoints controller的初始化方法NewEndpointController
。
從NewEndpointController
函式程式碼中可以看到,endpoints controller註冊了三個informer,分別是podInformer、serviceInformer與endpointsInformer,以及註冊了service與pod物件的EventHandler,也即對這2個物件的event進行監聽,把event放入事件佇列,由endpoints controller的核心處理方法做做處理。
// pkg/controller/endpoint/endpoints_controller.go
func NewEndpointController(podInformer coreinformers.PodInformer, serviceInformer coreinformers.ServiceInformer,
endpointsInformer coreinformers.EndpointsInformer, client clientset.Interface, endpointUpdatesBatchPeriod time.Duration) *EndpointController {
broadcaster := record.NewBroadcaster()
broadcaster.StartLogging(klog.Infof)
broadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: client.CoreV1().Events("")})
recorder := broadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "endpoint-controller"})
if client != nil && client.CoreV1().RESTClient().GetRateLimiter() != nil {
ratelimiter.RegisterMetricAndTrackRateLimiterUsage("endpoint_controller", client.CoreV1().RESTClient().GetRateLimiter())
}
e := &EndpointController{
client: client,
queue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "endpoint"),
workerLoopPeriod: time.Second,
}
serviceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: e.onServiceUpdate,
UpdateFunc: func(old, cur interface{}) {
e.onServiceUpdate(cur)
},
DeleteFunc: e.onServiceDelete,
})
e.serviceLister = serviceInformer.Lister()
e.servicesSynced = serviceInformer.Informer().HasSynced
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: e.addPod,
UpdateFunc: e.updatePod,
DeleteFunc: e.deletePod,
})
e.podLister = podInformer.Lister()
e.podsSynced = podInformer.Informer().HasSynced
e.endpointsLister = endpointsInformer.Lister()
e.endpointsSynced = endpointsInformer.Informer().HasSynced
e.triggerTimeTracker = endpointutil.NewTriggerTimeTracker()
e.eventBroadcaster = broadcaster
e.eventRecorder = recorder
e.endpointUpdatesBatchPeriod = endpointUpdatesBatchPeriod
e.serviceSelectorCache = endpointutil.NewServiceSelectorCache()
return e
}
1.2 Run
主要看到for迴圈處,根據workers的值(來源於kcm啟動引數concurrent-endpoint-syncs
配置),啟動相應數量的goroutine,跑e.worker
方法。
// pkg/controller/endpoint/endpoints_controller.go
func (e *EndpointController) Run(workers int, stopCh <-chan struct{}) {
defer utilruntime.HandleCrash()
defer e.queue.ShutDown()
klog.Infof("Starting endpoint controller")
defer klog.Infof("Shutting down endpoint controller")
if !cache.WaitForNamedCacheSync("endpoint", stopCh, e.podsSynced, e.servicesSynced, e.endpointsSynced) {
return
}
for i := 0; i < workers; i++ {
go wait.Until(e.worker, e.workerLoopPeriod, stopCh)
}
go func() {
defer utilruntime.HandleCrash()
e.checkLeftoverEndpoints()
}()
<-stopCh
}
1.2.1 worker
直接看到processNextWorkItem
方法,從佇列queue中取出一個key,然後呼叫e.syncService
方法對該key做處理,e.syncService
方法也即endpoints controller的核心處理方法,後面會做詳細分析。
// pkg/controller/endpoint/endpoints_controller.go
func (e *EndpointController) worker() {
for e.processNextWorkItem() {
}
}
func (e *EndpointController) processNextWorkItem() bool {
eKey, quit := e.queue.Get()
if quit {
return false
}
defer e.queue.Done(eKey)
err := e.syncService(eKey.(string))
e.handleErr(err, eKey)
return true
}
2.endpoints controller核心處理分析
基於tag v1.17.4
https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4
直接看到syncService方法,作為endpoints controller核心處理分析的入口。
2.1 核心處理邏輯-syncService
主要邏輯:
(1)獲取service物件,當查詢不到該service物件時,刪除同名endpoints物件;
(2)當service物件的.Spec.Selector
為空時,不存在對應的endpoints物件,直接返回;
(3)根據service物件的.Spec.Selector
,查詢與service物件匹配的pod列表;
(4)查詢service的annotations中是否配置了TolerateUnreadyEndpoints
,代表允許為unready的pod也建立endpoints,該配置將會影響下面對endpoints物件的subsets資訊的計算;
(5)遍歷service物件匹配的pod列表,找出合適的pod,計算endpoints的subsets資訊;
遍歷pod列表時如何計算出subsets?
(5.1)過濾掉pod ip為空的pod;
(5.2)當TolerateUnreadyEndpoints
配置為false且pod的deletetimestamp不為空時,過濾掉該pod;
(5.3)當service沒有ports配置,且ClusterIP為None時,為headless service,呼叫addEndpointSubset函式計算subsets,計算出來的subsets中的ports資訊為空;
(5.4)當service有ports配置,遍歷ports配置,迴圈呼叫addEndpointSubset函式計算subsets(addEndpointSubset函式在後面會展開分析)。
(6)獲取endpoints物件;
(7)判斷現存endpoints物件與調諧中重新計算出來的的endpoints物件的subsets與labels是否一致,一致則無需更新,直接返回;
(8)當endpoints物件不存在時新建endpoints物件,當endpoints物件存在時更新endpoints物件。
func (e *EndpointController) syncService(key string) error {
startTime := time.Now()
defer func() {
klog.V(4).Infof("Finished syncing service %q endpoints. (%v)", key, time.Since(startTime))
}()
namespace, name, err := cache.SplitMetaNamespaceKey(key)
if err != nil {
return err
}
service, err := e.serviceLister.Services(namespace).Get(name)
if err != nil {
if !errors.IsNotFound(err) {
return err
}
// Delete the corresponding endpoint, as the service has been deleted.
// TODO: Please note that this will delete an endpoint when a
// service is deleted. However, if we're down at the time when
// the service is deleted, we will miss that deletion, so this
// doesn't completely solve the problem. See #6877.
err = e.client.CoreV1().Endpoints(namespace).Delete(name, nil)
if err != nil && !errors.IsNotFound(err) {
return err
}
e.triggerTimeTracker.DeleteService(namespace, name)
return nil
}
if service.Spec.Selector == nil {
// services without a selector receive no endpoints from this controller;
// these services will receive the endpoints that are created out-of-band via the REST API.
return nil
}
klog.V(5).Infof("About to update endpoints for service %q", key)
pods, err := e.podLister.Pods(service.Namespace).List(labels.Set(service.Spec.Selector).AsSelectorPreValidated())
if err != nil {
// Since we're getting stuff from a local cache, it is
// basically impossible to get this error.
return err
}
// If the user specified the older (deprecated) annotation, we have to respect it.
tolerateUnreadyEndpoints := service.Spec.PublishNotReadyAddresses
if v, ok := service.Annotations[TolerateUnreadyEndpointsAnnotation]; ok {
b, err := strconv.ParseBool(v)
if err == nil {
tolerateUnreadyEndpoints = b
} else {
utilruntime.HandleError(fmt.Errorf("Failed to parse annotation %v: %v", TolerateUnreadyEndpointsAnnotation, err))
}
}
// We call ComputeEndpointLastChangeTriggerTime here to make sure that the
// state of the trigger time tracker gets updated even if the sync turns out
// to be no-op and we don't update the endpoints object.
endpointsLastChangeTriggerTime := e.triggerTimeTracker.
ComputeEndpointLastChangeTriggerTime(namespace, service, pods)
subsets := []v1.EndpointSubset{}
var totalReadyEps int
var totalNotReadyEps int
for _, pod := range pods {
if len(pod.Status.PodIP) == 0 {
klog.V(5).Infof("Failed to find an IP for pod %s/%s", pod.Namespace, pod.Name)
continue
}
if !tolerateUnreadyEndpoints && pod.DeletionTimestamp != nil {
klog.V(5).Infof("Pod is being deleted %s/%s", pod.Namespace, pod.Name)
continue
}
ep, err := podToEndpointAddressForService(service, pod)
if err != nil {
// this will happen, if the cluster runs with some nodes configured as dual stack and some as not
// such as the case of an upgrade..
klog.V(2).Infof("failed to find endpoint for service:%v with ClusterIP:%v on pod:%v with error:%v", service.Name, service.Spec.ClusterIP, pod.Name, err)
continue
}
epa := *ep
if endpointutil.ShouldSetHostname(pod, service) {
epa.Hostname = pod.Spec.Hostname
}
// Allow headless service not to have ports.
if len(service.Spec.Ports) == 0 {
if service.Spec.ClusterIP == api.ClusterIPNone {
subsets, totalReadyEps, totalNotReadyEps = addEndpointSubset(subsets, pod, epa, nil, tolerateUnreadyEndpoints)
// No need to repack subsets for headless service without ports.
}
} else {
for i := range service.Spec.Ports {
servicePort := &service.Spec.Ports[i]
portName := servicePort.Name
portProto := servicePort.Protocol
portNum, err := podutil.FindPort(pod, servicePort)
if err != nil {
klog.V(4).Infof("Failed to find port for service %s/%s: %v", service.Namespace, service.Name, err)
continue
}
var readyEps, notReadyEps int
epp := &v1.EndpointPort{Name: portName, Port: int32(portNum), Protocol: portProto}
subsets, readyEps, notReadyEps = addEndpointSubset(subsets, pod, epa, epp, tolerateUnreadyEndpoints)
totalReadyEps = totalReadyEps + readyEps
totalNotReadyEps = totalNotReadyEps + notReadyEps
}
}
}
subsets = endpoints.RepackSubsets(subsets)
// See if there's actually an update here.
currentEndpoints, err := e.endpointsLister.Endpoints(service.Namespace).Get(service.Name)
if err != nil {
if errors.IsNotFound(err) {
currentEndpoints = &v1.Endpoints{
ObjectMeta: metav1.ObjectMeta{
Name: service.Name,
Labels: service.Labels,
},
}
} else {
return err
}
}
createEndpoints := len(currentEndpoints.ResourceVersion) == 0
if !createEndpoints &&
apiequality.Semantic.DeepEqual(currentEndpoints.Subsets, subsets) &&
apiequality.Semantic.DeepEqual(currentEndpoints.Labels, service.Labels) {
klog.V(5).Infof("endpoints are equal for %s/%s, skipping update", service.Namespace, service.Name)
return nil
}
newEndpoints := currentEndpoints.DeepCopy()
newEndpoints.Subsets = subsets
newEndpoints.Labels = service.Labels
if newEndpoints.Annotations == nil {
newEndpoints.Annotations = make(map[string]string)
}
if !endpointsLastChangeTriggerTime.IsZero() {
newEndpoints.Annotations[v1.EndpointsLastChangeTriggerTime] =
endpointsLastChangeTriggerTime.Format(time.RFC3339Nano)
} else { // No new trigger time, clear the annotation.
delete(newEndpoints.Annotations, v1.EndpointsLastChangeTriggerTime)
}
if newEndpoints.Labels == nil {
newEndpoints.Labels = make(map[string]string)
}
if !helper.IsServiceIPSet(service) {
newEndpoints.Labels = utillabels.CloneAndAddLabel(newEndpoints.Labels, v1.IsHeadlessService, "")
} else {
newEndpoints.Labels = utillabels.CloneAndRemoveLabel(newEndpoints.Labels, v1.IsHeadlessService)
}
klog.V(4).Infof("Update endpoints for %v/%v, ready: %d not ready: %d", service.Namespace, service.Name, totalReadyEps, totalNotReadyEps)
if createEndpoints {
// No previous endpoints, create them
_, err = e.client.CoreV1().Endpoints(service.Namespace).Create(newEndpoints)
} else {
// Pre-existing
_, err = e.client.CoreV1().Endpoints(service.Namespace).Update(newEndpoints)
}
if err != nil {
if createEndpoints && errors.IsForbidden(err) {
// A request is forbidden primarily for two reasons:
// 1. namespace is terminating, endpoint creation is not allowed by default.
// 2. policy is misconfigured, in which case no service would function anywhere.
// Given the frequency of 1, we log at a lower level.
klog.V(5).Infof("Forbidden from creating endpoints: %v", err)
// If the namespace is terminating, creates will continue to fail. Simply drop the item.
if errors.HasStatusCause(err, v1.NamespaceTerminatingCause) {
return nil
}
}
if createEndpoints {
e.eventRecorder.Eventf(newEndpoints, v1.EventTypeWarning, "FailedToCreateEndpoint", "Failed to create endpoint for service %v/%v: %v", service.Namespace, service.Name, err)
} else {
e.eventRecorder.Eventf(newEndpoints, v1.EventTypeWarning, "FailedToUpdateEndpoint", "Failed to update endpoint %v/%v: %v", service.Namespace, service.Name, err)
}
return err
}
return nil
}
2.1.1 addEndpointSubset
下面來展開分析下計算service物件subsets資訊的函式addEndpointSubset,計算出的subsets包括了Address(ReadyAddresses)與NotReadyAddresses。
主要邏輯:
(1)當配置了tolerateUnreadyEndpoints且為true時,或pod處於ready狀態時,將計算進subsets中的Addresses;
(2)當配置了tolerateUnreadyEndpoints且為false或沒有配置時,或pod不處於ready狀態時,呼叫shouldPodBeInEndpoints函式,返回true時將計算進subsets中的NotReadyAddresses。
(2.1)當pod.Spec.RestartPolicy為Never,Pod Status.Phase不為Failed/Successed時,將計算進subsets中的NotReadyAddresses;
(2.2)當pod.Spec.RestartPolicy為OnFailure, Pod Status.Phase不為Successed時,Pod對應的EndpointAddress也會被加入到NotReadyAddresses中;
(2.3)其他情況下,將計算進subsets中的NotReadyAddresses。
// pkg/controller/endpoint/endpoints_controller.go
func addEndpointSubset(subsets []v1.EndpointSubset, pod *v1.Pod, epa v1.EndpointAddress,
epp *v1.EndpointPort, tolerateUnreadyEndpoints bool) ([]v1.EndpointSubset, int, int) {
var readyEps int
var notReadyEps int
ports := []v1.EndpointPort{}
if epp != nil {
ports = append(ports, *epp)
}
if tolerateUnreadyEndpoints || podutil.IsPodReady(pod) {
subsets = append(subsets, v1.EndpointSubset{
Addresses: []v1.EndpointAddress{epa},
Ports: ports,
})
readyEps++
} else if shouldPodBeInEndpoints(pod) {
klog.V(5).Infof("Pod is out of service: %s/%s", pod.Namespace, pod.Name)
subsets = append(subsets, v1.EndpointSubset{
NotReadyAddresses: []v1.EndpointAddress{epa},
Ports: ports,
})
notReadyEps++
}
return subsets, readyEps, notReadyEps
}
func shouldPodBeInEndpoints(pod *v1.Pod) bool {
switch pod.Spec.RestartPolicy {
case v1.RestartPolicyNever:
return pod.Status.Phase != v1.PodFailed && pod.Status.Phase != v1.PodSucceeded
case v1.RestartPolicyOnFailure:
return pod.Status.Phase != v1.PodSucceeded
default:
return true
}
}
IsPodReady
當在pod的.status.conditions中,type為Ready的status屬性值為True時,IsPodReady返回true。
// pkg/api/v1/pod/util.go
// IsPodReady returns true if a pod is ready; false otherwise.
func IsPodReady(pod *v1.Pod) bool {
return IsPodReadyConditionTrue(pod.Status)
}
// GetPodReadyCondition extracts the pod ready condition from the given status and returns that.
// Returns nil if the condition is not present.
func GetPodReadyCondition(status v1.PodStatus) *v1.PodCondition {
_, condition := GetPodCondition(&status, v1.PodReady)
return condition
}
總結
endpoints controller架構圖
endpoints controller的大致組成和處理流程如下圖,endpoints controller對pod、service物件註冊了event handler,當有事件時,會watch到然後將對應的service物件放入到queue中,然後syncService
方法為endpoints controller調諧endpoints物件的核心處理邏輯所在,從queue中取出service物件,再查詢相應的pod物件列表,然後對endpoints物件做調諧處理。
endpoints controller核心處理邏輯
endpoints controller的核心處理邏輯是獲取service物件,當service不存在時刪除同名endpoints物件,當存在時,根據service物件所關聯的pod列表,計算出endpoints物件的最新subsets資訊,然後新建或更新endpoints物件。