【kubernetes/k8s原始碼分析】kubelet原始碼分析之啟動容器

阿新 • • 發佈：2018-11-06

主要是呼叫runtime，這裡預設為docker

0. 資料流

NewMainKubelet（cmd/kubelet/app/server.go） ->
NewKubeGenericRuntimeManager(pkg/kubelet/kuberuntime/kuberuntime_manager.go) ->

syncPod（pkg/kubelet/kubelet.go） ->
SyncPod（pkg/kubelet/kuberuntime/kuberuntime_manager.go）

1. 資料結構

1.1 介面ContainerManager

管理執行在宿主機上的container，介面定義比較清晰

// Manages the containers running on a machine.
type ContainerManager interface {
	// Runs the container manager's housekeeping.
	// - Ensures that the Docker daemon is in a container.
	// - Creates the system container where all non-containerized processes run.
	Start(*v1.Node, ActivePodsFunc, config.SourcesReady, status.PodStatusProvider, internalapi.RuntimeService) error

	// SystemCgroupsLimit returns resources allocated to system cgroups in the machine.
	// These cgroups include the system and Kubernetes services.
	SystemCgroupsLimit() v1.ResourceList

	// GetNodeConfig returns a NodeConfig that is being used by the container manager.
	GetNodeConfig() NodeConfig

	// Status returns internal Status.
	Status() Status

	// NewPodContainerManager is a factory method which returns a podContainerManager object
	// Returns a noop implementation if qos cgroup hierarchy is not enabled
	NewPodContainerManager() PodContainerManager

	// GetMountedSubsystems returns the mounted cgroup subsystems on the node
	GetMountedSubsystems() *CgroupSubsystems

	// GetQOSContainersInfo returns the names of top level QoS containers
	GetQOSContainersInfo() QOSContainersInfo

	// GetNodeAllocatableReservation returns the amount of compute resources that have to be reserved from scheduling.
	GetNodeAllocatableReservation() v1.ResourceList

	// GetCapacity returns the amount of compute resources tracked by container manager available on the node.
	GetCapacity() v1.ResourceList

	// GetDevicePluginResourceCapacity returns the node capacity (amount of total device plugin resources),
	// node allocatable (amount of total healthy resources reported by device plugin),
	// and inactive device plugin resources previously registered on the node.
	GetDevicePluginResourceCapacity() (v1.ResourceList, v1.ResourceList, []string)

	// UpdateQOSCgroups performs housekeeping updates to ensure that the top
	// level QoS containers have their desired state in a thread-safe way
	UpdateQOSCgroups() error

	// GetResources returns RunContainerOptions with devices, mounts, and env fields populated for
	// extended resources required by container.
	GetResources(pod *v1.Pod, container *v1.Container) (*kubecontainer.RunContainerOptions, error)

	// UpdatePluginResources calls Allocate of device plugin handler for potential
	// requests for device plugin resources, and returns an error if fails.
	// Otherwise, it updates allocatableResource in nodeInfo if necessary,
	// to make sure it is at least equal to the pod's requested capacity for
	// any registered device plugin resource
	UpdatePluginResources(*schedulercache.NodeInfo, *lifecycle.PodAdmitAttributes) error

	InternalContainerLifecycle() InternalContainerLifecycle

	// GetPodCgroupRoot returns the cgroup which contains all pods.
	GetPodCgroupRoot() string

	// GetPluginRegistrationHandler returns a plugin registration handler
	// The pluginwatcher's Handlers allow to have a single module for handling
	// registration.
	GetPluginRegistrationHandler() pluginwatcher.PluginHandler
}

2. NewMainKubelet函式

runtime初始化，呼叫NewKubeGenericRuntimeManager函式

    runtime, err := kuberuntime.NewKubeGenericRuntimeManager(
        kubecontainer.FilterEventRecorder(kubeDeps.Recorder),
        klet.livenessManager,
        seccompProfileRoot,
        containerRefManager,
        machineInfo,
        klet,
        kubeDeps.OSInterface,
        klet,
        httpClient,
        imageBackOff,
        kubeCfg.SerializeImagePulls,
        float32(kubeCfg.RegistryPullQPS),
        int(kubeCfg.RegistryBurst),
        kubeCfg.CPUCFSQuota,
        kubeCfg.CPUCFSQuotaPeriod,
        runtimeService,
        imageService,
        kubeDeps.ContainerManager.InternalContainerLifecycle(),
        legacyLogProvider,
        klet.runtimeClassManager,
    )

2.1 可以看到containerRuntime為runtime，實現了介面

klet.containerRuntime = runtime
klet.streamingRuntime = runtime
klet.runner = runtime

3. syncPod函式

路徑pkg/kubelet/kubelt.go

吧啦吧啦初始化一大堆，最終呼叫SyncPod，根據klet.containerRuntime = runtime，可以得到kubeGenericRuntimeManager實現了

	// Call the container runtime's SyncPod callback
	result := kl.containerRuntime.SyncPod(pod, apiPodStatus, podStatus, pullSecrets, kl.backOff)
	kl.reasonCache.Update(pod.UID, result)
	if err := result.Error(); err != nil {
		// Do not return error if the only failures were pods in backoff
		for _, r := range result.SyncResults {
			if r.Error != kubecontainer.ErrCrashLoopBackOff && r.Error != images.ErrImagePullBackOff {
				// Do not record an event here, as we keep all event logging for sync pod failures
				// local to container runtime so we get better errors
				return err
			}
		}

		return nil
	}

4. SyncPod函式

路徑pkg/kuberuntime/kuberuntime_manager.go

// SyncPod syncs the running pod into the desired pod by executing following steps:
//
//  1. Compute sandbox and container changes.
//  2. Kill pod sandbox if necessary.
//  3. Kill any containers that should not be running.
//  4. Create sandbox if necessary.
//  5. Create init containers.
//  6. Create normal containers.

4.1 Step 1: Compute sandbox and container changes

確定哪些容器要建立，哪些容器要刪除，需要刪除的放入podContainerChanges.ContainersToKill，需要建立的放入podContainerChanges.ContainersToStart

	podContainerChanges := m.computePodActions(pod, podStatus)
	glog.V(3).Infof("computePodActions got %+v for pod %q", podContainerChanges, format.Pod(pod))
	if podContainerChanges.CreateSandbox {
		ref, err := ref.GetReference(legacyscheme.Scheme, pod)
		if err != nil {
			glog.Errorf("Couldn't make a ref to pod %q: '%v'", format.Pod(pod), err)
		}
		if podContainerChanges.SandboxID != "" {
			m.recorder.Eventf(ref, v1.EventTypeNormal, events.SandboxChanged, "Pod sandbox changed, it will be killed and re-created.")
		} else {
			glog.V(4).Infof("SyncPod received new pod %q, will create a sandbox for it", format.Pod(pod))
		}
	}

4.2 Step 2: Kill the pod if the sandbox has changed

當沙箱變化的時候，需要重新建立pod

	if podContainerChanges.KillPod {
		if !podContainerChanges.CreateSandbox {
			glog.V(4).Infof("Stopping PodSandbox for %q because all other containers are dead.", format.Pod(pod))
		} else {
			glog.V(4).Infof("Stopping PodSandbox for %q, will start new one", format.Pod(pod))
		}

		killResult := m.killPodWithSyncResult(pod, kubecontainer.ConvertPodStatusToRunningPod(m.runtimeName, podStatus), nil)
		result.AddPodSyncResult(killResult)
		if killResult.Error() != nil {
			glog.Errorf("killPodWithSyncResult failed: %v", killResult.Error())
			return
		}

		if podContainerChanges.CreateSandbox {
			m.purgeInitContainers(pod, podStatus)
		}
	}

4.3 Step 3: kill any running containers in this pod which are not to keep

刪除不需要的pod下的容器

		// Step 3: kill any running containers in this pod which are not to keep.
		for containerID, containerInfo := range podContainerChanges.ContainersToKill {
			glog.V(3).Infof("Killing unwanted container %q(id=%q) for pod %q", containerInfo.name, containerID, format.Pod(pod))
			killContainerResult := kubecontainer.NewSyncResult(kubecontainer.KillContainer, containerInfo.name)
			result.AddSyncResult(killContainerResult)
			if err := m.killContainer(pod, containerID, containerInfo.name, containerInfo.message, nil); err != nil {
				killContainerResult.Fail(kubecontainer.ErrKillContainer, err.Error())
				glog.Errorf("killContainer %q(id=%q) for pod %q failed: %v", containerInfo.name, containerID, format.Pod(pod), err)
				return
			}
		}
	}

4.4 Step 4: Create a sandbox for the pod if necessary

建立sandbox（容器標準），呼叫createPodSandbox，建立pod配置，建立pod log目錄，呼叫m.runtimeService.RunPodSandbox

	// Step 4: Create a sandbox for the pod if necessary.
	podSandboxID := podContainerChanges.SandboxID
	if podContainerChanges.CreateSandbox {
		var msg string
		var err error

		glog.V(4).Infof("Creating sandbox for pod %q", format.Pod(pod))
		createSandboxResult := kubecontainer.NewSyncResult(kubecontainer.CreatePodSandbox, format.Pod(pod))
		result.AddSyncResult(createSandboxResult)
		podSandboxID, msg, err = m.createPodSandbox(pod, podContainerChanges.Attempt)
		if err != nil {
			createSandboxResult.Fail(kubecontainer.ErrCreatePodSandbox, msg)
			glog.Errorf("createPodSandbox for pod %q failed: %v", format.Pod(pod), err)
			ref, referr := ref.GetReference(legacyscheme.Scheme, pod)
			if referr != nil {
				glog.Errorf("Couldn't make a ref to pod %q: '%v'", format.Pod(pod), referr)
			}
			m.recorder.Eventf(ref, v1.EventTypeWarning, events.FailedCreatePodSandBox, "Failed create pod sandbox: %v", err)
			return
		}
		glog.V(4).Infof("Created PodSandbox %q for pod %q", podSandboxID, format.Pod(pod))

		podSandboxStatus, err := m.runtimeService.PodSandboxStatus(podSandboxID)
		if err != nil {
			ref, referr := ref.GetReference(legacyscheme.Scheme, pod)
			if referr != nil {
				glog.Errorf("Couldn't make a ref to pod %q: '%v'", format.Pod(pod), referr)
			}
			m.recorder.Eventf(ref, v1.EventTypeWarning, events.FailedStatusPodSandBox, "Unable to get pod sandbox status: %v", err)
			glog.Errorf("Failed to get pod sandbox status: %v; Skipping pod %q", err, format.Pod(pod))
			result.Fail(err)
			return
		}

		// If we ever allow updating a pod from non-host-network to
		// host-network, we may use a stale IP.
		if !kubecontainer.IsHostNetworkPod(pod) {
			// Overwrite the podIP passed in the pod status, since we just started the pod sandbox.
			podIP = m.determinePodSandboxIP(pod.Namespace, pod.Name, podSandboxStatus)
			glog.V(4).Infof("Determined the ip %q for pod %q after sandbox changed", podIP, format.Pod(pod))
		}
	}

4.5 Step 5: start the init container

init是為容器做初始化工作的,

	// Step 5: start the init container.
	if container := podContainerChanges.NextInitContainerToStart; container != nil {
		// Start the next init container.
		startContainerResult := kubecontainer.NewSyncResult(kubecontainer.StartContainer, container.Name)
		result.AddSyncResult(startContainerResult)
		isInBackOff, msg, err := m.doBackOff(pod, container, podStatus, backOff)
		if isInBackOff {
			startContainerResult.Fail(err, msg)
			glog.V(4).Infof("Backing Off restarting init container %+v in pod %v", container, format.Pod(pod))
			return
		}

		glog.V(4).Infof("Creating init container %+v in pod %v", container, format.Pod(pod))
		if msg, err := m.startContainer(podSandboxID, podSandboxConfig, container, pod, podStatus, pullSecrets, podIP, kubecontainer.ContainerTypeInit); err != nil {
			startContainerResult.Fail(err, msg)
			utilruntime.HandleError(fmt.Errorf("init container start failed: %v: %s", err, msg))
			return
		}

		// Successfully started the container; clear the entry in the failure
		glog.V(4).Infof("Completed init container %q for pod %q", container.Name, format.Pod(pod))
	}

4.6 Step 6: start containers in podContainerChanges.ContainersToStart

真是真正啟動容器程序，呼叫startContainer第五章節講解

	// Step 6: start containers in podContainerChanges.ContainersToStart.
	for _, idx := range podContainerChanges.ContainersToStart {
		container := &pod.Spec.Containers[idx]
		startContainerResult := kubecontainer.NewSyncResult(kubecontainer.StartContainer, container.Name)
		result.AddSyncResult(startContainerResult)

		isInBackOff, msg, err := m.doBackOff(pod, container, podStatus, backOff)
		if isInBackOff {
			startContainerResult.Fail(err, msg)
			glog.V(4).Infof("Backing Off restarting container %+v in pod %v", container, format.Pod(pod))
			continue
		}

		glog.V(4).Infof("Creating container %+v in pod %v", container, format.Pod(pod))
		if msg, err := m.startContainer(podSandboxID, podSandboxConfig, container, pod, podStatus, pullSecrets, podIP, kubecontainer.ContainerTypeRegular); err != nil {
			startContainerResult.Fail(err, msg)
			// known errors that are logged in other places are logged at higher levels here to avoid
			// repetitive log spam
			switch {
			case err == images.ErrImagePullBackOff:
				glog.V(3).Infof("container start failed: %v: %s", err, msg)
			default:
				utilruntime.HandleError(fmt.Errorf("container start failed: %v: %s", err, msg))
			}
			continue
		}
	}

5. startContainer函式

路徑：pkg/kubelet/kuberuntime/kuberuntime_container.go

5.1 Step 1: pull the image

一看就明白，拉去映象

	// Step 1: pull the image.
	imageRef, msg, err := m.imagePuller.EnsureImageExists(pod, container, pullSecrets)
	if err != nil {
		m.recordContainerEvent(pod, container, "", v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", grpc.ErrorDesc(err))
		return msg, err
	}

5.2 Step 2: create the container

create container，最終呼叫API使用GRPC連線，相當於docker create操作

	// Step 2: create the container.
	ref, err := kubecontainer.GenerateContainerRef(pod, container)
	if err != nil {
		glog.Errorf("Can't make a ref to pod %q, container %v: %v", format.Pod(pod), container.Name, err)
	}
	glog.V(4).Infof("Generating ref for container %s: %#v", container.Name, ref)

	// For a new container, the RestartCount should be 0
	restartCount := 0
	containerStatus := podStatus.FindContainerStatusByName(container.Name)
	if containerStatus != nil {
		restartCount = containerStatus.RestartCount + 1
	}

	containerConfig, cleanupAction, err := m.generateContainerConfig(container, pod, restartCount, podIP, imageRef, containerType)
	if cleanupAction != nil {
		defer cleanupAction()
	}
	if err != nil {
		m.recordContainerEvent(pod, container, "", v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", grpc.ErrorDesc(err))
		return grpc.ErrorDesc(err), ErrCreateContainerConfig
	}

	containerID, err := m.runtimeService.CreateContainer(podSandboxID, containerConfig, podSandboxConfig)
	if err != nil {
		m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", grpc.ErrorDesc(err))
		return grpc.ErrorDesc(err), ErrCreateContainer
	}

5.3 Step 3: start the container

相當於docker start這種操作

	// Step 3: start the container.
	err = m.runtimeService.StartContainer(containerID)
	if err != nil {
		m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToStartContainer, "Error: %v", grpc.ErrorDesc(err))
		return grpc.ErrorDesc(err), kubecontainer.ErrRunContainer
	}
	m.recordContainerEvent(pod, container, containerID, v1.EventTypeNormal, events.StartedContainer, "Started container")

	// Symlink container logs to the legacy container log location for cluster logging
	// support.
	// TODO(random-liu): Remove this after cluster logging supports CRI container log path.
	containerMeta := containerConfig.GetMetadata()
	sandboxMeta := podSandboxConfig.GetMetadata()
	legacySymlink := legacyLogSymlink(containerID, containerMeta.Name, sandboxMeta.Name,
		sandboxMeta.Namespace)
	containerLog := filepath.Join(podSandboxConfig.LogDirectory, containerConfig.LogPath)
	// only create legacy symlink if containerLog path exists (or the error is not IsNotExist).
	// Because if containerLog path does not exist, only dandling legacySymlink is created.
	// This dangling legacySymlink is later removed by container gc, so it does not make sense
	// to create it in the first place. it happens when journald logging driver is used with docker.
	if _, err := m.osInterface.Stat(containerLog); !os.IsNotExist(err) {
		if err := m.osInterface.Symlink(containerLog, legacySymlink); err != nil {
			glog.Errorf("Failed to create legacy symbolic link %q to container %q log %q: %v",
				legacySymlink, containerID, containerLog, err)
		}
	}

【kubernetes/k8s原始碼分析】kubelet原始碼分析之cdvisor原始碼分析

資料流 UnsecuredDependencies -> run 1. cadvisor.New初始化 if kubeDeps.CAdvisorInterface == nil { imageFsInfoProvider := cadv

【kubernetes/k8s原始碼分析】kubelet原始碼分析之容器網路初始化原始碼分析

一. 網路基礎 1.1 網路名稱空間的操作建立網路名稱空間： ip netns add 名稱空間內執行命令： ip netns exec 進入名稱空間： ip netns exec bash 1.2 bridge-nf-c

【kubernetes/k8s原始碼分析】kubelet原始碼分析之資源上報

0. 資料流路徑： pkg/kubelet/kubelet.go Run函式（） -> syncNodeStatus () -> registerWithAPIServer() ->

【kubernetes/k8s原始碼分析】kubelet原始碼分析之啟動容器

主要是呼叫runtime，這裡預設為docker 0. 資料流 NewMainKubelet（cmd/kubelet/app/server.go） -> NewKubeGenericRuntimeManager(pkg/kubelet/kuberuntime/kuberuntime

【kubernetes/k8s原始碼分析】kubelet原始碼分析－statusManager與probeManager

簡介在 kubelet 初始化的時候，會NewMainKubelet函式中建立 statusManager 和 probeManager。 statusManager 負責維護狀態資訊，並把 pod 狀態更新到 apiserver，但是不負責監控 pod 狀

【kubernetes/k8s概念】CNI host-local原始碼分析

接著上章節假設host-local成功分配IP，這章節講解host-local 原始碼地址: https://github.com/containernetworking/plugins 引數 { "name": "macvlannet", "cniVers

【kubernetes/k8s概念】CNI macvlan原始碼分析

macvlan原理在linux命令列執行 lsmod | grep macvlan 檢視當前核心是否載入了該driver；如果沒有檢視到，可以通過 modprobe macvlan 來載入 &n

【kubernetes/k8s原始碼分析】 controller-manager之replicaset原始碼分析

ReplicaSet簡介 Kubernetes 中建議使用 ReplicaSet來取代 ReplicationController。ReplicaSet 跟 ReplicationController 沒有本質的不同， ReplicaSet 支援集合式的

【kubernetes/k8s原始碼分析】 client-go包之Informer原始碼分析

Informer 簡介 Informer 是 Client-go 中的一個核心工具包。如果 Kubernetes 的某個元件，需要 List/Get Kubernetes 中的 Object（包括pod，service等等），可以直接使用

【kubernetes/k8s原始碼分析】kube-apiserver的go-restful框架使用

go-restful框架 github: https://github.com/emicklei/go-restful 三個重要資料結構 1. 初始化路徑pkg/kubelet/kubelet.go中函式Ne

【kubernetes/k8s原始碼分析】 deployment原始碼分析

0. 開始 func NewControllerInitializers() map[string]InitFunc { controllers := map[string]InitFunc{}

【kubernetes/k8s概念】CNI plugin calico原始碼分析

calico解決不同物理機上容器之間的通訊，而calico-plugin是在k8s建立Pod時為Pod設定虛擬網絡卡(容器中的eth0和lo網絡卡)，calico-plugin是由兩個靜態的二進位制檔案組成，由kubelet以命令列的形式呼叫，這兩個二進位制

【kubernetes/k8s原始碼分析】kubernetes event原始碼分析

描述使用方式 eventBroadcaster := record.NewBroadcaster() eventBroadcaster.StartLogging(glog.Infof) eventBroadcaster

【kubernetes/k8s原始碼分析】kubectl-controller-manager之job原始碼分析

job介紹 Job: 批量一次性任務，並保證處理的一個或者多個Pod成功結束非並行Job: 固定完成次數的並行Job: 帶有工作佇列的並行Job: SPEC引數 .spec.completions:

【kubernetes/k8s原始碼分析】kubectl-controller-manager之cronjob原始碼分析

crontab的基本格式支援 , - * / 四個字元 *：表示匹配任意值，如果在Minutes 中使用表示每分鐘 &

【kubernetes/k8s原始碼分析】kubectl-controller-manager之HPA原始碼分析

本文基於kubernetes版本：v1.12.1 HPA介紹 https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ Th

【kubernetes/k8s原始碼分析】kubectl-controller-manager之pod gc原始碼分析

引數： --controllers strings：配置需要enable的列表這裡也包括podgc All con

【kubernetes/k8s原始碼分析】kubectl-proxy ipvs原始碼分析

kubernetes版本： 1.12.1 原始碼路徑 pkg/proxy/ipvs/proxier.go 本文只講解IPVS相關部分，啟動流程前文： https://blog.csdn.net/zhonglinzhang/article/details/80185053

【kubernetes/k8s原始碼分析】CNI flannel原始碼分析

原始碼路徑： https://github.com/containernetworking/plugins 版本： v.0.10.0 flannel cni路徑： plugins/plugins/meta/flannel/flannel.go subnet

【kubernetes/k8s原始碼分析】eviction機制原理以及原始碼解析

What? Why? kubelet通過OOM Killer來回收缺點: System OOM events會儲存記錄直到完成了OOM OOM Killer幹掉containers後，Scheduler可能又會排程新的Pod到該Node上或

【kubernetes/k8s原始碼分析】kubelet原始碼分析之啟動容器

0. 資料流

1. 資料結構

1.1 介面ContainerManager

2. NewMainKubelet函式

2.1 可以看到containerRuntime為runtime，實現了介面

3. syncPod函式

4. SyncPod函式

4.1 Step 1: Compute sandbox and container changes

4.2 Step 2: Kill the pod if the sandbox has changed

4.3 Step 3: kill any running containers in this pod which are not to keep

4.4 Step 4: Create a sandbox for the pod if necessary

4.5 Step 5: start the init container

4.6 Step 6: start containers in podContainerChanges.ContainersToStart

5. startContainer函式

5.1 Step 1: pull the image

5.2 Step 2: create the container

5.3 Step 3: start the container

相關推薦