虛擬化原理之xen-cpu虛擬化
第3章 CPU虛擬化
3.1 xen基本機制和提供的服務
為了實現半虛擬化的目標,VMM必須提供一系列的機制。討論一下這些機制需要實現的功能:
q 計算機系統啟動的時候,需要讀BIOS獲得機器的記憶體,硬碟引數等物理資訊。在虛擬化的情況下,BIOS是不存在的。所以VMM需要模擬這部分的功能。
q VMM執行在保護模式,而Guest OS也執行在保護模式,需要提供保護模式下的資訊共享機制。
q VMM運作在最高優先順序(0級),而Guest OS執行在低優先順序。這意味著虛擬機器的核心不能執行某些特權指令,VMM必須提供執行這些特權指令的介面。
q VMM要通知事件到VM,需要機制實現這種事件機制。
q Linux系統程序之間有通訊機制。而虛擬機器之間也需要一種安全高效的通訊機制。
為實現這些功能,xen提供了一系列的機制來完成這些功能。
3.1.1 啟動資訊頁
啟動資訊頁包含了核心啟動所需要的資訊。啟動資訊頁是一個start_info的資料結構,定義在/xen/include/public/xen.h檔案。啟動資訊頁包括了分配給domain的記憶體頁面數,xen store通訊頁表的機器頁號,儲存共享資訊頁的實體地址等等。
3.1.2 共享資訊頁
啟動資訊頁在domain啟動或者恢復時候才發揮作用,而共享資訊頁在整個系統執行的過程中都發揮作用。共享資訊頁的結構shared_info
共享資訊頁主要是與VCPU和虛擬機器狀態相關的資訊,包括VCPU狀態資訊,時鐘資訊和虛擬中斷狀態資訊。共享資訊頁能夠被xen和Guest OS訪問,因此可以用來在xen和Guest OS之間共享資訊。
3.1.3 超級呼叫
超級呼叫為Guest OS提供了實現特權指令的機制。在linux系統中,核心提供了系統呼叫功能,這是通過軟中斷指令(int 80H)實現。超級呼叫也是通過軟中斷實現的, 它使用了0x82這個軟中斷呼叫號。
3.1.4 事件通道
事件通道提供了xen和domain之間的事件通知機制。虛擬機器的中斷也是通過事件通道方式來實現。事件通道在
3.1.5 授權表
授權表提供了domain間的共享記憶體機制。和共享記憶體不同的是,必須經過共享記憶體所有者domain的授權才有權訪問。這也是授權表名稱的由來。
3.1.6 Xen store和xen bus
Xen store類似windows裡面的登錄檔。Xen store儲存了各個VM的配置資訊,前後端裝置的資訊,虛擬機器狀態等等。Xen store是一種高階通訊機制,它是基於低階通訊機制共享頁面和事件通道來實現的。Xen store提供了更高階的操作,它提供了一個具有層次結構的目錄,類似linux裡面的樹形目錄。通過xen store可以列出目錄,讀寫值,寫入值等等。
Xen bus可以看做是一條虛擬的匯流排。<object data="data:application/x-silverlight-2," 它是對具體物理匯流排的模擬,在後面章節將詳細討論xenbus。
3.2 虛擬化資料結構
前文講到VMM通過VCPU來保證VM之間的隔離,同時通過VCPU來排程虛擬機器。VMM定義了一些資料結構來完成這些任務,其中最重要的有四個資料結構。
q Vcpu結構:儲存vcpu
q Arch_vcpu結構:
q Vcpu_guest_context:
q Vcpu_info:
3.2.1 VCPU資料結構
VCPU結構儲存了vcpu的基本資訊,同時有成員指標指向arch_vcpu結構。Vcpu的基本資訊包括cpu ID,vcpu排程相關資訊,vcpu狀態資訊等。
程式碼清單2-1 VCPU結構
<object data="data:application/x-silverlight-2," struct vcpu
{
int vcpu_id;
int processor;
vcpu_info_t *vcpu_info;
struct domain *domain;
struct vcpu *next_in_list;
uint64_t periodic_period;
uint64_t periodic_last_event;
struct timer periodic_timer;
struct timer singleshot_timer;
struct timer poll_timer; /* timeout for SCHEDOP_poll */
void *sched_priv; /* scheduler-specific data */
struct vcpu_runstate_info runstate;
/* Has the FPU been initialised? */
bool_t fpu_initialised;
/* Has the FPU been used since it was last saved? */
bool_t fpu_dirtied;
/* Is this VCPU polling any event channels (SCHEDOP_poll)? */
bool_t is_polling;
/* Initialization completed for this VCPU? */
bool_t is_initialised;
/* Currently running on a CPU? */
bool_t is_running;
/* NMI callback pending for this VCPU? */
bool_t nmi_pending;
/* Avoid NMI reentry by allowing NMIs to be masked for short periods. */
bool_t nmi_masked;
/* Require shutdown to be deferred for some asynchronous operation? */
bool_t defer_shutdown;
/* VCPU is paused following shutdown request (d->is_shutting_down)? */
bool_t paused_for_shutdown;
unsigned long pause_flags;
atomic_t pause_count;
u16 virq_to_evtchn[NR_VIRQS];
/* Bitmask of CPUs on which this VCPU may run. */
cpumask_t cpu_affinity;
unsigned long nmi_addr; /* NMI callback address. */
/* Bitmask of CPUs which are holding onto this VCPU's state. */
cpumask_t vcpu_dirty_cpumask;
struct arch_vcpu arch;
};
type="application/x-silverlight-2"
每一個domain可以擁有多個VCPU,這些VCPU通過成員next_in_list組成了一個單向連結串列。通過這個連結串列,可以遍歷一個domain內的所有VCPU。
而runstate這個成員則是儲存VCPU的狀態以及在各個狀態執行的時間。
程式碼清單2-2 vcpu執行狀態
<object data="data:application/x-silverlight-2," struct vcpu_runstate_info {
/* VCPU's current state (RUNSTATE_*). */
int state;
/* When was current state entered (system time, ns)? */
uint64_t state_entry_time;
/*
* Time spent in each RUNSTATE_* (ns). The sum of these times is
* guaranteed not to drift from system time.
*/
uint64_t time[4];
};type="application/x-silverlight-2"
可以看到,time變數是個四個成員的陣列,說明vcpu有四種狀態。這四種狀態分別是執行態,可執行態,阻塞態和離線態。執行態是vcpu處於執行中,而可執行態說明vcpu已經具備執行的條件,但是還沒有分配物理cpu。阻塞態則說明vcpu還需要等待某些資源才能執行。
3.2.2 arch_vcpu
Arch_vcpu結構是跟物理cpu有關的結構,它儲存的資訊和物理cpu的架構有關。通常包括和vcpu排程有關的函式指標,堆疊資訊和上下文切換相關的資訊。每種cpu架構都有各自不同的arch_vcpu結構,下文展示x86結構的arch_vcpu結構。
程式碼清單2-3 Arch_vcpu
<object data="data:application/x-silverlight-2," struct arch_vcpu
{
/* Needs 16-byte aligment for FXSAVE/FXRSTOR. */
struct vcpu_guest_context guest_context
__attribute__((__aligned__(16)));
struct pae_l3_cache pae_l3_cache;
unsigned long flags; /* TF_ */
void (*schedule_tail) (struct vcpu *);
void (*ctxt_switch_from) (struct vcpu *);
void (*ctxt_switch_to) (struct vcpu *);
/* Bounce information for propagating an exception to guest OS. */
struct trap_bounce trap_bounce;
/* I/O-port access bitmap. */
XEN_GUEST_HANDLE(uint8_t) iobmp; /* Guest kernel virtual address of the bitmap. */
int iobmp_limit; /* Number of ports represented in the bitmap. */
int iopl; /* Current IOPL for this VCPU. */
struct desc_struct int80_desc;
/* Virtual Machine Extensions */
struct hvm_vcpu hvm_vcpu;
l1_pgentry_t *perdomain_ptes;
pagetable_t guest_table; /* (MFN) guest notion of cr3 */
/* guest_table holds a ref to the page, and also a type-count unless
* shadow refcounts are in use */
pagetable_t shadow_table[4]; /* (MFN) shadow(s) of guest */
pagetable_t monitor_table; /* (MFN) hypervisor PT (for HVM) */
unsigned long cr3; /* (MA) value to install in HW CR3 */
/* Current LDT details. */
unsigned long shadow_ldt_mapcnt;
struct paging_vcpu paging;
} __cacheline_aligned;type="application/x-silverlight-2"
這裡的guest_context變數儲存的就是cpu切換時候的暫存器資訊和GDT,LDT等描述符資訊。
而函式指標ctxt_switch_from 和ctxt_switch_to是要在vcpu切換的時候呼叫,對於xen的半虛擬化和全虛擬化來說,它們的實現也是各自不同的。
shadow_table則儲存了和影子頁表有關的資訊。
3.2.3 vcpu_guest_context
程式碼清單2-4 vcpu_guest_context
struct vcpu_guest_context {
/* FPU registers come first so they can be aligned for FXSAVE/FXRSTOR. */
struct { char x[512]; } fpu_ctxt; /* User-level FPU registers */
#define VGCF_I387_VALID (1<<0)
#define VGCF_IN_KERNEL (1<<2)
#define _VGCF_i387_valid 0
#define VGCF_i387_valid (1<<_VGCF_i387_valid)
#define _VGCF_in_kernel 2
#define VGCF_in_kernel (1<<_VGCF_in_kernel)
#define _VGCF_failsafe_disables_events 3
#define VGCF_failsafe_disables_events (1<<_VGCF_failsafe_disables_events)
#define _VGCF_syscall_disables_events 4
#define VGCF_syscall_disables_events (1<<_VGCF_syscall_disables_events)
#define _VGCF_online 5
#define VGCF_online (1<<_VGCF_online)
unsigned long flags; /* VGCF_* flags */
struct cpu_user_regs user_regs; /* User-level CPU registers */
struct trap_info trap_ctxt[256]; /* Virtual IDT */
unsigned long ldt_base, ldt_ents; /* LDT (linear address, # ents) */
unsigned long gdt_frames[16], gdt_ents; /* GDT (machine frames, # ents) */
unsigned long kernel_ss, kernel_sp; /* Virtual TSS (only SS1/SP1) */
/* NB. User pagetable on x86/64 is placed in ctrlreg[1]. */
unsigned long ctrlreg[8]; /* CR0-CR7 (control registers) */
unsigned long debugreg[8]; /* DB0-DB7 (debug registers) */
#ifdef __i386__
unsigned long event_callback_cs; /* CS:EIP of event callback */
unsigned long event_callback_eip;
unsigned long failsafe_callback_cs; /* CS:EIP of failsafe callback */
unsigned long failsafe_callback_eip;
#else
unsigned long event_callback_eip;
unsigned long failsafe_callback_eip;
#ifdef __XEN__
union {
unsigned long syscall_callback_eip;
struct {
unsigned int event_callback_cs; /* compat CS of event cb */
unsigned int failsafe_callback_cs; /* compat CS of failsafe cb */
};
};
#else
unsigned long syscall_callback_eip;
#endif
#endif
unsigned long vm_assist; /* VMASST_TYPE_* bitmap */
#ifdef __x86_64__
/* Segment base addresses. */
uint64_t fs_base;
uint64_t gs_base_kernel;
uint64_t gs_base_user;
#endif
};
struct vcpu_guest_context {
/* FPU registers come first so they can be aligned for FXSAVE/FXRSTOR. */
struct { char x[512]; } fpu_ctxt; /* User-level FPU registers */
#define VGCF_I387_VALID (1<<0)
#define VGCF_IN_KERNEL (1<<2)
#define _VGCF_i387_valid 0
#define VGCF_i387_valid (1<<_VGCF_i387_valid)
#define _VGCF_in_kernel 2
#define VGCF_in_kernel (1<<_VGCF_in_kernel)
#define _VGCF_failsafe_disables_events 3
#define VGCF_failsafe_disables_events (1<<_VGCF_failsafe_disables_events)
#define _VGCF_syscall_disables_events 4
#define VGCF_syscall_disables_events (1<<_VGCF_syscall_disables_events)
#define _VGCF_online 5
#define VGCF_online (1<<_VGCF_online)
unsigned long flags; /* VGCF_* flags */
struct cpu_user_regs user_regs; /* User-level CPU registers */
struct trap_info trap_ctxt[256]; /* Virtual IDT */
unsigned long ldt_base, ldt_ents; /* LDT (linear address, # ents) */
unsigned long gdt_frames[16], gdt_ents; /* GDT (machine frames, # ents) */
unsigned long kernel_ss, kernel_sp; /* Virtual TSS (only SS1/SP1) */
/* NB. User pagetable on x86/64 is placed in ctrlreg[1]. */
unsigned long ctrlreg[8]; /* CR0-CR7 (control registers) */
unsigned long debugreg[8]; /* DB0-DB7 (debug registers) */
#ifdef __i386__
unsigned long event_callback_cs; /* CS:EIP of event callback */
unsigned long event_callback_eip;
unsigned long failsafe_callback_cs; /* CS:EIP of failsafe callback */
unsigned long failsafe_callback_eip;
#else
unsigned long event_callback_eip;
unsigned long failsafe_callback_eip;
#ifdef __XEN__
union {
unsigned long syscall_callback_eip;
struct {
unsigned int event_callback_cs; /* compat CS of event cb */
unsigned int failsafe_callback_cs; /* compat CS of failsafe cb */
};
};
#else
unsigned long syscall_callback_eip;
#endif
#endif
unsigned long vm_assist; /* VMASST_TYPE_* bitmap */
#ifdef __x86_64__
/* Segment base addresses. */
uint64_t fs_base;
uint64_t gs_base_kernel;
uint64_t gs_base_user;
#endif
struct vcpu_guest_context {
/* FPU registers come first so they can be aligned for FXSAVE/FXRSTOR. */
struct { char x[512]; } fpu_ctxt; /* User-level FPU registers */
#define VGCF_I387_VALID (1<<0)
#define VGCF_IN_KERNEL (1<<2)
#define _VGCF_i387_valid 0
#define VGCF_i387_valid (1<<_VGCF_i387_valid)
#define _VGCF_in_kernel 2
#define VGCF_in_kernel (1<<_VGCF_in_kernel)
#define _VGCF_failsafe_disables_events 3
#define VGCF_failsafe_disables_events (1<<_VGCF_failsafe_disables_events)
#define _VGCF_syscall_disables_events 4
#define VGCF_syscall_disables_events (1<<_VGCF_syscall_disables_events)
#define _VGCF_online 5
#define VGCF_online (1<<_VGCF_online)
unsigned long flags; /* VGCF_* flags */
struct cpu_user_regs user_regs; /* User-level CPU registers */
struct trap_info trap_ctxt[256]; /* Virtual IDT */
unsigned long ldt_base, ldt_ents; /* LDT (linear address, # ents) */
unsigned long gdt_frames[16], gdt_ents; /* GDT (machine frames, # ents) */
unsigned long kernel_ss, kernel_sp; /* Virtual TSS (only SS1/SP1) */
/* NB. User pagetable on x86/64 is placed in ctrlreg[1]. */
unsigned long ctrlreg[8]; /* CR0-CR7 (control registers) */
unsigned long debugreg[8]; /* DB0-DB7 (debug registers) */
#ifdef __i386__
unsigned long event_callback_cs; /* CS:EIP of event callback */
unsigned long event_callback_eip;
unsigned long failsafe_callback_cs; /* CS:EIP of failsafe callback */
unsigned long failsafe_callback_eip;
#else
unsigned long event_callback_eip;
unsigned long failsafe_callback_eip;
#ifdef __XEN__
union {
unsigned long syscall_callback_eip;
struct {
unsigned int event_callback_cs; /* compat CS of event cb */
unsigned int failsafe_callback_cs; /* compat CS of failsafe cb */
};
};
#else
unsigned long syscall_callback_eip;
#endif
#endif
unsigned long vm_assist; /* VMASST_TYPE_* bitmap */
#ifdef __x86_64__
/* Segment base addresses. */
uint64_t fs_base;
uint64_t gs_base_kernel;
uint64_t gs_base_user;
#endif
struct vcpu_guest_context {
/* FPU registers come first so they can be aligned for FXSAVE/FXRSTOR. */
struct { char x[512]; } fpu_ctxt; /* User-level FPU registers */
#define VGCF_I387_VALID (1<<0)
#define VGCF_IN_KERNEL (1<<2)
#define _VGCF_i387_valid 0
#define VGCF_i387_valid (1<<_VGCF_i387_valid)
#define _VGCF_in_kernel 2
#define VGCF_in_kernel (1<<_VGCF_in_kernel)
#define _VGCF_failsafe_disables_events 3
#define VGCF_failsafe_disables_events (1<<_VGCF_failsafe_disables_events)
#define _VGCF_syscall_disables_events 4
#define VGCF_syscall_disables_events (1<<_VGCF_syscall_disables_events)
#define _VGCF_online 5
#define VGCF_online (1<<_VGCF_online)
unsigned long flags; /* VGCF_* flags */
struct cpu_user_regs user_regs; /* User-level CPU registers */
struct trap_info trap_ctxt[256]; /* Virtual IDT */
unsigned long ldt_base, ldt_ents; /* LDT (linear address, # ents) */
unsigned long gdt_frames[16], gdt_ents; /* GDT (machine frames, # ents) */
unsigned long kernel_ss, kernel_sp; /* Virtual TSS (only SS1/SP1) */
/* NB. User pagetable on x86/64 is placed in ctrlreg[1]. */
unsigned long ctrlreg[8]; /* CR0-CR7 (control registers) */
unsigned long debugreg[8]; /* DB0-DB7 (debug registers) */
#ifdef __i386__
unsigned long event_callback_cs; /* CS:EIP of event callback */
unsigned long event_callback_eip;
unsigned long failsafe_callback_cs; /* CS:EIP of failsafe callback */
unsigned long failsafe_callback_eip;
#else
unsigned long event_callback_eip;
unsigned long failsafe_callback_eip;
#ifdef __XEN__
union {
unsigned long syscall_callback_eip;
struct {
unsigned int event_callback_cs; /* compat CS of event cb */
unsigned int failsafe_callback_cs; /* compat CS of failsafe cb */
};
};
#else
unsigned long syscall_callback_eip;
#endif
#endif
unsigned long vm_assist; /* VMASST_TYPE_* bitmap */
#ifdef __x86_64__
/* Segment base addresses. */
uint64_t fs_base;
uint64_t gs_base_kernel;
uint64_t gs_base_user;
#endif
struct vcpu_guest_context {
/* FPU registers come first so they can be aligned for FXSAVE/FXRSTOR. */
struct { char x[512]; } fpu_ctxt; /* User-level FPU registers */
#define VGCF_I387_VALID (1<<0)
#define VGCF_IN_KERNEL (1<<2)
#define _VGCF_i387_valid 0
#define VGCF_i387_valid (1<<_VGCF_i387_valid)
#define _VGCF_in_kernel 2
#define VGCF_in_kernel (1<<_VGCF_in_kernel)
#define _VGCF_failsafe_disables_events 3
#define VGCF_failsafe_disables_events (1<<_VGCF_failsafe_disables_events)
#define _VGCF_syscall_disables_events 4
#define VGCF_syscall_disables_events (1<<_VGCF_syscall_disables_events)
#define _VGCF_online 5
#define VGCF_online (1<<_VGCF_online)
unsigned long flags; /* VGCF_* flags */
struct cpu_user_regs user_regs; /* User-level CPU registers */
struct trap_info trap_ctxt[256]; /* Virtual IDT */
unsigned long ldt_base, ldt_ents; /* LDT (linear address, # ents) */
unsigned long gdt_frames[16], gdt_ents; /* GDT (machine frames, # ents) */
unsigned long kernel_ss, kernel_sp; /* Virtual TSS (only SS1/SP1) */
/* NB. User pagetable on x86/64 is placed in ctrlreg[1]. */
unsigned long ctrlreg[8]; /* CR0-CR7 (control registers) */
unsigned long debugreg[8]; /* DB0-DB7 (debug registers) */
#ifdef __i386__
unsigned long event_callback_cs; /* CS:EIP of event callback */
unsigned long event_callback_eip;
unsigned long failsafe_callback_cs; /* CS:EIP of failsafe callback */
unsigned long failsafe_callback_eip;
#else
unsigned long event_callback_eip;
unsigned long failsafe_callback_eip;
#ifdef __XEN__
union {
unsigned long syscall_callback_eip;
struct {
unsigned int event_callback_cs; /* compat CS of event cb */
unsigned int failsafe_callback_cs; /* compat CS of failsafe cb */
};
};
#else
unsigned long syscall_callback_eip;
#endif
#endif
unsigned long vm_assist; /* VMASST_TYPE_* bitmap */
#ifdef __x86_64__
/* Segment base addresses. */
uint64_t fs_base;
uint64_t gs_base_kernel;
uint64_t gs_base_user;
#endif
};
可以看到,這個結構體儲存了cr0~cr7暫存器的地址,返回現場的eip指令地址,以及GDT,LDT和TSS的數值。
3.2.4 vcpu_info
程式碼清單2-5 Vcpu_info
struct vcpu_info {
/*
* 'evtchn_upcall_pending' is written non-zero by Xen to indicate
* a pending notification for a particular VCPU. It is then cleared
* by the guest OS /before/ checking for pending work, thus avoiding
* a set-and-check race. Note that the mask is only accessed by Xen
* on the CPU that is currently hosting the VCPU. This means that the
* pending and mask flags can be updated by the guest without special
* synchronisation (i.e., no need for the x86 LOCK prefix).
* This may seem suboptimal because if the pending flag is set by
* a different CPU then an IPI may be scheduled even when the mask
* is set. However, note:
* 1. The task of 'interrupt holdoff' is covered by the per-event-
* channel mask bits. A 'noisy' event that is continually being
* triggered can be masked at source at this very precise
* granularity.
* 2. The main purpose of the per-VCPU mask is therefore to restrict
* reentrant execution: whether for concurrency control, or to
* prevent unbounded stack usage. Whatever the purpose, we expect
* that the mask will be asserted only for short periods at a time,
* and so the likelihood of a 'spurious' IPI is suitably small.
* The mask is read before making an event upcall to the guest: a
* non-zero mask therefore guarantees that the VCPU will not receive
* an upcall activation. The mask is cleared when the VCPU requests
* to block: this avoids wakeup-waiting races.
*/
uint8_t evtchn_upcall_pending;
uint8_t evtchn_upcall_mask;
unsigned long evtchn_pending_sel;
struct arch_vcpu_info arch;
struct vcpu_time_info time;
}; /* 64 bytes (x86) */
struct vcpu_info {
uint8_t evtchn_upcall_pending;
uint8_t evtchn_upcall_mask;
unsigned long evtchn_pending_sel;
struct arch_vcpu_info arch;
struct vcpu_time_info time;
}; /* 64 bytes (x86) */
Vcpu_info位於共享資訊頁,因此可以被Guest OS所訪問。它包括event_chan的資訊和系統時間資訊。
3.3 vcpu建立和排程
Vcpu和domain具有密不可分的關係。建立domain的時候,也要同時為domain分配vcpu。
每個vcpu,都要通過排程器來排程。首先分析vcpu的建立和初始化,然後分析排程器如何來排程vcpu。
3.3.1 Vcpu的建立和初始化
在xen的初始化階段,就要通過init_idle_domain建立一個domain,同時為它分配vcpu。從這裡