把握linux內核設計思想(十三):內存管理之進程地址空間
【版權聲明:尊重原創,轉載請保留出處:blog.csdn.net/shallnet。文章僅供學習交流,請勿用於商業用途】
進程地址空間由進程可尋址的虛擬內存組成,Linux 的虛擬地址空間為0~4G字節(註:本節講述均以32為為例)。Linux內核將這 4G 字節的空間分為兩部分。將最高的 1G 字節(從虛擬地址0xC0000000到0xFFFFFFFF)。供內核使用,稱為“內核空間”。
而將較低的 3G 字節(從虛擬地址 0x00000000 到 0xBFFFFFFF),供各個進程使用,稱為“用戶空間” 。
由於每一個進程能夠通過系統調用進入內核。因此,Linux 內核由系統內的全部進程共享。於是,從詳細進程的角度來看。每一個進程能夠擁有 4G 字節的虛擬空間。
進程僅僅能訪問合法的地址空間,假設一個進程訪問了不合法的地址空間。內核就會終止該進程。並返回“段錯誤”。
虛擬內存的合法地址空間在哪而呢?我們先來看看進程虛擬地址空間的劃分:
當中堆棧安排在虛擬地址空間頂部,數據段和代碼段分布在虛擬地址空間底部。空洞部分就是進程執行時能夠動態分布的空間。包含映射內核地址空間內容、動態申請地址空間、共享庫的代碼或數據等。
在虛擬地址空間中,僅僅有那些映射到物理存儲空間的地址才是合法的地址空間。每一片合法的地址空間片段都相應一個獨立的虛擬內存區域(VMA,virtual memory areas )。而進程的進程地址空間就是由這些內存區域組成。
struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ struct rb_root mm_rb; struct vm_area_struct * mmap_cache; /* last find_vma result */ unsigned long (*get_unmapped_area) (struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags); void (*unmap_area) (struct mm_struct *mm, unsigned long addr); unsigned long mmap_base; /* base of mmap area */ unsigned long task_size; /* size of task vm space */ unsigned long cached_hole_size; /* if non-zero, the largest hole below free_area_cache */ unsigned long free_area_cache; /* first hole of size cached_hole_size or larger */ pgd_t * pgd; atomic_t mm_users; /* How many users with user space? */ atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ int map_count; /* number of VMAs */ struct rw_semaphore mmap_sem; spinlock_t page_table_lock; /* Protects page tables and some counters */ struct list_head mmlist; /* List of maybe swapped mm‘s. These are globally strung * together off init_mm.mmlist, and are protected * by mmlist_lock */ /* Special counters, in some configurations protected by the * page_table_lock, in other configurations by being atomic. */ mm_counter_t _file_rss; mm_counter_t _anon_rss; unsigned long hiwater_rss; /* High-watermark of RSS usage */ unsigned long hiwater_vm; /* High-water virtual memory usage */ unsigned long total_vm, locked_vm, shared_vm, exec_vm; unsigned long stack_vm, reserved_vm, def_flags, nr_ptes; unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack; unsigned long arg_start, arg_end, env_start, env_end; unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */ struct linux_binfmt *binfmt; cpumask_t cpu_vm_mask; /* Architecture-specific MM context */ mm_context_t context; /* Swap token stuff */ /* * Last value of global fault stamp as seen by this process. * In other words, this value gives an indication of how long * it has been since this task got the token. * Look at mm/thrash.c */ unsigned int faultstamp; unsigned int token_priority; unsigned int last_interval; unsigned long flags; /* Must use atomic bitops to access the bits */ struct core_state *core_state; /* coredumping support */ #ifdef CONFIG_AIO spinlock_t ioctx_lock; struct hlist_head ioctx_list; #endif #ifdef CONFIG_MM_OWNER /* * "owner" points to a task that is regarded as the canonical * user/owner of this mm. All of the following must be true in * order for it to be changed: * * current == mm->owner * current->mm != mm * new_owner->mm == mm * new_owner->alloc_lock is held */ struct task_struct *owner; #endif #ifdef CONFIG_PROC_FS /* store ref to file /proc/<pid>/exe symlink points to */ struct file *exe_file; unsigned long num_exe_file_vmas; #endif #ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_mm *mmu_notifier_mm; #endif };
/* * This struct defines a memory VMM memory area. There is one of these * per VM-area/task. A VM area is any part of the process virtual memory * space that has a special rule for the page-fault handlers (ie a shared * library, the executable area etc). */ struct vm_area_struct { struct mm_struct * vm_mm; /* The address space we belong to. */ unsigned long vm_start; /* Our start address within vm_mm. */ unsigned long vm_end; /* The first byte after our end address within vm_mm. */ /* linked list of VM areas per task, sorted by address */ struct vm_area_struct *vm_next; pgprot_t vm_page_prot; /* Access permissions of this VMA. */ unsigned long vm_flags; /* Flags, see mm.h. */ struct rb_node vm_rb; /* * For areas with an address space and backing store, * linkage into the address_space->i_mmap prio tree, or * linkage to the list of like vmas hanging off its node, or * linkage of vma in the address_space->i_mmap_nonlinear list. */ union { struct { struct list_head list; void *parent; /* aligns with prio_tree_node parent */ struct vm_area_struct *head; } vm_set; struct raw_prio_tree_node prio_tree_node; } shared; /* * A file‘s MAP_PRIVATE vma can be in both i_mmap tree and anon_vma * list, after a COW of one of the file pages. A MAP_SHARED vma * can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack * or brk vma (with NULL file) can only be in an anon_vma list. */ struct list_head anon_vma_node; /* Serialized by anon_vma->lock */ struct anon_vma *anon_vma; /* Serialized by page_table_lock */ /* Function pointers to deal with this struct. */ const struct vm_operations_struct *vm_ops; /* Information about our backing store: */ unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE units, *not* PAGE_CACHE_SIZE */ struct file * vm_file; /* File we map to (can be NULL). */ void * vm_private_data; /* was vm_pte (shared mem) */ unsigned long vm_truncate_count;/* truncate_count or restart_addr */ #ifndef CONFIG_MMU struct vm_region *vm_region; /* NOMMU mapping region */ #endif #ifdef CONFIG_NUMA struct mempolicy *vm_policy; /* NUMA policy for the VMA */ #endif };vm_area_struct結構體描寫敘述了進程地址空間內連續區間上的一個獨立內存範圍,每個內存區域都使用該結構體表示,每個結構體以雙向鏈表的形式連接起來。除鏈表結構外,Linux 還利用紅黑樹mm_rb來組織 vm_area_struct。通過這樣的樹結構。Linux 能夠高速定位某個虛擬內存地址。
成員vm_mm則指向其屬於的進程地址空間結構體。所以兩個不同的進程將同一個文件映射到自己的地址空間中。他們分別都會有一個vm_area_struct結構體來標識自己的內存區域。兩個共享地址空間的線程則僅僅有一個vm_area_struct結構體來標識,由於他們使用的是同一個進程地址空間。
能夠使用cat /proc/PID/maps命令和pmap命令查看給定進程空間和當中所含的內存區域。
以筆者系統上進程號為17192的進程為例。
# cat /proc/17192/maps //顯示該進程地址空間中所有內存區域 001e3000-00201000 r-xp 00000000 fd:00 789547 /lib/ld-2.12.so 00201000-00202000 r--p 0001d000 fd:00 789547 /lib/ld-2.12.so 00202000-00203000 rw-p 0001e000 fd:00 789547 /lib/ld-2.12.so 00209000-00399000 r-xp 00000000 fd:00 789548 /lib/libc-2.12.so 00399000-0039a000 ---p 00190000 fd:00 789548 /lib/libc-2.12.so 0039a000-0039c000 r--p 00190000 fd:00 789548 /lib/libc-2.12.so 0039c000-0039d000 rw-p 00192000 fd:00 789548 /lib/libc-2.12.so 0039d000-003a0000 rw-p 00000000 00:00 0 08048000-08049000 r-xp 00000000 fd:00 1191771 /home/allen/Myprojects/blog/conn_user_kernel/test/a.out 08049000-0804a000 rw-p 00000000 fd:00 1191771 /home/allen/Myprojects/blog/conn_user_kernel/test/a.out b7755000-b7756000 rw-p 00000000 00:00 0 b776d000-b776e000 rw-p 00000000 00:00 0 b776e000-b776f000 r-xp 00000000 00:00 0 [vdso] bfc9f000-bfcb4000 rw-p 00000000 00:00 0 [stack] #
# pmap 17192 17192: ./a.out 001e3000 120K r-x-- /lib/ld-2.12.so //本行和以下兩行為動態鏈接程序ld.so的代碼段、數據段、bss段 00201000 4K r---- /lib/ld-2.12.so 00202000 4K rw--- /lib/ld-2.12.so 00209000 1600K r-x-- /lib/libc-2.12.so //本行和以下為C庫中libc.so的代碼段、數據段和bss段 00399000 4K ----- /lib/libc-2.12.so 0039a000 8K r---- /lib/libc-2.12.so 0039c000 4K rw--- /lib/libc-2.12.so 0039d000 12K rw--- [ anon ] 08048000 4K r-x-- /home/allen/Myprojects/blog/conn_user_kernel/test/a.out //可運行對象的代碼段 08049000 4K rw--- /home/allen/Myprojects/blog/conn_user_kernel/test/a.out //可運行對象的數據段 b7755000 4K rw--- [ anon ] b776d000 4K rw--- [ anon ] b776e000 4K r-x-- [ anon ] bfc9f000 84K rw--- [ stack ] //堆棧段 total 1860K結構體中vm_ops域指定內存區域相關操作函數表。內核使用表中方法操作VMA。操作函數表由vm_operations_struct結構體表示,定義在<include/linux/mm.h>文件裏:
/* * These are the virtual MM functions - opening of an area, closing and * unmapping it (needed to keep files on disk up-to-date etc), pointer * to the functions called when a no-page or a wp-page exception occurs. */ struct vm_operations_struct { void (*open)(struct vm_area_struct * area); //指定內存區域被載入到一個地址空間時函數被調用 void (*close)(struct vm_area_struct * area); //指定內存區域從地址空間刪除時函數被調用 int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf); //沒有出如今物理內存中的頁面被訪問時,頁面故障處理調用該函數 /* notification that a previously read-only page is about to become * writable, if an error is returned it will cause a SIGBUS */ int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf); /* called by access_process_vm when get_user_pages() fails, typically * for use by special VMAs that can switch between memory and hardware */ int (*access)(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write); #ifdef CONFIG_NUMA ...... #endif };在內核中,給定一個屬於某個進程的虛擬地址,要求找到其所屬的區間以及 vma_area_struct 結構,這通過 find_vma()來實現,這樣的搜索通過紅-黑樹進行。
該函數定義於<mm/mmap.c>中:
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) { struct vm_area_struct *vma = NULL; if (mm) { /* 首先檢查近期使用的內存區域,看緩存的VMA是否包括所需地址 */ /* (命中錄接近35%.) */ vma = mm->mmap_cache; //假設緩存中不包括未包括希望的VMA,該函數搜索紅-黑樹。 if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) { struct rb_node * rb_node; rb_node = mm->mm_rb.rb_node; vma = NULL; while (rb_node) { struct vm_area_struct * vma_tmp; vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb); if (vma_tmp->vm_end > addr) { vma = vma_tmp; if (vma_tmp->vm_start <= addr) break; rb_node = rb_node->rb_left; } else rb_node = rb_node->rb_right; } if (vma) mm->mmap_cache = vma; } } return vma; }當某個程序的映像開始運行時,可運行映像必須裝入到進程的虛擬地址空間。假設該進程用到了不論什麽一個共享庫,則共享庫也必須裝入到進程的虛擬地址空間。
由此可看出,Linux並不將映像裝入到物理內存。相反。可運行文件僅僅是被連接到進程的虛擬地址空間中。隨著程序的運行。被引用的程序部分會由操作系統裝入到物理內存。這樣的將映像鏈接到進程地址空間的方法被稱為“內存映射”。
static inline unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flag, unsigned long offset) { unsigned long ret = -EINVAL; if ((offset + PAGE_ALIGN(len)) < offset) goto out; if (!(offset & ~PAGE_MASK)) ret = do_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT); out: return ret; }該函數會將一個新的地址區間增加到進程的地址空間中。
定義於<include/linux/mm.h>。
函數中參數的含義:
offset\:文件內的偏移量。由於我們並非一下子所有映射一個文件,可能僅僅是映射文件的一部分,off 就表示那部分的起始位置。
len:要映射的文件部分的長度。
addr:虛擬空間中的一個地址,表示從這個地址開始查找一個空暇的虛擬區。
prot: 這個參數指定對這個虛擬區所包括頁的存取權限。可能的標誌有 PROT_READ、PROT_WRITE、PROT_EXEC 和 PROT_NONE。前 3 個標誌與標誌 VM_READ、VM_WRITE 及 VM_EXEC的意義一樣。PROT_NONE 表示進程沒有以上 3 個存取權限中的隨意一個。
flag:這個參數指定虛擬區的其它標誌。
該函數調用 do_mmap_pgoff()函數,該函數做內存映射的主要工作。該函數比較長。具體實現可查看<mm/mmap.c>文件。
內核必須從磁盤映像或交換文件(此頁被換出)中將其裝入物理內存,這就是請頁機制。
把握linux內核設計思想(十三):內存管理之進程地址空間