記一次核心模組記憶體越界排查
本文記錄一次核心模組記憶體越界,導致故障的排查分析過程,和各位共享交流。
異常都是在系統啟動階段出的。
異常資訊一:
只看前兩行資訊,出現mapcount為負數,這個顯然有問題。列印的程式碼是:[ 6.854984] BUG: Bad page state in process khelper pfn:6db6d9addc07010f [ 6.862471] page:ffff880821883b48 flags:ffff880821885e00 count:562584288 mapcount:-30711 mapping:ffff8808218857c0 index:ffff8808218854a0 [ 6.876156] Pid: 132, comm: khelper Not tainted 2.6.32.41- #17 [ 6.885093] Call Trace: [ 6.887837] [<ffffffff810ba056>] bad_page+0x115/0x130 [ 6.893571] [<ffffffff810bba56>] get_page_from_freelist+0x45c/0x6d4 [ 6.900669] [<ffffffff810cb6f3>] ? __inc_zone_state+0x56/0x66 [ 6.907183] [<ffffffff810bbee6>] __alloc_pages_nodemask+0x152/0x8b8 [ 6.914284] [<ffffffff810bbee6>] ? __alloc_pages_nodemask+0x152/0x8b8 [ 6.921577] [<ffffffff810e221a>] alloc_pages_current+0x96/0x9f [ 6.928191] [<ffffffff810e556d>] alloc_slab_page+0x1a/0x25 [ 6.934404] [<ffffffff810e55c3>] new_slab+0x4b/0x1ca [ 6.940040] [<ffffffff810e60c8>] __slab_alloc+0x1c3/0x366 [ 6.946166] [<ffffffff811c221d>] ? selinux_cred_prepare+0x1a/0x30 [ 6.953081] [<ffffffff810e5719>] ? new_slab+0x1a1/0x1ca [ 6.958999] [<ffffffff810e7718>] __kmalloc_track_caller+0xb0/0xff [ 6.965901] [<ffffffff81069ec4>] ? prepare_creds+0x1a/0xa4 [ 6.972116] [<ffffffff811c221d>] ? selinux_cred_prepare+0x1a/0x30 [ 6.979026] [<ffffffff810c8947>] kmemdup+0x1b/0x31 [ 6.984476] [<ffffffff811c221d>] selinux_cred_prepare+0x1a/0x30 [ 6.991185] [<ffffffff811bd02d>] security_prepare_creds+0x11/0x13 [ 6.998094] [<ffffffff81069f38>] prepare_creds+0x8e/0xa4 [ 7.004121] [<ffffffff8106a5e6>] copy_creds+0x74/0x186 [ 7.009963] [<ffffffff810488bb>] copy_process+0x261/0x110b [ 7.016185] [<ffffffff8105fec1>] ? __call_usermodehelper+0x0/0x6a [ 7.023079] [<ffffffff81049cb2>] do_fork+0x158/0x2fa [ 7.028725] [<ffffffff8101211a>] ? __cycles_2_ns+0x11/0x3d [ 7.034943] [<ffffffff81012212>] ? native_sched_clock+0x3b/0x3d [ 7.041649] [<ffffffff8105fec1>] ? __call_usermodehelper+0x0/0x6a [ 7.048543] [<ffffffff8100ce22>] kernel_thread+0x82/0xe0 [ 7.054566] [<ffffffff8105fec1>] ? __call_usermodehelper+0x0/0x6a [ 7.061482] [<ffffffff8105fcd8>] ? ____call_usermodehelper+0x0/0x118 [ 7.068666] [<ffffffff8100ce80>] ? child_rip+0x0/0x20 [ 7.074407] [<ffffffff8105ff0c>] ? __call_usermodehelper+0x4b/0x6a [ 7.081403] [<ffffffff8106168e>] worker_thread+0x14e/0x1f8 [ 7.087627] [<ffffffff810651a3>] ? autoremove_wake_function+0x0/0x38 [ 7.094818] [<ffffffff81061540>] ? worker_thread+0x0/0x1f8 [ 7.101039] [<ffffffff81064f69>] kthread+0x69/0x71 [ 7.106488] [<ffffffff8100ce8a>] child_rip+0xa/0x20 [ 7.112027] [<ffffffff81064f00>] ? kthread+0x0/0x71 [ 7.117564] [<ffffffff8100ce80>] ? child_rip+0x0/0x20
printk(KERN_ALERT
"page:%p flags:%p count:%d mapcount:%d mapping:%p index:%lx\n",
page, (void *)page->flags, page_count(page),
page_mapcount(page), page->mapping, page->index);
起先page是0xffff880821883b48這個值,覺得忒奇怪,就認為是個非法的值。於是走了點彎路,初步判定是page的值不對了。
於是單從這一次的改寫來說,直接原因是從夥伴系統取出的struct page值不對。
struct page的來源,可能來自每cpu冷熱快取連結串列pcp,也來自zone->free_area連結串列。
於是想到的兩種可能:
1)如果執行過程中,上述連結串列的list_head結構體的next域被改,則取出來的第一個page指標就是非法值。
連結串列pcp是每cpu變數,一般bootmem分配;
連結串列free_area來自bootmem分配。
2)或者某個時候,page結構裡的lru.next欄位被改,那麼當此page被刪除後,連結串列頭指標list_head.next = 非法的lru.next,
下回從夥伴取可用page的時候,取到了非法page值。
page結構位於node_map數組裡,這個陣列也是bootmem分配。
於是根據異常資訊一判斷是bootmem區間被覆蓋。
還剩下的疑點是,0xffff880821883b48這個值正常不? 於是在正常環境下,以某次page的lru為起點,按照lru->next往後
列印連結串列上各個page的值。發現,在一條連結串列上,大多數page的地址都是0xffffea000fbab8c8這種值,只有一個'page'的地址是
0xffff880821883b48。後來想了一下,這個所謂的0xffff880821883b48開頭的'page'其實就是zone->free_area->list_head或者pcp->list_head
頭指標,因為這兩個頭指標都是在動態記憶體區分配,並且page所在的lru又是個雙向環形連結串列,因此遍歷到頭指標也不足為奇了。
這麼看來0xffff880821883b48這個值是正常的,不正常的是錯誤的取到list_head頭作為page。
異常資訊二:
[ 6.839518] BUG: unable to handle kernel NULL pointer dereference at 0000000000000006
[ 6.858841] IP: [<ffffffff810bb84a>] get_page_from_freelist+0x306/0x6d4
[ 6.866234] PGD 0
[ 6.868481] Oops: 0002 [#1] PREEMPT SMP
[ 6.872895] last sysfs file:
[ 6.876201] CPU 19
[ 6.878551] Modules linked in:
[ 6.881971] Pid: 139, comm: khelper Not tainted 2.6.32.41 #1 To be filled by O.E.M.
[ 6.893024] RIP: 0010:[<ffffffff810bb84a>] [<ffffffff810bb84a>] get_page_from_freelist+0x306/0x6d4
[ 6.903125] RSP: 0018:ffff880821addac0 EFLAGS: 00010092
[ 6.909046] RAX: 0000000000000006 RBX: ffff88047e4e98f8 RCX: ffff88047e4e9920
[ 6.916991] RDX: ffff88047e4e9960 RSI: 0000000000000001 RDI: 0000000000000001
[ 6.924944] RBP: ffff880821addb90 R08: 000000000037e329 R09: ffff8800000140c0
[ 6.932898] R10: 0000000000000000 R11: dead000000200200 R12: dead000000100100
[ 6.940843] R13: 0000000000000246 R14: ffff8800000140c0 R15: ffff88047e4e9900
[ 6.948798] FS: 0000000000000000(0000) GS:ffff880028360000(0000) knlGS:0000000000000000
[ 6.957818] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 6.964223] CR2: 0000000000000006 CR3: 0000000001001000 CR4: 00000000000406e0
[ 6.972177] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 6.980123] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 6.988077] Process khelper (pid: 139, threadinfo ffff880821adc000, task ffff880821a39f40)
[ 6.997290] Stack:
[ 6.999531] ffff880821addaf0 ffffffff810b944a ffffea0000000041 0000000000000002
[ 7.007622] <0> 0000000200000246 ffff880000015140 0000000021addbd0 0000000000000000
[ 7.016220] <0> 00000040000212d0 ffff880000015148 ffff88047e4e98f8 0000000200000008
[ 7.025027] Call Trace:
[ 7.027759] [<ffffffff810b944a>] ? prep_compound_page+0x45/0x66
[ 7.034459] [<ffffffff810bbe30>] __alloc_pages_nodemask+0x152/0x8b8
[ 7.041546] [<ffffffff810e229a>] alloc_pages_current+0x96/0x9f
[ 7.048146] [<ffffffff810e55ed>] alloc_slab_page+0x1a/0x25
[ 7.054361] [<ffffffff810e5643>] new_slab+0x4b/0x1ca
[ 7.059995] [<ffffffff810e6148>] __slab_alloc+0x1c3/0x366
[ 7.066111] [<ffffffff8106a0b0>] ? prepare_exec_creds+0x1b/0xb3
[ 7.072809] [<ffffffff810e3b8a>] ? bit_spin_lock+0x17/0x6c
[ 7.079022] [<ffffffff8106a0b0>] ? prepare_exec_creds+0x1b/0xb3
[ 7.085719] [<ffffffff810e6433>] kmem_cache_alloc+0x57/0xd8
[ 7.092029] [<ffffffff8106a0b0>] prepare_exec_creds+0x1b/0xb3
[ 7.098533] [<ffffffff810f1275>] prepare_bprm_creds+0x30/0x53
[ 7.105038] [<ffffffff810f2648>] do_execve+0x80/0x2e0
[ 7.110769] [<ffffffff8100a5c3>] sys_execve+0x3e/0x58
[ 7.116498] [<ffffffff8105fea5>] ? __call_usermodehelper+0x0/0x6a
[ 7.123391] [<ffffffff8100cf08>] kernel_execve+0x68/0xd0
[ 7.129411] [<ffffffff8105fea5>] ? __call_usermodehelper+0x0/0x6a
[ 7.136303] [<ffffffff8105fdc9>] ? ____call_usermodehelper+0x10d/0x118
[ 7.143677] [<ffffffff8100ce8a>] child_rip+0xa/0x20
[ 7.149213] [<ffffffff8105fea5>] ? __call_usermodehelper+0x0/0x6a
[ 7.156104] [<ffffffff8105fcbc>] ? ____call_usermodehelper+0x0/0x118
[ 7.163286] [<ffffffff8100ce80>] ? child_rip+0x0/0x20
[ 7.169011] Code: 00 00 00 ad de 4c 89 65 80 49 bc 00 01 10 00 00 00 ad de 48 8b 4d 80 48 8b 5d 80 48 83 c1 28 48 8b 53 28 48 8b 41 08 48 89 42 08 <48> 89 10 4c 89 59 08 4c 89 63 28 41 ff 0f e9 99 00 00 00 f7 85
[ 7.190840] RIP [<ffffffff810bb84a>] get_page_from_freelist+0x306/0x6d4
[ 7.198325] RSP <ffff880821addac0>
[ 7.202212] CR2: 0000000000000006
[ 7.205903] ---[ end trace 4eaa2a86a8e2da22 ]---
根據RIP得到出錯的位置是0xffffffff810bb8fc,附近反彙編是:
ffffffff810bb8d0: mov $0xdead000000200200,%r11
ffffffff810bb8d7:
ffffffff810bb8da: mov %r12,-0x80(%rbp)
ffffffff810bb8de: mov $0xdead000000100100,%r12
ffffffff810bb8e5:
ffffffff810bb8e8: mov -0x80(%rbp),%rcx
ffffffff810bb8ec: mov -0x80(%rbp),%rbx
ffffffff810bb8f0: add $0x28,%rcx
ffffffff810bb8f4: mov 0x28(%rbx),%rdx
ffffffff810bb8f8: mov 0x8(%rcx),%rax
ffffffff810bb8fc: mov %rax,0x8(%rdx)
ffffffff810bb900: mov %rdx,(%rax)
ffffffff810bb903: mov %r11,0x8(%rcx)
ffffffff810bb907: mov %r12,0x28(%rbx)
單從彙編很難看出對應C程式碼是哪句,
不過這裡有條線索,0xffffffff810bb8d0把一個立即數0xdead000000200200賦給r11,
0xffffffff810bb8de把0xdead000000100100賦給r12。這兩個特殊值分別是LIST_POISON2和LIST_POISON1。
再加上0xffffffff810bb903和ffffffff810bb907的賦值,我們可以推斷出這附近做了一個list_del操作,因為:
static inline void list_del(struct list_head *entry)
{
__list_del(entry->prev, entry->next);
entry->next = LIST_POISON1;
entry->prev = LIST_POISON2;
}
因此推斷出,這段反彙編翻譯成C程式碼應該是:
ffffffff810bb8d0: mov $0xdead000000200200,%r11 ;LIST_POISON2
ffffffff810bb8d7:
ffffffff810bb8da: mov %r12,-0x80(%rbp)
ffffffff810bb8de: mov $0xdead000000100100,%r12 ;LIST_POISON1
ffffffff810bb8e5:
ffffffff810bb8e8: mov -0x80(%rbp),%rcx ; rcx = page
ffffffff810bb8ec: mov -0x80(%rbp),%rbx ;rbx = page
ffffffff810bb8f0: add $0x28,%rcx ;rcx = page->lru
ffffffff810bb8f4: mov 0x28(%rbx),%rdx ;rdx = page->lru->next
ffffffff810bb8f8: mov 0x8(%rcx),%rax ;rax = page->lru->prev
ffffffff810bb8fc: mov %rax,0x8(%rdx) ; page->lru->next->prev = page->lru->prev
ffffffff810bb900: mov %rdx,(%rax) ; page->lru->prev->next = page->lru->next
ffffffff810bb903: mov %r11,0x8(%rcx) ; page->lru->prev = LIST_POISON2;
ffffffff810bb907: mov %r12,0x28(%rbx) ; page->lru->next = LIST_POISON1;
所以故障直接原因是在獲取到一個可用的page,將它從lru上取下來時出問題了。即執行list_del(&page->lru)時,由於&page->lru->prev指標非法導致飛了。
綜上,根據異常資訊一和二,很可能是lru連結串列壞掉導致的問題,配置CONFIG_DEBUG_LIST嘗試復現也是情理之中。
不過,再回過頭檢查,在異常資訊一里,曾出現過page->flags為ffff880821885e00的情況,這很有可能是棧溢位引起。
檢查我們自己寫的模組,發現有個函式裡申明瞭一塊8K多的靜態陣列!對x86_64來說,核心棧只有兩個頁8K,這麼搞的話
肯定會引起溢位。但是核心配置了CONFIG_DEBUG_STACKOVERFLOW為什麼沒檢查到這種情況?
由於核心配置了CONFIG_NO_HZ,並且系統又是在啟動階段,CONFIG_DEBUG_STACKOVERFLOW又是在中斷裡檢查的,
所以可能造成溢位檢查滯後。另外,更重要的原因,核心棧溢位檢查的程式碼是:
WARN_ONCE(regs->sp >= curbase &&
regs->sp <= curbase + THREAD_SIZE &&
regs->sp < curbase + sizeof(struct thread_info) +
sizeof(struct pt_regs) + 128,
"do_IRQ: %s near stack overflow (cur:%Lx,sp:%lx)\n",
current->comm, curbase, regs->sp);
如果棧一次性壓棧過於厲害,超過thread_info了,那上面的程式碼是檢查不出來的。
檢查最新的3.10的核心,邏輯已經完善過來了:
if(regs->sp >= curbase &&
regs->sp <= curbase + THREAD_SIZE &&
regs->sp >= curbase + sizeof(struct thread_info) +
sizeof(struct pt_regs) + 128)
return OK;
warn(xxx);
將模組裡的動態陣列改成kmalloc後拷機再未出現故障。