[Kernel_exception2] data abort Unable to handle kernel paging request
一、概序:
data abort 型別的KE比較常見,觸發此KE的原因是,使用者空間使用的地址都是虛擬地址,此地址經過MMU的負複雜
的頁表對映到實體地址,當其中發生一些異常導致此虛擬地址無法訪問到對應的實體地址時,就會通過報對應的BUG
使系統重啟,此地址有可能已經被其他程序訪問,也有可能因為部分硬體問題導致對應的地址出現翻轉導致無法訪問。
二、案例:
(1)硬體bitflip的KE:
堆疊資訊如下:
[20512.223175] -(3)[30488:kworker/u8:2]Unable to handle kernel paging request at virtual address 4156106c [20512.223201] -(3)[30488:kworker/u8:2]pgd = c0003000 [20512.223207] [4156106c] *pgd=80000040005003, *pmd=00000000 [20512.223223] -(3)[30488:kworker/u8:2]Internal error: Oops: 205 [#1] PREEMPT SMP ARM [20512.223230] -(3)[30488:kworker/u8:2]Kernel Offset: disabled [20513.223253] -(3)[30488:kworker/u8:2]PC is at set_task_cpu+0xd8/0x23c [20513.223262] -(3)[30488:kworker/u8:2]LR is at walt_fixup_busy_time+0x1f0/0x4ac [20513.223268] -(3)[30488:kworker/u8:2]pc : [<c02596f0>] lr : [<c028f46c>] psr: 60070093
使用GDB通過解析對應的符號表vmlinux可以看到堆疊如下:
(gdb) bt #0 0xc02596f0 in set_task_rq (cpu=<optimized out>, p=<optimized out>) at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/sched/sched.h:1061 #1 __set_task_cpu (cpu=<optimized out>, p=<optimized out>) at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/sched/sched.h:1084 #2 set_task_cpu (p=0xdbcd4000, new_cpu=0) at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/sched/core.c:1314 #3 0xc025a648 in try_to_wake_up (p=0xdbcd4000, state=<optimized out>, wake_flags=<optimized out>) at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/sched/core.c:2214 #4 0xc025a914 in wake_up_process (p=<optimized out>) at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/sched/core.c:2294 #5 0xc0240bdc in wake_up_worker (pool=<optimized out>) at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/workqueue.c:837 #6 process_one_work (worker=0xdbea5080, work=0xd5c5b434) at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/workqueue.c:2076 #7 0xc0241998 in worker_thread (__worker=0xdbea5080) at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/workqueue.c:2225
對應的set_task_cpu程式碼如下:
void set_task_cpu(struct task_struct *p, unsigned int new_cpu) { ...... if (task_cpu(p) != new_cpu) { if (p->sched_class->migrate_task_rq) p->sched_class->migrate_task_rq(p); p->se.nr_migrations++; perf_event_task_migrate(p); walt_fixup_busy_time(p, new_cpu); } __set_task_cpu(p, new_cpu); }
對應幀的反彙編程式碼如下:
(gdb) f 3
#3 0xc025a648 in try_to_wake_up (p=0xdbcd4000, state=<optimized out>, wake_flags=<optimized out>)
at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/sched/core.c:2214
2214 in /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/sched/core.c
(gdb) i reg
r0 0xdbcd4080 3687661696
r1 0xdf79c900 3749300480
r2 0x4094d 264525
r3 0x4094d 264525
r4 0xdbcd4000 3687661568
r5 0xdbcd4644 3687663172
r6 0xc1404548 3242214728
gdb) disas
Dump of assembler code for function try_to_wake_up:
0xc025a62c <+552>: beq 0xc025a648 <try_to_wake_up+580>
0xc025a630 <+556>: ldr r3, [r11, #-56] ; 0x38
0xc025a634 <+560>: mov r1, r10
0xc025a638 <+564>: mov r0, r4 //將r4的值傳給r0
0xc025a63c <+568>: orr r3, r3, #4
0xc025a640 <+572>: str r3, [r11, #-56] ; 0x38
0xc025a644 <+576>: bl 0xc0259618 <set_task_cpu> //跳轉到set_task_cpu函式中
=> 0xc025a648 <+580>: movw r3, #17828 ; 0x45a4
從上面彙編程式碼可以看出r4的值應該和r0相等(也就是程式碼中p的值),但時間r0 的倒數第四位翻轉為1,使訪問的
地址發生變化:dbcd4000 ->dbcd4080,從此點可以看出是硬體Bitflip導致的KE,如果問題概率比較高的話,可
以通過交叉CPU/memory來驗證此問題。
(2)踩記憶體觸發的KE:
所謂踩記憶體,意思就是將要使用的這塊記憶體已經其他地方非法佔有,非法佔有的方式有陣列越界/use after free等,下面
看一個具體的例項,其中kernel log打印出來的堆疊資訊如下:
[ 192.960966] (0)[1410:Signal Catcher]Unable to handle kernel paging request at virtual address 880646e1
[ 192.960998] (0)[1410:Signal Catcher]pgd = d06f4000
[ 192.961013] [880646e1] *pgd=00000000
[ 193.961221] -(0)[1410:Signal Catcher]PC is at find_vma+0x54/0x80
[ 193.961233] -(0)[1410:Signal Catcher]LR is at 0xd18ac3d8
通過GDB載入vmlinux解析出如下堆疊:
(gdb) bt
#0 find_vma (mm=0xdab73180, addr=3040309248) at /home/buildsrv-96/jenkins/workspace/UNIFIED_VERSION_BUILD-2/code/kernel-3.18/mm/mmap.c:2099
#1 0xc01171f8 in __do_page_fault (tsk=<optimized out>, flags=<optimized out>, fsr=<optimized out>, addr=<optimized out>, mm=<optimized out>)
at /home/buildsrv-96/jenkins/workspace/UNIFIED_VERSION_BUILD-2/code/kernel-3.18/arch/arm/mm/fault.c:232
#2 do_page_fault (addr=0, fsr=3040309248, regs=0xd0001fb0)
at /home/buildsrv-96/jenkins/workspace/UNIFIED_VERSION_BUILD-2/code/kernel-3.18/arch/arm/mm/fault.c:314
#3 0xc01003dc in do_DataAbort (addr=0, fsr=23, regs=0xd0001fb0)
看到第0幀的addr = 3040309248就可以明顯發現很奇怪,一般不會出現這種異常的addr,下面接著分析,
(gdb) f 2
#2 do_page_fault (addr=0, fsr=3040309248, regs=0xd0001fb0)
at /home/buildsrv-96/jenkins/workspace/UNIFIED_VERSION_BUILD-2/code/kernel-3.18/arch/arm/mm/fault.c:314
314 in /home/buildsrv-96/jenkins/workspace/UNIFIED_VERSION_BUILD-2/code/kernel-3.18/arch/arm/mm/fault.c
切到第二幀的時候,可以看到addr = 0,並且在函式的傳遞過程中,addr的值並沒有發生變化,這裡可以看出addr
有被踩的可能,下面看彙編程式碼也可以很明顯的看出addr被踩:
(gdb) disas
Dump of assembler code for function do_page_fault:
0xc0117130 <+0>: mov r12, sp
0xc0117134 <+4>: push {r4, r5, r6, r7, r8, r9, r10, r11, r12, lr, pc}
0xc0117138 <+8>: sub r11, r12, #4
0xc01171f0 <+192>: mov r0, r5
0xc01171f4 <+196>: bl 0xc0222e1c <find_vma>
=> 0xc01171f8 <+200>: subs r9, r0, #0 //r9 = r0 - 0=0
0xc01171fc <+204>: beq 0xc01173b8 <do_page_fault+648>
0xc0117200 <+208>: ldr r3, [r9]
0xc0117204 <+212>: cmp r8, r3
0xc0117208 <+216>: bcc 0xc0117390 <do_page_fault+608>
(gdb) i reg
r0 0x0 0
r1 0xb5377000 3040309248
r2 0xff000b2c 4278192940
r3 0x880646ed 2282112749
r4 0xd0001fb0 3489669040
r5 0xdab73180 3669438848
r6 0xd18ac100 3515531520
r7 0x17 23
r8 0xb5377000 3040309248
r9 0xb5377000 3040309248
r10 0xdab731b8 3669438904
上面彙編程式碼中r9中的值應該為0,但棧打印出來的是0xb5377000 = 3040309248,懷疑這個地址被踩了導致出現
的問題。對於踩記憶體的問題,需要開啟slub或者kasan的debug機制來除錯此類問題,當出現踩記憶體時可以將對應踩的
位置表示出來,具體方法可以參考部落格:記憶體管理三 核心記憶體檢測KASAN。
作者:frank_zyp
您的支援是對博主最大的鼓勵,感謝您的認真閱讀。
本文無所謂版權,歡迎轉載。