1. 程式人生 > >[Kernel_exception2] data abort Unable to handle kernel paging request

[Kernel_exception2] data abort Unable to handle kernel paging request

一、概序:

    data abort 型別的KE比較常見,觸發此KE的原因是,使用者空間使用的地址都是虛擬地址,此地址經過MMU的負複雜

的頁表對映到實體地址,當其中發生一些異常導致此虛擬地址無法訪問到對應的實體地址時,就會通過報對應的BUG

使系統重啟,此地址有可能已經被其他程序訪問,也有可能因為部分硬體問題導致對應的地址出現翻轉導致無法訪問。

二、案例:

(1)硬體bitflip的KE:

    堆疊資訊如下:

[20512.223175] -(3)[30488:kworker/u8:2]Unable to handle kernel paging request at virtual address 4156106c
[20512.223201] -(3)[30488:kworker/u8:2]pgd = c0003000
[20512.223207] [4156106c] *pgd=80000040005003, *pmd=00000000
[20512.223223] -(3)[30488:kworker/u8:2]Internal error: Oops: 205 [#1] PREEMPT SMP ARM
[20512.223230] -(3)[30488:kworker/u8:2]Kernel Offset: disabled
[20513.223253] -(3)[30488:kworker/u8:2]PC is at set_task_cpu+0xd8/0x23c
[20513.223262] -(3)[30488:kworker/u8:2]LR is at walt_fixup_busy_time+0x1f0/0x4ac
[20513.223268] -(3)[30488:kworker/u8:2]pc : [<c02596f0>]    lr : [<c028f46c>]    psr: 60070093

    使用GDB通過解析對應的符號表vmlinux可以看到堆疊如下:

(gdb) bt
#0  0xc02596f0 in set_task_rq (cpu=<optimized out>, p=<optimized out>)
    at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/sched/sched.h:1061
#1  __set_task_cpu (cpu=<optimized out>, p=<optimized out>)
    at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/sched/sched.h:1084
#2  set_task_cpu (p=0xdbcd4000, new_cpu=0) at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/sched/core.c:1314
#3  0xc025a648 in try_to_wake_up (p=0xdbcd4000, state=<optimized out>, wake_flags=<optimized out>)
    at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/sched/core.c:2214
#4  0xc025a914 in wake_up_process (p=<optimized out>)
    at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/sched/core.c:2294
#5  0xc0240bdc in wake_up_worker (pool=<optimized out>)
    at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/workqueue.c:837
#6  process_one_work (worker=0xdbea5080, work=0xd5c5b434)
    at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/workqueue.c:2076
#7  0xc0241998 in worker_thread (__worker=0xdbea5080)
    at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/workqueue.c:2225

    對應的set_task_cpu程式碼如下:

void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
{
    ......
	if (task_cpu(p) != new_cpu) {
		if (p->sched_class->migrate_task_rq)
			p->sched_class->migrate_task_rq(p);
		p->se.nr_migrations++;
		perf_event_task_migrate(p);

		walt_fixup_busy_time(p, new_cpu);
	}

	__set_task_cpu(p, new_cpu);
}

    對應幀的反彙編程式碼如下:

(gdb) f 3
#3  0xc025a648 in try_to_wake_up (p=0xdbcd4000, state=<optimized out>, wake_flags=<optimized out>)
    at /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/sched/core.c:2214
2214	in /home/buildsrv-108/jenkins/workspace/PY_UNIFIED_VERSION_BUILD/code/kernel-4.9/kernel/sched/core.c

(gdb) i reg
r0             0xdbcd4080	3687661696
r1             0xdf79c900	3749300480
r2             0x4094d	264525
r3             0x4094d	264525
r4             0xdbcd4000	3687661568
r5             0xdbcd4644	3687663172
r6             0xc1404548	3242214728

gdb) disas
Dump of assembler code for function try_to_wake_up:

   0xc025a62c <+552>:    beq    0xc025a648 <try_to_wake_up+580>
   0xc025a630 <+556>:    ldr    r3, [r11, #-56]    ; 0x38
   0xc025a634 <+560>:    mov    r1, r10
   0xc025a638 <+564>:    mov    r0, r4      //將r4的值傳給r0
   0xc025a63c <+568>:    orr    r3, r3, #4
   0xc025a640 <+572>:    str    r3, [r11, #-56]    ; 0x38
   0xc025a644 <+576>:    bl    0xc0259618 <set_task_cpu> //跳轉到set_task_cpu函式中
=> 0xc025a648 <+580>:    movw    r3, #17828    ; 0x45a4

    從上面彙編程式碼可以看出r4的值應該和r0相等(也就是程式碼中p的值),但時間r0 的倒數第四位翻轉為1,使訪問的

地址發生變化:dbcd4000 ->dbcd4080,從此點可以看出是硬體Bitflip導致的KE,如果問題概率比較高的話,可

以通過交叉CPU/memory來驗證此問題。

 

(2)踩記憶體觸發的KE:

    所謂踩記憶體,意思就是將要使用的這塊記憶體已經其他地方非法佔有,非法佔有的方式有陣列越界/use after free等,下面

看一個具體的例項,其中kernel log打印出來的堆疊資訊如下:

[  192.960966]  (0)[1410:Signal Catcher]Unable to handle kernel paging request at virtual address 880646e1
[  192.960998]  (0)[1410:Signal Catcher]pgd = d06f4000
[  192.961013] [880646e1] *pgd=00000000
[  193.961221] -(0)[1410:Signal Catcher]PC is at find_vma+0x54/0x80
[  193.961233] -(0)[1410:Signal Catcher]LR is at 0xd18ac3d8

    通過GDB載入vmlinux解析出如下堆疊:

(gdb) bt
#0  find_vma (mm=0xdab73180, addr=3040309248) at /home/buildsrv-96/jenkins/workspace/UNIFIED_VERSION_BUILD-2/code/kernel-3.18/mm/mmap.c:2099
#1  0xc01171f8 in __do_page_fault (tsk=<optimized out>, flags=<optimized out>, fsr=<optimized out>, addr=<optimized out>, mm=<optimized out>)
    at /home/buildsrv-96/jenkins/workspace/UNIFIED_VERSION_BUILD-2/code/kernel-3.18/arch/arm/mm/fault.c:232
#2  do_page_fault (addr=0, fsr=3040309248, regs=0xd0001fb0)
    at /home/buildsrv-96/jenkins/workspace/UNIFIED_VERSION_BUILD-2/code/kernel-3.18/arch/arm/mm/fault.c:314
#3  0xc01003dc in do_DataAbort (addr=0, fsr=23, regs=0xd0001fb0)

    看到第0幀的addr = 3040309248就可以明顯發現很奇怪,一般不會出現這種異常的addr,下面接著分析,

(gdb) f 2
#2  do_page_fault (addr=0, fsr=3040309248, regs=0xd0001fb0)
    at /home/buildsrv-96/jenkins/workspace/UNIFIED_VERSION_BUILD-2/code/kernel-3.18/arch/arm/mm/fault.c:314
314	in /home/buildsrv-96/jenkins/workspace/UNIFIED_VERSION_BUILD-2/code/kernel-3.18/arch/arm/mm/fault.c

    切到第二幀的時候,可以看到addr = 0,並且在函式的傳遞過程中,addr的值並沒有發生變化,這裡可以看出addr

有被踩的可能,下面看彙編程式碼也可以很明顯的看出addr被踩:

(gdb) disas
Dump of assembler code for function do_page_fault:
   0xc0117130 <+0>:	mov	r12, sp
   0xc0117134 <+4>:	push	{r4, r5, r6, r7, r8, r9, r10, r11, r12, lr, pc}
   0xc0117138 <+8>:	sub	r11, r12, #4
   0xc01171f0 <+192>:	mov	r0, r5
   0xc01171f4 <+196>:	bl	0xc0222e1c <find_vma>
=> 0xc01171f8 <+200>:	subs	r9, r0, #0   //r9 = r0 - 0=0
   0xc01171fc <+204>:	beq	0xc01173b8 <do_page_fault+648>
   0xc0117200 <+208>:	ldr	r3, [r9]
   0xc0117204 <+212>:	cmp	r8, r3
   0xc0117208 <+216>:	bcc	0xc0117390 <do_page_fault+608>

(gdb) i reg
r0             0x0	0
r1             0xb5377000	3040309248
r2             0xff000b2c	4278192940
r3             0x880646ed	2282112749
r4             0xd0001fb0	3489669040
r5             0xdab73180	3669438848
r6             0xd18ac100	3515531520
r7             0x17	23
r8             0xb5377000	3040309248
r9             0xb5377000	3040309248
r10            0xdab731b8	3669438904

    上面彙編程式碼中r9中的值應該為0,但棧打印出來的是0xb5377000 = 3040309248,懷疑這個地址被踩了導致出現

的問題。對於踩記憶體的問題,需要開啟slub或者kasan的debug機制來除錯此類問題,當出現踩記憶體時可以將對應踩的

位置表示出來,具體方法可以參考部落格:記憶體管理三 核心記憶體檢測KASAN

 

作者:frank_zyp 
您的支援是對博主最大的鼓勵,感謝您的認真閱讀。 
本文無所謂版權,歡迎轉載。