一次核心 crash 的排查記錄
阿新 • • 發佈:2020-05-09
# 一次核心 crash 的排查記錄
使用的發行版本是 CentOS,核心版本是 `3.10.0`,在正常執行的情況下核心發生了崩潰,還好有 vmcore 生成。
## 準備排查環境
1. crash
2. 核心除錯資訊rpm,下載的兩個 rpm 版本必須和核心版本一致
- kernel-debuginfo-common-x86_64-3.10.0-327.el7.x86_64.rpm
- kernel-debuginfo-3.10.0-327.el7.x86_64.rpm
包從這個地址中獲取的,速度尚可 https://mirrors.ocf.berkeley.edu/centos-debuginfo/7/x86_64/
## 排查
準備好生成的 vmcore
1. 進入 crash
```sh
[zzz@localhost kernel-debug]# crash ../vmcore /usr/lib/debug/lib/modules/3.10.0-327.el7.x86_64/vmlinux
```
2. 可以看到直接原因是訪問了空指標 (0000000000000008)
```sh
KERNEL: /usr/lib/debug/lib/modules/3.10.0-327.el7.x86_64/vmlinux
DUMPFILE: ../vmcore [PARTIAL DUMP]
CPUS: 8
DATE: Wed Apr 29 19:40:42 2020
UPTIME: 335 days, 01:46:01
LOAD AVERAGE: 23.98, 26.19, 15.75
TASKS: 688
NODENAME: localhost.localdomain
RELEASE: 3.10.0-327.el7.x86_64
VERSION: #1 SMP Thu Nov 19 22:10:57 UTC 2015
MACHINE: x86_64 (3408 Mhz)
MEMORY: 15.6 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000008"
PID: 76
COMMAND: "kswapd0"
TASK: ffff88044beba280 [THREAD_INFO: ffff88044bef0000]
CPU: 2
STATE: TASK_RUNNING (PANIC)
```
3. 觀察堆疊,看程式碼層面大概是哪裡產生的問題
```sh
crash> bt
PID: 76 TASK: ffff88044beba280 CPU: 2 COMMAND: "kswapd0"
#0 [ffff88044bef3610] machine_kexec at ffffffff81051beb
#1 [ffff88044bef3670] crash_kexec at ffffffff810f2542
#2 [ffff88044bef3740] oops_end at ffffffff8163e1a8
#3 [ffff88044bef3768] no_context at ffffffff8162e2b8
#4 [ffff88044bef37b8] __bad_area_nosemaphore at ffffffff8162e34e
#5 [ffff88044bef3800] bad_area_nosemaphore at ffffffff8162e4b8
#6 [ffff88044bef3810] __do_page_fault at ffffffff81640fce
#7 [ffff88044bef3868] do_page_fault at ffffffff81641113
#8 [ffff88044bef3890] page_fault at ffffffff8163d408
[exception RIP: down_read_trylock+9]
RIP: ffffffff810aa989 RSP: ffff88044bef3940 RFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff8801b4f9ff80 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000008
RBP: ffff88044bef3940 R8: 0000000000000000 R9: 0000000000017bc0
R10: ffff880465fd8000 R11: 0000000000000000 R12: ffff8801b4f9ff81
R13: ffffea00047dfbc0 R14: 0000000000000008 R15: 0000000000000001
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff88044bef3948] page_lock_anon_vma_read at ffffffff811a2e65
#10 [ffff88044bef3978] page_referenced at ffffffff811a30e7
#11 [ffff88044bef39f0] shrink_page_list at ffffffff8117d264
#12 [ffff88044bef3b28] shrink_inactive_list at ffffffff8117df3a
#13 [ffff88044bef3bf0] shrink_lruvec at ffffffff8117ea05
#14 [ffff88044bef3cf0] shrink_zone at ffffffff8117ee66
#15 [ffff88044bef3d48] balance_pgdat at ffffffff8118010c
#16 [ffff88044bef3e20] kswapd at ffffffff811803d3
#17 [ffff88044bef3ec8] kthread at ffffffff810a5aef
#18 [ffff88044bef3f50] ret_from_fork at ffffffff81645858
```
異常發生在 `down_read_trylock` 函式內,後面發生了 `page fault`,先反彙編看一下 RIP 內地址(ffffffff810aa989)的內容:
```sh
crash> dis -l ffffffff810aa989
/usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/arch/x86/include/asm/rwsem.h: 83
0xffffffff8