Linux Kernel BUG:soft lockup CPU問題(二)
阿新 • • 發佈:2018-12-12
產品大使用者場景下執行一段時間即報“Aug 7 19:19:58 localhost kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [xxxx:80779]" 錯誤,核心日誌呼叫棧顯示是在獲取鎖時失敗,該鎖是讀寫鎖,使用read_lock或者write_lock獲取鎖,鎖使用情況如下所示,如圖所示鎖用在桶深256的HASH表下掛連結串列插入刪除時的競爭保護。剛開始按照使用者態經驗加鎖時鎖不可用不是被掛起睡眠嗎?等待鎖可用再被喚醒,怎麼會一直佔有CPU20多秒,通過分析核心讀寫鎖函式發現核心讀寫鎖類似spin_lock死等,所以應該是該鎖被其他人長時間佔用,到了呼叫棧所示的呼叫時一直無法獲取鎖導致的,死等20多秒超時導致該CPU的watchdog發現(watchdog是IRQ所以不會因為死等鎖而不能排程),同時可以確定使用者態死鎖不會導致soft lockup。
結合產品核心程式碼進而分析可能原因: 1) 多個核心鎖的PV操作順序不當導致死鎖; 2) HASH鏈上的節點太多,導致獲取鎖的任務處理時間超過20s;3)某任務獲取鎖之後睡眠(同spin_lock後不能睡眠)4) 其他原因死鎖
分析程式碼排除1,增加除錯手段檢視所有HASH鏈的長度排除2,分析程式碼發現操作HASH表時,獲取鎖後有kmalloc操作,但是引數2是GFP_ATOMIC不會睡眠(引數GFP_KERNEL可能導致睡眠),陷入困境。想到尋求核心自帶debug特性輔助,發現lockdep。lockdep需要修改核心預設配置重編譯核心,如何配置使能lockdep見點選開啟連結
dmesg 輸出:
[ 4773.122536] ================================= [ 4773.122539] [ INFO: inconsistent lock state ] [ 4773.122542] 3.10.101 #9 Tainted: GF O [ 4773.122544] --------------------------------- [ 4773.122547] inconsistent {IN-SOFTIRQ-R} -> {SOFTIRQ-ON-W} usage. [ 4773.122551] xxx/17184 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 4773.122554] ( [ 4773.122566] {IN-SOFTIRQ-R} state was registered at: [ 4773.122568] [<ffffffff810bf9bb>] __lock_acquire+0x5ab/0xc30 [ 4773.122576] [<ffffffff810c00f0>] lock_acquire+0xb0/0x160 [ 4773.122580] [<ffffffff815cf7d9>] _raw_read_lock+0x39/0x70 [ 4773.122586] [<ffffffffa08a891a>] xxxx [ 4773.122590] [<ffffffffa08ab2c0>] xxxx [ 4773.122595] [<ffffffff81519f09>] udp_queue_rcv_skb+0x1b9/0x570 [ 4773.122600] [<ffffffff8151a911>] __udp4_lib_rcv+0x181/0xa20 [ 4773.122604] [<ffffffff8151b1ca>] udp_rcv+0x1a/0x20 [ 4773.122609] [<ffffffff814e8512>] ip_local_deliver_finish+0xe2/0x300 [ 4773.122615] [<ffffffff814e8ef7>] ip_local_deliver+0x47/0x80 [ 4773.122619] [<ffffffff814e8890>] ip_rcv_finish+0x160/0x630 [ 4773.122624] [<ffffffff814e9159>] ip_rcv+0x229/0x370 [ 4773.122628] [<ffffffff814af0a6>] __netif_receive_skb_core+0x6c6/0x8a0 [ 4773.122634] [<ffffffff814af298>] __netif_receive_skb+0x18/0x60 [ 4773.122639] [<ffffffff814af363>] netif_receive_skb+0x83/0xf0 [ 4773.122644] [<ffffffff814afe80>] napi_gro_receive+0x80/0xb0 [ 4773.122649] [<ffffffffa004e0b4>] rtl8169_poll+0x174/0x68f [r8169] [ 4773.122668] [<ffffffff814af76a>] net_rx_action+0x15a/0x2a0 [ 4773.122673] [<ffffffff81063e57>] __do_softirq+0x107/0x2a0 [ 4773.122678] [<ffffffff815d9e3c>] call_softirq+0x1c/0x30 [ 4773.122683] [<ffffffff810156ed>] do_softirq+0x8d/0xc0 [ 4773.122688] [<ffffffff810641cd>] irq_exit+0xdd/0xf0 [ 4773.122692] [<ffffffff815da766>] do_IRQ+0x56/0xc0 [ 4773.122697] [<ffffffff815d01f2>] ret_from_intr+0x0/0x1a [ 4773.122701] [<ffffffff8109378b>] try_to_wake_up+0x1cb/0x2e0 [ 4773.122707] [<ffffffff810938b5>] wake_up_process+0x15/0x20 [ 4773.122711] [<ffffffff81237c91>] do_msgsnd+0x401/0x430 [ 4773.122717] [<ffffffff81237d0d>] SyS_msgsnd+0x4d/0x60 [ 4773.122721] [<ffffffff815d8ba0>] tracesys+0xdd/0xe2 [ 4773.122726] irq event stamp: 12549 [ 4773.122728] hardirqs last enabled at (12549): [<ffffffff81142105>] user_exit+0x55/0xb0 [ 4773.122732] hardirqs last disabled at (12548): [<ffffffff811420eb>] user_exit+0x3b/0xb0 [ 4773.122736] softirqs last enabled at (12468): [<ffffffff8149b463>] release_sock+0x173/0x1d0 [ 4773.122741] softirqs last disabled at (12464): [<ffffffff815cf478>] _raw_spin_lock_bh+0x18/0x70 [ 4773.122746] other info that might help us debug this: [ 4773.122748] Possible unsafe locking scenario: [ 4773.122751] CPU7 [ 4773.122752] ---- [ 4773.122754] lock(&xxx); [ 4773.122757] <Interrupt> [ 4773.122759] lock(&xxx); [ 4773.122762] *** DEADLOCK ***