linux kernel crash問題分析解決
一,問題場景和環境
系統環境:
redhat6.4 kernel:2.6.32-358
問題:
使用iptables給mangle表添加了一條規則,使用nfqueue做為target。當一個http請求命中這個規則之後,機器直接重啟了。偶發性的出了兩次問題,但是卻在重啟的機器上重現不了這個問題。
二,排查
1,查看messages,kernel和dmesg相關日誌,未發現有任何異常
2,查看重啟前機器的負載,cpu,內存,磁盤io,網絡io都正常
3,由於是使用了nfqueue做為target才導致的重啟,懷疑是系統的問題,通過現象看應該是iptables的nfqueue導致的問題,而nfqueue用於從內核讀取數據包在用戶態處理。故具體定位在
4,通過服務器顯示屏幕來看重啟的時候會有什麽有用的輸出,但是服務器在客戶的機房,查看太麻煩
5,使用last查看服務器的重啟記錄,發現一個意外現象,即:機器因為nfqueue重啟的那個記錄裏面有一個crash記錄,意思即系統奔潰了,從而導致重啟。那就能斷定是系統或者kernel crash了。
6,linux系統一般默認都安裝配置了kdump,故當 linux 系統內核發生崩潰的時候,可以通過 kdump 等方式收集內核崩潰之前的內存,在/var/crash/日期 目錄生成一個轉儲文件 vmcore。使用crash工具可以分享vmcore文件,來獲取
三,分析vmcore文件
1,安裝指定kernel的debuginfo包:
# yum install kernel-debuginfo-2.6.32-358.el6.x86_64
2,使用系統自帶的crash命令分析vmcore:
# crash /usr/lib/debug/lib/modules/2.6.32-358.el6.x86_64/vmlinux vmcore crash 7.1.0-6.el6 Copyright (C) 2002-2014 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... WARNING: kernel version inconsistency between vmlinux and dumpfile KERNEL: vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 40 DATE: Tue Oct 31 11:53:41 2017 UPTIME: 342 days, 12:15:26 LOAD AVERAGE: 0.00, 0.02, 0.00 TASKS: 1050 NODENAME: web_yp_49_202.mobileztgame RELEASE: 2.6.32-358.el6.x86_64 VERSION: #1 SMP Tue Jan 29 11:47:41 EST 2013 MACHINE: x86_64 (2499 Mhz) MEMORY: 128 GB PANIC: "BUG: unable to handle kernel NULL pointer dereference at (null)" PID: 0 COMMAND: "swapper" TASK: ffff882069324080 (1 of 40) [THREAD_INFO: ffff881068896000] CPU: 5 STATE: TASK_RUNNING (PANIC)
從crash的輸出可以看到kernel崩潰的原因為kernel遇見空指針導致崩潰
bt 命令用於查看系統崩潰前的堆棧等信息
bt命令結果如下:
crash> bt PID: 0 TASK: ffff882069324080 CPU: 5 COMMAND: "swapper" #0 [ffff8800618a3750] machine_kexec at ffffffff81035b7b #1 [ffff8800618a37b0] crash_kexec at ffffffff810c0db2 #2 [ffff8800618a3880] oops_end at ffffffff815111d0 #3 [ffff8800618a38b0] no_context at ffffffff81046bfb #4 [ffff8800618a3900] __bad_area_nosemaphore at ffffffff81046e85 #5 [ffff8800618a3950] bad_area_nosemaphore at ffffffff81046f53 #6 [ffff8800618a3960] __do_page_fault at ffffffff810476b1 #7 [ffff8800618a3a80] do_page_fault at ffffffff8151311e #8 [ffff8800618a3ab0] page_fault at ffffffff815104d5 [exception RIP: nf_queue+152] RIP: ffffffff81475718 RSP: ffff8800618a3b60 RFLAGS: 00010207 RAX: 0000000000000020 RBX: 0000000000000000 RCX: ffff8810638a3c00 RDX: 0000000000000002 RSI: ffff880959189980 RDI: 0000000000000000 RBP: ffff8800618a3bd0 R8: 0000000000021773 R9: 0000000000000001 R10: 000000000000000e R11: 0000000000000006 R12: ffff880959189980 R13: 0000000000000000 R14: ffffffff8147e8b0 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff8800618a3bd8] nf_hook_slow at ffffffff81474800 #10 [ffff8800618a3c58] ip_rcv at ffffffff8147ef54 #11 [ffff8800618a3c98] __netif_receive_skb at ffffffff8144819b #12 [ffff8800618a3cf8] netif_receive_skb at ffffffff8144a578 #13 [ffff8800618a3d38] napi_skb_finish at ffffffff8144a680 #14 [ffff8800618a3d58] napi_gro_receive at ffffffff8144cc29 #15 [ffff8800618a3d78] ixgbe_poll at ffffffffa015e44c [ixgbe] #16 [ffff8800618a3e68] net_rx_action at ffffffff8144cd43 #17 [ffff8800618a3ec8] __do_softirq at ffffffff81076fb1 #18 [ffff8800618a3f38] call_softirq at ffffffff8100c1cc #19 [ffff8800618a3f50] do_softirq at ffffffff8100de05 #20 [ffff8800618a3f70] irq_exit at ffffffff81076d95 #21 [ffff8800618a3f80] do_IRQ at ffffffff81516c95 --- <IRQ stack> --- #22 [ffff881068897db8] ret_from_intr at ffffffff8100b9d3 [exception RIP: intel_idle+222] RIP: ffffffff812d37ae RSP: ffff881068897e68 RFLAGS: 00000206 RAX: 0000000000000000 RBX: ffff881068897ed8 RCX: 0000000000000000 RDX: 00000000000e3cb1 RSI: 0000000000000000 RDI: 00000000379d13ba RBP: ffffffff8100b9ce R8: 0000000000000004 R9: 0000000000000050 R10: 0069229e5ea9dbfa R11: 0000000000000000 R12: ffff8800618b15a0 R13: 0000000000000000 R14: 0069229c2b297a40 R15: ffff8800618b16a0 ORIG_RAX: ffffffffffffff62 CS: 0010 SS: 0018 #23 [ffff881068897ee0] cpuidle_idle_call at ffffffff81414ef7 #24 [ffff881068897f00] cpu_idle at ffffffff81009fc6
通過bt分析,我們從下到上來看kernel崩潰前的系統調用,定位到kernel崩潰前的一個exception是ip寄存器RIP的異常,而通過dis 命令來看一下該地址的反匯編結果:
crash> dis -l ffffffff81475718 /usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/net/netfilter/nf_queue.c: 221 0xffffffff81475718 <nf_queue+152>: mov (%rbx),%r12
故可定位到出現異常的代碼段:
# vim /usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/net/netfilter/nf_queue.c +221 215 segs = skb_gso_segment(skb, 0); 216 kfree_skb(skb); 217 if (IS_ERR(segs)) 218 return 1; 219 220 do { 221 struct sk_buff *nskb = segs->next; 222 223 segs->next = NULL; 224 if (!__nf_queue(segs, elem, pf, hook, indev, outdev, okfn, 225 queuenum)) 226 kfree_skb(segs); 227 segs = nskb; 228 } while (segs); 229 return 1;
而通過看skb_gso_segment結構體,可以判斷出是因為skb_gso_segment在某些情況下會返回NULL,從而導致如上代碼segs->next獲取到了空指針,從而導致kernel崩潰。而既然是gso導致的問題,應該可以通過調整系統gso屬性來規避這個問題:
# vim /usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/net/core/dev.c +1728 1728 /** 1729 * skb_gso_segment - Perform segmentation on skb. 1730 * @skb: buffer to segment 1731 * @features: features for the output path (see dev->features) 1732 * 1733 * This function segments the given skb and returns a list of segments. 1734 * 1735 * It may return NULL if the skb requires no segmentation. This is 1736 * only possible when GSO is used for verifying header integrity. 1737 */ 1738 struct sk_buff *skb_gso_segment(struct sk_buff *skb, int features) 1739 { 1740 struct sk_buff *segs = ERR_PTR(-EPROTONOSUPPORT); 1741 struct packet_type *ptype; 1742 __be16 type = skb->protocol; 1743 int err;
從網上找到的對應patch如下:
https://patchwork.kernel.org/patch/6615071/
四,問題重現
1,最早發現問題,想要重現的辦法是通過如下url訪問:curl “t.test.com”,發現重現不了。
2,之後,通過搜索相關TSO/GSO/LRO/GRO相關的資料,覺得有可能是由於發送的數據包太小,導致沒有觸發相關的數據包分段重組,從而沒有導致重現問題。故增大了請求的數據包,通過如下url重現了問題:
# curl “t.test.com/v2/user-manage/css/bootstrap.min.css?test1=sdfsfsdfsdfa&test2_id=2234234234234234234&test_id=50129009890098&test_token=1670056402|_80_m_lxxj1298|1493196793|c726299f2d03b8462764bacf20e2395f|sdfsdfdsfsdffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffsdfsdfsdfdsfsdfhgjgjghjghjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjfhjgjfghjfjfhjjjjjjjjjjjjjjjjjjjjjfffffadfsfsdfsdfsdfsdfsdfdsfdssdfsdfsdfsdfsdfsdf”
iptables相關規則如下:
# ipset create lee hash:ip hashsize 819200 maxelem 100000 timeout 300 # ipset add lee 1.1.1.1 timeout 300 # iptables -t mangle -I PREROUTING -p tcp -m multiport --dports 80,443 -m set --match-set lee src -m string --string t.test.com --algo kmp --from 0 --to 1480 -j NFQUEUE
五,問題結論
linux kernel bug
六,解決辦法
1,升級kernel。從patch和源代碼可以看出kernel 3.0以後應該fix了這個問題,看了下3.10的kernel代碼已經fix
2,使用drop,不再使用nfqueue這個target來添加iptables規則(建議使用這個辦法)
3,調整網卡gso相關屬性,發現通過關閉lro來解決這個重啟問題。具體命令:
# ethtool -K eth0 lro on
LRO簡介:
Linux 在 2.6.24 中加入了支持 IPv4 TCP 協議的 LRO (Large Receive Offload) ,它通過將多個 TCP 數據聚合在一個 skb 結構,在稍後的某個時刻作為一個大數據包交付給上層的網絡協議棧,以減少上層協議棧處理 skb 的開銷,提高系統接收 TCP 數據包的能力。當然,這一切都需要網卡驅動程序支持。
七,參考
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/kernel_crash_dump_guide/sect-crash-running-the-utility
https://patchwork.kernel.org/patch/6615071/
https://www.ibm.com/developerworks/cn/linux/l-cn-network-pt/index.html
本文出自 “佳” 博客,請務必保留此出處http://leejia.blog.51cto.com/4356849/1978729
linux kernel crash問題分析解決