記憶體MCE錯誤導致系統崩潰的問題分析
今天伺服器因為記憶體問題而崩潰,通過mcelog工具分析是在讀記憶體的時候Error overflow(雖然是ECC記憶體,但也無奈錯誤太多),估計是記憶體硬體故障,如果再次出現的話就得考慮更換記憶體。
最終原因:硬體故障,應該是主機板問題,因為是線上伺服器為減少計劃內停機時間,同時更換主機板和記憶體解決。
# more /var/log/messages
Oct 31 14:19:36 pingu_fd kernel: sbridge: HANDLING MCE MEMORY ERROR
Oct 31 14:19:36 pingu_fd kernel: CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010092Oct 31 14:19:36 pingu_fd kernel: TSC 0 ADDR 428fc8840 MISC 204808e886 PROCESSOR 0:206d6 TIME 1383200376 SOCKET 0 APIC 0
Oct 31 14:19:36 pingu_fd kernel: sbridge: HANDLING MCE MEMORY ERROR
Oct 31 14:19:36 pingu_fd kernel: CPU 0: Machine Check Exception: 0 Bank 10: 8800004800800092
Oct 31 14:19:36 pingu_fd kernel: TSC 0 ADDR 0 MISC 4900030243025000 PROCESSOR 0:206d6 TIME 1383200376 SOCKET 0 APIC 0
通過mcelog翻譯message的內容如下:
# mcelog sandybridge-ep --ascii < mcelog-manu.txt
sbridge: HANDLING MCE MEMORY ERROR
Hardware event. This is not a software error.
CPU 0 BANK 5
MISC 244076f686 ADDR 1a6bca040
TIME 1383200376 Thu Oct 31 14:19:36 2013
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNEL2_ERR
Transaction: Memory read error
STATUS cc0000c000010092 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 45
SOCKET 0 APIC 0