1. 程式人生 > 實用技巧 >MySQL異常OOM排查

MySQL異常OOM排查

收到告警,提示mysql掛了,此時看監控,負載已經比較高,伺服器已經無法登入。看見監控此時的負載情況如下:

除了系統層面的監控還可以看到,mysql層面的監控已經斷圖,等負載降下來的時候mysql已經被oom,自動重啟了,從mysql錯誤日誌看到如下資訊:

2020-08-01T09:14:11.778918+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4271ms. The settings might not be optimal. (flushed=386 and evicted=119, during the time.)

這基本上說明,資料庫收到的寫入過多,導致BufferPool充滿髒頁。這會觸發PageCleaner進行操作並清除髒頁。由於髒頁比平時多,因此PageCleaner清除緩衝區需要花費更多時間。

由於監控在負載異常高的這段時間msyql層面已經斷圖,看不到相關監控,那麼如何排查呢?那隻能從binlog入手了,好,那麼分析一下binlog,看有沒有大事務之類的。

mysqlbinlog mysql-bin.005298 | grep "GTID$(printf '\t')last_committed" -B 1  | egrep -E '^# at|^#20' | awk '{print $1,$2,$3}' | sed 's/server//' | sed 'N;s/\n/ /' | awk 'NR==1 {tmp=$1} NR>1 {print $4,$NF,($3-tmp);tmp=$3}
' | sort -k 3 -n -r | head -n 20

可以看見再8:56:44左右,可以看見最大的一個事務在100M,這是比較大了。通過在該時間範圍內繼續解析binlog,發現有大範圍的更新資料。這不夠直觀,我們寫一段程式碼,把DML趨勢放入ES來分析一下,程式碼比較簡單,如下:

#!/usr/bin/python
#coding: utf8
""" 跟蹤一段時間內dml趨勢,用ES分析
Usage:
python dml2es.py | logstash -f /etc/logstash/conf.d/t_binlog.conf

"""
import json
import os
import
sys import logging from datetime import datetime from pymysqlreplication import BinLogStreamReader from pymysqlreplication.event import QueryEvent, RotateEvent from pymysqlreplication.row_event import ( WriteRowsEvent, UpdateRowsEvent, DeleteRowsEvent, ) def init_log(): # get root logger mylogger = logging.getLogger() mylogger.setLevel(logging.DEBUG) # create file handler and add formatter to handler verbose_log_file = "/tmp/%s.verbose.log" % (os.path.basename(sys.argv[0])) fh = logging.FileHandler(verbose_log_file) fh.setLevel(logging.DEBUG) fh.setFormatter(logging.Formatter( fmt="%(asctime)s %(filename)s:%(lineno)d %(levelname)s %(message)s", datefmt="%a %d %b %Y %H:%M:%S" )) # create stream handler and add formatter to handler sh = logging.StreamHandler(sys.stdout) sh.setLevel(logging.INFO) sh.setFormatter(logging.Formatter( fmt="%(asctime)s %(levelname)-8s %(message)s", datefmt="%H:%M:%S" )) # add two handler to logger mylogger.addHandler(fh) mylogger.addHandler(sh) return mylogger if __name__ == "__main__": logger = init_log() mysql_settings = { 'host': '127.0.0.1', 'port': 3306, 'user': 'xxx', 'passwd': 'xxx' } binlog_file = 'mysql-bin.005298' start_position = 123 stop_position = 1031510925 # 不需要blocking stream = BinLogStreamReader( connection_settings = mysql_settings, server_id=9999, log_file=binlog_file, log_pos=start_position, resume_stream=True, blocking=False ) for binlogevent in stream: data = {} if binlogevent.packet.log_pos >= stop_position: break # 當輸出 "@timestamp" : binlogevent.timestamp # 即 "@timestamp" => 1582624830 # logstash 會有warning # [WARN ] 2020-02-25 18:00:31.313 [[main]<stdin] Event - Unrecognized @timestamp value type=class org.jruby.RubyFixnum # 所以需要進行下面的格式轉換 # 使用 datetime.fromtimestamp 將解析為 當前作業系統時區的 datetime if isinstance(binlogevent, WriteRowsEvent): data = { "schema" : binlogevent.schema, "table" : binlogevent.table, "dmltype" : "insert", "@timestamp" :datetime.fromtimestamp(binlogevent.timestamp).strftime('%Y-%m-%dT%H:%M:%S') , } print json.dumps(data) elif isinstance(binlogevent, DeleteRowsEvent): data = { "schema" : binlogevent.schema, "table" : binlogevent.table, "dmltype" : "delete", "@timestamp" :datetime.fromtimestamp(binlogevent.timestamp).strftime('%Y-%m-%dT%H:%M:%S') , } print json.dumps(data) elif isinstance(binlogevent, UpdateRowsEvent): data = { "schema" : binlogevent.schema, "table" : binlogevent.table, "dmltype" : "update", "@timestamp" :datetime.fromtimestamp(binlogevent.timestamp).strftime('%Y-%m-%dT%H:%M:%S') , } print json.dumps(data) elif isinstance(binlogevent, RotateEvent): if binlogevent.packet.log_pos <= 1655: logger.debug( u"next_binlog:" + binlogevent.next_binlog ) stream.close()
View Code

我們在kibana裡面把時間縮短到秒級看看這段時間的DML趨勢

可以看見和我們直接分析binlog得到的結果一致,就是這個時間點大量的更新。