Zabbix housekeeper processes more than 75% busy
原因分析
為了防止資料庫持續增大,Zabbix有自動刪除歷史資料的機制,即housekeeper,而在頻繁清理歷史資料的時候,MySQL資料庫可能出現效能降低的情況,此時就會告警。
一般來說,Zabbix都會監控Zabbix Server本身。如下所示,我們可以分析“Zabbix server: Utilization of housekeeper internal processes, in %”圖形,通過分析,我們可以看到Zabbix housekeeper processes在10點32分的時候, housekeeper程序突然開始繁忙。
拉長時間範圍的截圖如下所示:
如需進一步分析,那麼就必須檢視MySQL
mysql> show variables like'%slow_query%';
+---------------------+-------------------------------------+
| Variable_name | Value |
+---------------------+-------------------------------------+
| slow_query_log | ON |
| slow_query_log_file | /mysql_data/mysql/xxxx-slow.log |
+---------------------+-------------------------------------+
2 rowsinset (0.01 sec)
#注意,如果沒有設定過的話,這裡預設是UTC時間。所以跟東八區時間有所區別。
# Time: 2020-08-26T02:34:56.354162Z
# User@Host: zabbix[zabbix] @ localhost [] Id: 345463
# Query_time: 13.832335 Lock_time: 0.000088 Rows_sent: 0 Rows_examined: 5000
SET timestamp=1598409282;
delete from history where itemid=37078 limit 5000;
# Time: 2020-08-26T02:35:00.377783Z
# User@Host: zabbix[zabbix] @ localhost [] Id: 345463
# Query_time: 4.023518 Lock_time: 0.000126 Rows_sent: 0 Rows_examined: 5000
SET timestamp=1598409296;
delete from history where itemid=37079 limit 5000;
# Time: 2020-08-26T02:35:36.848120Z
# User@Host: zabbix[zabbix] @ localhost [] Id: 345463
# Query_time: 21.513432 Lock_time: 0.000094 Rows_sent: 0 Rows_examined: 5000
SET timestamp=1598409315;
delete from history where itemid=37099 limit 5000;
# Time: 2020-08-26T02:35:46.705206Z
# User@Host: zabbix[zabbix] @ localhost [] Id: 345463
# Query_time: 9.856468 Lock_time: 0.000124 Rows_sent: 0 Rows_examined: 5000
SET timestamp=1598409336;
delete from history where itemid=37100 limit 5000;
# Time: 2020-08-26T02:36:43.856421Z
# User@Host: zabbix[zabbix] @ localhost [] Id: 345463
# Query_time: 38.186585 Lock_time: 0.000039 Rows_sent: 0 Rows_examined: 5000
SET timestamp=1598409365;
delete from history where itemid=38789 limit 5000;
# Time: 2020-08-26T02:36:59.432174Z
# User@Host: zabbix[zabbix] @ localhost [127.0.0.1] Id: 345563
# Query_time: 8.542213 Lock_time: 0.000084 Rows_sent: 20 Rows_examined: 7298
SET timestamp=1598409410;
SELECT DISTINCT e.eventid,e.clock,e.ns,e.objectid,e.acknowledged,er1.r_eventid FROM events e LEFT JOIN event_recovery er1 ON er1.eventid=e.eventid WHERE e.sou
rce='0' AND e.object='0' AND e.objectid=26811 AND e.eventid<='3437835' AND e.value=1 ORDER BY e.eventid DESC LIMIT 20;
# Time: 2020-08-26T02:37:02.317422Z
# User@Host: zabbix[zabbix] @ localhost [] Id: 345463
# Query_time: 18.460853 Lock_time: 0.000101 Rows_sent: 0 Rows_examined: 5000
SET timestamp=1598409403;
delete from history where itemid=38790 limit 5000;
另外,Zabbix Server也會將慢查詢SQL寫入zabbix_server.log中,如下所示。
# grep "slow query" zabbix_server.log
通過分析,我們可以發現MySQL在刪除history和histry_unit資料。經過分析,這裡突然出現 housekeeper程序繁忙,是因為我刪除了模板Zabbix template for Microsoft SQL Server,並勾選了Clear,所以導致Zabbix Server需要刪除大量的資料。當然這個只是一個誘因,本身history變得非常大才是一個重要的原因。你可以通過下面指令碼,檢視一下這些表的Size大小資訊。
SELECT TABLE_SCHEMA
, TABLE_NAME
, (DATA_LENGTH/1024/1024) AS DATA_SIZE_MB
, (INDEX_LENGTH/1024/1024) AS INDEX_SIZE_MB
, ((DATA_LENGTH+INDEX_LENGTH)/1024/1024) AS TABLE_SIZE_MB
, TABLE_ROWS
FROM INFORMATION_SCHEMA.TABLES
WHERE table_schema = 'zabbix'
ORDERBY TABLE_SIZE_MB ASC;
解決方案
一般短時間出現這樣的告警可以忽略,如果一直出現這個告警的話,我們就應該調整引數HousekeepingFrequency和MaxHousekeeperDelete。
Zabbix 5.x下面,預設情況下HousekeepingFrequency值為1,表示一小時執行一次。 MaxHousekeeperDelete表示一次刪除5000條記錄。如下所示
# grep -C 1 HousekeepingFrequency /etc/zabbix/zabbix_server.conf
### Option: HousekeepingFrequency
# How often Zabbix will perform housekeeping procedure (in hours).
# Housekeeping is removing outdated information from the database.
# To prevent Housekeeper from being overloaded, no more than 4 times HousekeepingFrequency
# hours of outdated information are deleted in one housekeeping cycle, for each item.
# To lower load on server startup housekeeping is postponed for 30 minutes after server start.
# With HousekeepingFrequency=0 the housekeeper can be only executed using the runtime control option.
# In this case the period of outdated information deleted in one housekeeping cycle is 4 times the
--
# Default:
# HousekeepingFrequency=1
# grep -C 1 MaxHousekeeperDelete /etc/zabbix/zabbix_server.conf
### Option: MaxHousekeeperDelete
# The table "housekeeper" contains "tasks" for housekeeping procedure in the format:
# [housekeeperid], [tablename], [field], [value].
# No more than 'MaxHousekeeperDelete' rows (corresponding to [tablename], [field], [value])
# will be deleted per one task in one housekeeping cycle.
--
# Default:
# MaxHousekeeperDelete=5000
從Zabbix 1.8.2 開始支援該引數MaxHousekeeperDelete,如果MaxHousekeeperDelete設定為0,表示不限制刪除的行數。當然這個不建議這麼做。另外它僅在對已經被刪除的監控項進行歷史和趨勢資料刪除操作時有效。一般是通過將housekeeper程序做歸檔的時間間隔調大,一次刪除資料的量放大來解決問題。至於這個值到底多大合適,沒有標準答案。要根據實際情況、測試才能給出一個合適的值。
HousekeepingFrequency=6 #間隔時間6小時
MaxHousekeeperDelete=10000 #最大刪除量
在這個案例中,將MaxHousekeeperDelete調整為100000,發現delete操作反而慢了許多。如下所示:
836378:20200826:161213.441 slow query: 773.254950 sec, "delete from history where itemid=45251 limit 100000"
836378:20200826:162435.978 slow query: 742.537260 sec, "delete from history where itemid=46694 limit 100000"
836378:20200826:163329.011 slow query: 532.932137 sec, "delete from history where itemid=51313 limit 100000"
836378:20200826:163842.539 slow query: 313.528311 sec, "delete from history where itemid=52664 limit 100000"
如果我將MaxHousekeeperDelete調整為10000的話,發現delete的效能還是要快一些。所以,這個不妨多測試驗證一下。
943980:20200826:233157.246 slow query: 5.393617 sec, "delete from history where itemid=37769 limit 10000"
943980:20200826:233202.914 slow query: 5.667551 sec, "delete from history where itemid=38407 limit 10000"
943980:20200826:233208.044 slow query: 5.129767 sec, "delete from history where itemid=41283 limit 10000"
943980:20200826:233217.462 slow query: 7.011403 sec, "delete from history where itemid=37770 limit 10000"
943980:20200826:233222.516 slow query: 5.053935 sec, "delete from history where itemid=38408 limit 10000"
943980:20200826:233227.286 slow query: 4.769753 sec, "delete from history where itemid=41284 limit 10000"
另外,還有一些方法,例如減少歷史資料的保留時間、對history等大表進行分割槽,也可以避免或減少這個告警出現的概率。根據個人的經驗,如果像history表變得非常大以後,即使調整上面引數,其實效果並不明顯。需要通過分割槽或手工清理歷史資料來解決。這樣效果才顯著。