SLS機器學習最佳實戰:日誌聚類+異常告警
摘要: 圍繞日誌,挖掘其中更大價值,一直是我們團隊所關注。在原有日誌實時查詢基礎上,今年SLS在DevOps領域完善了如下功能: - 上下文查詢 - 實時Tail和智慧聚類,以提高問題調查效率 - 提供多種時序資料的異常檢測和預測函式,來做更智慧的檢查和預測 - 資料分析的結果視覺化 - 強大的告...
0.文章系列連結
- SLS機器學習介紹(01):時序統計建模
- SLS機器學習介紹(02):時序聚類建模
- SLS機器學習介紹(03):時序異常檢測建模
- SLS機器學習介紹(04):規則模式挖掘
- SLS機器學習介紹(05):時間序列預測
1.手中的錘子都有啥?
圍繞日誌,挖掘其中更大價值,一直是我們團隊所關注。在原有日誌實時查詢基礎上,今年SLS在DevOps領域完善了如下功能:
- 上下文查詢
- 實時Tail和智慧聚類,以提高問題調查效率
- 提供多種時序資料的異常檢測和預測函式,來做更智慧的檢查和預測
- 資料分析的結果視覺化
- 強大的告警設定和通知,通過呼叫webhook進行關聯行動
今天我們重點介紹下,日誌只能聚類和異常告警如何配合,更好的進行異常發現和告警
2.平臺實驗
2.1 實驗資料
一份Sys Log的原始資料,,並且開啟了日誌聚類服務,具體的狀態截圖如下:
通過調整下面截圖中紅色框1的大小,可以改變圖中紅色框2的結果,但是對於每個最細粒度的pattern並不會改變,也就是說:子Pattern的結果是穩定且唯一的,我們可以通過子Pattern的Signature找到對應的原始日誌條目。
2.2 生成子模式的時序資訊
假設,我們對這個子Pattern要進行監控:
msg:vm-111932.tc su: pam_unix(*:session): session closed for user root
對應的 signature_id : __log_signature__: 1814836459146662485
我們得到了上述pattern對應的原始日誌,可以看下具體的數量在時間軸上的直返圖:
上圖中,我們可以發現,這個模式的日誌分佈不是很均衡,其中還有一些是沒有的,如果直接按照時間視窗統計數量,得到的時序圖如下:
__log_signature__: 1814836459146662485 |
select
date_trunc('minute', __time__) as time,
COUNT(*) as num
from log GROUP BY time order by time ASC limit 10000
上述圖中我們發現時間上並不是連續的。因此,我們需要對這條時序進行補點操作。
__log_signature__: 1814836459146662485 |
select
time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time,
avg(num) as num
from (
select
__time__ - __time__ % 60 as time,
COUNT(*) as num
from log GROUP BY time order by time desc )
GROUP by time order by time ASC limit 10000
2.3 對時序進行異常檢測
使用時序異常檢測函式: ts_predicate_arma
__log_signature__: 1814836459146662485 |
select
ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg')
from (
select
time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time,
avg(num) as num
from (
select
__time__ - __time__ % 60 as time,
COUNT(*) as num
from log GROUP BY time order by time desc )
GROUP by time order by time ASC ) limit 10000
2.4 告警該如何設定
- 將機器學習函式的結果拆解開
__log_signature__: 1814836459146662485 |
select
t1[1] as unixtime, t1[2] as src, t1[3] as pred, t1[4] as up, t1[5] as lower, t1[6] as prob
from (
select
ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') as res
from (
select
time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time,
avg(num) as num
from (
select
__time__ - __time__ % 60 as time,
COUNT(*) as num
from log GROUP BY time order by time desc )
GROUP by time order by time ASC )) , unnest(res) as t(t1)
- 針對最近兩分鐘的結果進行告警
__log_signature__: 1814836459146662485 |
select
unixtime, src, pred, up, lower, prob
from (
select
t1[1] as unixtime, t1[2] as src, t1[3] as pred, t1[4] as up, t1[5] as lower, t1[6] as prob
from (
select
ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') as res
from (
select
time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time,
avg(num) as num
from (
select
__time__ - __time__ % 60 as time, COUNT(*) as num
from log GROUP BY time order by time desc )
GROUP by time order by time ASC )) , unnest(res) as t(t1) )
where is_nan(src) = false order by unixtime desc limit 2
- 針對上升點進行告警,並設定兜底策略
__log_signature__: 1814836459146662485 |
select
sum(prob) as sumProb, max(src) as srcMax, max(up) as upMax
from (
select
unixtime, src, pred, up, lower, prob
from (
select
t1[1] as unixtime, t1[2] as src, t1[3] as pred, t1[4] as up, t1[5] as lower, t1[6] as prob
from (
select
ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') as res
from (
select
time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time, avg(num) as num
from (
select
__time__ - __time__ % 60 as time, COUNT(*) as num
from log GROUP BY time order by time desc )
GROUP by time order by time ASC )) , unnest(res) as t(t1) )
where is_nan(src) = false order by unixtime desc limit 2 )
具體的告警設定如下:
3.硬廣時間
3.1 日誌進階
這裡是日誌服務的各種功能的演示 日誌服務整體介紹,各種Demo
更多日誌進階內容可以參考:日誌服務學習路徑。
作者:悟冥
原文連結
本文為雲棲社群原創內容,未經