prometheus告警技術初探(一)
阿新 • • 發佈:2019-01-29
告警規則
global: scrape_interval: 15s evaluation_interval: 15s #每過15秒執行一次報警規則,也就是說15秒執行一次報警 alerting: alertmanagers: - static_configs: - targets: ["localhost:9093"] # 設定報警資訊推送地址 , 一般而言設定的是alertManager的地址 rule_files: - "test_rules.yml" # 設定報警規則 scrape_configs: - job_name: 'node' #自己定義的監控的job_name static_configs: - targets: ['localhost:9100'] - job_name: 'CDG-MS' honor_labels: true metrics_path: '/prometheus' static_configs: - targets: ['localhost:8089'] relabel_configs: - target_label: env replacement: dev - job_name: 'eureka' file_sd_configs: - files: - "/app/enmonster/basic/prometheus/prometheus-2.2.1.linux-amd64/eureka.json" relabel_configs: - source_labels: [__job_name__] regex: (.*) target_label: job replacement: ${1} - target_label: env replacement: dev
由上面可以看到,我們可以設定報警規則的檔案 ,
groups: - name: example #報警規則組的名字 rules: - alert: InstanceDown #檢測job的狀態,持續1分鐘metrices不能訪問會發給altermanager進行報警 expr: up == 0 for: 1m #持續時間 , 表示持續一分鐘獲取不到資訊,則觸發報警 labels: serverity: page # 自定義標籤 annotations: summary: "Instance {{ $labels.instance }} down" # 自定義摘要 description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes." # 自定義具體描述
上面是一個非常通用的一個報警規則,檢測應用是否DOWN掉
在啟動的時候一定要用這種方式啟動,不然是不可以重新載入配置
./prometheus --config.file=prometheus.yml --web.enable-lifecycle
自定義報警通知
修改prometheus.yml配置檔案
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:17201"] # 設定報警資訊推送地址
當有報警資訊需要通知的時候,會通過上面的配置,推送到localhost:17201 這個服務上去, 推送方式如下:
介面地址:/api/v1/alerts
程式樣例:
@RequestMapping(value = "/api/v1/alerts")
public String alert(@RequestBody String body){
log.info("/api/v1/alerts = {}",body);
return "success";
}
入參結構:
[{
"labels": {
"alertname": "InstanceDown",
"env": "dev",
"instance": "10.208.204.46:19999",
"job": "RMS-MS",
"serverity": "page"
},
"annotations": {
"description": "10.208.204.46:19999 of job RMS-MS has been down for more than 5 minutes.",
"summary": "Instance 10.208.204.46:19999 down"
},
"startsAt": "2018-06-19T17:07:54.140071559+08:00",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"
},
{
"labels": {
"alertname": "InstanceDown",
"env": "dev",
"instance": "10.208.204.46:19999",
"job": "RMS-MS",
"serverity": "page"
},
"annotations": {
"description": "10.208.204.46:19999 of job RMS-MS has been down for more than 5 minutes.",
"summary": "Instance 10.208.204.46:19999 down"
},
"startsAt": "2018-06-19T17:07:54.140071559+08:00",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"
},
{
"labels": {
"alertname": "InstanceDown",
"env": "dev",
"instance": "192.168.164.1:18093",
"job": "RMS-MS",
"serverity": "page"
},
"annotations": {
"description": "192.168.164.1:18093 of job RMS-MS has been down for more than 5 minutes.",
"summary": "Instance 192.168.164.1:18093 down"
},
"startsAt": "2018-06-19T17:07:54.140071559+08:00",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"
}
]
假如說有RMS-MS三臺機器都宕機了的話,那麼prometheus會發送如上資料至localhost:17201/api/v1/alerts這個介面,
如此我們就可以根據以上資料做報警通知了
AlertManager
使用prometheus自帶的報警元件, 當報警被觸發時,prometheus會將報警資料推送給AlertManager , AlertManager 接收到報警資訊之後,會根據他這邊的規則,然後推送報警通知。
global:
resolve_timeout: 5m
route:
group_by: ['job']
group_wait: 30s
#同一組間隔
group_interval: 5m # 同一組的的告警訊息間隔,在5m分鐘內收到的同一個組的訊息,會彙總統一發送
repeat_interval: 1s # 相同的告警訊息的重複傳送的間隔時間
receiver: 'webhook' # 接受者型別
receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://10.208.204.46:17210/test/alert2' # 接收地址
資料結構如下:
{
"receiver": "webhook",
"status": "firing",
"alerts": [{
"status": "firing",
"labels": {},
"annotations": {
"description": "10.208.204.46:19999 of job RMS-MS has been down for more than 5 minutes.",
"summary": "Instance 10.208.204.46:19999 down"
},
"startsAt": "2018-06-19T17:25:54.143824172+08:00",
"endsAt": "0001-01-01T00:00:00Z",
" generatorURL": "http://localhost.localdomain:9090/graph?g0.expr=up+==+0&g0.tab=1"
},
{
"status": "firing",
"labels": {
"alertname": "InstanceDown",
"env": "dev",
"instance": "192.168.164.1:18093",
"job": "RMS-MS",
"serverity": "page"
},
"annotations": {
"description": "192.168.164.1 :18093 of job RMS-MS has been down for more than 5 minutes.",
"summary": "Instance 192.168.164.1:18093 down"
},
"startsAt": "2018-06-19T17:25:54.143824172+08:00",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "http://localhost.localdomain:9090/graph?g0.expr=up+==+0& g0.tab=1"
}
],
"groupLabels": {
"job": "RMS-MS"
},
"commonLabels": {
"alertname": "InstanceDown",
"env": "dev",
"job": "RMS-MS",
"serverity": "page"
},
"commonAnnotations": {},
"externalURL": "http://localhost.localdomain:9093",
"version": "4",
"groupKey": "{}:{job=\"RMS-MS\"}"
}
假如一個叢集三臺機器都DOWN的話,那麼AlertManager會將三臺機器的資訊做彙總,然後傳送給webhook介面
比較
功能點 | AlertManager | 自定義報警 |
---|---|---|
分組 | 會將同一個分組的報警資訊打包做彙總 | 需要自研 |
抑制 | 抑制是指當警報發出後,停止重複傳送由此警報引發其他錯誤的警報的機制。 | 需要自研 |
沉默 | 簡單的特定時間靜音提醒的機制 | 需要自研 |
缺點 | 不是java開發的,要深入瞭解困難 | 自研成本高,初期較簡陋 |
優點 | 技術成熟 | - |
推薦使用AlertManager做報警通知的第一道關口,後續使用wehbook的方式推送至我方程式。