prometheus(4)之alertmanager報警外掛
報警處理流程如下:
1. Prometheus Server監控目標主機上暴露的http介面(這裡假設介面A),通過Promethes配置的'scrape_interval'定義的時間間隔,定期採集目標主機上監控資料。
2. 當介面A不可用的時候,Server端會持續的嘗試從介面中取資料,直到"scrape_timeout"時間後停止嘗試。這時候把介面的狀態變為“DOWN”。
3. Prometheus同時根據配置的"evaluation_interval"的時間間隔,定期(預設1min)的對Alert Rule進行評估;當到達評估週期的時候,發現介面A為DOWN,即UP=0為真,啟用Alert,進入“PENDING”狀態,並記錄當前active的時間;
4. 當下一個alert rule的評估週期到來的時候,發現UP=0繼續為真,然後判斷警報Active的時間是否已經超出rule裡的‘for’ 持續時間,如果未超出,則進入下一個評估週期;如果時間超出,則alert的狀態變為“FIRING”;同時呼叫Alertmanager介面,傳送相關報警資料。
5. AlertManager收到報警資料後,會將警報資訊進行分組,然後根據alertmanager配置的“group_wait”時間先進行等待。等wait時間過後再發送報警資訊。
6. 屬於同一個Alert Group的警報,在等待的過程中可能進入新的alert,如果之前的報警已經成功發出,那麼間隔“group_interval”的時間間隔後再重新發送報警資訊。比如配置的是郵件報警,那麼同屬一個group的報警資訊會彙總在一個郵件裡進行傳送。
7. 如果Alert Group裡的警報一直沒發生變化並且已經成功傳送,等待‘repeat_interval’時間間隔之後再重複傳送相同的報警郵件;如果之前的警報沒有成功傳送,則相當於觸發第6條條件,則需要等待group_interval時間間隔後重復發送。
同時最後至於警報資訊具體發給誰,滿足什麼樣的條件下指定警報接收人,設定不同報警傳送頻率,這裡有alertmanager的route路由規則進行配置。
alertmanager配置檔案
kind: ConfigMap apiVersion: v1 metadata: name: alertmanager namespace: monitor-sa data: alertmanager.yml: |- global: resolve_timeout: 1m #解析超時時間 smtp_smarthost: 'smtp.163.com:25' smtp_from:'*****@163.com' smtp_auth_username: '138****' smtp_auth_password: '****GRMBHNBOY' #登入授權碼 smtp_require_tls: false route: #告警分發策略 group_by: [alertname] #分組標籤依據 group_wait: 10s #告警等待時間 在等待時間內組中產生新的告警 一起進行傳送 group_interval: 10s #不同組告警 間隔時間 repeat_interval: 10m #重複告警間隔時間 receiver:default-receiver #設定預設告警接收人 receivers: #告警接收 - name: 'default-receiver' email_configs: - to: '******@qq.com' send_resolved: true - to: '******@qq.com' send_resolved: true
alertmanager配置檔案解釋說明: smtp_smarthost: 'smtp.163.com:25' #163郵箱的SMTP伺服器地址+埠 smtp_from: '[email protected]' #這是指定從哪個郵箱傳送報警 smtp_auth_username: '15011572657' #這是傳送郵箱的認證使用者,不是郵箱名 smtp_auth_password: ' BGWHYUOSOOHWEUJM' #這是傳送郵箱的授權碼而不是登入密碼,你們需要用自己的,不要用我的,用我的你會發不出來報警 email_configs: - to: '[email protected]' #to後面指定傳送到哪個郵箱,我傳送到我的qq郵箱,大家需要寫自己的郵箱地址,不應該跟smtp_from的郵箱名字重複 route: #用於設定告警的分發策略 group_by: [alertname] #alertmanager會根據group_by配置將Alert分組 group_wait: 10s # 分組等待時間。也就是告警產生後等待10s,如果有同組告警一起發出 group_interval: 10s # 上下兩組傳送告警的間隔時間 repeat_interval: 10m # 重複傳送告警的時間,減少相同郵件的傳送頻率,預設是1h receiver: default-receiver #定義誰來收告警
安裝prometheus+alertmanager
prometheus+alertmanager配置檔案
kind: ConfigMap apiVersion: v1 metadata: labels: app: prometheus name: prometheus-config namespace: monitor-sa data: prometheus.yml: | rule_files: - /etc/prometheus/rules.yml alerting: alertmanagers: - static_configs: - targets: ["localhost:9093"] global: scrape_interval: 15s scrape_timeout: 10s evaluation_interval: 1m scrape_configs: - job_name: 'kubernetes-node' kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__address__] regex: '(.*):10250' replacement: '${1}:9100' target_label: __address__ action: replace - action: labelmap regex: __meta_kubernetes_node_label_(.+) - job_name: 'kubernetes-node-cadvisor' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - job_name: 'kubernetes-apiserver' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_scrape - action: replace regex: (.+) source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_path target_label: __metrics_path__ - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 source_labels: - __address__ - __meta_kubernetes_pod_annotation_prometheus_io_port target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - action: replace source_labels: - __meta_kubernetes_namespace target_label: kubernetes_namespace - action: replace source_labels: - __meta_kubernetes_pod_name target_label: kubernetes_pod_name - job_name: 'kubernetes-schedule' scrape_interval: 5s static_configs: - targets: ['172.17.166.217:10251','172.17.166.218:10251','172.17.166.219:10251'] - job_name: 'kubernetes-controller-manager' scrape_interval: 5s static_configs: - targets: ['172.17.166.217:10252','172.17.166.218:10252','172.17.166.219:10252'] - job_name: 'kubernetes-kube-proxy' scrape_interval: 5s static_configs: - targets: ['172.17.166.219:10249','172.17.27.255:10249','172.17.27.248:10249','172.17.4.79:10249'] - job_name: 'pushgateway' scrape_interval: 5s static_configs: - targets: ['172.17.166.217:9091'] honor_labels: true - job_name: 'kubernetes-etcd' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/ca.pem cert_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/kubernetes.pem key_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/kubernetes-key.pem scrape_interval: 5s static_configs: - targets: ['172.17.166.219:2379','172.17.4.79:2379','172.17.27.255:2379','172.17.27.248:2379'] rules.yml: | groups: - name: example rules: - alert: kube-proxy的cpu使用率大於80% expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}元件的cpu使用率超過80%" - alert: kube-proxy的cpu使用率大於90% expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}元件的cpu使用率超過90%" - alert: scheduler的cpu使用率大於80% expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}元件的cpu使用率超過80%" - alert: scheduler的cpu使用率大於90% expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}元件的cpu使用率超過90%" - alert: controller-manager的cpu使用率大於80% expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}元件的cpu使用率超過80%" - alert: controller-manager的cpu使用率大於90% expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}元件的cpu使用率超過90%" - alert: apiserver的cpu使用率大於80% expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}元件的cpu使用率超過80%" - alert: apiserver的cpu使用率大於90% expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}元件的cpu使用率超過90%" - alert: etcd的cpu使用率大於80% expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}元件的cpu使用率超過80%" - alert: etcd的cpu使用率大於90% expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}元件的cpu使用率超過90%" - alert: kube-state-metrics的cpu使用率大於80% expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.k8s_app}}元件的cpu使用率超過80%" value: "{{ $value }}%" threshold: "80%" - alert: kube-state-metrics的cpu使用率大於90% expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 0 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.k8s_app}}元件的cpu使用率超過90%" value: "{{ $value }}%" threshold: "90%" - alert: coredns的cpu使用率大於80% expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 80 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.k8s_app}}元件的cpu使用率超過80%" value: "{{ $value }}%" threshold: "80%" - alert: coredns的cpu使用率大於90% expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 90 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.k8s_app}}元件的cpu使用率超過90%" value: "{{ $value }}%" threshold: "90%" - alert: kube-proxy開啟控制代碼數>600 expr: process_open_fds{job=~"kubernetes-kube-proxy"} > 600 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}開啟控制代碼數>600" value: "{{ $value }}" - alert: kube-proxy開啟控制代碼數>1000 expr: process_open_fds{job=~"kubernetes-kube-proxy"} > 1000 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}開啟控制代碼數>1000" value: "{{ $value }}" - alert: kubernetes-schedule開啟控制代碼數>600 expr: process_open_fds{job=~"kubernetes-schedule"} > 600 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}開啟控制代碼數>600" value: "{{ $value }}" - alert: kubernetes-schedule開啟控制代碼數>1000 expr: process_open_fds{job=~"kubernetes-schedule"} > 1000 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}開啟控制代碼數>1000" value: "{{ $value }}" - alert: kubernetes-controller-manager開啟控制代碼數>600 expr: process_open_fds{job=~"kubernetes-controller-manager"} > 600 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}開啟控制代碼數>600" value: "{{ $value }}" - alert: kubernetes-controller-manager開啟控制代碼數>1000 expr: process_open_fds{job=~"kubernetes-controller-manager"} > 1000 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}開啟控制代碼數>1000" value: "{{ $value }}" - alert: kubernetes-apiserver開啟控制代碼數>600 expr: process_open_fds{job=~"kubernetes-apiserver"} > 600 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}開啟控制代碼數>600" value: "{{ $value }}" - alert: kubernetes-apiserver開啟控制代碼數>1000 expr: process_open_fds{job=~"kubernetes-apiserver"} > 1000 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}開啟控制代碼數>1000" value: "{{ $value }}" - alert: kubernetes-etcd開啟控制代碼數>600 expr: process_open_fds{job=~"kubernetes-etcd"} > 600 for: 2s labels: severity: warnning annotations: description: "{{$labels.instance}}的{{$labels.job}}開啟控制代碼數>600" value: "{{ $value }}" - alert: kubernetes-etcd開啟控制代碼數>1000 expr: process_open_fds{job=~"kubernetes-etcd"} > 1000 for: 2s labels: severity: critical annotations: description: "{{$labels.instance}}的{{$labels.job}}開啟控制代碼數>1000" value: "{{ $value }}" - alert: coredns expr: process_open_fds{k8s_app=~"kube-dns"} > 600 for: 2s labels: severity: warnning annotations: description: "外掛{{$labels.k8s_app}}({{$labels.instance}}): 開啟控制代碼數超過600" value: "{{ $value }}" - alert: coredns expr: process_open_fds{k8s_app=~"kube-dns"} > 1000 for: 2s labels: severity: critical annotations: description: "外掛{{$labels.k8s_app}}({{$labels.instance}}): 開啟控制代碼數超過1000" value: "{{ $value }}" - alert: kube-proxy expr: process_virtual_memory_bytes{job=~"kubernetes-kube-proxy"} > 6000000000 for: 2s labels: severity: warnning annotations: description: "元件{{$labels.job}}({{$labels.instance}}): 使用虛擬記憶體超過2G" value: "{{ $value }}" - alert: scheduler expr: process_virtual_memory_bytes{job=~"kubernetes-schedule"} > 6000000000 for: 2s labels: severity: warnning annotations: description: "元件{{$labels.job}}({{$labels.instance}}): 使用虛擬記憶體超過2G" value: "{{ $value }}" - alert: kubernetes-controller-manager expr: process_virtual_memory_bytes{job=~"kubernetes-controller-manager"} > 6000000000 for: 2s labels: severity: warnning annotations: description: "元件{{$labels.job}}({{$labels.instance}}): 使用虛擬記憶體超過2G" value: "{{ $value }}" - alert: kubernetes-apiserver expr: process_virtual_memory_bytes{job=~"kubernetes-apiserver"} > 6000000000 for: 2s labels: severity: warnning annotations: description: "元件{{$labels.job}}({{$labels.instance}}): 使用虛擬記憶體超過6G" value: "{{ $value }}" - alert: kubernetes-etcd expr: (process_virtual_memory_bytes{job=~"kubernetes-etcd"}) / 10 > 6000000000 for: 2s labels: severity: warnning annotations: description: "元件{{$labels.job}}({{$labels.instance}}): 使用虛擬記憶體超過6G" value: "{{ $value }}" - alert: kube-dns expr: process_virtual_memory_bytes{k8s_app=~"kube-dns"} > 6000000000 for: 2s labels: severity: warnning annotations: description: "外掛{{$labels.k8s_app}}({{$labels.instance}}): 使用虛擬記憶體超過6G" value: "{{ $value }}" - alert: HttpRequestsAvg expr: sum(rate(rest_client_requests_total{job=~"kubernetes-kube-proxy|kubernetes-kubelet|kubernetes-schedule|kubernetes-control-manager|kubernetes-apiservers"}[1m])) > 1000 for: 2s labels: team: admin annotations: description: "元件{{$labels.job}}({{$labels.instance}}): TPS超過1000" value: "{{ $value }}" threshold: "1000" - alert: Pod_restarts expr: kube_pod_container_status_restarts_total{namespace=~"kube-system|default|monitor-sa"} > 0 for: 2s labels: severity: warnning annotations: description: "在{{$labels.namespace}}名稱空間下發現{{$labels.pod}}這個pod下的容器{{$labels.container}}被重啟,這個監控指標是由{{$labels.instance}}採集的" value: "{{ $value }}" threshold: "0" - alert: Pod_waiting expr: kube_pod_container_status_waiting_reason{namespace=~"kube-system|default"} == 1 for: 2s labels: team: admin annotations: description: "空間{{$labels.namespace}}({{$labels.instance}}): 發現{{$labels.pod}}下的{{$labels.container}}啟動異常等待中" value: "{{ $value }}" threshold: "1" - alert: Pod_terminated expr: kube_pod_container_status_terminated_reason{namespace=~"kube-system|default|monitor-sa"} == 1 for: 2s labels: team: admin annotations: description: "空間{{$labels.namespace}}({{$labels.instance}}): 發現{{$labels.pod}}下的{{$labels.container}}被刪除" value: "{{ $value }}" threshold: "1" - alert: Etcd_leader expr: etcd_server_has_leader{job="kubernetes-etcd"} == 0 for: 2s labels: team: admin annotations: description: "元件{{$labels.job}}({{$labels.instance}}): 當前沒有leader" value: "{{ $value }}" threshold: "0" - alert: Etcd_leader_changes expr: rate(etcd_server_leader_changes_seen_total{job="kubernetes-etcd"}[1m]) > 0 for: 2s labels: team: admin annotations: description: "元件{{$labels.job}}({{$labels.instance}}): 當前leader已發生改變" value: "{{ $value }}" threshold: "0" - alert: Etcd_failed expr: rate(etcd_server_proposals_failed_total{job="kubernetes-etcd"}[1m]) > 0 for: 2s labels: team: admin annotations: description: "元件{{$labels.job}}({{$labels.instance}}): 服務失敗" value: "{{ $value }}" threshold: "0" - alert: Etcd_db_total_size expr: etcd_debugging_mvcc_db_total_size_in_bytes{job="kubernetes-etcd"} > 10000000000 for: 2s labels: team: admin annotations: description: "元件{{$labels.job}}({{$labels.instance}}):db空間超過10G" value: "{{ $value }}" threshold: "10G" - alert: Endpoint_ready expr: kube_endpoint_address_not_ready{namespace=~"kube-system|default"} == 1 for: 2s labels: team: admin annotations: description: "空間{{$labels.namespace}}({{$labels.instance}}): 發現{{$labels.endpoint}}不可用" value: "{{ $value }}" threshold: "1" - name: 物理節點狀態-監控告警 rules: - alert: 物理節點cpu使用率 expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100 > 90 for: 2s labels: severity: ccritical annotations: summary: "{{ $labels.instance }}cpu使用率過高" description: "{{ $labels.instance }}的cpu使用率超過90%,當前使用率[{{ $value }}],需要排查處理" - alert: 物理節點記憶體使用率 expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 90 for: 2s labels: severity: critical annotations: summary: "{{ $labels.instance }}記憶體使用率過高" description: "{{ $labels.instance }}的記憶體使用率超過90%,當前使用率[{{ $value }}],需要排查處理" - alert: InstanceDown expr: up == 0 for: 2s labels: severity: critical annotations: summary: "{{ $labels.instance }}: 伺服器宕機" description: "{{ $labels.instance }}: 伺服器延時超過2分鐘" - alert: 物理節點磁碟的IO效能 expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) > 6000000 for: 2s labels: severity: critical annotations: summary: "{{$labels.mountpoint}} 流入磁碟IO使用率過高!" description: "{{$labels.mountpoint }} 流入磁碟IO大於60%(目前使用:{{$value}})" - alert: 入網流量頻寬 expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400 for: 2s labels: severity: critical annotations: summary: "{{$labels.mountpoint}} 流入網路頻寬過高!" description: "{{$labels.mountpoint }}流入網路頻寬持續5分鐘高於100M. RX頻寬使用率{{$value}}" - alert: 出網流量頻寬 expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400 for: 2s labels: severity: critical annotations: summary: "{{$labels.mountpoint}} 流出網路頻寬過高!" description: "{{$labels.mountpoint }}流出網路頻寬持續5分鐘高於100M. RX頻寬使用率{{$value}}" - alert: TCP會話 expr: node_netstat_Tcp_CurrEstab > 1000 for: 2s labels: severity: critical annotations: summary: "{{$labels.mountpoint}} TCP_ESTABLISHED過高!" description: "{{$labels.mountpoint }} TCP_ESTABLISHED大於1000%(目前使用:{{$value}}%)" - alert: 磁碟容量 expr: 100 - ( node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100 ) > 80 for: 2s labels: severity: critical annotations: summary: "{{$labels.mountpoint}} 磁碟分割槽使用率過高!" description: "{{$labels.mountpoint }} 磁碟分割槽使用大於80%(目前使用:{{$value}}%)"prometheus-alertmanager-cfg.yaml
常用報警引數指標:
- process_cpu_seconds_total 各targets cpu總數(cpu預設採集資料型別counter 使用rate提取一定時間內 數率變化)
- process_open_fds各targets 檔案開啟控制代碼數 (通常每個連結會佔用一個控制代碼數 也就是一個連線數)
- process_virtual_memory_bytes 各targets 虛擬記憶體使用
- rest_client_requests_total各targets TPS (TPS指一定的時間內請求的數量~吞吐量)
- kube_pod_container_status_restarts_total (pod重啟狀態)
- kube_pod_container_status_waiting_reason (pod啟動異常 指的是pod 容器啟動狀態在等待中)
- kube_pod_container_status_terminated_reason (pod刪除狀態)
- etcd_server_leader_changes_seen_total (etcd的leader 也就是主是否重新選舉 leader發生變化)
- etcd_server_proposals_failed_total (etcd服務失敗總數)
- etcd_debugging_mvcc_db_total_size_in_bytes (etcd磁碟的使用,etcd metric預設採集的單位是E prometheus採集單位轉換存在問題)
- kube_endpoint_address_not_ready (etcd狀態錯誤 沒有leader 代表當前叢集宕機數量超過一半)
- node_cpu_seconds_total (採集物理節點cpu)
- node_memory_MemTotal_bytes (採集物理節點記憶體)
- up == 0 (代表有服務處於down狀態)
- node_disk_io_time_seconds_total (物理節點I/O使用率)
- node_network_receive_bytes_total (入網流量)
- node_network_transmit_bytes_total (出網流量)
- node_netstat_Tcp_CurrEstab (物理節點tcp會話數)
- node_filesystem_free_bytes (物理節點磁碟使用)
- node_filesystem_size_bytes (磁碟總大小) 使用除以總的 *100既得出當前使用率
安裝prometheus+alertmanager
--- apiVersion: apps/v1 kind: Deployment metadata: name: prometheus-server namespace: monitor-sa labels: app: prometheus spec: replicas: 1 selector: matchLabels: app: prometheus component: server #matchExpressions: #- {key: app, operator: In, values: [prometheus]} #- {key: component, operator: In, values: [server]} template: metadata: labels: app: prometheus component: server annotations: prometheus.io/scrape: 'false' spec: #nodeName: node1 serviceAccountName: monitor containers: - name: prometheus image: 172.17.166.217/kubenetes/prometheus:v2.2.1 #imagePullPolicy: IfNotPresent command: - "/bin/prometheus" args: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus" - "--storage.tsdb.retention=24h" - "--web.enable-lifecycle" ports: - containerPort: 9090 protocol: TCP volumeMounts: - mountPath: /etc/prometheus name: prometheus-config - mountPath: /prometheus/ name: prometheus-storage-volume - name: k8s-certs mountPath: /var/run/secrets/kubernetes.io/k8s-certs/etcd/ - name: alertmanager image: 172.17.166.217/kubenetes/alertmanager:v0.14.0 #imagePullPolicy: IfNotPresent args: - "--config.file=/etc/alertmanager/alertmanager.yml" - "--log.level=debug" ports: - containerPort: 9093 protocol: TCP name: alertmanager volumeMounts: - name: alertmanager-config mountPath: /etc/alertmanager - name: alertmanager-storage mountPath: /alertmanager - name: localtime mountPath: /etc/localtime volumes: - name: prometheus-config configMap: name: prometheus-config - name: prometheus-storage-volume hostPath: path: /data type: Directory - name: k8s-certs secret: secretName: etcd-certs - name: alertmanager-config configMap: name: alertmanager - name: alertmanager-storage hostPath: path: /data/alertmanager type: DirectoryOrCreate - name: localtime hostPath: path: /usr/share/zoneinfo/Asia/Shanghaiprometheus+alertmanager-deploy.yaml
--- apiVersion: v1 kind: Service metadata: labels: name: prometheuss kubernetes.io/cluster-service: 'true' name: prometheuss namespace: monitor-sa spec: ports: - name: prometheus #nodePort: 30066 port: 9090 protocol: TCP targetPort: 9090 selector: app: prometheus sessionAffinity: None #type: NodePortprometheus-svc.yaml
是因為kube-proxy預設埠10249是監聽在127.0.0.1上的,需要改成監聽到物理節點上,按如下方法修改,線上建議在安裝k8s的時候就做修改,這樣風險小一些:
kubectl edit configmap kube-proxy -n kube-system
把metricsBindAddress這段修改成metricsBindAddress: 0.0.0.0:10249
然後重新啟動kube-proxy這個pod
[root@xianchaomaster1]# kubectl get pods -n kube-system | grep kube-proxy |awk '{print $1}' | xargs kubectl delete pods -n kube-system
[root@xianchaomaster1]# ss -antulp |grep :10249
可顯示如下
tcp LISTEN 0 128 [::]:10249 [::]:*
點選status->targets,可看到如下
點選Alerts,可看到如下
把controller-manager的cpu使用率大於90%展開,可看到如下
FIRING表示prometheus已經將告警發給alertmanager,在Alertmanager 中可以看到有一個 alert。
登入到alertmanager web介面
瀏覽器輸入192.168.40.180:30066,顯示如下
配置alertmanager-傳送報警到釘釘
1.建立釘釘機器人 開啟電腦版釘釘,建立一個群,建立自定義機器人,按如下步驟建立 https://ding-doc.dingtalk.com/doc#/serverapi2/qf2nxq https://developers.dingtalk.com/document/app/custom-robot-access 我建立的機器人如下: 群設定-->智慧群助手-->新增機器人-->自定義-->新增 機器人名稱:test 接收群組:釘釘報警測試 安全設定: 自定義關鍵詞:cluster1 上面配置好之後點選完成即可,這樣就會建立一個test的報警機器人,建立機器人成功之後怎麼檢視webhook,按如下: 點選智慧群助手,可以看到剛才建立的test這個機器人,點選test,就會進入到test機器人的設定介面 出現如下內容: 機器人名稱:test 接受群組:釘釘報警測試 訊息推送:開啟 webhook: https://oapi.dingtalk.com/robot/send?access_token=8a53475677339a11cec453c608543c3d85ea73b330ea70c4b2de96a0839cbb90 安全設定: 自定義關鍵詞:cluster1 2.安裝釘釘的webhook外掛,在k8s的控制節點xianchaomaster1操作 tar zxvf prometheus-webhook-dingtalk-0.3.0.linux-amd64.tar.gz prometheus-webhook-dingtalk-0.3.0.linux-amd64.tar.gz壓縮包所在的百度網盤地址如下: 連結:https://pan.baidu.com/s/1_HtVZsItq2KsYvOlkIP9DQ 提取碼:d59o cd prometheus-webhook-dingtalk-0.3.0.linux-amd64 啟動釘釘報警外掛 nohup ./prometheus-webhook-dingtalk --web.listen-address="0.0.0.0:8060" --ding.profile="cluster1=https://oapi.dingtalk.com/robot/send?access_token=8a53475677339a11cec453c608543c3d85ea73b330ea70c4b2de96a0839cbb90" & 對原來的alertmanager-cm.yaml檔案做備份 cp alertmanager-cm.yaml alertmanager-cm.yaml.bak 重新生成一個新的alertmanager-cm.yaml檔案
cat >alertmanager-cm.yaml <<EOF kind: ConfigMap apiVersion: v1 metadata: name: alertmanager namespace: monitor-sa data: alertmanager.yml: |- global: resolve_timeout: 1m smtp_smarthost: 'smtp.163.com:25' smtp_from: '[email protected]' smtp_auth_username: '1501157****' smtp_auth_password: ‘BGWHYUOSOOHWEUJM' smtp_require_tls: false route: group_by: [alertname] group_wait: 10s group_interval: 10s repeat_interval: 10m receiver: cluster1 receivers: - name: cluster1 webhook_configs: - url: 'http://192.168.40.180:8060/dingtalk/cluster1/send' send_resolved: true EOFalertmanager-dd.yaml
配置alertmanager-傳送報警到微信
1註冊企業微信 登陸網址: https://work.weixin.qq.com/ 找到應用管理,建立應用 應用名字wechat 建立成功之後顯示如下:
AgentId:1000003
Secret:Ov5SWq_JqrolsOj6dD4Jg9qaMu1TTaDzVTCrXHcjlFs
2.修改alertmanager-cm.yaml global: smtp_smarthost: 'smtp.163.com:25' smtp_from: '[email protected]' smtp_auth_username: '15011572657' smtp_auth_password: 'BGWHYUOSOOHWEUJM' smtp_require_tls: false route: group_by: [alertname] group_wait: 10s group_interval: 10s repeat_interval: 3m receiver: "prometheus" receivers: - name: 'prometheus' wechat_configs: - corp_id: wwa82df90a693abb15 to_user: '@all' agent_id: 1000003 api_secret: Ov5SWq_JqrolsOj6dD4Jg9qaMu1TTaDzVTCrXHcjlFs 引數說明: secret: 企業微信("企業應用"-->"自定應用"[Prometheus]--> "Secret") wechat是本人自建立應用名稱 corp_id: 企業資訊("我的企業"--->"CorpID"[在底部]) agent_id: 企業微信("企業應用"-->"自定應用"[Prometheus]--> "AgentId") wechat是自建立應用名稱 #在這建立的應用名字是wechat,那麼在配置route時,receiver也應該是Prometheus to_user: '@all' :傳送報警到所有人
配置自定義告警模板
cat template_wechat.tmpl {{ define "wechat.default.message" }} {{ range .Alerts }} ========start========== 告警程式:node_exporter 告警名稱:{{ .Labels.alertname }} 故障主機: {{ .Labels.instance }} 告警主題: {{ .Annotations.summary }} 告警資訊: {{ .Annotations.description }} ========end========== {{ end }} {{ end }}
不同告警分組
routes: - match_re: service: ^(foo1|foo2|baz)$ receiver: team-X-mails routes: - match: severity: critical receiver: team-X-pager - match: service: files receiver: team-Y-mails routes: - match: severity: critical receiver: team-Y-pager - match: service: database receiver: team-DB-pager # Also group alerts by affected database. group_by: [alertname, cluster, database] routes: - match: owner: team-X receiver: team-X-pager continue: true - match: owner: team-Y receiver: team-Y-pager
global:#配置郵箱、url、微信等 route: #配置路由樹 - receiver: #從接受組(與route同級別)中選擇接受 - group_by:[]#填寫標籤的key,通過相同的key不同的value來判斷 ===研究rules中的標籤值 - continue: false #告警是否去繼續路由子節點 - match: [labelname:labelvalue,labelname1,labelvalue1] #通過標籤去匹配這次告警是否符合這個路由節點,???必須全部匹配才可以告警???待測試。 - match_re: [labelname:regex] #通過正則表達是匹配標籤,意義同上 - group_wait: 30s #組內等待時間,同一分組內收到第一個告警等待多久開始傳送,目標是為了同組訊息同時傳送,不佔用告警資訊,預設30s - group_interval: 5m #當組內已經發送過一個告警,組內若有新增告警需要等待的時間,預設為5m,這條要確定組內資訊是影響同一業務才能設定,若分組不合理,可能導致告警延遲,造成影響 - repeat_inteval: 4h #告警已經發送,且無新增告警,若重複告警需要間隔多久 預設4h 屬於重複告警,時間間隔應根據告警的嚴重程度來設定 routes: - route:#路由子節點 配置資訊跟主節點的路由資訊一致
例如:
route: receiver: 'default-receiver' group_wait: 30s group_interval: 5m repeat_interval: 4h group_by: [cluster, alertname] routes: - receiver: 'database-pager' group_wait: 10s match_re: service: mysql|cassandra - receiver: 'frontend-pager' group_by: [product, environment] match: team: frontend