Prometheus【普羅米修斯】+Grafana部署企業級監控(一)

阿新 • • 發佈：2022-04-17

一、前言

Prometheus是最初在SoundCloud上構建的開源系統監視和警報工具包。自2012年成立以來，許多公司和組織都採用了Prometheus，該專案擁有非常活躍的開發人員和使用者社群。Prometheus 於2016年加入了 Cloud Native Computing Foundation，這是繼Kubernetes之後的第二個託管專案。

Exporter是一個採集監控資料並通過Prometheus監控規範對外提供資料的元件，能為Prometheus提供監控的介面。

Exporter將監控資料採集的端點通過HTTP服務的形式暴露給Prometheus Server，Prometheus Server通過訪問該Exporter提供的Endpoint端點，即可獲取到需要採集的監控資料。不同的Exporter負責不同的業務。

Prometheus              開源的系統監控和報警框架，靈感源自Google的Borgmon監控系統

AlertManager            處理由客戶端應用程式（如Prometheus server）傳送的警報。它負責將重複資料刪除，分組和路由到正確的接收者整合，還負責沉默和抑制警報

Node_Exporter           用來監控各節點的資源資訊的exporter，應部署到prometheus監控的所有節點

PushGateway             推送閘道器，用於接收各節點推送的資料並暴露給Prometheus server

官網：https://prometheus.io

文件：https://prometheus.io/docs/introduction/overview/

下載prometheus各元件：https://prometheus.io/download/

二、prometheus 介紹

2.1 prometheus的特點

1. 多維的資料模型（基於時間序列的Key、Value鍵值對）

2. 靈活的查詢和聚合語言PromQL

3. 提供本地儲存和分散式儲存

4. 通過基於HTTP的Pull模型採集時間序列資料

5. 可利用Pushgateway（Prometheus的可選中介軟體）實現Push模式

6. 可通過動態服務發現或靜態配置發現目標機器

7. 支援多種圖表和資料大盤

2.2 prometheus的元件

1. Prometheus server，負責拉取、儲存時間序列資料

2. 客戶端庫（client library），插入應用程式程式碼

3. 推送閘道器（push gateway），支援短暫的任務

4. 特殊型別的exporter，支援如HAProxy，StatsD，Graphite等服務

5. 一個alertmanager處理告警

6. 各種支援工具

2.3 prometheus的架構

2.4 prometheus的使用場景

prometheus非常適合記錄任何純數字時間序列。它既適合以機器為中心的監視，也適合監視高度動態的面向服務的體系結構。在微服務世界中，它對多維資料收集和查詢的支援是一種特別的優勢。

prometheus的設計旨在提高可靠性，使其成為中斷期間要使用的系統，從而使您能夠快速診斷問題。每個prometheus伺服器都是獨立的，而不依賴於網路儲存或其他遠端服務，當基礎設施部分出現問題時仍然可以使用它。

2.5 prometheus的相關概念

資料模型：

prometheus將所有資料儲存為時間序列：屬於相同 metric名稱和相同標籤組（鍵值對）的時間戳值流。

metric 和標籤：

每一個時間序列都是由其 metric名稱和一組標籤（鍵值對）組成唯一標識。

metric名稱代表了被監控系統的一般特徵（如 http_requests_total代表接收到的HTTP請求總數）。它可能包含ASCII字母和數字，以及下劃線和冒號，它必須匹配正則表示式[a-zA-Z_:][a-zA-Z0-9_:]*。

注意：冒號是為使用者定義的記錄規則保留的，不應該被exporter使用

標籤給prometheus建立了多維度資料模型：對於相同的 metric名稱，標籤的任何組合都可以標識該 metric的特定維度例項（例如：所有使用POST方法到 /api/tracks 介面的HTTP請求）。查詢語言會基於這些維度進行過濾和聚合。更改任何標籤值，包括新增或刪除標籤，都會建立一個新的時間序列。

標籤名稱可能包含ASCII字母、數字和下劃線，它必須匹配正則表示式[a-zA-Z_][a-zA-Z0-9_]*。另外，以雙下劃線__開頭的標籤名稱僅供內部使用。

標籤值可以包含任何Unicode字元。標籤值為空的標籤被認為是不存在的標籤。

表示法：

給定 metric名稱和一組標籤，通常使用以下表示法標識時間序列：

<metric name>{<label name>=<label value>, ...}

例如，一個時間序列的 metric名稱是 api_http_requests_total，標籤是method="POST"和 handler="/messages"。可以這樣寫：

api_http_requests_total{method="POST", handler="/messages"}

這和OpenTSDB的表示法是一樣的。

metric型別：

Counter             值只能單調增加或重啟時歸零，可以用來表示處理的請求數、完成的任務數、出現的錯誤數量等

Gauge               值可以任意增加或減少，可以用來測量溫度、當前記憶體使用等

Histogram           取樣觀測結果，一般用來請求持續時間或響應大小，並在一個可配置的分佈區間（bucket）內計算這些結果，提供所有觀測結果的總和

                        累加的 counter，代表觀測區間：<basename>_bucket{le="<upper inclusive bound>"}
                        所有觀測值的總數：<basename>_sum
                        觀測的事件數量：<basenmae>_count

Summary             取樣觀測結果，一般用來請求持續時間或響應大小，提供觀測次數及所有觀測結果的總和，還可以通過一個滑動的時間視窗計算可分配的分位數
                        觀測的事件流φ-quantiles (0 ≤ φ ≤ 1)：<basename>{quantile="φ"}
                        所有觀測值的總和：<basename>_sum
                        觀測的事件數量：<basename>_count

例項與任務：

在prometheus中，一個可以拉取資料的端點叫做例項（instance），一般等同於一個程序。一組有著同樣目標的例項（例如為彈性或可用性而複製的程序副本）叫做任務（job）。

當prometheus拉取目標時，它會自動新增一些標籤到時間序列中，用於標識被拉取的目標：

job：目標所屬的任務名稱

instance：目標URL中的<host>:<port>部分

如果兩個標籤在被拉取的資料中已經存在，那麼就要看配置選項 honor_labels 的值來決定行為了。

每次對例項的拉取，prometheus會在以下的時間序列中儲存一個樣本（樣本指的是在一個時間序列中特定時間點的一個值）：

up{job="<job-name>", instance="<instance-id>"}：如果例項健康（可達），則為 1 ，否則為 0

scrape_duration_seconds{job="<job-name>", instance="<instance-id>"}：拉取的時長

scrape_samples_post_metric_relabeling{job="<job-name>", instance="<instance-id>"}：在 metric relabeling 之後，留存的樣本數量

scrape_samples_scraped{job="<job-name>", instance="<instance-id>"}：目標暴露出的樣本數量

up 時間序列對於例項的可用性監控來說非常有用。

三、prometheus 部署配置

系統	IP地址	軟體安裝	節點
Centos 7.6	10.0.0.10	prometheus	master
Centos 7.6	10.0.0.11	Altermanager	node1
Centos 7.6	10.0.0.12	Grafana	node2

3.1 下載prometheus

# 官網下載地址:https://prometheus.io/download/
wget https://github.com/prometheus/prometheus/releases/download/v2.28.1/prometheus-2.28.1.linux-amd64.tar.gz

# 解壓縮
tar -xf prometheus-2.28.1.linux-amd64.tar.gz -C /usr/local/

# 重新命名
cd /usr/local/ && mv prometheus-2.28.1.linux-amd64  prometheus

# 設定軟連結
ln -s /usr/local/prometheus/prometheus   /usr/bin/prometheus

3.2 配置systemd管理prometheus

# 建立prometheus使用者
useradd -M -s /sbin/nologin prometheus

# 修改許可權
chown -R prometheus:prometheus /usr/local/prometheus 

#  編輯prometheus.service檔案
vim /usr/lib/systemd/system/prometheus.service

示例檔案1 簡潔明瞭

[Unit]
Description=Prometheus
Documentation=https://prometheus.io/
After=network.target

[Service]
WorkingDirectory=/usr/local/prometheus/
ExecStart=/usr/local/prometheus/prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -KILL $MAINPID
Type=simple
KillMode=control-group
Restart=on-failure
RestartSec=15s

[Install]
WantedBy=multi-user.target

示例檔案2 更加細節

[Unit]
Description=Prometheus
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
Environment="GOMAXPROCS=4"
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/prometheus/prometheus \
  --config.file=/usr/local/prometheus/prometheus.yml \
  --storage.tsdb.path=/usr/local/prometheus \
  --storage.tsdb.retention=30d \
  --web.console.libraries=/usr/local/prometheus/console_libraries \
  --web.console.templates=/usr/local/prometheus/consoles \
  --web.listen-address=0.0.0.0:9090 \
  --web.read-timeout=5m \
  --web.max-connections=30 \
  --query.max-concurrency=50 \
  --query.timeout=2m \
  # 開啟熱重啟
  --web.enable-lifecycle
PrivateTmp=true
PrivateDevices=true
ProtectHome=true
NoNewPrivileges=true
LimitNOFILE=infinity
ReadWriteDirectories=/usr/local/prometheus
ProtectSystem=full

SyslogIdentifier=prometheus
Restart=always

[Install]
WantedBy=multi-user.target

3.3 啟動prometheus

3.3.1 通過二進位制檔案啟動prometheus

# 前臺啟動
/usr/local/prometheus/prometheus 
# 後臺啟動方式一
/usr/local/prometheus/prometheus &
# 後臺啟動方式二
nohup /usr/local/prometheus/prometheus &>/var/log/prometheus.log  &

3.3.2 通過systemd啟動prometheus

#  過載配置檔案
systemctl daemon-reload
# 設定開機啟動並啟動prometheus
systemctl enable prometheus && systemctl start prometheus
# 檢視埠
netstat -lntp | grep prometheus
tcp6       0      0 :::9090                 :::*                    LISTEN      43742/prometheus
# 檢視執行狀態
[root@prometheus-server opt]# systemctl status prometheus
● prometheus.service - Prometheus
   Loaded: loaded (/usr/lib/systemd/system/prometheus.service; enabled; vendor preset: disabled)
   Active: active (running) since 二 2021-07-27 09:57:58 CST; 21min ago
     Docs: https://prometheus.io/
 Main PID: 1134 (prometheus)
    Tasks: 10
   Memory: 107.0M
   CGroup: /system.slice/prometheus.service
           └─1134 /usr/local/prometheus/prometheus

訪問hostip:9090，至此，prometheus部署完成，接下來需要配置prometheus。

四、prometheus配置檔案詳解

prometheus的配置檔案prometheus.yml，它主要分以下幾個配置塊：

全域性配置        global

告警配置        alerting

規則檔案配置    rule_files

拉取配置        scrape_configs

遠端讀寫配置    remote_read、remote_write

4.1 全域性配置`global`

global指定在所有其他配置上下文中有效的引數。還可用作其他配置部分的預設設定。

global:
  # 預設拉取頻率
  [ scrape_interval: <duration> | default = 1m ]

  # 拉取超時時間
  [ scrape_timeout: <duration> | default = 10s ]

  # 執行規則頻率
  [ evaluation_interval: <duration> | default = 1m ]

  # 通訊時新增到任何時間序列或告警的標籤
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    [ <labelname>: <labelvalue> ... ]

  # 記錄PromQL查詢的日誌檔案
  [ query_log_file: <string> ]

4.2 告警配置`alerting`

alerting指定與Alertmanager相關的設定。

alerting:
  alert_relabel_configs:
    [ - <relabel_config> ... ]
  alertmanagers:
    [ - <alertmanager_config> ... ]

4.3 規則檔案配置`rule_files`

rule_files指定prometheus載入的任何規則的位置，從所有匹配的檔案中讀取規則和告警。目前沒有規則

rule_files:
  [ - <filepath_glob> ... ]

4.4 拉取配置 `scrape_configs`

scrape_configs指定prometheus監控哪些資源。預設會拉取prometheus本身的時間序列資料，通過http://hostIP:9090/metrics進行拉取。

一個scrape_config指定一組目標和引數，描述如何拉取它們。在一般情況下，一個拉取配置指定一個作業。在高階配置中，這可能會改變。

可以通過static_configs引數靜態配置目標，也可以使用支援的服務發現機制之一動態發現目標。

此外，relabel_configs在拉取之前，可以對任何目標及其標籤進行修改

scrape_configs:
job_name: <job_name>

# 拉取頻率
[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]

# 拉取超時時間
[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]

# 拉取的http路徑
[ metrics_path: <path> | default = /metrics ]

# honor_labels 控制prometheus處理已存在於收集資料中的標籤與prometheus將附加在伺服器端的標籤("作業"和"例項"標籤、手動配置的目標標籤和由服務發現實現生成的標籤)之間的衝突
# 如果 honor_labels 設定為 "true"，則通過保持從拉取資料獲得的標籤值並忽略衝突的伺服器端標籤來解決標籤衝突
# 如果 honor_labels 設定為 "false"，則通過將拉取資料中衝突的標籤重新命名為"exported_<original-label>"來解決標籤衝突(例如"exported_instance"、"exported_job")，然後附加伺服器端標籤
# 注意，任何全域性配置的 "external_labels"都不受此設定的影響。在與外部系統的通訊中，只有當時間序列還沒有給定的標籤時，它們才被應用，否則就會被忽略
[ honor_labels: <boolean> | default = false ]

# honor_timestamps 控制prometheus是否遵守拉取資料中的時間戳
# 如果 honor_timestamps 設定為 "true"，將使用目標公開的metrics的時間戳
# 如果 honor_timestamps 設定為 "false"，目標公開的metrics的時間戳將被忽略
[ honor_timestamps: <boolean> | default = true ]

# 配置用於請求的協議
[ scheme: <scheme> | default = http ]

# 可選的http url引數
params:
  [ <string>: [<string>, ...] ]

# 在每個拉取請求上配置 username 和 password 來設定 Authorization 頭部，password 和 password_file 二選一
basic_auth:
  [ username: <string> ]
  [ password: <secret> ]
  [ password_file: <string> ]

# 在每個拉取請求上配置 bearer token 來設定 Authorization 頭部，bearer_token 和 bearer_token_file 二選一
[ bearer_token: <secret> ]

# 在每個拉取請求上配置 bearer_token_file 來設定 Authorization 頭部，bearer_token_file 和 bearer_token 二選一
[ bearer_token_file: /path/to/bearer/token/file ]

# 配置拉取請求的TLS設定
tls_config:
  [ <tls_config> ]

# 可選的代理URL
[ proxy_url: <string> ]

# Azure服務發現配置列表
azure_sd_configs:
  [ - <azure_sd_config> ... ]

# Consul服務發現配置列表
consul_sd_configs:
  [ - <consul_sd_config> ... ]

# DNS服務發現配置列表
dns_sd_configs:
  [ - <dns_sd_config> ... ]

# EC2服務發現配置列表
ec2_sd_configs:
  [ - <ec2_sd_config> ... ]

# OpenStack服務發現配置列表
openstack_sd_configs:
  [ - <openstack_sd_config> ... ]

# file服務發現配置列表
file_sd_configs:
  [ - <file_sd_config> ... ]

# GCE服務發現配置列表
gce_sd_configs:
  [ - <gce_sd_config> ... ]

# Kubernetes服務發現配置列表
kubernetes_sd_configs:
  [ - <kubernetes_sd_config> ... ]

# Marathon服務發現配置列表
marathon_sd_configs:
  [ - <marathon_sd_config> ... ]

# AirBnB's Nerve服務發現配置列表
nerve_sd_configs:
  [ - <nerve_sd_config> ... ]

# Zookeeper Serverset服務發現配置列表
serverset_sd_configs:
  [ - <serverset_sd_config> ... ]

# Triton服務發現配置列表
triton_sd_configs:
  [ - <triton_sd_config> ... ]

# 靜態配置目標列表
static_configs:
  [ - <static_config> ... ]

# 目標relabel配置列表
relabel_configs:
  [ - <relabel_config> ... ]

# metric relabel配置列表
metric_relabel_configs:
  [ - <relabel_config> ... ]

# 每次拉取樣品的數量限制
# metric relabelling之後，如果有超過這個數量的樣品，整個拉取將被視為失效。0表示沒有限制
[ sample_limit: <int> | default = 0 ]

4.5遠端讀寫配置 `remote_read/remote_write`

remote_read/remote_write將資料來源與prometheus分離，當前不做配置

# 與遠端寫功能相關的設定
remote_write:
  [ - <remote_write> ... ]

# 與遠端讀功能相關的設定
remote_read:
  [ - <remote_read> ... ]

4.6 簡單配置示例

vim  /usr/local/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['10.0.0.10:9090']
# 對配置檔案進行檢查
# /usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml

Checking /usr/local/prometheus/prometheus.yml
  SUCCESS: 0 rule files found # 看到SUCCESS就說明配置檔案沒有問題

【擴充套件】配置自動發現服務示例

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  - job_name: 'nodes'
    # 靜態配置換成了服務發現配置
    file_sd_configs:
    - files:
      - "nodes/*.yml"

allnodes.yml

- targets:
  - "10.0.0.10:9100"
  - "10.0.0.11:9100"
  - "10.0.0.12:9100"
  labels:
    app: nodes
    name: mynode

五、Node_expoter部署

下載node_exporter

# 下載
wget https://github.com/prometheus/node_exporter/releases/download/v1.2.0/node_exporter-1.2.0.linux-amd64.tar.gz

# 解壓
tar -xf node_exporter-1.2.0.linux-amd64.tar.gz -C /usr/local/

# 重新命名
cd /usr/local/ && mv node_exporter-1.2.0.linux-amd64  node_exporter

# 設定軟連結/usr/local
ln -s /usr/local/node_exporter/node_exporter   /usr/bin/node_exporter

# 配置systemd管理node_exporter
# useradd -M -s /sbin/nologin prometheus              #若已建立，可省略該步
chown -R prometheus:prometheus /usr/local/node_exporter
# 編輯node_exporter.service檔案
vim /usr/lib/systemd/system/node_exporter.service

示例1 簡單明瞭

[Unit]
Description=Node_exporter
Documentation=https://github.com/prometheus/node_exporter/
After=network.target

[Service]
WorkingDirectory=/usr/local/node_exporter/
ExecStart=/usr/local/node_exporter/node_exporter
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -KILL $MAINPID
Type=simple
KillMode=control-group
Restart=on-failure
RestartSec=15s

[Install]
WantedBy=multi-user.target

示例2 更加詳細

[Unit]
Description=Node_exporter
After=network.target

[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/local/node_exporter/node_exporter \
  --web.listen-address=0.0.0.0:9100 \
  --web.telemetry-path=/metrics \
  --log.level=info \
  --log.format=logfmt
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -KILL $MAINPID
Type=simple
KillMode=control-group
Restart=always
RestartSec=15s

[Install]
WantedBy=multi-user.target

5.2 啟動node_exporter

5.2.1 使用二進位制檔案啟動

# 前臺啟動
node_exporter 

# 後臺啟動方式1
node_exporter &

# 後臺啟動方式2
nohup node_exporter &> /var/log/node_exporter.log &

# 帶引數後臺啟動
nohup node_exporter  --web.listen-address=0.0.0.0:9100  --web.telemetry-path=/metrics \ 
&> /var/log/node_exporter/node_exporter.log &

5.2.2 使用systemd啟動

# 過載配置檔案
systemctl daemon-reload

# 設定開機啟動並啟動node_exporter
systemctl enable node_exporter && systemctl start node_exporter

# 檢查埠
netstat -lntp | grep node_exporter
tcp6       0      0 :::9100                 :::*                    LISTEN      2725/node_exporter

# 檢視執行狀態
# systemctl status node_exporter
● node_exporter.service - node_exporter
   Loaded: loaded (/usr/lib/systemd/system/node_exporter.service; enabled; vendor preset: disabled)
   Active: active (running) since 二 2021-07-27 11:10:04 CST; 6s ago
     Docs: https://github.com/prometheus/node_exporter/
 Main PID: 83850 (node_exporter)
    Tasks: 6
   Memory: 28.0M
   CGroup: /system.slice/node_exporter.service
           └─83850 /usr/local/node_exporter/node_exporter

訪問hostip:9100
node exporter展示了prometheus可以拉取的指標，包括在輸出中更下方的各種系統指標（帶有字首node_）。要檢視這些指標（以及幫助和型別資訊）：

curl http://localhost:9100/metrics | grep 'node_'

5.3 配置scrape_configs

啟動好node_exporter後，還需要配置prometheus才能訪問node exporter指標。

vim /usr/local/prometheus/prometheus.yml                #修改 scrape_configs 內容
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['10.0.0.10:9090']

  - job_name: 'nodes'
    static_configs:
    - targets: ['10.0.0.10:9100','10.0.0.11:9100','10.0.0.12:9100']
# 重啟prometheus或者過載prometheus配置檔案
systemctl restart prometheus  或者 systemctl reload prometheus

檢視node狀態
訪問prometheus頁面，Status → Targets
可以看到，之前部署的node exporter狀態是UP，說明執行正常。

通過部署的node_exporter可以收集當前主機的系統基礎資訊。如檢視系統15分鐘平均負載，

至此，node_exporter部署完成。

【拓展】node_exporter配置使用者名稱和密碼

#  安裝httpd-tools
yum install -y httpd-tools

#  生成密碼
htpasswd -nBC 10 "" | tr -d ':\n'   # 回車後輸入密碼  例如輸入6個1
$2y$10$SpFQBSWkvNboPXm/YaxwZOUo1WDi86QGSpf1ZfXJHyZmrK9RVWXX6

# node_exporter安裝目錄下編輯web-config.yml檔案
basic_auth_users:
  # 使用者名稱: 密碼-上面加密生產的字串 
  mynode: $2y$10$SpFQBSWkvNboPXm/YaxwZOUo1WDi86QGSpf1ZfXJHyZmrK9RVWXX6
# 啟動node_exporter
node_exporter --web.config=/usr/local/node_exporter/web-config.yml 

# systemd啟動配置檔案如下
[Unit]
Description=Node_exporter
Documentation=https://github.com/prometheus/node_exporter/
After=network.target

[Service]
WorkingDirectory=/usr/local/node_exporter/
# 啟動命令加入了web.config的配置檔案位置
ExecStart=/usr/local/node_exporter/node_exporter  --web.config=/usr/local/node_exporter/web-config.yml 
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -KILL $MAINPID
Type=simple
KillMode=control-group
Restart=on-failure
RestartSec=15s

[Install]
WantedBy=multi-user.target

再次訪問ip:9100此時必須使用賬戶名+密碼才能訪問
此時，prometheus將無法抓取node_exporter的資料

修改prometheus配置檔案如下:

vim /usr/local/prometheus/prometheus.yml                #修改 scrape_configs 內容
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['10.0.0.10:9090']

  - job_name: 'nodes'
    # 增加使用者認證資訊
    basic_auth:
       # 這裡配置的是瀏覽器訪問時需要輸入的使用者名稱和密碼
       username: mynode
       password: 111111
    static_configs:
    - targets: ['10.0.0.10:9100','10.0.0.11:9100','10.0.0.12:9100']
# 重啟prometheus或者過載prometheus配置檔案
systemctl restart prometheus  或者 systemctl reload prometheus

六、AlertManager部署

6.1 下載alertmanager

# 下載
wget  https://github.com/prometheus/alertmanager/releases/download/v0.22.2/alertmanager-0.22.2.linux-amd64.tar.gz

# 解壓
tar -xf alertmanager-0.22.2.linux-amd64.tar.gz -C /usr/local/

# 重新命名
cd /usr/local/ && mv alertmanager-0.22.2.linux-amd64  alertmanager

# 設定軟連結
ln -s /usr/local/alertmanager/alertmanager  /usr/bin/alertmanager
ln -s /usr/local/alertmanager/amtool /usr/bin/amtool

# 配置systemd管理alertmanager
useradd -M -s /sbin/nologin prometheus              #若已建立，可省略
chown -R prometheus:prometheus /usr/local/alertmanager
# 編輯alertmanager.service檔案
vim /usr/lib/systemd/system/alertmanager.service

示例1 簡潔明瞭

[Unit]
Description=Alertmanager
Documentation=https://github.com/prometheus/alertmanager/releases/
After=network.target

[Service]
WorkingDirectory=/usr/local/alertmanager/
ExecStart=/usr/local/alertmanager/alertmanager
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -KILL $MAINPID
Type=simple
KillMode=control-group
Restart=on-failure
RestartSec=15s

[Install]
WantedBy=multi-user.target

示例2 更加詳細

[Unit]
Description=Alertmanager
Documentation=https://github.com/prometheus/alertmanager/releases/
After=network.target

[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/local/alertmanager/alertmanager \
  --config.file=/usr/local/alertmanager/alertmanager.yml \
  --storage.path=/usr/local/alertmanager/data \
  --web.listen-address=0.0.0.0:9093 \
  --cluster.listen-address=0.0.0.0:9094 \
  --log.level=info \
  --log.format=logfmt
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -KILL $MAINPID 
Type=simple
KillMode=control-group
Restart=always
RestartSec=15s

[Install]
WantedBy=multi-user.target

檢查配置檔案

[root@localhost alertmanager]# amtool check-config alertmanager.yml 
Checking 'alertmanager.yml'  SUCCESS
Found:
 - global config
 - route
 - 1 inhibit rules
 - 1 receivers
 - 0 templates

6.2 啟動altermanager

6.2.1 通過二進位制檔案啟動

# 前臺啟動
alertmanager
# 後臺啟動方式1
alertmanager &
# 後臺啟動方式2
nohup alertmanager &> /var/log/alertmanager/alertmanager.log &
# 帶引數啟動
nohup alertmanager  --config.file=/usr/local/alertmanager/alertmanager.yml \
  --storage.path=/usr/local/alertmanager/data   --web.listen-address=0.0.0.0:9093 \
  --cluster.listen-address=0.0.0.0:9094  &> /var/log/alertmanager/alertmanager.log   &

6.2.2 systemd啟動

# 過載配置檔案
systemctl daemon-reload
# 設定開機啟動並啟動alertmanager
systemctl enable alertmanager && systemctl start alertmanager
# 檢視埠
netstat -lntp | grep alertmanager
tcp6       0      0 :::9093                 :::*                    LISTEN      89558/alertmanager  
tcp6       0      0 :::9094                 :::*                    LISTEN      89558/alertmanager

# 檢視執行狀態
[root@localhost alertmanager]# systemctl status alertmanager
● alertmanager.service - Alertmanager
   Loaded: loaded (/usr/lib/systemd/system/alertmanager.service; enabled; vendor preset: disabled)
   Active: active (running) since 二 2021-07-27 14:49:47 CST; 7s ago
     Docs: https://github.com/prometheus/alertmanager/releases/
 Main PID: 89558 (alertmanager)
    Tasks: 9
   Memory: 17.4M
   CGroup: /system.slice/alertmanager.service
           └─89558 /usr/local/alertmanager/alertmanager

訪問hostIP:9093 因未配置prometheus這裡顯示為空

6.3 配置altering

啟動好alertmanager後，還需要配置prometheus才能通過alertmanager告警。

vim /usr/local/prometheus/prometheus.yml                #更改 alerting 內容
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - "10.0.0.11:9093"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['10.0.0.10:9090']

  - job_name: 'nodes'
    static_configs:
    - targets: ['10.0.0.10:9100','10.0.0.11:9100','10.0.0.12:9100']

  - job_name: 'alertmanager'
    static_configs:
    - targets: ['10.0.0.11:9093']
# 重啟prometheus或者過載配置檔案
systemctl restart prometheus 或者 systemctl reload prometheus

訪問prometheus頁面，Status → Targets，
可以看到，之前部署的alertmanager狀態是UP，說明執行正常。

至此，alertmanager部署完成。但alertmanager還需要進一步配置通知路由和通知接收者。

6.4 altermanager配置

alertmanager通過命令列標誌和配置檔案進行配置。命令列標誌配置不可變的系統引數時，配置檔案定義禁止規則，通知路由和通知接收器。

alertmanager的配置檔案alertmanager.yml，它主要分以下幾個配置塊：

全域性配置        global

通知模板        templates

路由配置        route

接收器配置      receivers

抑制配置        inhibit_rules

6.4.1 全域性配置 `global`

global指定在所有其他配置上下文中有效的引數。還用作其他配置部分的預設設定。

global:
  # 預設的SMTP頭欄位
  [ smtp_from: <tmpl_string> ]

  # 預設的SMTP smarthost用於傳送電子郵件，包括埠號
  # 埠號通常是25，對於TLS上的SMTP，埠號為587
  # Example: smtp.example.org:587
  [ smtp_smarthost: <string> ]

  # 要標識給SMTP伺服器的預設主機名
  [ smtp_hello: <string> | default = "localhost" ]

  # SMTP認證使用CRAM-MD5，登入和普通。如果為空，Alertmanager不會對SMTP伺服器進行身份驗證
  [ smtp_auth_username: <string> ]

  # SMTP Auth using LOGIN and PLAIN.
  [ smtp_auth_password: <secret> ]

  # SMTP Auth using PLAIN.
  [ smtp_auth_identity: <string> ]

  # SMTP Auth using CRAM-MD5.
  [ smtp_auth_secret: <secret> ]

  # 預設的SMTP TLS要求
  # 注意，Go不支援到遠端SMTP端點的未加密連線
  [ smtp_require_tls: <bool> | default = true ]

  # 用於Slack通知的API URL
  [ slack_api_url: <secret> ]
  [ victorops_api_key: <secret> ]
  [ victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ]
  [ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" ]
  [ opsgenie_api_key: <secret> ]
  [ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ]
  [ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" ]
  [ wechat_api_secret: <secret> ]
  [ wechat_api_corp_id: <string> ]

  # 預設HTTP客戶端配置
  [ http_config: <http_config> ]

  # 如果告警不包括EndsAt，則ResolveTimeout是alertmanager使用的預設值，在此時間過後，如果告警沒有更新，則可以宣告警報已解除
  # 這對Prometheus的告警沒有影響，它們包括EndsAt
  [ resolve_timeout: <duration> | default = 5m ]

6.4.2 通知模板 `templates`

templates指定了從其中讀取自定義通知模板定義的檔案，最後一個檔案可以使用一個萬用字元匹配器，如templates/*.tmpl

templates:
  [ - <filepath> ... ]

6.4.3 路由配置 `route`

route定義了路由樹中的節點及其子節點。如果未設定，則其可選配置引數將從其父節點繼承。

每個告警都會在已配置的頂級路由處進入路由樹，該路由樹必須與所有告警匹配（即沒有任何已配置的匹配器），然後它會遍歷子節點。如果continue設定為false，它將在第一個匹配的子項之後停止；如果continue設定為true，則告警將繼續與後續的同級進行匹配。如果告警與節點的任何子節點都不匹配（不匹配的子節點或不存在子節點），則根據當前節點的配置引數來處理告警。

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

6.4.4 接收器配置 `receivers`

receivers是一個或多個通知整合的命名配置。建議通過webhook接收器實現自定義通知整合。

receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'

6.4.5 抑制規則配置 `inhibit_rules`

當存在與另一組匹配器匹配的告警（源）時，抑制規則會使與一組匹配器匹配的告警（目標）“靜音”。目標和源告警的equal列表中的標籤名稱都必須具有相同的標籤值。

在語義上，缺少標籤和帶有空值的標籤是相同的。因此，如果equal源告警和目標告警都缺少列出的所有標籤名稱，則將應用抑制規則。

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

6.4.6 預設配置示例

vim /usr/local/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

七、Grafana部署

grafana 是一款採用 go 語言編寫的開源應用，主要用於大規模指標資料的視覺化展現，是網路架構和應用分析中最流行的時序資料展示工具，目前已經支援絕大部分常用的時序資料庫。

官網：https://grafana.com

7.1 安裝Grafana

7.1.1 通過yum方式安裝

vim /etc/yum.repos.d/grafana.repo
[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
# 執行命令安裝Grafana
yum makecache fast -y
yum install -y initscripts urw-fonts wget
yum install -y grafana

7.2.2 二進位制包安裝

# 下載rpm包
wget https://dl.grafana.com/oss/release/grafana-8.0.6-1.x86_64.rpm
# 執行安裝，效果等同於yum安裝
yum install -y grafana-8.0.6-1.x86_64.rpm

# 下載二進位制檔案
wget https://dl.grafana.com/oss/release/grafana-8.0.6.linux-amd64.tar.gz
# 解壓
tar -xf grafana-8.0.6.linux-amd64.tar.gz -C /usr/local/
# 重新命名
cd /usr/local/  && mv grafana-8.0.6  grafana

# 設定軟連結
ln -s /usr/local/grafana/bin/grafana-server /usr/bin/grafana-server
ln -s /usr/local/grafana/bin/grafana-cli /usr/bin/grafana-cli

# 配置systemd管理檔案
vim /usr/lib/systemd/system/grafana-server.service
[Unit]
Description=Grafana
Documentation=https://grafana.com/
After=network.target

[Service]
WorkingDirectory=/usr/local/grafana/
ExecStart=/usr/local/grafana/bin/grafana-server
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -KILL $MAINPID
Type=simple
KillMode=control-group
Restart=on-failure
RestartSec=15s

[Install]
WantedBy=multi-user.target

7.1.3 docker部署grafana

mkdir /opt/grafana && chmod 777 /opt/grafana

docker pull grafana/grafana

docker run -d -p 3000:3000 --name=grafana -v /opt/grafana:/var/lib/grafana grafana/grafana

docker exec -it grafana grafana-cli plugins install alexanderzobnin-zabbix-app              #安裝zabbix外掛

docker restart grafana

初始賬號密碼為admin、admin，建議後面更改密碼。

grafana容器配置檔案：/etc/grafana/grafana.ini

7.2 啟動grafana

7.2.1 二進位制檔案啟動

# 前臺啟動
grafana-server

# 後臺啟動方式1
grafana-server &

# 後臺啟動方式2
nohup grafana-server &>/var/log/grafana/grafana.log  & 

# 帶引數後臺啟動
nohup grafana-server -config "/usr/local/grafana/conf/defaults.ini"  &>/var/log/grafana/grafana.log  &

7.2.2 systemd啟動

# 過載配置檔案
systemctl daemon-reload

# 設定開機啟動並啟動Grafana
systemctl enable grafana-server && systemctl start grafana-server

# 檢視埠
netstat -lntp | grep 3000
tcp6       0      0 :::3000                 :::*                    LISTEN      3303/grafana-server 

# 檢視執行狀態
[root@localhost grafana]# systemctl status grafana-server
● grafana-server.service - Grafana
   Loaded: loaded (/usr/lib/systemd/system/grafana-server.service; disabled; vendor preset: disabled)
   Active: active (running) since 二 2021-07-27 15:43:06 CST; 10s ago
     Docs: https://grafana.com/
 Main PID: 3303 (grafana-server)
   CGroup: /system.slice/grafana-server.service
           └─3303 /usr/local/grafana/bin/grafana-server

訪問ip:3000，初始賬號密碼為admin、admin，建議後面更改密碼。
grafana部署完成。

二進位制安裝:【grafana配置檔案：/usr/local/grafana/defaults.ini】

yum安裝【grafana配置檔案：/etc/grafana/grafana.ini】

7.3 使用grafana

7.3.1 匯入prometheus資料來源

Configuration → Data Sources → Prometheus → Select，填入http://ip:9090，儲存即可

7.3.2 匯入dashboard

官方dashboard模板：https://grafana.com/grafana/dashboards

選擇排行第一的中文模板：1 Node Exporter for Prometheus Dashboard CN v20200628，模板ID是8919。

Manage → Import，填入模板ID，匯入，
自定義dashboard名稱，選擇資料來源Prometheus
至此，prometheus + grafana 部署完成

7.3.3 安裝外掛【可選】

grafana所有外掛地址：https://grafana.com/grafana/plugins?orderBy=weight&direction=asc

grafana-cli plugins install alexanderzobnin-zabbix-app              #安裝zabbix外掛

grafana-cli plugins install grafana-clock-panel                     #時鐘

grafana-cli plugins install grafana-piechart-panel                  #餅圖

grafana-cli plugins install novalabs-annotations-panel              #註釋

grafana-cli plugins install farski-blendstat-panel                  #混合

grafana-cli plugins install yesoreyeram-boomtable-panel             #多表

grafana-cli plugins install yesoreyeram-boomtheme-panel             #多主題

grafana-cli plugins install jeanbaptistewatenberg-percent-panel     #百分比

grafana-cli plugins install corpglory-progresslist-panel            #程序列表

grafana-cli plugins install mxswat-separator-panel                  #分隔符 |

grafana-cli plugins install aidanmountford-html-panel               #網頁

安裝完外掛需要重啟grafana：systemctl restart grafana-server。

監控主機是否存活

grafana與zabbix結合時，zabbix的agent.ping無法準確反映主機是否宕機。因此，需要使用icmpping[<target>,<packets>,<interval>,<size>,<timeout>]來監控是否存活，存活返回1，反之返回0。