1. 程式人生 > 實用技巧 >基於Prometheus監控例項

基於Prometheus監控例項

部署Prometheus

基於Prometheus+Grafana監控服務物件,如伺服器,MySQL/mongodb等資料庫

前期準備

軟體下載

#  Prometheus Server
https://prometheus.io/download/

wget -c https://github.com/prometheus/prometheus/releases/download/v2.20.0/prometheus-2.20.0.linux-amd64.tar.gz &

# 告警通知管理元件
wget -c https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz &

# exporter元件
wget -c https://github.com/prometheus/consul_exporter/releases/download/v0.7.1/consul_exporter-0.7.1.linux-amd64.tar.gz &
wget -c https://github.com/prometheus/mysqld_exporter/releases/download/v0.12.1/mysqld_exporter-0.12.1.linux-amd64.tar.gz &
wget -c https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz &

Prometheus 安裝

傳統二進位制包安裝和 Docker 安裝方式

二進位制包安裝

mkdir -p /ups/app/monitor/
# 解壓
tar -xf prometheus-*.linux-amd64.tar.gz -C /ups/app/monitor/
# 重新命名目錄
cd /ups/app/monitor/
mv prometheus-*.linux-amd64 prometheus
ln -s prometheus-2.20.0 prometheus

# 建立目錄
mkdir -p prometheus/{bin,logs,config/rules,data}
cd prometheus/config && mkdir -p targets/{node,redis,postgresql,mysql}
# 建立使用者
# groupadd -g 2000 prometheus
useradd -r -M -c "Prometheus Server" -d /ups/app/monitor/ -s /sbin/nologin prometheus
# 修改目錄屬主
chown -R prometheus.prometheus /ups/app/monitor/prometheus-2.20.0
# 重構目錄結構
cd /ups/app/monitor/prometheus
mv prometheus promtool tsdb bin/
mv prometheus.yml config/
服務啟動引數項
[root@progs prometheus]# ./bin/prometheus --help
usage: prometheus [<flags>]

The Prometheus monitoring server

Flags:
  -h, --help                     Show context-sensitive help (also try --help-long and --help-man).
      --version                  Show application version.
      --config.file="prometheus.yml"  
                                 Prometheus configuration file path.
      --web.listen-address="0.0.0.0:9090"  
                                 Address to listen on for UI, API, and telemetry.
      --web.read-timeout=5m      Maximum duration before timing out read of the request, and closing idle connections.
      --web.max-connections=512  Maximum number of simultaneous connections.
      --web.external-url=<URL>   The URL under which Prometheus is externally reachable (for example, if Prometheus is served via a reverse
                                 proxy). Used for generating relative and absolute links back to Prometheus itself. If the URL has a path
                                 portion, it will be used to prefix all HTTP endpoints served by Prometheus. If omitted, relevant URL
                                 components will be derived automatically.
      --web.route-prefix=<path>  Prefix for the internal routes of web endpoints. Defaults to path of --web.external-url.
      --web.user-assets=<path>   Path to static asset directory, available at /user.
      --web.enable-lifecycle     Enable shutdown and reload via HTTP request.
      --web.enable-admin-api     Enable API endpoints for admin control actions.
      --web.console.templates="consoles"  
                                 Path to the console template directory, available at /consoles.
      --web.console.libraries="console_libraries"  
                                 Path to the console library directory.
      --web.page-title="Prometheus Time Series Collection and Processing Server"  
                                 Document title of Prometheus instance.
      --web.cors.origin=".*"     Regex for CORS origin. It is fully anchored. Example: 'https?://(domain1|domain2)\.com'
      --storage.tsdb.path="data/"  
                                 Base path for metrics storage.
      --storage.tsdb.retention=STORAGE.TSDB.RETENTION  
                                 [DEPRECATED] How long to retain samples in storage. This flag has been deprecated, use
                                 "storage.tsdb.retention.time" instead.
      --storage.tsdb.retention.time=STORAGE.TSDB.RETENTION.TIME  
                                 How long to retain samples in storage. When this flag is set it overrides "storage.tsdb.retention". If neither
                                 this flag nor "storage.tsdb.retention" nor "storage.tsdb.retention.size" is set, the retention time defaults
                                 to 15d. Units Supported: y, w, d, h, m, s, ms.
      --storage.tsdb.retention.size=STORAGE.TSDB.RETENTION.SIZE  
                                 [EXPERIMENTAL] Maximum number of bytes that can be stored for blocks. Units supported: KB, MB, GB, TB, PB.
                                 This flag is experimental and can be changed in future releases.
      --storage.tsdb.no-lockfile  
                                 Do not create lockfile in data directory.
      --storage.tsdb.allow-overlapping-blocks  
                                 [EXPERIMENTAL] Allow overlapping blocks, which in turn enables vertical compaction and vertical query merge.
      --storage.tsdb.wal-compression  
                                 Compress the tsdb WAL.
      --storage.remote.flush-deadline=<duration>  
                                 How long to wait flushing sample on shutdown or config reload.
      --storage.remote.read-sample-limit=5e7  
                                 Maximum overall number of samples to return via the remote read interface, in a single query. 0 means no
                                 limit. This limit is ignored for streamed response types.
      --storage.remote.read-concurrent-limit=10  
                                 Maximum number of concurrent remote read calls. 0 means no limit.
      --storage.remote.read-max-bytes-in-frame=1048576  
                                 Maximum number of bytes in a single frame for streaming remote read response types before marshalling. Note
                                 that client might have limit on frame size as well. 1MB as recommended by protobuf by default.
      --rules.alert.for-outage-tolerance=1h  
                                 Max time to tolerate prometheus outage for restoring "for" state of alert.
      --rules.alert.for-grace-period=10m  
                                 Minimum duration between alert and restored "for" state. This is maintained only for alerts with configured
                                 "for" time greater than grace period.
      --rules.alert.resend-delay=1m  
                                 Minimum amount of time to wait before resending an alert to Alertmanager.
      --alertmanager.notification-queue-capacity=10000  
                                 The capacity of the queue for pending Alertmanager notifications.
      --alertmanager.timeout=10s  
                                 Timeout for sending alerts to Alertmanager.
      --query.lookback-delta=5m  The maximum lookback duration for retrieving metrics during expression evaluations and federation.
      --query.timeout=2m         Maximum time a query may take before being aborted.
      --query.max-concurrency=20  
                                 Maximum number of queries executed concurrently.
      --query.max-samples=50000000  
                                 Maximum number of samples a single query can load into memory. Note that queries will fail if they try to load
                                 more samples than this into memory, so this also limits the number of samples a query can return.
      --log.level=info           Only log messages with the given severity or above. One of: [debug, info, warn, error]
      --log.format=logfmt        Output format of log messages. One of: [logfmt, json]
配置服務項
# 配置服務啟動項
cat > /usr/lib/systemd/system/prometheus.service <<-EOF
[Unit]
Description=https://prometheus.io
After=network.target
#After=postgresql.service mariadb.service mysql.service
Wants=network-online.target

[Service]
User=prometheus
Group=prometheus

Type=simple

WorkingDirectory=/ups/app/monitor/prometheus/
# RuntimeDirectory=prometheus
# RuntimeDirectoryMode=0750
ExecStart=/ups/app/monitor/prometheus/bin/prometheus \
    --config.file=/ups/app/monitor/prometheus/config/prometheus.yml \
    --storage.tsdb.retention=30d \
    --storage.tsdb.path="/ups/app/monitor/prometheus/data/" \
    --web.console.templates=/ups/app/monitor/prometheus/consoles \
    --web.console.libraries=/ups/app/monitor/prometheus/console_libraries \
    --web.enable-lifecycle --web.enable-admin-api \
    --web.listen-address=:9090 
Restart=on-failure
# Sets open_files_limit
LimitNOFILE=10000
TimeoutStopSec=20

StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=prometheus

[Install]
WantedBy=multi-user.target
EOF
日誌重定向輸出到指定檔案
cat > /etc/rsyslog.d/prometheus.conf <<-EOF
if \$programname == 'prometheus' then /ups/app/monitor/prometheus/logs/prometheusd.log
& stop
EOF
配置引數檔案

vi /ups/app/monitor/prometheus/config/prometheus.yml

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - progs:9093  # 對應啟動的altermanager節點的9093埠

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/alert_node.yml"
  - "rules/alert_mysql.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
    - targets: ['localhost:9100']
    relabel_configs:
    - action: replace
      source_labels: ['__address__']  ##源標籤
      regex: (.*):(.*)                ##正則,會匹配到__address__值
      replacement: $1                 ##引用正則匹配到的內容
      target_label: HOSTNAME          ##賦予新的標籤,名為HOSTNAME

  - job_name: 'MySQL'
    static_configs:
    - targets: ['localhost:9104']
    relabel_configs:
    - action: replace
      source_labels: ['__address__']  ##源標籤
      regex: (.*):(.*)                ##正則,會匹配到__address__值
      replacement: $1                 ##引用正則匹配到的內容
      target_label: instance          ##賦予新的標籤,名為 instance
檢查配置檔案
cd /ups/app/monitor/prometheus
./bin/promtool check config config/prometheus.yml
啟動服務
# 啟動服務
./bin/prometheus --config.file=config/prometheus.yml
或
# 載入服務
systemctl daemon-reload

systemctl enable prometheus.service
systemctl start  prometheus.service
systemctl stop   prometheus.service
systemctl status prometheus.service
重新載入Prometheus服務

增加啟動引數--web.enable-lifecycle可以不關閉服務方式載入配置

curl -X POST http://localhost:9090/-/reload
驗證
# 執行 version 檢查執行環境是否正常
./bin/prometheus version

lsof -i :9090

# 開啟Web介面,預設埠9090
http://192.168.10.181:9090

Docker安裝方式

安裝docker軟體
yum -y install docker
執行命令安裝Prometheus
使用Quay.io or Docker Hub Docker映象倉庫安裝
$ docker run --name prometheus -d -p 127.0.0.1:9090:9090 quay.io/prometheus/prometheus

# 通過prometheus.yml檔案啟動
docker run \
    -p 9090:9090 \
    -v /tmp/prometheus.yml:/etc/prometheus/prometheus.yml \
    prom/prometheus

# 配置使用額外的卷
docker run \
    -p 9090:9090 \
    -v /path/to/config:/etc/prometheus \
    prom/prometheus

通過Dockerfile安裝
FROM prom/prometheus
ADD prometheus.yml /etc/prometheus/

# 
docker build -t my-prometheus .
docker run -p 9090:9090 my-prometheus
Docker管理prometheus
# 執行 docker ps 檢視所有服務
docker ps


執行 docker start prometheus 啟動服務

執行 docker stats prometheus 檢視 prometheus 狀態

執行 docker stop prometheus 停止服務

配置

Prometheus 啟動的時候,可以載入執行引數 -config.file 指定配置檔案,預設為 prometheus.yml

在配置檔案中我們可以指定 global, alerting, rule_files, scrape_configs, remote_write, remote_read 等屬性。

全域性配置

global 屬於全域性的預設配置,它主要包含 4 個屬性,

  • scrape_interval: 拉取 targets 的預設時間間隔。
  • scrape_timeout: 拉取一個 target 的超時時間。
  • evaluation_interval: 執行 rules 的時間間隔。
  • external_labels: 額外的屬性,會新增到拉取的資料並存到資料庫中。

告警配置

可以使用執行引數 -alertmanager.xxx 來配置 Alertmanager,它這種方式不靈活。不支援動態更新載入,以及動態定義告警屬性。

因此,通過alerting 配置主要用來解決這個問題。它能夠更好的管理 Alertmanager, 主要包含 2 個引數:

  • alert_relabel_configs: 動態修改 alert 屬性的規則配置。
  • alertmanagers: 用於動態發現 Alertmanager 的配置。

規則配置

rule_files 主要用於配置 rules 檔案,它支援多個檔案以及檔案目錄

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

資料拉取配置

scrape_configs 主要用於配置拉取資料節點,每一個拉取配置主要包含以下引數:

  • job_name:任務名稱
  • honor_labels: 用於解決拉取資料標籤有衝突,當設定為 true, 以拉取資料為準,否則以服務配置為準
  • params:資料拉取訪問時帶的請求引數
  • scrape_interval: 拉取時間間隔
  • scrape_timeout: 拉取超時時間
  • metrics_path: 拉取節點的 metric 路徑
  • scheme: 拉取資料訪問協議
  • sample_limit: 儲存的資料標籤個數限制,如果超過限制,該資料將被忽略,不入儲存;預設值為0,表示沒有限制
  • relabel_configs: 拉取資料重置標籤配置
  • metric_relabel_configs:metric 重置標籤配置

遠端可寫儲存

remote_write 主要用於可寫遠端儲存配置,主要包含以下引數:

  • url: 訪問地址
  • remote_timeout: 請求超時時間
  • write_relabel_configs: 標籤重置配置, 拉取到的資料,經過重置處理後,傳送給遠端儲存

注意: remote_write 屬於試驗階段,慎用

遠端可讀儲存

remote_read 主要用於可讀遠端儲存配置,主要包含以下引數:

  • url: 訪問地址
  • remote_timeout: 請求超時時間

注意: remote_read 屬於試驗階段,慎用

服務發現

在 Prometheus 的配置中,一個最重要的概念就是資料來源 target,而資料來源的配置主要分為靜態配置和動態發現, 大致為以下幾類:

  • static_configs: 靜態服務發現
  • dns_sd_configs: DNS 服務發現
  • file_sd_configs: 檔案服務發現
  • consul_sd_configs: Consul 服務發現
  • serverset_sd_configs: Serverset 服務發現
  • nerve_sd_configs: Nerve 服務發現
  • marathon_sd_configs: Marathon 服務發現
  • kubernetes_sd_configs: Kubernetes 服務發現
  • gce_sd_configs: GCE 服務發現
  • ec2_sd_configs: EC2 服務發現
  • openstack_sd_configs: OpenStack 服務發現
  • azure_sd_configs: Azure 服務發現
  • triton_sd_configs: Triton 服務發現

配置樣例

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.
  evaluation_interval: 15s # By default, scrape targets every 15 seconds.

rule_files:
  - "rules/node.rules"

scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    scrape_interval: 8s
    static_configs:
      - targets: ['127.0.0.1:9100', '127.0.0.12:9100']

  - job_name: 'mysqld'
    static_configs:
      - targets: ['127.0.0.1:9104']
      
  - job_name: 'memcached'
    static_configs:
      - targets: ['127.0.0.1:9150']

部署Grafana

web視覺化軟體

軟體下載地址

# grafana程式包
https://grafana.com/grafana/download 
# grafana-dashboards包
https://github.com/percona/grafana-dashboards/releases


# Standalone Linux Binaries(64 Bit)SHA256: b6cbc04505edb712f206228261d0ea5ab7e9c03e9f77d0d36930886c861366ed
wget https://dl.grafana.com/oss/release/grafana-7.1.1.linux-amd64.tar.gz
tar -xf grafana-7.1.1.linux-amd64.tar.gz

軟體安裝部署

二進位制包安裝

mkdir -p /ups/app/monitor/
# 解壓
tar -xf grafana-*.linux-amd64.tar.gz -C /ups/app/monitor/

# 重新命名目錄
cd /ups/app/monitor/
mv grafana-6.7.1 grafana
mkdir -p /ups/app/monitor/grafana/{logs}

# 建立使用者
# groupadd -g 2001 grafana
useradd -r -d /ups/app/monitor/grafana -c "Grafana Server" -M -s /sbin/nologin grafana

# 修改目錄屬主
chown -R grafana.grafana /ups/app/monitor/grafana
配置服務項
# 配置服務啟動項
cat > /usr/lib/systemd/system/grafana.service <<-EOF
[Unit]
Description=Grafana instance
Documentation=http://docs.grafana.org
Wants=network-online.target
After=network-online.target
#After=After=postgresql-12.service mysql3308.service mysql.service

[Service]
# EnvironmentFile=/etc/sysconfig/grafana-server
User=grafana
Group=grafana
Type=notify
Restart=on-failure
WorkingDirectory=/ups/app/monitor/grafana
RuntimeDirectory=grafana
RuntimeDirectoryMode=0750

# ExecStart=/ups/app/monitor/grafana/bin/grafana-server                               \
#                             --config=\${CONF_FILE}                                   \
#                             --pidfile=\${PID_FILE_DIR}/grafana-server.pid            \
#                             --packaging=rpm                                         \
#                             cfg:default.paths.logs=\${LOG_DIR}                       \
#                             cfg:default.paths.data=\${DATA_DIR}                      \
#                             cfg:default.paths.plugins=\${PLUGINS_DIR}                \
#                             cfg:default.paths.provisioning=\${PROVISIONING_CFG_DIR}  

ExecStart=/ups/app/monitor/grafana/bin/grafana-server
LimitNOFILE=10000
TimeoutStopSec=20

#StandardOutput=syslog
#StandardError=syslog
#SyslogIdentifier=grafana

[Install]
WantedBy=multi-user.target
EOF
日誌重定向輸出到指定檔案()
cat > /etc/rsyslog.d/grafana.conf <<-EOF
if \$programname == 'grafana' then /ups/app/monitor/grafana/logs/grafana.log
& stop
EOF
啟動服務
# 啟動服務
/ups/app/monitor/grafana/bin/grafana-server &
或
# 載入服務
systemctl daemon-reload

systemctl enable  grafana.service
systemctl start   grafana.service
systemctl stop    grafana.service
systemctl restart grafana.service
systemctl status  grafana.service

Docker安裝方式

docker run -d --name=grafana -p 3000:3000 grafana/grafana

驗證

# 開啟Web介面,預設埠3000 (預設賬號/密碼:admin/admin)
http://192.168.10.181:3000

配置檔案

路徑

  • 預設路徑:$WORKING_DIR/conf/defaults.ini
  • 自定義配置:$WORKING_DIR/conf/custom.ini
  • 使用--config引數覆蓋自定義配置檔案路徑
    • ./grafana-server --config /custom/config.ini --homepath /custom/homepath cfg:default.paths.logs=/custom/path

新增外掛

語法

[root@progs bin]# ./grafana-cli --help
NAME:
   Grafana CLI - A new cli application

USAGE:
   grafana-cli [global options] command [command options] [arguments...]

VERSION:
   7.1.1

AUTHOR:
   Grafana Project <[email protected]>

COMMANDS:
   plugins  Manage plugins for grafana
   admin    Grafana admin commands
   help, h  Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --pluginsDir value       Path to the Grafana plugin directory (default: "/var/lib/grafana/plugins") [$GF_PLUGIN_DIR]
   --repo value             URL to the plugin repository (default: "https://grafana.com/api/plugins") [$GF_PLUGIN_REPO]
   --pluginUrl value        Full url to the plugin zip file instead of downloading the plugin from grafana.com/api [$GF_PLUGIN_URL]
   --insecure               Skip TLS verification (insecure) (default: false)
   --debug                  Enable debug logging (default: false)
   --configOverrides value  Configuration options to override defaults as a string. e.g. cfg:default.paths.log=/dev/null
   --homepath value         Path to Grafana install/home path, defaults to working directory
   --config value           Path to config file
   --help, -h               show help (default: false)
   --version, -v            print the version (default: false)

# 查詢可用的外掛
grafana-cli plugins list-remote

id: abhisant-druid-datasource version: 0.0.5
id: agenty-flowcharting-panel version: 0.9.0
id: aidanmountford-html-panel version: 0.0.1
id: akumuli-datasource version: 1.3.11
id: alexanderzobnin-zabbix-app version: 3.12.4
id: alexandra-trackmap-panel version: 1.2.5
id: andig-darksky-datasource version: 1.0.1
id: aquaqanalytics-kdbadaptor-datasource version: 1.0.1
id: ayoungprogrammer-finance-datasource version: 1.0.0
id: belugacdn-app version: 1.2.0
id: bessler-pictureit-panel version: 1.0.0
id: blackmirror1-singlestat-math-panel version: 1.1.7
id: blackmirror1-statusbygroup-panel version: 1.1.1
id: bosun-app version: 0.0.28
id: briangann-datatable-panel version: 1.0.2
id: briangann-gauge-panel version: 0.0.6
id: btplc-alarm-box-panel version: 1.0.8
id: btplc-peak-report-panel version: 0.2.4
id: btplc-status-dot-panel version: 0.2.4
id: btplc-trend-box-panel version: 0.1.9
id: camptocamp-prometheus-alertmanager-datasource version: 0.0.8
id: citilogics-geoloop-panel version: 1.1.1
id: cloudflare-app version: 0.1.4
id: cloudspout-button-panel version: 7.0.3
id: cognitedata-datasource version: 2.0.0
id: corpglory-progresslist-panel version: 1.0.5
id: dalmatinerdb-datasource version: 1.0.5
id: dalvany-image-panel version: 2.1.1
id: ddurieux-glpi-app version: 1.3.0
id: devicehive-devicehive-datasource version: 2.0.1
id: devopsprodigy-kubegraf-app version: 1.4.2
id: digiapulssi-breadcrumb-panel version: 1.1.6
id: digiapulssi-organisations-panel version: 1.3.0
id: digrich-bubblechart-panel version: 1.1.0
id: doitintl-bigquery-datasource version: 1.0.8
id: farski-blendstat-panel version: 1.0.2
id: fastweb-openfalcon-datasource version: 1.0.0
id: fatcloud-windrose-panel version: 0.7.0
id: fetzerch-sunandmoon-datasource version: 0.1.6
id: flant-statusmap-panel version: 0.2.0
id: foursquare-clouderamanager-datasource version: 0.9.2
id: fzakaria-simple-annotations-datasource version: 1.0.0
id: gnocchixyz-gnocchi-datasource version: 1.7.0
id: goshposh-metaqueries-datasource version: 0.0.3
id: grafana-azure-data-explorer-datasource version: 2.1.0
id: grafana-azure-monitor-datasource version: 0.3.0
id: grafana-clock-panel version: 1.1.1
id: grafana-googlesheets-datasource version: 1.0.0
id: grafana-image-renderer version: 2.0.0
id: grafana-influxdb-08-datasource version: 1.0.2
id: grafana-influxdb-flux-datasource version: 7.0.0
id: grafana-kairosdb-datasource version: 3.0.1
id: grafana-kubernetes-app version: 1.0.1
id: grafana-piechart-panel version: 1.5.0
id: grafana-polystat-panel version: 1.2.0
id: grafana-simple-json-datasource version: 1.4.0
id: grafana-strava-datasource version: 1.1.1
id: grafana-worldmap-panel version: 0.3.2
id: gretamosa-topology-panel version: 1.0.0
id: gridprotectionalliance-openhistorian-datasource version: 1.0.2
id: gridprotectionalliance-osisoftpi-datasource version: 1.0.4
id: hawkular-datasource version: 1.1.1
id: ibm-apm-datasource version: 0.9.0
id: instana-datasource version: 2.7.3
id: jasonlashua-prtg-datasource version: 4.0.3
id: jdbranham-diagram-panel version: 1.6.2
id: jeanbaptistewatenberg-percent-panel version: 1.0.6
id: kentik-app version: 1.3.4
id: larona-epict-panel version: 1.2.2
id: linksmart-hds-datasource version: 1.0.1
id: linksmart-sensorthings-datasource version: 1.3.0
id: logzio-datasource version: 5.0.0
id: macropower-analytics-panel version: 1.0.0
id: magnesium-wordcloud-panel version: 1.0.0
id: marcuscalidus-svg-panel version: 0.3.3
id: marcusolsson-hourly-heatmap-panel version: 0.4.1
id: marcusolsson-treemap-panel version: 0.2.0
id: michaeldmoore-annunciator-panel version: 1.0.5
id: michaeldmoore-multistat-panel version: 1.4.1
id: monasca-datasource version: 1.0.0
id: monitoringartist-monitoringart-datasource version: 1.0.0
id: moogsoft-aiops-app version: 8.0.0
id: mtanda-google-calendar-datasource version: 1.0.4
id: mtanda-heatmap-epoch-panel version: 0.1.7
id: mtanda-histogram-panel version: 0.1.6
id: mxswat-separator-panel version: 1.0.0
id: natel-discrete-panel version: 0.1.0
id: natel-influx-admin-panel version: 0.0.5
id: natel-plotly-panel version: 0.0.6
id: natel-usgs-datasource version: 0.0.2
id: neocat-cal-heatmap-panel version: 0.0.3
id: novalabs-annotations-panel version: 0.0.1
id: ns1-app version: 0.0.7
id: ntop-ntopng-datasource version: 1.0.0
id: opennms-helm-app version: 5.0.1
id: ovh-warp10-datasource version: 2.2.0
id: paytm-kapacitor-datasource version: 0.1.2
id: percona-percona-app version: 1.0.0
id: petrslavotinek-carpetplot-panel version: 0.1.1
id: pierosavi-imageit-panel version: 0.1.3
id: pr0ps-trackmap-panel version: 2.1.0
id: praj-ams-datasource version: 1.2.0
id: pue-solr-datasource version: 1.0.2
id: quasardb-datasource version: 3.8.2
id: rackerlabs-blueflood-datasource version: 0.0.2
id: radensolutions-netxms-datasource version: 1.2.2
id: raintank-snap-app version: 0.0.5
id: raintank-worldping-app version: 1.2.7
id: redis-datasource version: 1.1.2
id: ryantxu-ajax-panel version: 0.0.7-dev
id: ryantxu-annolist-panel version: 0.0.1
id: satellogic-3d-globe-panel version: 0.1.0
id: savantly-heatmap-panel version: 0.2.0
id: sbueringer-consul-datasource version: 0.1.5
id: scadavis-synoptic-panel version: 1.0.4
id: sidewinder-datasource version: 0.2.0
id: simpod-json-datasource version: 0.2.0
id: skydive-datasource version: 1.2.0
id: smartmakers-trafficlight-panel version: 1.0.0
id: sni-pnp-datasource version: 1.0.5
id: sni-thruk-datasource version: 1.0.3
id: snuids-radar-panel version: 1.4.4
id: snuids-trafficlights-panel version: 1.4.5
id: spotify-heroic-datasource version: 0.0.1
id: stagemonitor-elasticsearch-app version: 0.83.2
id: udoprog-heroic-datasource version: 0.1.0
id: vertamedia-clickhouse-datasource version: 2.0.2
id: vertica-grafana-datasource version: 0.1.0
id: vonage-status-panel version: 1.0.9
id: voxter-app version: 0.0.1
id: xginn8-pagerduty-datasource version: 0.2.1
id: yesoreyeram-boomtable-panel version: 1.3.0
id: yesoreyeram-boomtheme-panel version: 0.1.0
id: zuburqan-parity-report-panel version: 1.2.1

安裝外掛

安裝到預設外掛路徑

./grafana-cli --pluginsDir /ups/app/monitor/grafana/data/plugins plugins install grafana-piechart-panel 
./grafana-cli --pluginsDir /ups/app/monitor/grafana/data/plugins plugins install grafana-polystat-panel
./grafana-cli --pluginsDir /ups/app/monitor/grafana/data/plugins plugins install digiapulssi-breadcrumb-panel 

安裝過程截圖

結果確認

./bin/grafana-cli plugins ls

匯入模板

前端介面匯入檔案

後臺配置模板路徑

# 1. 解壓
unzip -qo grafana-dashboards-2.9.0.zip
cd grafana-dashboards-2.9.0
cp -r dashboards /ups/app/monitor/grafana/grafana-dashboards

# 2. 建立 mysqld_export.yml 檔案
cat > /ups/app/monitor/grafana/conf/provisioning/dashboards/mysqld_export.yml <<-EOF
apiVersion: 1

providers:
  - name: 'mysqld_exporter'
     orgId: 1
     folder: ''
     type: file
     options:
       path: /ups/app/monitor/grafana/grafana-dashboards
EOF

# 3. 重啟grafana服務

配置Promethues資料來源

Exporter軟體

在 Prometheus 中負責資料彙報的程式統一叫做 Exporter, 而不同的 Exporter 負責不同的業務。

軟體

主機監控程式(node_exporter)

軟體部署

二進位制安裝
軟體安裝
# 建立使用者
#groupadd -g 2000 prometheus
useradd -r -M -c "Prometheus agent" -d /ups/app/monitor/ -s /sbin/nologin prometheus

# 解壓檔案
mkdir -p /ups/app/monitor/
tar -xf node_exporter-*.linux-amd64.tar.gz -C /ups/app/monitor/ --no-same-owner

# 重新命名目錄
cd /ups/app/monitor/
mv node_exporter-*.linux-amd64 node_exporter

# 修改目錄屬主
# chown -R prometheus.prometheus /ups/app/monitor/node_exporter
配置服務項
# 配置服務檔案
cat > /usr/lib/systemd/system/node_exporter.service <<-EOF
[Unit]
Description=node exporter
Documentation=https://prometheus.io
After=network.target

[Service]
#User=prometheus
#Group=prometheus
Restart=on-failure
ExecStart=/ups/app/monitor/node_exporter/node_exporter --web.listen-address=:9100
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=node_exporter

[Install]
WantedBy=multi-user.target
EOF
  • 日誌重定向輸出到指定檔案

    •   cat > /etc/rsyslog.d/node_exporter.conf <<-EOF
        if \$programname == 'node_exporter' then /ups/app/monitor/node_exporter/node.log
        & stop
        EOF
      
啟動服務
# 啟動服務
systemctl daemon-reload
systemctl restart node_exporter.service
systemctl status node_exporter.service

或

# 啟動客戶端
cd /ups/app/monitor/node_exporter
./node_exporter &
Docker安裝
docker run -d -p 9100:9100 \
  -v "/proc:/host/proc:ro" \
  -v "/sys:/host/sys:ro" \
  -v "/:/rootfs:ro" \
  --net="host" \
  quay.io/prometheus/node-exporter \
    -collector.procfs /host/proc \
    -collector.sysfs /host/sys \
    -collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)"

接入Prometheus監控

exporter集中式配置
  • 修改prometheus引數檔案

利用 Prometheus 的 static_configs 來拉取 node_exporter 的資料。開啟 prometheus.yml 檔案, 在 scrape_configs 中新增如下配置

# 配置prometheus.yml檔案
cat >> /ups/app/monitor/prometheus/config/prometheus.yml <<-EOF

  - job_name: 'node_exporter'
    scrape_interval: 1s
    file_sd_configs:
      - files:
        - targets/node/nodes-instances.json
        refresh_interval: 10s
    relabel_configs:
    - action: replace
      source_labels: ['__address__']
      regex: (.*):(.*)
      replacement: $1
      target_label: hostname
    - action: labeldrop
      regex: __meta_filepath
EOF
  • 配置主機伺服器列表json檔案

vi /ups/app/monitor/prometheus/config/targets/node/nodes-instances.json

[
  {
    "targets": [ "192.168.10.181:9100","192.168.10.182:9100", "192.168.10.190:9100","192.168.10.191:9100","192.168.10.192:9100"]
  }
]
exporter獨立配置

每個監控物件獨立一個檔案配置

  • 修改Prometheus引數配置檔案
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - progs:9093  # 對應啟動的altermanager節點的9093埠

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/alert_node.yml"
  - "rules/alert_mysql.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    scrape_interval: 1s
    file_sd_configs:
      - files:
        - targets/node/*.yml
        refresh_interval: 10s
    relabel_configs:
    - action: replace
      source_labels: ['__address__']
      regex: (.*):(.*)
      replacement: $1
      target_label: hostname
    - action: labeldrop
      regex: __meta_filepath
  • 配置主機伺服器例項檔案
vi /ups/app/monitor/prometheus/config/targets/node/nodes1-instances.yml
[
  {
    "targets": ["192.168.10.181:9100"],
    "labels": { }
  }
]

vi /ups/app/monitor/prometheus/config/targets/node/nodes2-instances.yml
[
  {
    "targets": ["192.168.10.182:9100"],
    "labels": { }
  }
]
重啟prometheus載入配置
# 檢查並重新載入配置檔案
./bin/promtool check config config/prometheus.yml
# 重啟服務
systemctl restart prometheus

訪問

瀏覽器中訪問 http://IP:9100/metrics

監控功能

預設開啟的功能
名稱 說明 系統
arp /proc/net/arp 中收集 ARP 統計資訊 Linux
conntrack /proc/sys/net/netfilter/ 中收集 conntrack 統計資訊 Linux
cpu 收集 cpu 統計資訊 Darwin, Dragonfly, FreeBSD, Linux
diskstats /proc/diskstats 中收集磁碟 I/O 統計資訊 Linux
edac 錯誤檢測與糾正統計資訊 Linux
entropy 可用核心熵資訊 Linux
exec execution 統計資訊 Dragonfly, FreeBSD
filefd /proc/sys/fs/file-nr 中收集檔案描述符統計資訊 Linux
filesystem 檔案系統統計資訊,例如磁碟已使用空間 Darwin, Dragonfly, FreeBSD, Linux, OpenBSD
hwmon /sys/class/hwmon/ 中收集監控器或感測器資料資訊 Linux
infiniband 從 InfiniBand 配置中收集網路統計資訊 Linux
loadavg 收集系統負載資訊 Darwin, Dragonfly, FreeBSD, Linux, NetBSD, OpenBSD, Solaris
mdadm /proc/mdstat 中獲取裝置統計資訊 Linux
meminfo 記憶體統計資訊 Darwin, Dragonfly, FreeBSD, Linux
netdev 網口流量統計資訊,單位 bytes Darwin, Dragonfly, FreeBSD, Linux, OpenBSD
netstat /proc/net/netstat 收集網路統計資料,等同於 netstat -s Linux
sockstat /proc/net/sockstat 中收集 socket 統計資訊 Linux
stat /proc/stat 中收集各種統計資訊,包含系統啟動時間,forks, 中斷等 Linux
textfile 通過 --collector.textfile.directory 引數指定本地文字收集路徑,收集文字資訊 any
time 系統當前時間 any
uname 通過 uname 系統呼叫, 獲取系統資訊 any
vmstat /proc/vmstat 中收集統計資訊 Linux
wifi 收集 wifi 裝置相關統計資料 Linux
xfs 收集 xfs 執行時統計資訊 Linux (kernel 4.4+)
zfs 收集 zfs 效能統計資訊 Linux
預設關閉功能
名稱 說明 系統
bonding 收集系統配置以及啟用的繫結網絡卡數量 Linux
buddyinfo /proc/buddyinfo 中收集記憶體碎片統計資訊 Linux
devstat 收集裝置統計資訊 Dragonfly, FreeBSD
drbd 收集遠端映象塊裝置(DRBD)統計資訊 Linux
interrupts 收集更具體的中斷統計資訊 Linux,OpenBSD
ipvs /proc/net/ip_vs 中收集 IPVS 狀態資訊,從 /proc/net/ip_vs_stats 獲取統計資訊 Linux
ksmd /sys/kernel/mm/ksm 中獲取核心和系統統計資訊 Linux
logind logind 中收集會話統計資訊 Linux
meminfo_numa /proc/meminfo_numa 中收集記憶體統計資訊 Linux
mountstats /proc/self/mountstat 中收集檔案系統統計資訊,包括 NFS 客戶端統計資訊 Linux
nfs /proc/net/rpc/nfs 中收集 NFS 統計資訊,等同於 nfsstat -c Linux
qdisc 收集佇列推定統計資訊 Linux
runit 收集 runit 狀態資訊 any
supervisord 收集 supervisord 狀態資訊 any
systemd systemd 中收集裝置系統狀態資訊 Linux
tcpstat /proc/net/tcp/proc/net/tcp6 收集 TCP 連線狀態資訊 Linux

監控MySQL

MySQL資料庫伺服器上安裝mysql_exporter

安裝exporter軟體

# 建立使用者
# groupadd -g 2000 prometheus
useradd -u 2000 -M -c "Prometheus agent" -s /sbin/nologin prometheus

# 解壓檔案
mkdir -p /ups/app/monitor/
tar -xf mysqld_exporter-0.12.1.linux-amd64.tar.gz -C /ups/app/monitor/

# 重新命名目錄
cd /ups/app/monitor/
mv mysqld_exporter-0.12.1.linux-amd64 mysqld_exporter

# 修改目錄屬主
chown -R prometheus.prometheus /ups/app/monitor/mysqld_exporter
建立MySQL監控使用者

在待監控MySQL上建立使用者

CREATE USER 'monitor'@'localhost' IDENTIFIED BY 'monitor';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'monitor'@'localhost';
CREATE USER 'monitor'@'192.168.10.%' IDENTIFIED BY 'monitor';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'monitor'@'192.168.10.%';
flush privileges;
配置客戶端賬號密碼檔案
cat > /ups/app/monitor/mysqld_exporter/.my.cnf <<EOF
[client]
user=monitor
password=monitor
port=3308
socket=/ups/app/mysql/mysql3308/logs/mysql3308.sock
host=progs
EOF

chmod 400 /ups/app/monitor/mysqld_exporter/.my.cnf
chown prometheus:prometheus /ups/app/monitor/mysqld_exporter/.my.cnf
配置服務
# 配置服務檔案
cat > /usr/lib/systemd/system/mysql_exporter.service <<-EOF
[Unit]
Description=mysqld exporter
Documentation=https://prometheus.io
After=network.target
After=postgresql-12.service mysql3308.service mysql.service

[Service]
Restart=on-failure
# ExecStart=/ups/app/monitor/mysqld_exporter/mysqld_exporter --config.my-cnf=/ups/app/monitor/mysqld_exporter/.my.cnf

ExecStart=/ups/app/monitor/mysqld_exporter/mysqld_exporter \
            --config.my-cnf=/ups/app/monitor/mysqld_exporter/.my.cnf \
            --collect.info_schema.innodb_tablespaces \
            --collect.info_schema.innodb_metrics  \
            --collect.perf_schema.tableiowaits \
            --collect.perf_schema.indexiowaits \
            --collect.perf_schema.tablelocks \
            --collect.engine_innodb_status \
            --collect.perf_schema.file_events \
            --collect.binlog_size \
            --collect.info_schema.clientstats \
            --collect.perf_schema.eventswaits

StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=mysqld_exporter

[Install]
WantedBy=multi-user.target
EOF
  • 日誌重定向輸出到指定檔案

    •   cat > /etc/rsyslog.d/mysqld_exporter.conf <<-EOF
        if \$programname == 'mysqld_exporter' then /ups/app/monitor/mysqld_exporter/node.log
        & stop
        EOF
      
啟動服務
# 啟動服務
systemctl daemon-reload
systemctl restart mysql_exporter.service
systemctl status mysql_exporter.service

或

# 啟動客戶端
./mysqld_exporter --config.my-cnf=/ups/app/monitor/mysqld_exporter/.my.cnf

# 預設埠:9104
lsof -i :9104
netstat -tnlp|grep ':9104'
驗證

http://192.168.10.181:9104/metrics

加入到Prometheus監控(Prometheus Server端)

# 配置prometheus.yml檔案
cat >> /ups/app/monitor/prometheus/config/prometheus.yml <<-EOF

  - job_name: 'MySQL'
    static_configs:
    - targets: ['progs:9104','192.168.10.181:9104']

EOF

重啟prometheus
# 檢查並重新載入配置檔案
./bin/promtool check config config/prometheus.yml
# 重啟服務
systemctl restart prometheus
驗證
http://192.168.10.181:9090/tagets

監控PostgreSQL

軟體部署

下載地址
wget -c https://github.com/wrouesnel/postgres_exporter/releases/download/v0.8.0/postgres_exporter_v0.8.0_linux-amd64.tar.gz
安裝
二進位制包安裝
  • 解壓
tar -xf postgres_exporter_v0.8.0_linux-amd64.tar.gz -C /ups/app/monitor
mv postgres_exporter* postgres_exporter
  • 配置服務項
# 配置服務檔案
cat > /usr/lib/systemd/system/postgres_exporter.service <<-EOF
[Unit]
Description=PostgreSQL Exporter
Documentation=https://github.com/wrouesnel/postgres_exporter
After=network.target

[Service]
User=postgres
Group=postgres
Restart=on-failure
# DATA_SOURCE_NAME=\"postgresql://postgres:postgres@localhost:5432/postgres?sslmode=disable\"; 
ExecStart="export DATA_SOURCE_NAME=\"user=postgres passfile=/home/postgres/.pgpass host=192.168.10.181 port=5432 sslmode=prefer\"; \"/ups/app/monitor/postgres_exporter/postgres_exporter --web.listen-address=:9187 --extend.query-path=/ups/app/monitor/postgres_exporter/queries.yaml\""
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=postgres_exporter

[Install]
WantedBy=multi-user.target
EOF
  • 配置自定義查詢語句檔案

vi /ups/app/monitor/postgres_exporter/queries.yaml

pg_replication:
  query: "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag"
  master: true
  metrics:
    - lag:
        usage: "GAUGE"
        description: "Replication lag behind master in seconds"

pg_postmaster:
  query: "SELECT pg_postmaster_start_time as start_time_seconds from pg_postmaster_start_time()"
  master: true
  metrics:
    - start_time_seconds:
        usage: "GAUGE"
        description: "Time at which postmaster started"

pg_stat_user_tables:
  query: "SELECT current_database() datname, schemaname, relname, seq_scan, seq_tup_read, idx_scan, idx_tup_fetch, n_tup_ins, n_tup_upd, n_tup_del, n_tup_hot_upd, n_live_tup, n_dead_tup, n_mod_since_analyze, COALESCE(last_vacuum, '1970-01-01Z'), COALESCE(last_vacuum, '1970-01-01Z') as last_vacuum, COALESCE(last_autovacuum, '1970-01-01Z') as last_autovacuum, COALESCE(last_analyze, '1970-01-01Z') as last_analyze, COALESCE(last_autoanalyze, '1970-01-01Z') as last_autoanalyze, vacuum_count, autovacuum_count, analyze_count, autoanalyze_count FROM pg_stat_user_tables"
  metrics:
    - datname:
        usage: "LABEL"
        description: "Name of current database"
    - schemaname:
        usage: "LABEL"
        description: "Name of the schema that this table is in"
    - relname:
        usage: "LABEL"
        description: "Name of this table"
    - seq_scan:
        usage: "COUNTER"
        description: "Number of sequential scans initiated on this table"
    - seq_tup_read:
        usage: "COUNTER"
        description: "Number of live rows fetched by sequential scans"
    - idx_scan:
        usage: "COUNTER"
        description: "Number of index scans initiated on this table"
    - idx_tup_fetch:
        usage: "COUNTER"
        description: "Number of live rows fetched by index scans"
    - n_tup_ins:
        usage: "COUNTER"
        description: "Number of rows inserted"
    - n_tup_upd:
        usage: "COUNTER"
        description: "Number of rows updated"
    - n_tup_del:
        usage: "COUNTER"
        description: "Number of rows deleted"
    - n_tup_hot_upd:
        usage: "COUNTER"
        description: "Number of rows HOT updated (i.e., with no separate index update required)"
    - n_live_tup:
        usage: "GAUGE"
        description: "Estimated number of live rows"
    - n_dead_tup:
        usage: "GAUGE"
        description: "Estimated number of dead rows"
    - n_mod_since_analyze:
        usage: "GAUGE"
        description: "Estimated number of rows changed since last analyze"
    - last_vacuum:
        usage: "GAUGE"
        description: "Last time at which this table was manually vacuumed (not counting VACUUM FULL)"
    - last_autovacuum:
        usage: "GAUGE"
        description: "Last time at which this table was vacuumed by the autovacuum daemon"
    - last_analyze:
        usage: "GAUGE"
        description: "Last time at which this table was manually analyzed"
    - last_autoanalyze:
        usage: "GAUGE"
        description: "Last time at which this table was analyzed by the autovacuum daemon"
    - vacuum_count:
        usage: "COUNTER"
        description: "Number of times this table has been manually vacuumed (not counting VACUUM FULL)"
    - autovacuum_count:
        usage: "COUNTER"
        description: "Number of times this table has been vacuumed by the autovacuum daemon"
    - analyze_count:
        usage: "COUNTER"
        description: "Number of times this table has been manually analyzed"
    - autoanalyze_count:
        usage: "COUNTER"
        description: "Number of times this table has been analyzed by the autovacuum daemon"

pg_statio_user_tables:
  query: "SELECT current_database() datname, schemaname, relname, heap_blks_read, heap_blks_hit, idx_blks_read, idx_blks_hit, toast_blks_read, toast_blks_hit, tidx_blks_read, tidx_blks_hit FROM pg_statio_user_tables"
  metrics:
    - datname:
        usage: "LABEL"
        description: "Name of current database"
    - schemaname:
        usage: "LABEL"
        description: "Name of the schema that this table is in"
    - relname:
        usage: "LABEL"
        description: "Name of this table"
    - heap_blks_read:
        usage: "COUNTER"
        description: "Number of disk blocks read from this table"
    - heap_blks_hit:
        usage: "COUNTER"
        description: "Number of buffer hits in this table"
    - idx_blks_read:
        usage: "COUNTER"
        description: "Number of disk blocks read from all indexes on this table"
    - idx_blks_hit:
        usage: "COUNTER"
        description: "Number of buffer hits in all indexes on this table"
    - toast_blks_read:
        usage: "COUNTER"
        description: "Number of disk blocks read from this table's TOAST table (if any)"
    - toast_blks_hit:
        usage: "COUNTER"
        description: "Number of buffer hits in this table's TOAST table (if any)"
    - tidx_blks_read:
        usage: "COUNTER"
        description: "Number of disk blocks read from this table's TOAST table indexes (if any)"
    - tidx_blks_hit:
        usage: "COUNTER"
        description: "Number of buffer hits in this table's TOAST table indexes (if any)"
        
pg_database:
  query: "SELECT pg_database.datname, pg_database_size(pg_database.datname) as size FROM pg_database"
  master: true
  cache_seconds: 30
  metrics:
    - datname:
        usage: "LABEL"
        description: "Name of the database"
    - size_bytes:
        usage: "GAUGE"
        description: "Disk space used by the database"

pg_stat_statements:
  query: "SELECT t2.rolname, t3.datname, queryid, calls, total_time / 1000 as total_time_seconds, min_time / 1000 as min_time_seconds, max_time / 1000 as max_time_seconds, mean_time / 1000 as mean_time_seconds, stddev_time / 1000 as stddev_time_seconds, rows, shared_blks_hit, shared_blks_read, shared_blks_dirtied, shared_blks_written, local_blks_hit, local_blks_read, local_blks_dirtied, local_blks_written, temp_blks_read, temp_blks_written, blk_read_time / 1000 as blk_read_time_seconds, blk_write_time / 1000 as blk_write_time_seconds FROM pg_stat_statements t1 join pg_roles t2 on (t1.userid=t2.oid) join pg_database t3 on (t1.dbid=t3.oid)"
  master: true
  metrics:
    - rolname:
        usage: "LABEL"
        description: "Name of user"
    - datname:
        usage: "LABEL"
        description: "Name of database"
    - queryid:
        usage: "LABEL"
        description: "Query ID"
    - calls:
        usage: "COUNTER"
        description: "Number of times executed"
    - total_time_seconds:
        usage: "COUNTER"
        description: "Total time spent in the statement, in milliseconds"
    - min_time_seconds:
        usage: "GAUGE"
        description: "Minimum time spent in the statement, in milliseconds"
    - max_time_seconds:
        usage: "GAUGE"
        description: "Maximum time spent in the statement, in milliseconds"
    - mean_time_seconds:
        usage: "GAUGE"
        description: "Mean time spent in the statement, in milliseconds"
    - stddev_time_seconds:
        usage: "GAUGE"
        description: "Population standard deviation of time spent in the statement, in milliseconds"
    - rows:
        usage: "COUNTER"
        description: "Total number of rows retrieved or affected by the statement"
    - shared_blks_hit:
        usage: "COUNTER"
        description: "Total number of shared block cache hits by the statement"
    - shared_blks_read:
        usage: "COUNTER"
        description: "Total number of shared blocks read by the statement"
    - shared_blks_dirtied:
        usage: "COUNTER"
        description: "Total number of shared blocks dirtied by the statement"
    - shared_blks_written:
        usage: "COUNTER"
        description: "Total number of shared blocks written by the statement"
    - local_blks_hit:
        usage: "COUNTER"
        description: "Total number of local block cache hits by the statement"
    - local_blks_read:
        usage: "COUNTER"
        description: "Total number of local blocks read by the statement"
    - local_blks_dirtied:
        usage: "COUNTER"
        description: "Total number of local blocks dirtied by the statement"
    - local_blks_written:
        usage: "COUNTER"
        description: "Total number of local blocks written by the statement"
    - temp_blks_read:
        usage: "COUNTER"
        description: "Total number of temp blocks read by the statement"
    - temp_blks_written:
        usage: "COUNTER"
        description: "Total number of temp blocks written by the statement"
    - blk_read_time_seconds:
        usage: "COUNTER"
        description: "Total time the statement spent reading blocks, in milliseconds (if track_io_timing is enabled, otherwise zero)"
    - blk_write_time_seconds:
        usage: "COUNTER"
        description: "Total time the statement spent writing blocks, in milliseconds (if track_io_timing is enabled, otherwise zero)"
  • 日誌重定向輸出到指定檔案
cat > /etc/rsyslog.d/postgres_exporter.conf <<-EOF
if \$programname == 'postgres_exporter' then /ups/app/monitor/postgres_exporter/exporter.log
& stop
EOF
  • 啟動服務
# 啟動服務
systemctl daemon-reload
systemctl restart postgres_exporter.service
systemctl status postgres_exporter.service


# 命令列啟動客戶端-- postgresql://postgres:password@localhost:5432/postgres
export DATA_SOURCE_NAME="postgresql://postgres:postgres@localhost:5432/postgres?sslmode=disable"
export PG_EXPORTER_EXTEND_QUERY_PATH="/ups/app/monitor/postgres_exporter/queries.yaml"
./postgres_exporter &
Docker安裝
docker run --net=host -e DATA_SOURCE_NAME="postgresql://postgres:password@localhost:5432/postgres?sslmode=disable" wrouesnel/postgres_exporter

接入Prometheus監控

新增配置Prometheus檔案

  - job_name: 'postgres_exporter'
    scrape_interval: 1s
    file_sd_configs:
      - files:
        - targets/postgresql/*.yml
    relabel_configs:
      - action: replace
        source_labels: ['__address__']
        regex: (.*):(.*):(.*)
        replacement: $2
        target_label: hostip
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.10.181:9121
告警規則檔案

vi rules/alert_pg.yml

---
groups:
  - name: PostgreSQL
    rules:
    - alert: PostgreSQLMaxConnectionsReached
      expr: sum(pg_stat_activity_count) by (instance) > sum(pg_settings_max_connections) by (instance)
      for: 1m
      labels:
        severity: email
      annotations:
        summary: "{{ $labels.instance }} has maxed out Postgres connections."
        description: "{{ $labels.instance }} is exceeding the currently configured maximum Postgres connection limit (current value: {{ $value }}s). Services may be degraded - please take immediate action (you probably need to increase max_connections in the Docker image and re-deploy."

    - alert: PostgreSQLHighConnections
      expr: sum(pg_stat_activity_count) by (instance) > sum(pg_settings_max_connections * 0.8) by (instance)
      for: 10m
      labels:
        severity: email
      annotations:
        summary: "{{ $labels.instance }} is over 80% of max Postgres connections."
        description: "{{ $labels.instance }} is exceeding 80% of the currently configured maximum Postgres connection limit (current value: {{ $value }}s). Please check utilization graphs and confirm if this is normal service growth, abuse or an otherwise temporary condition or if new resources need to be provisioned (or the limits increased, which is mostly likely)."

    - alert: PostgreSQLDown
      expr: pg_up != 1
      for: 1m
      labels:
        severity: email
      annotations:
        summary: "PostgreSQL is not processing queries: {{ $labels.instance }}"
        description: "{{ $labels.instance }} is rejecting query requests from the exporter, and thus probably not allowing DNS requests to work either. User services should not be effected provided at least 1 node is still alive."

    - alert: PostgreSQLSlowQueries
      expr: avg(rate(pg_stat_activity_max_tx_duration{datname!~"template.*"}[2m])) by (datname) > 2 * 60
      for: 2m
      labels:
        severity: email
      annotations:
        summary: "PostgreSQL high number of slow on {{ $labels.cluster }} for database {{ $labels.datname }} "
        description: "PostgreSQL high number of slow queries {{ $labels.cluster }} for database {{ $labels.datname }} with a value of {{ $value }} "

    - alert: PostgreSQLQPS
      expr: avg(irate(pg_stat_database_xact_commit{datname!~"template.*"}[5m]) + irate(pg_stat_database_xact_rollback{datname!~"template.*"}[5m])) by (datname) > 10000
      for: 5m
      labels:
        severity: email
      annotations:
        summary: "PostgreSQL high number of queries per second {{ $labels.cluster }} for database {{ $labels.datname }}"
        description: "PostgreSQL high number of queries per second on {{ $labels.cluster }} for database {{ $labels.datname }} with a value of {{ $value }}"

    - alert: PostgreSQLCacheHitRatio
      expr: avg(rate(pg_stat_database_blks_hit{datname!~"template.*"}[5m]) / (rate(pg_stat_database_blks_hit{datname!~"template.*"}[5m]) + rate(pg_stat_database_blks_read{datname!~"template.*"}[5m]))) by (datname) < 0.98
      for: 5m
      labels:
        severity: email
      annotations:
        summary: "PostgreSQL low cache hit rate on {{ $labels.cluster }} for database {{ $labels.datname }}"
        description: "PostgreSQL low on cache hit rate on {{ $labels.cluster }} for database {{ $labels.datname }} with a value of {{ $value }}"
non-superuser收集指標所需許可權
DATA_SOURCE_NAME=postgresql://postgres_exporter:password@localhost:5432/postgres?sslmode=disable
-- To use IF statements, hence to be able to check if the user exists before
-- attempting creation, we need to switch to procedural SQL (PL/pgSQL)
-- instead of standard SQL.
-- More: https://www.postgresql.org/docs/9.3/plpgsql-overview.html
-- To preserve compatibility with <9.0, DO blocks are not used; instead,
-- a function is created and dropped.
CREATE OR REPLACE FUNCTION __tmp_create_user() returns void as $$
BEGIN
  IF NOT EXISTS (
          SELECT                       -- SELECT list can stay empty for this
          FROM   pg_catalog.pg_user
          WHERE  usename = 'postgres_exporter') THEN
    CREATE USER postgres_exporter;
  END IF;
END;
$$ language plpgsql;

SELECT __tmp_create_user();
DROP FUNCTION __tmp_create_user();

ALTER USER postgres_exporter WITH PASSWORD 'password';
ALTER USER postgres_exporter SET SEARCH_PATH TO postgres_exporter,pg_catalog;

-- If deploying as non-superuser (for example in AWS RDS), uncomment the GRANT
-- line below and replace <MASTER_USER> with your root user.
-- GRANT postgres_exporter TO <MASTER_USER>;
CREATE SCHEMA IF NOT EXISTS postgres_exporter;
GRANT USAGE ON SCHEMA postgres_exporter TO postgres_exporter;
GRANT CONNECT ON DATABASE postgres TO postgres_exporter;

CREATE OR REPLACE FUNCTION get_pg_stat_activity() RETURNS SETOF pg_stat_activity AS
$$ SELECT * FROM pg_catalog.pg_stat_activity; $$
LANGUAGE sql
VOLATILE
SECURITY DEFINER;

CREATE OR REPLACE VIEW postgres_exporter.pg_stat_activity
AS
  SELECT * from get_pg_stat_activity();

GRANT SELECT ON postgres_exporter.pg_stat_activity TO postgres_exporter;

CREATE OR REPLACE FUNCTION get_pg_stat_replication() RETURNS SETOF pg_stat_replication AS
$$ SELECT * FROM pg_catalog.pg_stat_replication; $$
LANGUAGE sql
VOLATILE
SECURITY DEFINER;

CREATE OR REPLACE VIEW postgres_exporter.pg_stat_replication
AS
  SELECT * FROM get_pg_stat_replication();

GRANT SELECT ON postgres_exporter.pg_stat_replication TO postgres_exporter;
重新載入配置
curl -X POST http://localhost:9090/-/reload

監控redis

軟體部署

下載地址
wget -c https://github.com/oliver006/redis_exporter/releases/download/v1.9.0/redis_exporter-v1.9.0.linux-amd64.tar.gz
安裝
二進位制包安裝
  • 解壓
tar -xf redis_exporter-v1.9.0.linux-amd64.tar.gz -C /ups/app/monitor/
mv redis_exporter-* redis_exporter

  • 配置服務項
# 配置服務檔案
cat > /usr/lib/systemd/system/redis_exporter.service <<-EOF
[Unit]
Description=Redis Exporter
Documentation=https://github.com/oliver006/redis_exporter
After=network.target

[Service]
#User=prometheus
#Group=prometheus
Restart=on-failure
ExecStart=/ups/app/monitor/redis_exporter/redis_exporter -redis-only-metrics --web.listen-address=:9121
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=redis_exporter

[Install]
WantedBy=multi-user.target
EOF
  • 日誌重定向輸出到指定檔案
cat > /etc/rsyslog.d/redis_exporter.conf <<-EOF
if \$programname == 'redis_exporter' then /ups/app/monitor/redis_exporter/exporter.log
& stop
EOF
  • 啟動服務
# 啟動服務
systemctl daemon-reload
systemctl restart redis_exporter.service
systemctl status redis_exporter.service


# 命令列啟動客戶端
cd /ups/app/monitor/redis_exporter
./redis_exporter &
Docker安裝
docker run -d --name redis_exporter -p 9121:9121 oliver006/redis_exporter

接入Prometheus監控

配置prometheus.yml檔案

新增redis資料採集項

  • 集中式配置
scrape_configs:
  - job_name: 'redis_exporter'
    file_sd_configs:
      - files:
        - targets/redis/redis-instances.json
    metrics_path: /scrape
    relabel_configs:
      - action: replace
        source_labels: ['__address__']
        regex: (.*):(.*):(.*)
        replacement: $2
        target_label: hostip
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.10.181:9121

  ## config for scraping the exporter itself
  - job_name: 'redis_exporter_single'
    static_configs:
      - targets:
        - 192.168.10.181:9121

配置redis伺服器json檔案

vi targets/redis/redis-instances.json

[
  {
    "targets": [ "redis://192.168.10.181:6379", "redis://192.168.10.151:6379"],
    "labels": { }
  }
]

​ 帶密碼URI格式:redis://host:<<PASSWORD>>@<<HOSTNAME>>:<<PORT>>

  • 獨立配置
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - progs:9093  # 對應啟動的altermanager節點的9093埠

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/alert_node.yml"
  - "rules/alert_mysql.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    scrape_interval: 1s
    file_sd_configs:
      - files:
        - targets/node/*.yml
        refresh_interval: 10s
    relabel_configs:
    - action: replace
      source_labels: ['__address__']
      regex: (.*):(.*)
      replacement: $1
      target_label: hostname
    - action: labeldrop
      regex: __meta_filepath

  - job_name: 'redis_exporter'
    scrape_interval: 1s
    file_sd_configs:
      - files:
        - targets/redis/*.yml
    metrics_path: /scrape
    relabel_configs:
      - action: replace
        source_labels: ['__address__']
        regex: (.*):(.*):(.*)
        replacement: $2
        target_label: hostip
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.10.181:9121

配置redis伺服器json檔案

vi targets/redis/redis1_exporter.yml
[
  {
    "targets": [ "redis://192.168.10.181:6379"],
    "labels": { }
  }
]

vi targets/redis/redis2_exporter.yml
[
  {
    "targets": [ "redis://192.168.10.151:6379"],
    "labels": { }
  }
]
重啟prometheus載入配置
# 檢查並重新載入配置檔案
./bin/promtool check config config/prometheus.yml
# 重啟服務
systemctl restart prometheus

告警元件

在 Prometheus 中告警分為兩部分:

  • Prometheus 服務根據所設定的告警規則將告警資訊傳送給 Alertmanager。
  • Alertmanager 對收到的告警資訊進行處理,包括去重,降噪,分組,策略路由告警通知。

使用告警服務主要的步驟如下:

  • 下載配置 Alertmanager。
  • 通過設定 -alertmanager.url 讓 Prometheus 服務與 Alertmanager 進行通訊。
  • 在 Prometheus 服務中設定告警規則。

安裝告警管理模組軟體

二進位制安裝

mkdir -p /ups/app/monitor/
# 解壓
tar -xf alertmanager-0.20.0.linux-amd64.tar.gz -C /ups/app/monitor/ --no-same-owner
cd /ups/app/monitor/
mv alertmanager-0.20.0.linux-amd64/ alertmanager

# 建立使用者
# groupadd -g 2000 prometheus
useradd -r -M -s /sbin/nologin -d /ups/app/monitor/alertmanager -c "Prometheus agent" prometheus

# 建立目錄
cd /ups/app/monitor/
mkdir -p alertmanager/{bin,logs,config,data}
cd alertmanager
mv alertmanager.yml config/
mv alertmanager amtool bin/

# 修改目錄屬主
chown -R prometheus.prometheus /ups/app/monitor/alertmanager
配置服務項
# 配置服務啟動項
cat > /usr/lib/systemd/system/alertmanager.service <<-EOF
[Unit]
Description=alertmanager
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/ups/app/monitor/alertmanager/bin/alertmanager \
        --config.file=/ups/app/monitor/alertmanager/config/alertmanager.yml \
        --web.listen-address=192.168.10.181:9093 \
        --cluster.listen-address=0.0.0.0:8001 \
        --storage.path=/ups/app/monitor/alertmanager/data \
        --log.level=info
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF
基本配置

cat /ups/app/monitor/alertmanager/config/alertmanager.yml

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
啟動服務
# 載入服務
systemctl daemon-reload
systemctl enable alertmanager.service 
systemctl start alertmanager.service 
systemctl status alertmanager

案例

通過企業微信接收告警

準備工作
  • 註冊企業微信賬號
  • 建立第三方應用,點選建立應用按鈕 -> 填寫應用
詳細配置
prometheus 配置

vi /ups/app/monitor/promethues/config/prometheus.yml

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
    - targets: ['localhost:9100']

rules.yml 配置

cat > /ups/app/monitor/promethues/config/rules.yml <<-EOF
groups:
- name: node
  rules:
  - alert: server_status
    expr: up{job="node"} == 0
    for: 15s
    annotations:
      summary: "機器 {{ $labels.instance }} 掛了"
EOF
alertmanger 配置
cat > /ups/app/monitor/alertmanager/config/alertmanager.yml <<-EOF
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'wechat'
receivers:
- name: 'wechat'
  wechat_configs:
  - corp_id: 'ww9e5158867cf67d24'
    to_party: '1'
    agent_id: '1000002'
    api_secret: 'eRDqnTEOtlk2DtPiaxOA2w5fFyNhpIPkdQU-6Ty94cI'
EOF

引數說明:

  • corp_id: 企業微信賬號唯一 ID, 可以在我的企業中檢視。
  • to_party: 需要傳送的組。
  • agent_id: 第三方企業應用的 ID,可以在自己建立的第三方企業應用詳情頁面檢視。
  • api_secret: 第三方企業應用的金鑰,可以在自己建立的第三方企業應用詳情頁面檢視。

附錄

參考文件