基於Prometheus監控例項
部署Prometheus
基於Prometheus+Grafana監控服務物件,如伺服器,MySQL/mongodb等資料庫
前期準備
軟體下載
# Prometheus Server https://prometheus.io/download/ wget -c https://github.com/prometheus/prometheus/releases/download/v2.20.0/prometheus-2.20.0.linux-amd64.tar.gz & # 告警通知管理元件 wget -c https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz & # exporter元件 wget -c https://github.com/prometheus/consul_exporter/releases/download/v0.7.1/consul_exporter-0.7.1.linux-amd64.tar.gz & wget -c https://github.com/prometheus/mysqld_exporter/releases/download/v0.12.1/mysqld_exporter-0.12.1.linux-amd64.tar.gz & wget -c https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz &
Prometheus 安裝
傳統二進位制包安裝和 Docker 安裝方式
二進位制包安裝
mkdir -p /ups/app/monitor/ # 解壓 tar -xf prometheus-*.linux-amd64.tar.gz -C /ups/app/monitor/ # 重新命名目錄 cd /ups/app/monitor/ mv prometheus-*.linux-amd64 prometheus ln -s prometheus-2.20.0 prometheus # 建立目錄 mkdir -p prometheus/{bin,logs,config/rules,data} cd prometheus/config && mkdir -p targets/{node,redis,postgresql,mysql} # 建立使用者 # groupadd -g 2000 prometheus useradd -r -M -c "Prometheus Server" -d /ups/app/monitor/ -s /sbin/nologin prometheus # 修改目錄屬主 chown -R prometheus.prometheus /ups/app/monitor/prometheus-2.20.0 # 重構目錄結構 cd /ups/app/monitor/prometheus mv prometheus promtool tsdb bin/ mv prometheus.yml config/
服務啟動引數項
[root@progs prometheus]# ./bin/prometheus --help usage: prometheus [<flags>] The Prometheus monitoring server Flags: -h, --help Show context-sensitive help (also try --help-long and --help-man). --version Show application version. --config.file="prometheus.yml" Prometheus configuration file path. --web.listen-address="0.0.0.0:9090" Address to listen on for UI, API, and telemetry. --web.read-timeout=5m Maximum duration before timing out read of the request, and closing idle connections. --web.max-connections=512 Maximum number of simultaneous connections. --web.external-url=<URL> The URL under which Prometheus is externally reachable (for example, if Prometheus is served via a reverse proxy). Used for generating relative and absolute links back to Prometheus itself. If the URL has a path portion, it will be used to prefix all HTTP endpoints served by Prometheus. If omitted, relevant URL components will be derived automatically. --web.route-prefix=<path> Prefix for the internal routes of web endpoints. Defaults to path of --web.external-url. --web.user-assets=<path> Path to static asset directory, available at /user. --web.enable-lifecycle Enable shutdown and reload via HTTP request. --web.enable-admin-api Enable API endpoints for admin control actions. --web.console.templates="consoles" Path to the console template directory, available at /consoles. --web.console.libraries="console_libraries" Path to the console library directory. --web.page-title="Prometheus Time Series Collection and Processing Server" Document title of Prometheus instance. --web.cors.origin=".*" Regex for CORS origin. It is fully anchored. Example: 'https?://(domain1|domain2)\.com' --storage.tsdb.path="data/" Base path for metrics storage. --storage.tsdb.retention=STORAGE.TSDB.RETENTION [DEPRECATED] How long to retain samples in storage. This flag has been deprecated, use "storage.tsdb.retention.time" instead. --storage.tsdb.retention.time=STORAGE.TSDB.RETENTION.TIME How long to retain samples in storage. When this flag is set it overrides "storage.tsdb.retention". If neither this flag nor "storage.tsdb.retention" nor "storage.tsdb.retention.size" is set, the retention time defaults to 15d. Units Supported: y, w, d, h, m, s, ms. --storage.tsdb.retention.size=STORAGE.TSDB.RETENTION.SIZE [EXPERIMENTAL] Maximum number of bytes that can be stored for blocks. Units supported: KB, MB, GB, TB, PB. This flag is experimental and can be changed in future releases. --storage.tsdb.no-lockfile Do not create lockfile in data directory. --storage.tsdb.allow-overlapping-blocks [EXPERIMENTAL] Allow overlapping blocks, which in turn enables vertical compaction and vertical query merge. --storage.tsdb.wal-compression Compress the tsdb WAL. --storage.remote.flush-deadline=<duration> How long to wait flushing sample on shutdown or config reload. --storage.remote.read-sample-limit=5e7 Maximum overall number of samples to return via the remote read interface, in a single query. 0 means no limit. This limit is ignored for streamed response types. --storage.remote.read-concurrent-limit=10 Maximum number of concurrent remote read calls. 0 means no limit. --storage.remote.read-max-bytes-in-frame=1048576 Maximum number of bytes in a single frame for streaming remote read response types before marshalling. Note that client might have limit on frame size as well. 1MB as recommended by protobuf by default. --rules.alert.for-outage-tolerance=1h Max time to tolerate prometheus outage for restoring "for" state of alert. --rules.alert.for-grace-period=10m Minimum duration between alert and restored "for" state. This is maintained only for alerts with configured "for" time greater than grace period. --rules.alert.resend-delay=1m Minimum amount of time to wait before resending an alert to Alertmanager. --alertmanager.notification-queue-capacity=10000 The capacity of the queue for pending Alertmanager notifications. --alertmanager.timeout=10s Timeout for sending alerts to Alertmanager. --query.lookback-delta=5m The maximum lookback duration for retrieving metrics during expression evaluations and federation. --query.timeout=2m Maximum time a query may take before being aborted. --query.max-concurrency=20 Maximum number of queries executed concurrently. --query.max-samples=50000000 Maximum number of samples a single query can load into memory. Note that queries will fail if they try to load more samples than this into memory, so this also limits the number of samples a query can return. --log.level=info Only log messages with the given severity or above. One of: [debug, info, warn, error] --log.format=logfmt Output format of log messages. One of: [logfmt, json]
配置服務項
# 配置服務啟動項
cat > /usr/lib/systemd/system/prometheus.service <<-EOF
[Unit]
Description=https://prometheus.io
After=network.target
#After=postgresql.service mariadb.service mysql.service
Wants=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
WorkingDirectory=/ups/app/monitor/prometheus/
# RuntimeDirectory=prometheus
# RuntimeDirectoryMode=0750
ExecStart=/ups/app/monitor/prometheus/bin/prometheus \
--config.file=/ups/app/monitor/prometheus/config/prometheus.yml \
--storage.tsdb.retention=30d \
--storage.tsdb.path="/ups/app/monitor/prometheus/data/" \
--web.console.templates=/ups/app/monitor/prometheus/consoles \
--web.console.libraries=/ups/app/monitor/prometheus/console_libraries \
--web.enable-lifecycle --web.enable-admin-api \
--web.listen-address=:9090
Restart=on-failure
# Sets open_files_limit
LimitNOFILE=10000
TimeoutStopSec=20
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=prometheus
[Install]
WantedBy=multi-user.target
EOF
日誌重定向輸出到指定檔案
cat > /etc/rsyslog.d/prometheus.conf <<-EOF
if \$programname == 'prometheus' then /ups/app/monitor/prometheus/logs/prometheusd.log
& stop
EOF
配置引數檔案
vi /ups/app/monitor/prometheus/config/prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- progs:9093 # 對應啟動的altermanager節點的9093埠
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/alert_node.yml"
- "rules/alert_mysql.yml"
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
relabel_configs:
- action: replace
source_labels: ['__address__'] ##源標籤
regex: (.*):(.*) ##正則,會匹配到__address__值
replacement: $1 ##引用正則匹配到的內容
target_label: HOSTNAME ##賦予新的標籤,名為HOSTNAME
- job_name: 'MySQL'
static_configs:
- targets: ['localhost:9104']
relabel_configs:
- action: replace
source_labels: ['__address__'] ##源標籤
regex: (.*):(.*) ##正則,會匹配到__address__值
replacement: $1 ##引用正則匹配到的內容
target_label: instance ##賦予新的標籤,名為 instance
檢查配置檔案
cd /ups/app/monitor/prometheus
./bin/promtool check config config/prometheus.yml
啟動服務
# 啟動服務
./bin/prometheus --config.file=config/prometheus.yml
或
# 載入服務
systemctl daemon-reload
systemctl enable prometheus.service
systemctl start prometheus.service
systemctl stop prometheus.service
systemctl status prometheus.service
重新載入Prometheus服務
增加啟動引數--web.enable-lifecycle
可以不關閉服務方式載入配置
curl -X POST http://localhost:9090/-/reload
驗證
# 執行 version 檢查執行環境是否正常
./bin/prometheus version
lsof -i :9090
# 開啟Web介面,預設埠9090
http://192.168.10.181:9090
Docker安裝方式
安裝docker軟體
yum -y install docker
執行命令安裝Prometheus
使用Quay.io or Docker Hub Docker映象倉庫安裝
$ docker run --name prometheus -d -p 127.0.0.1:9090:9090 quay.io/prometheus/prometheus
# 通過prometheus.yml檔案啟動
docker run \
-p 9090:9090 \
-v /tmp/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
# 配置使用額外的卷
docker run \
-p 9090:9090 \
-v /path/to/config:/etc/prometheus \
prom/prometheus
通過Dockerfile安裝
FROM prom/prometheus
ADD prometheus.yml /etc/prometheus/
#
docker build -t my-prometheus .
docker run -p 9090:9090 my-prometheus
Docker管理prometheus
# 執行 docker ps 檢視所有服務
docker ps
執行 docker start prometheus 啟動服務
執行 docker stats prometheus 檢視 prometheus 狀態
執行 docker stop prometheus 停止服務
配置
Prometheus 啟動的時候,可以載入執行引數 -config.file
指定配置檔案,預設為 prometheus.yml
。
在配置檔案中我們可以指定 global, alerting, rule_files, scrape_configs, remote_write, remote_read 等屬性。
全域性配置
global
屬於全域性的預設配置,它主要包含 4 個屬性,
- scrape_interval: 拉取 targets 的預設時間間隔。
- scrape_timeout: 拉取一個 target 的超時時間。
- evaluation_interval: 執行 rules 的時間間隔。
- external_labels: 額外的屬性,會新增到拉取的資料並存到資料庫中。
告警配置
可以使用執行引數 -alertmanager.xxx
來配置 Alertmanager,它這種方式不靈活。不支援動態更新載入,以及動態定義告警屬性。
因此,通過alerting
配置主要用來解決這個問題。它能夠更好的管理 Alertmanager, 主要包含 2 個引數:
- alert_relabel_configs: 動態修改 alert 屬性的規則配置。
- alertmanagers: 用於動態發現 Alertmanager 的配置。
規則配置
rule_files
主要用於配置 rules 檔案,它支援多個檔案以及檔案目錄
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
資料拉取配置
scrape_configs 主要用於配置拉取資料節點,每一個拉取配置主要包含以下引數:
- job_name:任務名稱
- honor_labels: 用於解決拉取資料標籤有衝突,當設定為 true, 以拉取資料為準,否則以服務配置為準
- params:資料拉取訪問時帶的請求引數
- scrape_interval: 拉取時間間隔
- scrape_timeout: 拉取超時時間
- metrics_path: 拉取節點的 metric 路徑
- scheme: 拉取資料訪問協議
- sample_limit: 儲存的資料標籤個數限制,如果超過限制,該資料將被忽略,不入儲存;預設值為0,表示沒有限制
- relabel_configs: 拉取資料重置標籤配置
- metric_relabel_configs:metric 重置標籤配置
遠端可寫儲存
remote_write
主要用於可寫遠端儲存配置,主要包含以下引數:
- url: 訪問地址
- remote_timeout: 請求超時時間
- write_relabel_configs: 標籤重置配置, 拉取到的資料,經過重置處理後,傳送給遠端儲存
注意: remote_write 屬於試驗階段,慎用
遠端可讀儲存
remote_read
主要用於可讀遠端儲存配置,主要包含以下引數:
- url: 訪問地址
- remote_timeout: 請求超時時間
注意: remote_read 屬於試驗階段,慎用
服務發現
在 Prometheus 的配置中,一個最重要的概念就是資料來源 target,而資料來源的配置主要分為靜態配置和動態發現, 大致為以下幾類:
- static_configs: 靜態服務發現
- dns_sd_configs: DNS 服務發現
- file_sd_configs: 檔案服務發現
- consul_sd_configs: Consul 服務發現
- serverset_sd_configs: Serverset 服務發現
- nerve_sd_configs: Nerve 服務發現
- marathon_sd_configs: Marathon 服務發現
- kubernetes_sd_configs: Kubernetes 服務發現
- gce_sd_configs: GCE 服務發現
- ec2_sd_configs: EC2 服務發現
- openstack_sd_configs: OpenStack 服務發現
- azure_sd_configs: Azure 服務發現
- triton_sd_configs: Triton 服務發現
配置樣例
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
evaluation_interval: 15s # By default, scrape targets every 15 seconds.
rule_files:
- "rules/node.rules"
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
scrape_interval: 8s
static_configs:
- targets: ['127.0.0.1:9100', '127.0.0.12:9100']
- job_name: 'mysqld'
static_configs:
- targets: ['127.0.0.1:9104']
- job_name: 'memcached'
static_configs:
- targets: ['127.0.0.1:9150']
部署Grafana
web視覺化軟體
軟體下載地址
# grafana程式包
https://grafana.com/grafana/download
# grafana-dashboards包
https://github.com/percona/grafana-dashboards/releases
# Standalone Linux Binaries(64 Bit)SHA256: b6cbc04505edb712f206228261d0ea5ab7e9c03e9f77d0d36930886c861366ed
wget https://dl.grafana.com/oss/release/grafana-7.1.1.linux-amd64.tar.gz
tar -xf grafana-7.1.1.linux-amd64.tar.gz
軟體安裝部署
二進位制包安裝
mkdir -p /ups/app/monitor/
# 解壓
tar -xf grafana-*.linux-amd64.tar.gz -C /ups/app/monitor/
# 重新命名目錄
cd /ups/app/monitor/
mv grafana-6.7.1 grafana
mkdir -p /ups/app/monitor/grafana/{logs}
# 建立使用者
# groupadd -g 2001 grafana
useradd -r -d /ups/app/monitor/grafana -c "Grafana Server" -M -s /sbin/nologin grafana
# 修改目錄屬主
chown -R grafana.grafana /ups/app/monitor/grafana
配置服務項
# 配置服務啟動項
cat > /usr/lib/systemd/system/grafana.service <<-EOF
[Unit]
Description=Grafana instance
Documentation=http://docs.grafana.org
Wants=network-online.target
After=network-online.target
#After=After=postgresql-12.service mysql3308.service mysql.service
[Service]
# EnvironmentFile=/etc/sysconfig/grafana-server
User=grafana
Group=grafana
Type=notify
Restart=on-failure
WorkingDirectory=/ups/app/monitor/grafana
RuntimeDirectory=grafana
RuntimeDirectoryMode=0750
# ExecStart=/ups/app/monitor/grafana/bin/grafana-server \
# --config=\${CONF_FILE} \
# --pidfile=\${PID_FILE_DIR}/grafana-server.pid \
# --packaging=rpm \
# cfg:default.paths.logs=\${LOG_DIR} \
# cfg:default.paths.data=\${DATA_DIR} \
# cfg:default.paths.plugins=\${PLUGINS_DIR} \
# cfg:default.paths.provisioning=\${PROVISIONING_CFG_DIR}
ExecStart=/ups/app/monitor/grafana/bin/grafana-server
LimitNOFILE=10000
TimeoutStopSec=20
#StandardOutput=syslog
#StandardError=syslog
#SyslogIdentifier=grafana
[Install]
WantedBy=multi-user.target
EOF
日誌重定向輸出到指定檔案(刪)
cat > /etc/rsyslog.d/grafana.conf <<-EOF
if \$programname == 'grafana' then /ups/app/monitor/grafana/logs/grafana.log
& stop
EOF
啟動服務
# 啟動服務
/ups/app/monitor/grafana/bin/grafana-server &
或
# 載入服務
systemctl daemon-reload
systemctl enable grafana.service
systemctl start grafana.service
systemctl stop grafana.service
systemctl restart grafana.service
systemctl status grafana.service
Docker安裝方式
docker run -d --name=grafana -p 3000:3000 grafana/grafana
驗證
# 開啟Web介面,預設埠3000 (預設賬號/密碼:admin/admin)
http://192.168.10.181:3000
配置檔案
路徑
- 預設路徑:$WORKING_DIR/conf/defaults.ini
- 自定義配置:$WORKING_DIR/conf/custom.ini
- 使用
--config
引數覆蓋自定義配置檔案路徑./grafana-server --config /custom/config.ini --homepath /custom/homepath cfg:default.paths.logs=/custom/path
新增外掛
語法
[root@progs bin]# ./grafana-cli --help
NAME:
Grafana CLI - A new cli application
USAGE:
grafana-cli [global options] command [command options] [arguments...]
VERSION:
7.1.1
AUTHOR:
Grafana Project <[email protected]>
COMMANDS:
plugins Manage plugins for grafana
admin Grafana admin commands
help, h Shows a list of commands or help for one command
GLOBAL OPTIONS:
--pluginsDir value Path to the Grafana plugin directory (default: "/var/lib/grafana/plugins") [$GF_PLUGIN_DIR]
--repo value URL to the plugin repository (default: "https://grafana.com/api/plugins") [$GF_PLUGIN_REPO]
--pluginUrl value Full url to the plugin zip file instead of downloading the plugin from grafana.com/api [$GF_PLUGIN_URL]
--insecure Skip TLS verification (insecure) (default: false)
--debug Enable debug logging (default: false)
--configOverrides value Configuration options to override defaults as a string. e.g. cfg:default.paths.log=/dev/null
--homepath value Path to Grafana install/home path, defaults to working directory
--config value Path to config file
--help, -h show help (default: false)
--version, -v print the version (default: false)
# 查詢可用的外掛
grafana-cli plugins list-remote
id: abhisant-druid-datasource version: 0.0.5
id: agenty-flowcharting-panel version: 0.9.0
id: aidanmountford-html-panel version: 0.0.1
id: akumuli-datasource version: 1.3.11
id: alexanderzobnin-zabbix-app version: 3.12.4
id: alexandra-trackmap-panel version: 1.2.5
id: andig-darksky-datasource version: 1.0.1
id: aquaqanalytics-kdbadaptor-datasource version: 1.0.1
id: ayoungprogrammer-finance-datasource version: 1.0.0
id: belugacdn-app version: 1.2.0
id: bessler-pictureit-panel version: 1.0.0
id: blackmirror1-singlestat-math-panel version: 1.1.7
id: blackmirror1-statusbygroup-panel version: 1.1.1
id: bosun-app version: 0.0.28
id: briangann-datatable-panel version: 1.0.2
id: briangann-gauge-panel version: 0.0.6
id: btplc-alarm-box-panel version: 1.0.8
id: btplc-peak-report-panel version: 0.2.4
id: btplc-status-dot-panel version: 0.2.4
id: btplc-trend-box-panel version: 0.1.9
id: camptocamp-prometheus-alertmanager-datasource version: 0.0.8
id: citilogics-geoloop-panel version: 1.1.1
id: cloudflare-app version: 0.1.4
id: cloudspout-button-panel version: 7.0.3
id: cognitedata-datasource version: 2.0.0
id: corpglory-progresslist-panel version: 1.0.5
id: dalmatinerdb-datasource version: 1.0.5
id: dalvany-image-panel version: 2.1.1
id: ddurieux-glpi-app version: 1.3.0
id: devicehive-devicehive-datasource version: 2.0.1
id: devopsprodigy-kubegraf-app version: 1.4.2
id: digiapulssi-breadcrumb-panel version: 1.1.6
id: digiapulssi-organisations-panel version: 1.3.0
id: digrich-bubblechart-panel version: 1.1.0
id: doitintl-bigquery-datasource version: 1.0.8
id: farski-blendstat-panel version: 1.0.2
id: fastweb-openfalcon-datasource version: 1.0.0
id: fatcloud-windrose-panel version: 0.7.0
id: fetzerch-sunandmoon-datasource version: 0.1.6
id: flant-statusmap-panel version: 0.2.0
id: foursquare-clouderamanager-datasource version: 0.9.2
id: fzakaria-simple-annotations-datasource version: 1.0.0
id: gnocchixyz-gnocchi-datasource version: 1.7.0
id: goshposh-metaqueries-datasource version: 0.0.3
id: grafana-azure-data-explorer-datasource version: 2.1.0
id: grafana-azure-monitor-datasource version: 0.3.0
id: grafana-clock-panel version: 1.1.1
id: grafana-googlesheets-datasource version: 1.0.0
id: grafana-image-renderer version: 2.0.0
id: grafana-influxdb-08-datasource version: 1.0.2
id: grafana-influxdb-flux-datasource version: 7.0.0
id: grafana-kairosdb-datasource version: 3.0.1
id: grafana-kubernetes-app version: 1.0.1
id: grafana-piechart-panel version: 1.5.0
id: grafana-polystat-panel version: 1.2.0
id: grafana-simple-json-datasource version: 1.4.0
id: grafana-strava-datasource version: 1.1.1
id: grafana-worldmap-panel version: 0.3.2
id: gretamosa-topology-panel version: 1.0.0
id: gridprotectionalliance-openhistorian-datasource version: 1.0.2
id: gridprotectionalliance-osisoftpi-datasource version: 1.0.4
id: hawkular-datasource version: 1.1.1
id: ibm-apm-datasource version: 0.9.0
id: instana-datasource version: 2.7.3
id: jasonlashua-prtg-datasource version: 4.0.3
id: jdbranham-diagram-panel version: 1.6.2
id: jeanbaptistewatenberg-percent-panel version: 1.0.6
id: kentik-app version: 1.3.4
id: larona-epict-panel version: 1.2.2
id: linksmart-hds-datasource version: 1.0.1
id: linksmart-sensorthings-datasource version: 1.3.0
id: logzio-datasource version: 5.0.0
id: macropower-analytics-panel version: 1.0.0
id: magnesium-wordcloud-panel version: 1.0.0
id: marcuscalidus-svg-panel version: 0.3.3
id: marcusolsson-hourly-heatmap-panel version: 0.4.1
id: marcusolsson-treemap-panel version: 0.2.0
id: michaeldmoore-annunciator-panel version: 1.0.5
id: michaeldmoore-multistat-panel version: 1.4.1
id: monasca-datasource version: 1.0.0
id: monitoringartist-monitoringart-datasource version: 1.0.0
id: moogsoft-aiops-app version: 8.0.0
id: mtanda-google-calendar-datasource version: 1.0.4
id: mtanda-heatmap-epoch-panel version: 0.1.7
id: mtanda-histogram-panel version: 0.1.6
id: mxswat-separator-panel version: 1.0.0
id: natel-discrete-panel version: 0.1.0
id: natel-influx-admin-panel version: 0.0.5
id: natel-plotly-panel version: 0.0.6
id: natel-usgs-datasource version: 0.0.2
id: neocat-cal-heatmap-panel version: 0.0.3
id: novalabs-annotations-panel version: 0.0.1
id: ns1-app version: 0.0.7
id: ntop-ntopng-datasource version: 1.0.0
id: opennms-helm-app version: 5.0.1
id: ovh-warp10-datasource version: 2.2.0
id: paytm-kapacitor-datasource version: 0.1.2
id: percona-percona-app version: 1.0.0
id: petrslavotinek-carpetplot-panel version: 0.1.1
id: pierosavi-imageit-panel version: 0.1.3
id: pr0ps-trackmap-panel version: 2.1.0
id: praj-ams-datasource version: 1.2.0
id: pue-solr-datasource version: 1.0.2
id: quasardb-datasource version: 3.8.2
id: rackerlabs-blueflood-datasource version: 0.0.2
id: radensolutions-netxms-datasource version: 1.2.2
id: raintank-snap-app version: 0.0.5
id: raintank-worldping-app version: 1.2.7
id: redis-datasource version: 1.1.2
id: ryantxu-ajax-panel version: 0.0.7-dev
id: ryantxu-annolist-panel version: 0.0.1
id: satellogic-3d-globe-panel version: 0.1.0
id: savantly-heatmap-panel version: 0.2.0
id: sbueringer-consul-datasource version: 0.1.5
id: scadavis-synoptic-panel version: 1.0.4
id: sidewinder-datasource version: 0.2.0
id: simpod-json-datasource version: 0.2.0
id: skydive-datasource version: 1.2.0
id: smartmakers-trafficlight-panel version: 1.0.0
id: sni-pnp-datasource version: 1.0.5
id: sni-thruk-datasource version: 1.0.3
id: snuids-radar-panel version: 1.4.4
id: snuids-trafficlights-panel version: 1.4.5
id: spotify-heroic-datasource version: 0.0.1
id: stagemonitor-elasticsearch-app version: 0.83.2
id: udoprog-heroic-datasource version: 0.1.0
id: vertamedia-clickhouse-datasource version: 2.0.2
id: vertica-grafana-datasource version: 0.1.0
id: vonage-status-panel version: 1.0.9
id: voxter-app version: 0.0.1
id: xginn8-pagerduty-datasource version: 0.2.1
id: yesoreyeram-boomtable-panel version: 1.3.0
id: yesoreyeram-boomtheme-panel version: 0.1.0
id: zuburqan-parity-report-panel version: 1.2.1
安裝外掛
安裝到預設外掛路徑
./grafana-cli --pluginsDir /ups/app/monitor/grafana/data/plugins plugins install grafana-piechart-panel
./grafana-cli --pluginsDir /ups/app/monitor/grafana/data/plugins plugins install grafana-polystat-panel
./grafana-cli --pluginsDir /ups/app/monitor/grafana/data/plugins plugins install digiapulssi-breadcrumb-panel
安裝過程截圖
結果確認
./bin/grafana-cli plugins ls
匯入模板
前端介面匯入檔案
後臺配置模板路徑
# 1. 解壓
unzip -qo grafana-dashboards-2.9.0.zip
cd grafana-dashboards-2.9.0
cp -r dashboards /ups/app/monitor/grafana/grafana-dashboards
# 2. 建立 mysqld_export.yml 檔案
cat > /ups/app/monitor/grafana/conf/provisioning/dashboards/mysqld_export.yml <<-EOF
apiVersion: 1
providers:
- name: 'mysqld_exporter'
orgId: 1
folder: ''
type: file
options:
path: /ups/app/monitor/grafana/grafana-dashboards
EOF
# 3. 重啟grafana服務
配置Promethues資料來源
Exporter軟體
在 Prometheus 中負責資料彙報的程式統一叫做 Exporter, 而不同的 Exporter 負責不同的業務。
主機監控程式(node_exporter)
軟體部署
二進位制安裝
軟體安裝
# 建立使用者
#groupadd -g 2000 prometheus
useradd -r -M -c "Prometheus agent" -d /ups/app/monitor/ -s /sbin/nologin prometheus
# 解壓檔案
mkdir -p /ups/app/monitor/
tar -xf node_exporter-*.linux-amd64.tar.gz -C /ups/app/monitor/ --no-same-owner
# 重新命名目錄
cd /ups/app/monitor/
mv node_exporter-*.linux-amd64 node_exporter
# 修改目錄屬主
# chown -R prometheus.prometheus /ups/app/monitor/node_exporter
配置服務項
# 配置服務檔案
cat > /usr/lib/systemd/system/node_exporter.service <<-EOF
[Unit]
Description=node exporter
Documentation=https://prometheus.io
After=network.target
[Service]
#User=prometheus
#Group=prometheus
Restart=on-failure
ExecStart=/ups/app/monitor/node_exporter/node_exporter --web.listen-address=:9100
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=node_exporter
[Install]
WantedBy=multi-user.target
EOF
-
日誌重定向輸出到指定檔案
-
cat > /etc/rsyslog.d/node_exporter.conf <<-EOF if \$programname == 'node_exporter' then /ups/app/monitor/node_exporter/node.log & stop EOF
-
啟動服務
# 啟動服務
systemctl daemon-reload
systemctl restart node_exporter.service
systemctl status node_exporter.service
或
# 啟動客戶端
cd /ups/app/monitor/node_exporter
./node_exporter &
Docker安裝
docker run -d -p 9100:9100 \
-v "/proc:/host/proc:ro" \
-v "/sys:/host/sys:ro" \
-v "/:/rootfs:ro" \
--net="host" \
quay.io/prometheus/node-exporter \
-collector.procfs /host/proc \
-collector.sysfs /host/sys \
-collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)"
接入Prometheus監控
exporter集中式配置
-
修改prometheus引數檔案
利用 Prometheus 的 static_configs 來拉取 node_exporter 的資料。開啟 prometheus.yml 檔案, 在 scrape_configs 中新增如下配置
# 配置prometheus.yml檔案
cat >> /ups/app/monitor/prometheus/config/prometheus.yml <<-EOF
- job_name: 'node_exporter'
scrape_interval: 1s
file_sd_configs:
- files:
- targets/node/nodes-instances.json
refresh_interval: 10s
relabel_configs:
- action: replace
source_labels: ['__address__']
regex: (.*):(.*)
replacement: $1
target_label: hostname
- action: labeldrop
regex: __meta_filepath
EOF
-
配置主機伺服器列表json檔案
vi /ups/app/monitor/prometheus/config/targets/node/nodes-instances.json
[
{
"targets": [ "192.168.10.181:9100","192.168.10.182:9100", "192.168.10.190:9100","192.168.10.191:9100","192.168.10.192:9100"]
}
]
exporter獨立配置
每個監控物件獨立一個檔案配置
-
修改Prometheus引數配置檔案
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- progs:9093 # 對應啟動的altermanager節點的9093埠
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/alert_node.yml"
- "rules/alert_mysql.yml"
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter'
scrape_interval: 1s
file_sd_configs:
- files:
- targets/node/*.yml
refresh_interval: 10s
relabel_configs:
- action: replace
source_labels: ['__address__']
regex: (.*):(.*)
replacement: $1
target_label: hostname
- action: labeldrop
regex: __meta_filepath
-
配置主機伺服器例項檔案
vi /ups/app/monitor/prometheus/config/targets/node/nodes1-instances.yml
[
{
"targets": ["192.168.10.181:9100"],
"labels": { }
}
]
vi /ups/app/monitor/prometheus/config/targets/node/nodes2-instances.yml
[
{
"targets": ["192.168.10.182:9100"],
"labels": { }
}
]
重啟prometheus載入配置
# 檢查並重新載入配置檔案
./bin/promtool check config config/prometheus.yml
# 重啟服務
systemctl restart prometheus
訪問
瀏覽器中訪問 http://IP:9100/metrics
監控功能
預設開啟的功能
名稱 | 說明 | 系統 |
---|---|---|
arp | 從 /proc/net/arp 中收集 ARP 統計資訊 |
Linux |
conntrack | 從 /proc/sys/net/netfilter/ 中收集 conntrack 統計資訊 |
Linux |
cpu | 收集 cpu 統計資訊 | Darwin, Dragonfly, FreeBSD, Linux |
diskstats | 從 /proc/diskstats 中收集磁碟 I/O 統計資訊 |
Linux |
edac | 錯誤檢測與糾正統計資訊 | Linux |
entropy | 可用核心熵資訊 | Linux |
exec | execution 統計資訊 | Dragonfly, FreeBSD |
filefd | 從 /proc/sys/fs/file-nr 中收集檔案描述符統計資訊 |
Linux |
filesystem | 檔案系統統計資訊,例如磁碟已使用空間 | Darwin, Dragonfly, FreeBSD, Linux, OpenBSD |
hwmon | 從 /sys/class/hwmon/ 中收集監控器或感測器資料資訊 |
Linux |
infiniband | 從 InfiniBand 配置中收集網路統計資訊 | Linux |
loadavg | 收集系統負載資訊 | Darwin, Dragonfly, FreeBSD, Linux, NetBSD, OpenBSD, Solaris |
mdadm | 從 /proc/mdstat 中獲取裝置統計資訊 |
Linux |
meminfo | 記憶體統計資訊 | Darwin, Dragonfly, FreeBSD, Linux |
netdev | 網口流量統計資訊,單位 bytes | Darwin, Dragonfly, FreeBSD, Linux, OpenBSD |
netstat | 從 /proc/net/netstat 收集網路統計資料,等同於 netstat -s |
Linux |
sockstat | 從 /proc/net/sockstat 中收集 socket 統計資訊 |
Linux |
stat | 從 /proc/stat 中收集各種統計資訊,包含系統啟動時間,forks, 中斷等 |
Linux |
textfile | 通過 --collector.textfile.directory 引數指定本地文字收集路徑,收集文字資訊 |
any |
time | 系統當前時間 | any |
uname | 通過 uname 系統呼叫, 獲取系統資訊 |
any |
vmstat | 從 /proc/vmstat 中收集統計資訊 |
Linux |
wifi | 收集 wifi 裝置相關統計資料 | Linux |
xfs | 收集 xfs 執行時統計資訊 | Linux (kernel 4.4+) |
zfs | 收集 zfs 效能統計資訊 | Linux |
預設關閉功能
名稱 | 說明 | 系統 |
---|---|---|
bonding | 收集系統配置以及啟用的繫結網絡卡數量 | Linux |
buddyinfo | 從 /proc/buddyinfo 中收集記憶體碎片統計資訊 |
Linux |
devstat | 收集裝置統計資訊 | Dragonfly, FreeBSD |
drbd | 收集遠端映象塊裝置(DRBD)統計資訊 | Linux |
interrupts | 收集更具體的中斷統計資訊 | Linux,OpenBSD |
ipvs | 從 /proc/net/ip_vs 中收集 IPVS 狀態資訊,從 /proc/net/ip_vs_stats 獲取統計資訊 |
Linux |
ksmd | 從 /sys/kernel/mm/ksm 中獲取核心和系統統計資訊 |
Linux |
logind | 從 logind 中收集會話統計資訊 |
Linux |
meminfo_numa | 從 /proc/meminfo_numa 中收集記憶體統計資訊 |
Linux |
mountstats | 從 /proc/self/mountstat 中收集檔案系統統計資訊,包括 NFS 客戶端統計資訊 |
Linux |
nfs | 從 /proc/net/rpc/nfs 中收集 NFS 統計資訊,等同於 nfsstat -c |
Linux |
qdisc | 收集佇列推定統計資訊 | Linux |
runit | 收集 runit 狀態資訊 | any |
supervisord | 收集 supervisord 狀態資訊 | any |
systemd | 從 systemd 中收集裝置系統狀態資訊 |
Linux |
tcpstat | 從 /proc/net/tcp 和 /proc/net/tcp6 收集 TCP 連線狀態資訊 |
Linux |
監控MySQL
MySQL資料庫伺服器上安裝mysql_exporter
安裝exporter軟體
# 建立使用者
# groupadd -g 2000 prometheus
useradd -u 2000 -M -c "Prometheus agent" -s /sbin/nologin prometheus
# 解壓檔案
mkdir -p /ups/app/monitor/
tar -xf mysqld_exporter-0.12.1.linux-amd64.tar.gz -C /ups/app/monitor/
# 重新命名目錄
cd /ups/app/monitor/
mv mysqld_exporter-0.12.1.linux-amd64 mysqld_exporter
# 修改目錄屬主
chown -R prometheus.prometheus /ups/app/monitor/mysqld_exporter
建立MySQL監控使用者
在待監控MySQL上建立使用者
CREATE USER 'monitor'@'localhost' IDENTIFIED BY 'monitor';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'monitor'@'localhost';
CREATE USER 'monitor'@'192.168.10.%' IDENTIFIED BY 'monitor';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'monitor'@'192.168.10.%';
flush privileges;
配置客戶端賬號密碼檔案
cat > /ups/app/monitor/mysqld_exporter/.my.cnf <<EOF
[client]
user=monitor
password=monitor
port=3308
socket=/ups/app/mysql/mysql3308/logs/mysql3308.sock
host=progs
EOF
chmod 400 /ups/app/monitor/mysqld_exporter/.my.cnf
chown prometheus:prometheus /ups/app/monitor/mysqld_exporter/.my.cnf
配置服務
# 配置服務檔案
cat > /usr/lib/systemd/system/mysql_exporter.service <<-EOF
[Unit]
Description=mysqld exporter
Documentation=https://prometheus.io
After=network.target
After=postgresql-12.service mysql3308.service mysql.service
[Service]
Restart=on-failure
# ExecStart=/ups/app/monitor/mysqld_exporter/mysqld_exporter --config.my-cnf=/ups/app/monitor/mysqld_exporter/.my.cnf
ExecStart=/ups/app/monitor/mysqld_exporter/mysqld_exporter \
--config.my-cnf=/ups/app/monitor/mysqld_exporter/.my.cnf \
--collect.info_schema.innodb_tablespaces \
--collect.info_schema.innodb_metrics \
--collect.perf_schema.tableiowaits \
--collect.perf_schema.indexiowaits \
--collect.perf_schema.tablelocks \
--collect.engine_innodb_status \
--collect.perf_schema.file_events \
--collect.binlog_size \
--collect.info_schema.clientstats \
--collect.perf_schema.eventswaits
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=mysqld_exporter
[Install]
WantedBy=multi-user.target
EOF
-
日誌重定向輸出到指定檔案
-
cat > /etc/rsyslog.d/mysqld_exporter.conf <<-EOF if \$programname == 'mysqld_exporter' then /ups/app/monitor/mysqld_exporter/node.log & stop EOF
-
啟動服務
# 啟動服務
systemctl daemon-reload
systemctl restart mysql_exporter.service
systemctl status mysql_exporter.service
或
# 啟動客戶端
./mysqld_exporter --config.my-cnf=/ups/app/monitor/mysqld_exporter/.my.cnf
# 預設埠:9104
lsof -i :9104
netstat -tnlp|grep ':9104'
驗證
http://192.168.10.181:9104/metrics
加入到Prometheus監控(Prometheus Server端)
# 配置prometheus.yml檔案
cat >> /ups/app/monitor/prometheus/config/prometheus.yml <<-EOF
- job_name: 'MySQL'
static_configs:
- targets: ['progs:9104','192.168.10.181:9104']
EOF
重啟prometheus
# 檢查並重新載入配置檔案
./bin/promtool check config config/prometheus.yml
# 重啟服務
systemctl restart prometheus
驗證
http://192.168.10.181:9090/tagets
監控PostgreSQL
軟體部署
下載地址
wget -c https://github.com/wrouesnel/postgres_exporter/releases/download/v0.8.0/postgres_exporter_v0.8.0_linux-amd64.tar.gz
安裝
二進位制包安裝
- 解壓
tar -xf postgres_exporter_v0.8.0_linux-amd64.tar.gz -C /ups/app/monitor
mv postgres_exporter* postgres_exporter
- 配置服務項
# 配置服務檔案
cat > /usr/lib/systemd/system/postgres_exporter.service <<-EOF
[Unit]
Description=PostgreSQL Exporter
Documentation=https://github.com/wrouesnel/postgres_exporter
After=network.target
[Service]
User=postgres
Group=postgres
Restart=on-failure
# DATA_SOURCE_NAME=\"postgresql://postgres:postgres@localhost:5432/postgres?sslmode=disable\";
ExecStart="export DATA_SOURCE_NAME=\"user=postgres passfile=/home/postgres/.pgpass host=192.168.10.181 port=5432 sslmode=prefer\"; \"/ups/app/monitor/postgres_exporter/postgres_exporter --web.listen-address=:9187 --extend.query-path=/ups/app/monitor/postgres_exporter/queries.yaml\""
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=postgres_exporter
[Install]
WantedBy=multi-user.target
EOF
- 配置自定義查詢語句檔案
vi /ups/app/monitor/postgres_exporter/queries.yaml
pg_replication:
query: "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag"
master: true
metrics:
- lag:
usage: "GAUGE"
description: "Replication lag behind master in seconds"
pg_postmaster:
query: "SELECT pg_postmaster_start_time as start_time_seconds from pg_postmaster_start_time()"
master: true
metrics:
- start_time_seconds:
usage: "GAUGE"
description: "Time at which postmaster started"
pg_stat_user_tables:
query: "SELECT current_database() datname, schemaname, relname, seq_scan, seq_tup_read, idx_scan, idx_tup_fetch, n_tup_ins, n_tup_upd, n_tup_del, n_tup_hot_upd, n_live_tup, n_dead_tup, n_mod_since_analyze, COALESCE(last_vacuum, '1970-01-01Z'), COALESCE(last_vacuum, '1970-01-01Z') as last_vacuum, COALESCE(last_autovacuum, '1970-01-01Z') as last_autovacuum, COALESCE(last_analyze, '1970-01-01Z') as last_analyze, COALESCE(last_autoanalyze, '1970-01-01Z') as last_autoanalyze, vacuum_count, autovacuum_count, analyze_count, autoanalyze_count FROM pg_stat_user_tables"
metrics:
- datname:
usage: "LABEL"
description: "Name of current database"
- schemaname:
usage: "LABEL"
description: "Name of the schema that this table is in"
- relname:
usage: "LABEL"
description: "Name of this table"
- seq_scan:
usage: "COUNTER"
description: "Number of sequential scans initiated on this table"
- seq_tup_read:
usage: "COUNTER"
description: "Number of live rows fetched by sequential scans"
- idx_scan:
usage: "COUNTER"
description: "Number of index scans initiated on this table"
- idx_tup_fetch:
usage: "COUNTER"
description: "Number of live rows fetched by index scans"
- n_tup_ins:
usage: "COUNTER"
description: "Number of rows inserted"
- n_tup_upd:
usage: "COUNTER"
description: "Number of rows updated"
- n_tup_del:
usage: "COUNTER"
description: "Number of rows deleted"
- n_tup_hot_upd:
usage: "COUNTER"
description: "Number of rows HOT updated (i.e., with no separate index update required)"
- n_live_tup:
usage: "GAUGE"
description: "Estimated number of live rows"
- n_dead_tup:
usage: "GAUGE"
description: "Estimated number of dead rows"
- n_mod_since_analyze:
usage: "GAUGE"
description: "Estimated number of rows changed since last analyze"
- last_vacuum:
usage: "GAUGE"
description: "Last time at which this table was manually vacuumed (not counting VACUUM FULL)"
- last_autovacuum:
usage: "GAUGE"
description: "Last time at which this table was vacuumed by the autovacuum daemon"
- last_analyze:
usage: "GAUGE"
description: "Last time at which this table was manually analyzed"
- last_autoanalyze:
usage: "GAUGE"
description: "Last time at which this table was analyzed by the autovacuum daemon"
- vacuum_count:
usage: "COUNTER"
description: "Number of times this table has been manually vacuumed (not counting VACUUM FULL)"
- autovacuum_count:
usage: "COUNTER"
description: "Number of times this table has been vacuumed by the autovacuum daemon"
- analyze_count:
usage: "COUNTER"
description: "Number of times this table has been manually analyzed"
- autoanalyze_count:
usage: "COUNTER"
description: "Number of times this table has been analyzed by the autovacuum daemon"
pg_statio_user_tables:
query: "SELECT current_database() datname, schemaname, relname, heap_blks_read, heap_blks_hit, idx_blks_read, idx_blks_hit, toast_blks_read, toast_blks_hit, tidx_blks_read, tidx_blks_hit FROM pg_statio_user_tables"
metrics:
- datname:
usage: "LABEL"
description: "Name of current database"
- schemaname:
usage: "LABEL"
description: "Name of the schema that this table is in"
- relname:
usage: "LABEL"
description: "Name of this table"
- heap_blks_read:
usage: "COUNTER"
description: "Number of disk blocks read from this table"
- heap_blks_hit:
usage: "COUNTER"
description: "Number of buffer hits in this table"
- idx_blks_read:
usage: "COUNTER"
description: "Number of disk blocks read from all indexes on this table"
- idx_blks_hit:
usage: "COUNTER"
description: "Number of buffer hits in all indexes on this table"
- toast_blks_read:
usage: "COUNTER"
description: "Number of disk blocks read from this table's TOAST table (if any)"
- toast_blks_hit:
usage: "COUNTER"
description: "Number of buffer hits in this table's TOAST table (if any)"
- tidx_blks_read:
usage: "COUNTER"
description: "Number of disk blocks read from this table's TOAST table indexes (if any)"
- tidx_blks_hit:
usage: "COUNTER"
description: "Number of buffer hits in this table's TOAST table indexes (if any)"
pg_database:
query: "SELECT pg_database.datname, pg_database_size(pg_database.datname) as size FROM pg_database"
master: true
cache_seconds: 30
metrics:
- datname:
usage: "LABEL"
description: "Name of the database"
- size_bytes:
usage: "GAUGE"
description: "Disk space used by the database"
pg_stat_statements:
query: "SELECT t2.rolname, t3.datname, queryid, calls, total_time / 1000 as total_time_seconds, min_time / 1000 as min_time_seconds, max_time / 1000 as max_time_seconds, mean_time / 1000 as mean_time_seconds, stddev_time / 1000 as stddev_time_seconds, rows, shared_blks_hit, shared_blks_read, shared_blks_dirtied, shared_blks_written, local_blks_hit, local_blks_read, local_blks_dirtied, local_blks_written, temp_blks_read, temp_blks_written, blk_read_time / 1000 as blk_read_time_seconds, blk_write_time / 1000 as blk_write_time_seconds FROM pg_stat_statements t1 join pg_roles t2 on (t1.userid=t2.oid) join pg_database t3 on (t1.dbid=t3.oid)"
master: true
metrics:
- rolname:
usage: "LABEL"
description: "Name of user"
- datname:
usage: "LABEL"
description: "Name of database"
- queryid:
usage: "LABEL"
description: "Query ID"
- calls:
usage: "COUNTER"
description: "Number of times executed"
- total_time_seconds:
usage: "COUNTER"
description: "Total time spent in the statement, in milliseconds"
- min_time_seconds:
usage: "GAUGE"
description: "Minimum time spent in the statement, in milliseconds"
- max_time_seconds:
usage: "GAUGE"
description: "Maximum time spent in the statement, in milliseconds"
- mean_time_seconds:
usage: "GAUGE"
description: "Mean time spent in the statement, in milliseconds"
- stddev_time_seconds:
usage: "GAUGE"
description: "Population standard deviation of time spent in the statement, in milliseconds"
- rows:
usage: "COUNTER"
description: "Total number of rows retrieved or affected by the statement"
- shared_blks_hit:
usage: "COUNTER"
description: "Total number of shared block cache hits by the statement"
- shared_blks_read:
usage: "COUNTER"
description: "Total number of shared blocks read by the statement"
- shared_blks_dirtied:
usage: "COUNTER"
description: "Total number of shared blocks dirtied by the statement"
- shared_blks_written:
usage: "COUNTER"
description: "Total number of shared blocks written by the statement"
- local_blks_hit:
usage: "COUNTER"
description: "Total number of local block cache hits by the statement"
- local_blks_read:
usage: "COUNTER"
description: "Total number of local blocks read by the statement"
- local_blks_dirtied:
usage: "COUNTER"
description: "Total number of local blocks dirtied by the statement"
- local_blks_written:
usage: "COUNTER"
description: "Total number of local blocks written by the statement"
- temp_blks_read:
usage: "COUNTER"
description: "Total number of temp blocks read by the statement"
- temp_blks_written:
usage: "COUNTER"
description: "Total number of temp blocks written by the statement"
- blk_read_time_seconds:
usage: "COUNTER"
description: "Total time the statement spent reading blocks, in milliseconds (if track_io_timing is enabled, otherwise zero)"
- blk_write_time_seconds:
usage: "COUNTER"
description: "Total time the statement spent writing blocks, in milliseconds (if track_io_timing is enabled, otherwise zero)"
- 日誌重定向輸出到指定檔案
cat > /etc/rsyslog.d/postgres_exporter.conf <<-EOF
if \$programname == 'postgres_exporter' then /ups/app/monitor/postgres_exporter/exporter.log
& stop
EOF
- 啟動服務
# 啟動服務
systemctl daemon-reload
systemctl restart postgres_exporter.service
systemctl status postgres_exporter.service
# 命令列啟動客戶端-- postgresql://postgres:password@localhost:5432/postgres
export DATA_SOURCE_NAME="postgresql://postgres:postgres@localhost:5432/postgres?sslmode=disable"
export PG_EXPORTER_EXTEND_QUERY_PATH="/ups/app/monitor/postgres_exporter/queries.yaml"
./postgres_exporter &
Docker安裝
docker run --net=host -e DATA_SOURCE_NAME="postgresql://postgres:password@localhost:5432/postgres?sslmode=disable" wrouesnel/postgres_exporter
接入Prometheus監控
新增配置Prometheus檔案
- job_name: 'postgres_exporter'
scrape_interval: 1s
file_sd_configs:
- files:
- targets/postgresql/*.yml
relabel_configs:
- action: replace
source_labels: ['__address__']
regex: (.*):(.*):(.*)
replacement: $2
target_label: hostip
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.10.181:9121
告警規則檔案
vi rules/alert_pg.yml
---
groups:
- name: PostgreSQL
rules:
- alert: PostgreSQLMaxConnectionsReached
expr: sum(pg_stat_activity_count) by (instance) > sum(pg_settings_max_connections) by (instance)
for: 1m
labels:
severity: email
annotations:
summary: "{{ $labels.instance }} has maxed out Postgres connections."
description: "{{ $labels.instance }} is exceeding the currently configured maximum Postgres connection limit (current value: {{ $value }}s). Services may be degraded - please take immediate action (you probably need to increase max_connections in the Docker image and re-deploy."
- alert: PostgreSQLHighConnections
expr: sum(pg_stat_activity_count) by (instance) > sum(pg_settings_max_connections * 0.8) by (instance)
for: 10m
labels:
severity: email
annotations:
summary: "{{ $labels.instance }} is over 80% of max Postgres connections."
description: "{{ $labels.instance }} is exceeding 80% of the currently configured maximum Postgres connection limit (current value: {{ $value }}s). Please check utilization graphs and confirm if this is normal service growth, abuse or an otherwise temporary condition or if new resources need to be provisioned (or the limits increased, which is mostly likely)."
- alert: PostgreSQLDown
expr: pg_up != 1
for: 1m
labels:
severity: email
annotations:
summary: "PostgreSQL is not processing queries: {{ $labels.instance }}"
description: "{{ $labels.instance }} is rejecting query requests from the exporter, and thus probably not allowing DNS requests to work either. User services should not be effected provided at least 1 node is still alive."
- alert: PostgreSQLSlowQueries
expr: avg(rate(pg_stat_activity_max_tx_duration{datname!~"template.*"}[2m])) by (datname) > 2 * 60
for: 2m
labels:
severity: email
annotations:
summary: "PostgreSQL high number of slow on {{ $labels.cluster }} for database {{ $labels.datname }} "
description: "PostgreSQL high number of slow queries {{ $labels.cluster }} for database {{ $labels.datname }} with a value of {{ $value }} "
- alert: PostgreSQLQPS
expr: avg(irate(pg_stat_database_xact_commit{datname!~"template.*"}[5m]) + irate(pg_stat_database_xact_rollback{datname!~"template.*"}[5m])) by (datname) > 10000
for: 5m
labels:
severity: email
annotations:
summary: "PostgreSQL high number of queries per second {{ $labels.cluster }} for database {{ $labels.datname }}"
description: "PostgreSQL high number of queries per second on {{ $labels.cluster }} for database {{ $labels.datname }} with a value of {{ $value }}"
- alert: PostgreSQLCacheHitRatio
expr: avg(rate(pg_stat_database_blks_hit{datname!~"template.*"}[5m]) / (rate(pg_stat_database_blks_hit{datname!~"template.*"}[5m]) + rate(pg_stat_database_blks_read{datname!~"template.*"}[5m]))) by (datname) < 0.98
for: 5m
labels:
severity: email
annotations:
summary: "PostgreSQL low cache hit rate on {{ $labels.cluster }} for database {{ $labels.datname }}"
description: "PostgreSQL low on cache hit rate on {{ $labels.cluster }} for database {{ $labels.datname }} with a value of {{ $value }}"
non-superuser收集指標所需許可權
DATA_SOURCE_NAME=postgresql://postgres_exporter:password@localhost:5432/postgres?sslmode=disable
-- To use IF statements, hence to be able to check if the user exists before
-- attempting creation, we need to switch to procedural SQL (PL/pgSQL)
-- instead of standard SQL.
-- More: https://www.postgresql.org/docs/9.3/plpgsql-overview.html
-- To preserve compatibility with <9.0, DO blocks are not used; instead,
-- a function is created and dropped.
CREATE OR REPLACE FUNCTION __tmp_create_user() returns void as $$
BEGIN
IF NOT EXISTS (
SELECT -- SELECT list can stay empty for this
FROM pg_catalog.pg_user
WHERE usename = 'postgres_exporter') THEN
CREATE USER postgres_exporter;
END IF;
END;
$$ language plpgsql;
SELECT __tmp_create_user();
DROP FUNCTION __tmp_create_user();
ALTER USER postgres_exporter WITH PASSWORD 'password';
ALTER USER postgres_exporter SET SEARCH_PATH TO postgres_exporter,pg_catalog;
-- If deploying as non-superuser (for example in AWS RDS), uncomment the GRANT
-- line below and replace <MASTER_USER> with your root user.
-- GRANT postgres_exporter TO <MASTER_USER>;
CREATE SCHEMA IF NOT EXISTS postgres_exporter;
GRANT USAGE ON SCHEMA postgres_exporter TO postgres_exporter;
GRANT CONNECT ON DATABASE postgres TO postgres_exporter;
CREATE OR REPLACE FUNCTION get_pg_stat_activity() RETURNS SETOF pg_stat_activity AS
$$ SELECT * FROM pg_catalog.pg_stat_activity; $$
LANGUAGE sql
VOLATILE
SECURITY DEFINER;
CREATE OR REPLACE VIEW postgres_exporter.pg_stat_activity
AS
SELECT * from get_pg_stat_activity();
GRANT SELECT ON postgres_exporter.pg_stat_activity TO postgres_exporter;
CREATE OR REPLACE FUNCTION get_pg_stat_replication() RETURNS SETOF pg_stat_replication AS
$$ SELECT * FROM pg_catalog.pg_stat_replication; $$
LANGUAGE sql
VOLATILE
SECURITY DEFINER;
CREATE OR REPLACE VIEW postgres_exporter.pg_stat_replication
AS
SELECT * FROM get_pg_stat_replication();
GRANT SELECT ON postgres_exporter.pg_stat_replication TO postgres_exporter;
重新載入配置
curl -X POST http://localhost:9090/-/reload
監控redis
軟體部署
下載地址
wget -c https://github.com/oliver006/redis_exporter/releases/download/v1.9.0/redis_exporter-v1.9.0.linux-amd64.tar.gz
安裝
二進位制包安裝
- 解壓
tar -xf redis_exporter-v1.9.0.linux-amd64.tar.gz -C /ups/app/monitor/
mv redis_exporter-* redis_exporter
- 配置服務項
# 配置服務檔案
cat > /usr/lib/systemd/system/redis_exporter.service <<-EOF
[Unit]
Description=Redis Exporter
Documentation=https://github.com/oliver006/redis_exporter
After=network.target
[Service]
#User=prometheus
#Group=prometheus
Restart=on-failure
ExecStart=/ups/app/monitor/redis_exporter/redis_exporter -redis-only-metrics --web.listen-address=:9121
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=redis_exporter
[Install]
WantedBy=multi-user.target
EOF
- 日誌重定向輸出到指定檔案
cat > /etc/rsyslog.d/redis_exporter.conf <<-EOF
if \$programname == 'redis_exporter' then /ups/app/monitor/redis_exporter/exporter.log
& stop
EOF
- 啟動服務
# 啟動服務
systemctl daemon-reload
systemctl restart redis_exporter.service
systemctl status redis_exporter.service
# 命令列啟動客戶端
cd /ups/app/monitor/redis_exporter
./redis_exporter &
Docker安裝
docker run -d --name redis_exporter -p 9121:9121 oliver006/redis_exporter
接入Prometheus監控
配置prometheus.yml檔案
新增redis資料採集項
- 集中式配置
scrape_configs:
- job_name: 'redis_exporter'
file_sd_configs:
- files:
- targets/redis/redis-instances.json
metrics_path: /scrape
relabel_configs:
- action: replace
source_labels: ['__address__']
regex: (.*):(.*):(.*)
replacement: $2
target_label: hostip
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.10.181:9121
## config for scraping the exporter itself
- job_name: 'redis_exporter_single'
static_configs:
- targets:
- 192.168.10.181:9121
配置redis伺服器json檔案
vi targets/redis/redis-instances.json
[
{
"targets": [ "redis://192.168.10.181:6379", "redis://192.168.10.151:6379"],
"labels": { }
}
]
帶密碼URI格式:redis://host:<<PASSWORD>>@<<HOSTNAME>>:<<PORT>>
- 獨立配置
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- progs:9093 # 對應啟動的altermanager節點的9093埠
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/alert_node.yml"
- "rules/alert_mysql.yml"
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter'
scrape_interval: 1s
file_sd_configs:
- files:
- targets/node/*.yml
refresh_interval: 10s
relabel_configs:
- action: replace
source_labels: ['__address__']
regex: (.*):(.*)
replacement: $1
target_label: hostname
- action: labeldrop
regex: __meta_filepath
- job_name: 'redis_exporter'
scrape_interval: 1s
file_sd_configs:
- files:
- targets/redis/*.yml
metrics_path: /scrape
relabel_configs:
- action: replace
source_labels: ['__address__']
regex: (.*):(.*):(.*)
replacement: $2
target_label: hostip
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.10.181:9121
配置redis伺服器json檔案
vi targets/redis/redis1_exporter.yml
[
{
"targets": [ "redis://192.168.10.181:6379"],
"labels": { }
}
]
vi targets/redis/redis2_exporter.yml
[
{
"targets": [ "redis://192.168.10.151:6379"],
"labels": { }
}
]
重啟prometheus載入配置
# 檢查並重新載入配置檔案
./bin/promtool check config config/prometheus.yml
# 重啟服務
systemctl restart prometheus
告警元件
在 Prometheus 中告警分為兩部分:
- Prometheus 服務根據所設定的告警規則將告警資訊傳送給 Alertmanager。
- Alertmanager 對收到的告警資訊進行處理,包括去重,降噪,分組,策略路由告警通知。
使用告警服務主要的步驟如下:
- 下載配置 Alertmanager。
- 通過設定
-alertmanager.url
讓 Prometheus 服務與 Alertmanager 進行通訊。 - 在 Prometheus 服務中設定告警規則。
安裝告警管理模組軟體
二進位制安裝
mkdir -p /ups/app/monitor/
# 解壓
tar -xf alertmanager-0.20.0.linux-amd64.tar.gz -C /ups/app/monitor/ --no-same-owner
cd /ups/app/monitor/
mv alertmanager-0.20.0.linux-amd64/ alertmanager
# 建立使用者
# groupadd -g 2000 prometheus
useradd -r -M -s /sbin/nologin -d /ups/app/monitor/alertmanager -c "Prometheus agent" prometheus
# 建立目錄
cd /ups/app/monitor/
mkdir -p alertmanager/{bin,logs,config,data}
cd alertmanager
mv alertmanager.yml config/
mv alertmanager amtool bin/
# 修改目錄屬主
chown -R prometheus.prometheus /ups/app/monitor/alertmanager
配置服務項
# 配置服務啟動項
cat > /usr/lib/systemd/system/alertmanager.service <<-EOF
[Unit]
Description=alertmanager
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/ups/app/monitor/alertmanager/bin/alertmanager \
--config.file=/ups/app/monitor/alertmanager/config/alertmanager.yml \
--web.listen-address=192.168.10.181:9093 \
--cluster.listen-address=0.0.0.0:8001 \
--storage.path=/ups/app/monitor/alertmanager/data \
--log.level=info
Restart=on-failure
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
EOF
基本配置
cat /ups/app/monitor/alertmanager/config/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
啟動服務
# 載入服務
systemctl daemon-reload
systemctl enable alertmanager.service
systemctl start alertmanager.service
systemctl status alertmanager
案例
通過企業微信接收告警
準備工作
- 註冊企業微信賬號
- 建立第三方應用,點選
建立應用按鈕
-> 填寫應用
詳細配置
prometheus 配置
vi /ups/app/monitor/promethues/config/prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules.yml"
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
rules.yml 配置
cat > /ups/app/monitor/promethues/config/rules.yml <<-EOF
groups:
- name: node
rules:
- alert: server_status
expr: up{job="node"} == 0
for: 15s
annotations:
summary: "機器 {{ $labels.instance }} 掛了"
EOF
alertmanger 配置
cat > /ups/app/monitor/alertmanager/config/alertmanager.yml <<-EOF
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'wechat'
receivers:
- name: 'wechat'
wechat_configs:
- corp_id: 'ww9e5158867cf67d24'
to_party: '1'
agent_id: '1000002'
api_secret: 'eRDqnTEOtlk2DtPiaxOA2w5fFyNhpIPkdQU-6Ty94cI'
EOF
引數說明:
- corp_id: 企業微信賬號唯一 ID, 可以在
我的企業
中檢視。 - to_party: 需要傳送的組。
- agent_id: 第三方企業應用的 ID,可以在自己建立的第三方企業應用詳情頁面檢視。
- api_secret: 第三方企業應用的金鑰,可以在自己建立的第三方企業應用詳情頁面檢視。