使用 Prometheus 和 Grafana 監控 Spark 應用

阿新 • • 發佈：2018-12-13

文章目錄

背景
實現

技術方案
採集資料寫入資料庫
dashboard 配置

效果
相關檔案
參考

背景

每個開發者都想了解自己任務執行時的狀態，便於調優及排錯，Spark 提供的 webui 已經提供了很多資訊，使用者可以從上面瞭解到任務的 shuffle，任務執行等資訊，但是執行時 Executor JVM 的狀態對使用者來說是個黑盒，在應用記憶體不足報錯時，初級使用者可能不瞭解程式究竟是 Driver 還是 Executor 記憶體不足，從而也無法正確的去調整引數。

Spark 的度量系統提供了相關資料，我們需要做的只是將其採集並展示。

實現

技術方案

後端儲存使用 Prometheus，類似的時序資料庫還有 influxDB/opentsdb 等。
前端展示使用的 Grafana，也可以使用 Graphite 或者自己繪圖。

這套方案最大的好處就是所有的元件都是開箱即用。

在叢集規模較大的情況下，建議可以先將指標採集到 kafka，然後再消費寫入資料庫。這樣做對採集和資料庫進行了解耦，還能在一定程度上能提高吞吐量，並且只需要實現一個 Kafka Sink，不需要對每個資料庫進行適配。建議使用現成輪子：jvm-profiler

版本資訊：
grafana-5.2.4
graphite_exporter-0.3.0
prometheus-2.3.2

採集資料寫入資料庫

spark 預設沒有 Prometheus Sink ，這時候一般需要去自己實現一個，例如 spark-metrics。

其實 prometheus 還提供了一個外掛（graphite_exporter），可以將 Graphite metrics 進行轉化並寫入 Prometheus （本文的方式），spark 是自帶 Graphite Sink 的，這下省事了，只需要配置一把就可以生效了。

/path/to/spark/conf/metrics.properties

*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=<metrics_hostname>
*.sink.graphite.port=<metrics_port>
*.sink.graphite.period=5
*.sink.graphite.unit=seconds

driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource

提交時記得使用 --files /path/to/spark/conf/metrics.properties 引數將配置檔案分發到所有的 Executor，否則將採集不到相應的 executor 資料。

啟動應用後，如果採集成功，將在 http://<metrics_hostname>:<metrics_port>/metrics 頁面中看到相應的資訊。

例如：

# HELP application_1533838659288_1030_driver_CodeGenerator_compilationTime_count Graphite metric application_1533838659288_1030.driver.CodeGenerator.compilationTime.count
# TYPE application_1533838659288_1030_driver_CodeGenerator_compilationTime_count gauge
application_1533838659288_1030_driver_CodeGenerator_compilationTime_count 2

原生的 Graphite 資料可以通過對映檔案轉化為有 label 維度的 Prometheus 資料。
例如：

mappings:
- match: '*.*.jvm.*.*'
  name: jvm_memory_usage
  labels:
    application: $1
    executor_id: $2
    mem_type: $3
    qty: $4

上述檔案會將資料轉化成 metric name 為 jvm_memory_usage，label 為 application，executor_id，mem_type，qty 的格式。

application_1533838659288_1030_1_jvm_heap_usage -> jvm_memory_usage{application="application_1533838659288_1030",executor_id="driver",mem_type="heap",qty="usage"}

啟動 graphite_exporter 時載入配置檔案
./graphite_exporter --graphite.mapping-config=graphite_exporter_mapping

配置 Prometheus 從 graphite_exporter 獲取資料
/path/to/prometheus/prometheus.yml

scrape_configs:
  - job_name: 'spark'
    static_configs:
    - targets: ['localhost:9108']

dashboard 配置

增加 Prometheus 資料來源
這裡寫圖片描述

將 application label 加入 Variables 用於篩選不同的應用
這裡寫圖片描述

配置相應的圖表
這裡寫圖片描述

效果

這裡寫圖片描述

參考

Monitoring Spark on Hadoop with Prometheus and Grafana

使用 Prometheus 和 Grafana 監控 Spark 應用

文章目錄

背景

實現

技術方案

採集資料寫入資料庫

dashboard 配置

效果

相關檔案

參考

使用 Prometheus 和 Grafana 監控 Spark 應用

使用Prometheus和Grafana監控服務

基於Prometheus和Grafana進行效能監控_Kubernetes中文社群

基於Prometheus和Grafana的監控平臺 - 環境搭建

基於Prometheus和Grafana的監控平臺 - 運維告警

【Springboot】用Prometheus+Grafana監控Springboot應用

基於Prometheus和Grafana打造業務監控看板

incubator-dolphinscheduler 如何在不寫任何新程式碼的情況下，能快速接入到prometheus和grafana中進行監控

[Istio] 外部訪問Istio自帶的Prometheus和Grafana

Docker整合Prometheus、Grafana監控Mysql

Docker整合Prometheus、Grafana監控MongoDB

Docker整合Prometheus、Grafana監控RabbitMQ

Telegraf和Grafana監控多平臺上的SQL Server

Telegraf和Grafana監控多平臺上的SQL Server-自定義監控資料收集

kubernetes1.15極速部署prometheus和grafana

部署 Prometheus 和 Grafana 到 k8s

.NetCore下使用Prometheus實現系統監控和警報（三）整合Grafana

.NetCore下使用Prometheus實現系統監控和警報（六）進階Grafana集成自定義收集指標

zabbix企業應用：通過SNMP和iDRAC監控DELL服務器硬件

graphite+grafana監控openstack和ceph

使用 Prometheus 和 Grafana 監控 Spark 應用

文章目錄

背景

實現

技術方案

採集資料寫入資料庫

dashboard 配置

效果

相關檔案

參考

相關推薦