kafka監控實戰(jmxtrans+InfluxDb+Grafana)
一、前言
從上周一直在調研找一款好用的kafka監控,我測試使用過的KafkaOffsetMonitor、Burrow、kafka-monitor、Kafka-Manager,他們各有優缺點,具體情況我這裏就不展開描述了,大家可以到它們的git上去查看, 並且它們基本上都是監控topic的寫入和讀取等等,沒有提供對於整體集群的監控信息,比如集群的分片、延時、內存使用情況等等,無意中發現了jmxtrans,jmxtrans它是一個通過jmx采集java應用的數據采集器,他的輸出可以是Graphite、StatsD、Ganglia、InfluxDb等等,剛好我們現有的監控是通過InfluxDb做數據存儲的,通過Grafana做展示,下面就給大家介紹一下jmxtrans+InfluxDb+Grafana監控kafka的整體解決方案,並且不需要任何額外的開發工作,完全使用原生的。
二、環境介紹
1、角色
a、10.10.10.10 InfluxDb b、10.10.10.100 Grafana c、10.10.30.69 jmxtrans d、kafka集群 10.10.20.14 node1 10.10.20.15 node2 10.10.20.16 node3 10.10.20.17 node4
2、軟件版本
influxdb-1.2.4-1.x86_64 grafana-4.1.1-1484211277.x86_64 jmxtrans-266.rpm kafka_2.10-0.9.0.0.jar.asc
3、架構圖
三、配置規劃
1、jmxtrans我們可以分別在每臺kafka節點上部署,也可以部署到一臺機器上,我這裏是選擇了後者,因為我的集群小,這樣配置文件可以集中管理,如果集群比較大,可以考慮分散部署。
2、關於jmxtrans的配置文件,分全局指標(每個kafka節點)和topic指標,全局指標每個節點一個配置文件,命名規則:base_10.10.20.14.json,topic指標是每個topic一個配置文件,命名規則:falcon_monitor_us_17.json
四、監控指標
1、全局指標
每秒輸入的流量
"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec" "attr" : [ "Count" ] "resultAlias":"BytesInPerSec" "tags" : {"application" : "BytesInPerSec"}
每秒輸入的流量
"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec" "attr" : [ "Count" ] "resultAlias":"BytesOutPerSec" "tags" : {"application" : "BytesOutPerSec"}
每秒輸入的流量
"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec" "attr" : [ "Count" ] "resultAlias":"BytesRejectedPerSec" "tags" : {"application" : "BytesRejectedPerSec"}
每秒的消息寫入總量
"obj" : "kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec" "attr" : [ "Count" ] "resultAlias":"MessagesInPerSec" "tags" : {"application" : "MessagesInPerSec"}
每秒FetchFollower的請求次數
"obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchFollower" "attr" : [ "Count" ] "resultAlias":"RequestsPerSec" "tags" : {"request" : "FetchFollower"}
每秒FetchConsumer的請求次數
"obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchConsumer" "attr" : [ "Count" ] "resultAlias":"RequestsPerSec" "tags" : {"request" : "FetchConsumer"}
每秒Produce的請求次數
"obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce" "attr" : [ "Count" ] "resultAlias":"RequestsPerSec" "tags" : {"request" : "Produce"}
內存使用的使用情況
"obj" : "java.lang:type=Memory" "attr" : [ "HeapMemoryUsage", "NonHeapMemoryUsage" ] "resultAlias":"MemoryUsage" "tags" : {"application" : "MemoryUsage"}
GC的耗時和次數
"obj" : "java.lang:type=GarbageCollector,name=*" "attr" : [ "CollectionCount","CollectionTime" ] "resultAlias":"GC" "tags" : {"application" : "GC"}
線程的使用情況
"obj" : "java.lang:type=Threading" "attr" : [ "PeakThreadCount","ThreadCount" ] "resultAlias":"Thread" "tags" : {"application" : "Thread"}
副本落後主分片的最大消息數量
"obj" : "kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica" "attr" : [ "Value" ] "resultAlias":"ReplicaFetcherManager" "tags" : {"application" : "MaxLag"}
該broker上的partition的數量
"obj" : "kafka.server:type=ReplicaManager,name=PartitionCount" "attr" : [ "Value" ] "resultAlias":"ReplicaManager" "tags" : {"application" : "PartitionCount"}
正在做復制的partition的數量
"obj" : "kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions" "attr" : [ "Value" ] "resultAlias":"ReplicaManager" "tags" : {"application" : "UnderReplicatedPartitions"}
Leader的replica的數量
"obj" : "kafka.server:type=ReplicaManager,name=LeaderCount" "attr" : [ "Value" ] "resultAlias":"ReplicaManager" "tags" : {"application" : "LeaderCount"}
一個請求FetchConsumer耗費的所有時間
"obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer" "attr" : [ "Count","Max" ] "resultAlias":"TotalTimeMs" "tags" : {"application" : "FetchConsumer"}
一個請求FetchFollower耗費的所有時間
"obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower" "attr" : [ "Count","Max" ] "resultAlias":"TotalTimeMs" "tags" : {"application" : "FetchFollower"}
一個請求Produce耗費的所有時間
"obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce" "attr" : [ "Count","Max" ] "resultAlias":"TotalTimeMs" "tags" : {"application" : "Produce"}
2、topic的監控指標
falcon_monitor_us每秒的寫入流量
"kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic=falcon_monitor_us" "attr" : [ "Count" ] "resultAlias":"falcon_monitor_us" "tags" : {"application" : "BytesInPerSec"}
falcon_monitor_us每秒的輸出流量
"kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic=falcon_monitor_us" "attr" : [ "Count" ] "resultAlias":"falcon_monitor_us" "tags" : {"application" : "BytesOutPerSec"}
falcon_monitor_us每秒寫入消息的數量
"obj" : "kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=falcon_monitor_us" "attr" : [ "Count" ] "resultAlias":"falcon_monitor_us" "tags" : {"application" : "MessagesInPerSec"}
falcon_monitor_us在每個分區最後的Offset
"obj" : "kafka.log:type=Log,name=LogEndOffset,topic=falcon_monitor_us,partition=*" "attr" : [ "Value" ] "resultAlias":"falcon_monitor_us" "tags" : {"application" : "LogEndOffset"}
PS:
1、參數說明
"obj"對應jmx的ObjectName,就是我們要監控的指標
"attr"對應ObjectName的屬性,可以理解為我們要監控的指標的值
"resultAlias"對應metric 的名稱,在InfluxDb裏面就是MEASUREMENTS名
"tags" 對應InfluxDb的tag功能,對與存儲在同一個MEASUREMENTS裏面的不同監控指標可以做區分,我們在用Grafana繪圖的時候會用到,建議對每個監控指標都打上tags
2、對於全局監控,每一個監控指標對應一個MEASUREMENTS,所有的kafka節點同一個監控指標數據寫同一個MEASUREMENTS ,對於topc監控的監控指標,同一個topic所有kafka節點寫到同一個MEASUREMENTS,並且以topic名稱命名
五、安裝
1、kafka
這裏不詳細介紹kafka集群的安裝,主要說一下kafka的啟動方式,因為我們需要通過jmx采集kafka的監控數據,所以在kafka的啟動時候需要啟動jmx端口,啟動方式如下:
cd /data/kafka/bin/ JMX_PORT=9999 nohup ./kafka-server-start.sh ../config/server.properties >/dev/null 2>&1 &
2、InfluxDb
yum -y install influxdb ##安裝 /etc/init.d/influxdb start ##啟動服務 [[email protected] jmxtrans]# influx Connected to http://localhost:8086 version 1.3.2 InfluxDB shell version: 1.3.2 > CREATE USER "root" WITH PASSWORD ‘123456‘ WITH ALL PRIVILEGES ##添加一個賬號 >
3、Grafana
yum -y install grafana ##安裝 /etc/init.d/grafana-server start ##啟動服務
4、jmxtrans
wget http://central.maven.org/maven2/org/jmxtrans/jmxtrans/266/jmxtrans-266.rpm rpm -ivh jmxtrans-266.rpm ##安裝 /etc/init.d/jmxtrans start ##啟動
六、配置
這裏主要介紹jmxtrans采集數據的配置文件撰寫和Grafana繪圖的配置註意事項,kafka和InfluxDb的配置這裏不做描述。
1、jmxtrans
a、jmxtrans默認讀取/var/lib/jmxtrans下的配置文件去采集數據的,所以我們把采集kafka監控數據的配置文件都在這個目錄下,下面是我的配置文件命名規範:
[[email protected] jmxtrans]# ll total 96 -rw-r--r-- 1 root root 1657 Aug 18 17:03 article-feedback-10min-json_14.json -rw-r--r-- 1 root root 1657 Aug 18 17:03 article-feedback-10min-json_15.json -rw-r--r-- 1 root root 1657 Aug 18 17:04 article-feedback-10min-json_16.json -rw-r--r-- 1 root root 1657 Aug 18 17:04 article-feedback-10min-json_17.json -rw-r--r-- 1 root root 8430 Aug 22 08:24 base_10.10.20.14.json -rw-r--r-- 1 root root 8431 Aug 22 08:24 base_10.10.20.15.json -rw-r--r-- 1 root root 8431 Aug 22 08:25 base_10.10.20.16.json -rw-r--r-- 1 root root 8431 Aug 22 08:25 base_10.10.20.17.json -rw-r--r-- 1 root root 2027 Aug 21 16:19 falcon_monitor_us_14.json -rw-r--r-- 1 root root 2027 Aug 21 16:20 falcon_monitor_us_15.json -rw-r--r-- 1 root root 2484 Aug 21 20:58 falcon_monitor_us_16.json -rw-r--r-- 1 root root 2027 Aug 21 16:20 falcon_monitor_us_17.json -rw-r--r-- 1 root root 2147 Aug 21 17:43 highgmp-articles-through-primary_14.json -rw-r--r-- 1 root root 2147 Aug 21 17:46 highgmp-articles-through-primary_15.json -rw-r--r-- 1 root root 2147 Aug 21 17:46 highgmp-articles-through-primary_16.json -rw-r--r-- 1 root root 2147 Aug 21 17:47 highgmp-articles-through-primary_17.json [[email protected] jmxtrans]# pwd /var/lib/jmxtrans
b、全局監控的配置文件,以10.10.20.14為例:
[[email protected] jmxtrans]# cat base_10.10.20.14.json { "servers" : [ { "port" : "9999", "host" : "10.10.20.14", "queries" : [ { "obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec", "attr" : [ "Count","OneMinuteRate" ], "resultAlias":"BytesInPerSec", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "BytesInPerSec"} } ] }, { "obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec", "attr" : [ "Count","OneMinuteRate" ], "resultAlias":"BytesOutPerSec", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "BytesOutPerSec"} } ] }, { "obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec", "attr" : [ "Count","OneMinuteRate" ], "resultAlias":"BytesRejectedPerSec", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "BytesRejectedPerSec"} } ] }, { "obj" : "kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec", "attr" : [ "Count","OneMinuteRate" ], "resultAlias":"MessagesInPerSec", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "MessagesInPerSec"} } ] }, { "obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchConsumer", "attr" : [ "Count" ], "resultAlias":"RequestsPerSec", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"request" : "FetchConsumer"} } ] }, { "obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchFollower", "attr" : [ "Count" ], "resultAlias":"RequestsPerSec", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"request" : "FetchFollower"} } ] }, { "obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce", "attr" : [ "Count" ], "resultAlias":"RequestsPerSec", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"request" : "Produce"} } ] }, { "obj" : "java.lang:type=Memory", "attr" : [ "HeapMemoryUsage", "NonHeapMemoryUsage" ], "resultAlias":"MemoryUsage", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "MemoryUsage"} } ] }, { "obj" : "java.lang:type=GarbageCollector,name=*", "attr" : [ "CollectionCount","CollectionTime" ], "resultAlias":"GC", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "GC"} } ] }, { "obj" : "java.lang:type=Threading", "attr" : [ "PeakThreadCount","ThreadCount" ], "resultAlias":"Thread", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "Thread"} } ] }, { "obj" : "kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica", "attr" : [ "Value" ], "resultAlias":"ReplicaFetcherManager", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "MaxLag"} } ] }, { "obj" : "kafka.server:type=ReplicaManager,name=PartitionCount", "attr" : [ "Value" ], "resultAlias":"ReplicaManager", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "PartitionCount"} } ] }, { "obj" : "kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions", "attr" : [ "Value" ], "resultAlias":"ReplicaManager", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "UnderReplicatedPartitions"} } ] }, { "obj" : "kafka.server:type=ReplicaManager,name=LeaderCount", "attr" : [ "Value" ], "resultAlias":"ReplicaManager", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "LeaderCount"} } ] }, { "obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer", "attr" : [ "Count","Max" ], "resultAlias":"TotalTimeMs", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "FetchConsumer"} } ] }, { "obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower", "attr" : [ "Count","Max" ], "resultAlias":"TotalTimeMs", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "FetchConsumer"} } ] }, { "obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce", "attr" : [ "Count","Max" ], "resultAlias":"TotalTimeMs", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "Produce"} } ] }, { "obj" : "kafka.server:type=ReplicaManager,name=IsrShrinksPerSec", "attr" : [ "Count" ], "resultAlias":"ReplicaManager", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "IsrShrinksPerSec"} } ] } ] } ] }
c、topic監控的配置文件,以falcon_monitor_us的10.10.20.14節點為例:
[[email protected] jmxtrans]# cat falcon_monitor_us_14.json { "servers" : [ { "port" : "9999", "host" : "10.10.20.14", "queries" : [ { "obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic=falcon_monitor_us", "attr" : [ "Count" ], "resultAlias":"falcon_monitor_us", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "BytesInPerSec"} } ] }, { "obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic=falcon_monitor_us", "attr" : [ "Count" ], "resultAlias":"falcon_monitor_us", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "BytesOutPerSec"} } ] }, { "obj" : "kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=falcon_monitor_us", "attr" : [ "Count" ], "resultAlias":"falcon_monitor_us", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "MessagesInPerSec"} } ] }, { "obj" : "kafka.log:type=Log,name=LogEndOffset,topic=falcon_monitor_us,partition=*", "attr" : [ "Value" ], "resultAlias":"falcon_monitor_us", "outputWriters" : [ { "@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory", "url" : "http://10.10.10.10:8086/", "username" : "root", "password" : "root", "database" : "jmxDB", "tags" : {"application" : "LogEndOffset"} } ] } ] } ] }
2、Grafana配置
a、添加數據源
Url、Database、User、Password需要和jmxtrans采集數據配置文件裏面的寫一致,然後點擊Save&Test,提示成功就正常了
b、創建一個dashboard,然後在這裏配置每一個監控指標的圖
c、要點說明
1、對於監控指標為Count的監控項,需要通過Grafana做計算得到我們想要的監控,比如BytesInPerSec這個指標,它的監控值是一個累計值,我們想要取到每秒的流量,肯定需要計算,(本次采集的值-上次采集的值)/60 ,jmxtrans是一分鐘采集一次數據,具體配置參考下面截圖:
因為我們是一分鐘采集一次數據,所以group by 和derivative選1分鐘;因為我們要每秒的流量,所以math這裏除以60
2、X軸的單位選擇,比如流量的單位、時間的單位、每秒消息的個數無單位等等,下面分布舉一個例子介紹說明
設置流量的單位 ,點擊需要設置的圖,選擇"Edit"進入編輯頁面,切到Axes這個tab頁,Unit--》data(Metric)--》bytes
設置時間的單位 ,點擊需要設置的圖,選擇"Edit"進入編輯頁面,切到Axes這個tab頁,Unit--》time--》milliseconds(ms)
設置按原始值展示,無單位 ,點擊需要設置的圖,選擇"Edit"進入編輯頁面,切到Axes這個tab頁,Unit--》none--》none
七、收獲總結
1、關於jmx收集了kafka的那些指標,對應的值都是那些類型,對應這個問題走了很多彎路,各種谷歌百度拿到了有人整理過的,一個一個試,發現很多不能用,要不就是寫的是錯誤的,要不就是版本不同,寫法不一樣,最後看到了jconsole這個工具,他可以連接到本地或者遠程的jmx端口,能看到在收集的所有指標,在windows下裝好jdk,在bin目錄你可以找到這個工具。
2、關於consumer的延時,關官方介紹有一個type是 type=consumer-fetch-manager-metrics的指標,但是我這通過jconsole連進來死活沒有找到,如果親們有使用這套監控方案的,求幫忙解惑我的這個問題,謝了,官網監控指標如下:
http://kafka.apache.org/documentation/#monitoring
本文出自 “屌絲運維男” 博客,謝絕轉載!
kafka監控實戰(jmxtrans+InfluxDb+Grafana)