1. 程式人生 > >Kubernetes更優雅的監控工具Prometheus Operator

Kubernetes更優雅的監控工具Prometheus Operator

Kubernetes更優雅的監控工具Prometheus Operator

[TOC]

1. Kubernetes Operator 介紹

在 Kubernetes 的支援下,管理和伸縮 Web 應用、移動應用後端以及 API 服務都變得比較簡單了。其原因是這些應用一般都是無狀態的,所以 Deployment 這樣的基礎 Kubernetes API 物件就可以在無需附加操作的情況下,對應用進行伸縮和故障恢復了。

而對於資料庫、快取或者監控系統等有狀態應用的管理,就是個挑戰了。這些系統需要應用領域的知識,來正確的進行伸縮和升級,當資料丟失或不可用的時候,要進行有效的重新配置。我們希望這些應用相關的運維技能可以編碼到軟體之中,從而藉助 Kubernetes 的能力,正確的執行和管理複雜應用。

Operator 這種軟體,使用 TPR(第三方資源,現在已經升級為 CRD) 機制對 Kubernetes API 進行擴充套件,將特定應用的知識融入其中,讓使用者可以建立、配置和管理應用。和 Kubernetes 的內建資源一樣,Operator 操作的不是一個單例項應用,而是叢集範圍內的多例項。

2. Prometheus Operator介紹

Kubernetes的Prometheus Operator為Kubernetes服務和Prometheus例項的部署和管理提供了簡單的監控定義。

安裝完畢後,Prometheus Operator提供了以下功能:

  • 建立/毀壞: 在Kubernetes namespace中更容易啟動一個Prometheus例項,一個特定的應用程式或團隊更容易使用Operator。
  • 簡單配置: 配置Prometheus的基礎東西,比如在Kubernetes的本地資源versions, persistence, retention policies, 和replicas。
  • Target Services通過標籤: 基於常見的Kubernetes label查詢,自動生成監控target 配置;不需要學習普羅米修斯特定的配置語言。

Prometheus Operator 架構圖如下:

Prometheus Operator架構

以上架構中的各組成部分以不同的資源方式執行在 Kubernetes 叢集中,它們各自有不同的作用:

Operator: Operator 資源會根據自定義資源(Custom Resource Definition / CRDs)來部署和管理 Prometheus Server,同時監控這些自定義資源事件的變化來做相應的處理,是整個系統的控制中心。
Prometheus

: Prometheus 資源是宣告性地描述 Prometheus 部署的期望狀態。
Prometheus Server: Operator 根據自定義資源 Prometheus 型別中定義的內容而部署的 Prometheus Server 叢集,這些自定義資源可以看作是用來管理 Prometheus Server 叢集的 StatefulSets 資源。
ServiceMonitor: ServiceMonitor 也是一個自定義資源,它描述了一組被 Prometheus 監控的 targets 列表。該資源通過 Labels 來選取對應的 Service Endpoint,讓 Prometheus Server 通過選取的 Service 來獲取 Metrics 資訊。
Service: Service 資源主要用來對應 Kubernetes 叢集中的 Metrics Server Pod,來提供給 ServiceMonitor 選取讓 Prometheus Server 來獲取資訊。簡單的說就是 Prometheus 監控的物件,例如 Node Exporter Service、Mysql Exporter Service 等等。
Alertmanager: Alertmanager 也是一個自定義資源型別,由 Operator 根據資源描述內容來部署 Alertmanager 叢集。

3. Prometheus Operator部署

環境

  • Kubernetes version: kubeadm安裝的1.12
  • helm version: v2.11.0

我們使用helm安裝。helm chart根據實際使用修改。prometheus-operator

裡面整合了grafana和監控kubernetes的exporter。需要注意的是,grafana我配置使用了mysql儲存資料,相關說明在另一篇文章中《使用Helm部署Prometheus和Grafana監控Kubernetes》

cd helm/prometheus-operator/
helm install --name prometheus-operator --namespace monitoring -f values.yaml ./

為了更加靈活的的使用Prometheus Operator,新增自定義監控是必不可少的。這裡我們使用ceph-exporter做示例。

values.yaml中這一段即是使用servicemonitor來新增監控:

serviceMonitor:
  enabled: true  # 開啟監控
  # on what port are the metrics exposed by etcd
  exporterPort: 9128
  # for apps that have deployed outside of the cluster, list their adresses here
  endpoints: []
  # Are we talking http or https?
  scheme: http
  # service selector label key to target ceph exporter pods
  serviceSelectorLabelKey: app
  # default rules are in templates/ceph-exporter.rules.yaml
  prometheusRules: {}
  # Custom Labels to be added to ServiceMonitor
  # 經過測試,servicemonitor標籤新增prometheus operator的release標籤即可正常監控
  additionalServiceMonitorLabels: 
    release: prometheus-operator
  #Custom Labels to be added to Prometheus Rules CRD
  additionalRulesLabels: {}

最重要的是這個引數additionalServiceMonitorLabels,經過測試,servicemonitor需要新增prometheus operator已有的標籤,才能成功新增監控。

[[email protected] prometheus-operator]# kubectl get servicemonitor ceph-exporter -n monitoring -o yaml
[[email protected] templates]# kubectl get servicemonitor -n monitoring ceph-exporter -o yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: 2018-10-30T06:51:12Z
  generation: 1
  labels:
    app: ceph-exporter
    chart: ceph-exporter-0.1.0
    heritage: Tiller
    prometheus: ceph-exporter
    release: prometheus-operator
  name: ceph-exporter
  namespace: monitoring
  resourceVersion: "13937459"
  selfLink: /apis/monitoring.coreos.com/v1/namespaces/monitoring/servicemonitors/ceph-exporter
  uid: 30569173-dc10-11e8-bcf3-000c293d66a5
spec:
  endpoints:
  - interval: 30s
    port: http
  namespaceSelector:
    matchNames:
    - monitoring
  selector:
    matchLabels:
      app: ceph-exporter
      release: ceph-exporter
[[email protected] prometheus-operator]# kubectl get pod -n monitoring  prometheus-operator-operator-7459848949-8dddt -o yaml|more
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: 2018-10-30T00:39:37Z
  generateName: prometheus-operator-operator-7459848949-
  labels:
    app: prometheus-operator-operator
    chart: prometheus-operator-0.1.6
    heritage: Tiller
    pod-template-hash: "745984894
    release: prometheus-operator

要點說明:

  • ServiceMonitor的標籤中至少需要有和prometheus-operator POD中標籤相匹配;
  • ServiceMonitorspec引數
  • service能被prometheus訪問,各端點正常;
  • 遇到問題,可以開啟prometheus operator和prometheus的除錯日誌。雖然日誌沒有什麼其它資訊,但是prometheus operator除錯日誌可以看到當前監控到的servicemonitor,這樣可以確認安裝的servicemonitor是否被匹配到。

安裝成功後,檢視相關資源:

[[email protected] prometheus-operator]# kubectl get service,servicemonitor,ep -n monitoring
NAME                                                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/alertmanager-operated                          ClusterIP   None             <none>        9093/TCP,6783/TCP   12d
service/ceph-exporter                                  ClusterIP   10.100.57.62     <none>        9128/TCP            46h
service/monitoring-mysql-mysql                         ClusterIP   10.108.93.155    <none>        3306/TCP            42d
service/prometheus-operated                            ClusterIP   None             <none>        9090/TCP            12d
service/prometheus-operator-alertmanager               ClusterIP   10.98.42.209     <none>        9093/TCP            6d19h
service/prometheus-operator-grafana                    ClusterIP   10.103.100.150   <none>        80/TCP              6d19h
service/prometheus-operator-kube-state-metrics         ClusterIP   10.110.76.250    <none>        8080/TCP            6d19h
service/prometheus-operator-operator                   ClusterIP   None             <none>        8080/TCP            6d19h
service/prometheus-operator-prometheus                 ClusterIP   10.111.24.83     <none>        9090/TCP            6d19h
service/prometheus-operator-prometheus-node-exporter   ClusterIP   10.97.126.74     <none>        9100/TCP            6d19h

NAME                                                                               AGE
servicemonitor.monitoring.coreos.com/ceph-exporter                                 1d
servicemonitor.monitoring.coreos.com/prometheus-operator                           8d
servicemonitor.monitoring.coreos.com/prometheus-operator-alertmanager              6d
servicemonitor.monitoring.coreos.com/prometheus-operator-apiserver                 6d
servicemonitor.monitoring.coreos.com/prometheus-operator-coredns                   6d
servicemonitor.monitoring.coreos.com/prometheus-operator-kube-controller-manager   6d
servicemonitor.monitoring.coreos.com/prometheus-operator-kube-etcd                 6d
servicemonitor.monitoring.coreos.com/prometheus-operator-kube-scheduler            6d
servicemonitor.monitoring.coreos.com/prometheus-operator-kube-state-metrics        6d
servicemonitor.monitoring.coreos.com/prometheus-operator-kubelet                   6d
servicemonitor.monitoring.coreos.com/prometheus-operator-node-exporter             6d
servicemonitor.monitoring.coreos.com/prometheus-operator-operator                  6d
servicemonitor.monitoring.coreos.com/prometheus-operator-prometheus                6d

NAME                                                     ENDPOINTS                                                                 AGE
endpoints/alertmanager-operated                          10.244.6.174:9093,10.244.6.174:6783                                       12d
endpoints/ceph-exporter                                  10.244.2.59:9128                                                          46h
endpoints/monitoring-mysql-mysql                         10.244.6.171:3306                                                         42d
endpoints/prometheus-operated                            10.244.2.60:9090,10.244.6.175:9090                                        12d
endpoints/prometheus-operator-alertmanager               10.244.6.174:9093                                                         6d19h
endpoints/prometheus-operator-grafana                    10.244.6.106:3000                                                         6d19h
endpoints/prometheus-operator-kube-state-metrics         10.244.2.163:8080                                                         6d19h
endpoints/prometheus-operator-operator                   10.244.6.113:8080                                                         6d19h
endpoints/prometheus-operator-prometheus                 10.244.2.60:9090,10.244.6.175:9090                                        6d19h
endpoints/prometheus-operator-prometheus-node-exporter   192.168.105.92:9100,192.168.105.93:9100,192.168.105.94:9100 + 4 more...   6d19h

4. Grafana新增dashboard

上面的prometheus-operator裡的_dashboards有我修改過的dashboard,比較全面,使用手動在grafana介面匯入,後續可以隨意修改dashboard,使用過程中非常方便。而如果將dashboard json檔案放到dashboards目錄中,helm安裝的話,安裝的dashboard不支援grafana中直接修改,使用過程中比較麻煩。

5. Alertmanager新增報警

新增prometheusrule,以下是一個示例:

[[email protected] ceph-exporter]# kubectl get prometheusrule -n monitoring ceph-exporter -o yaml 
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: 2018-10-30T06:51:12Z
  generation: 1
  labels:
    app: prometheus
    chart: ceph-exporter-0.1.0
    heritage: Tiller
    prometheus: ceph-exporter
    release: ceph-exporter
  name: ceph-exporter
  namespace: monitoring
  resourceVersion: "13965150"
  selfLink: /apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules/ceph-exporter
  uid: 30543ec9-dc10-11e8-bcf3-000c293d66a5
spec:
  groups:
  - name: ceph-exporter.rules
    rules:
    - alert: Ceph
      annotations:
        description: There is no running ceph exporter.
        summary: Ceph exporter is down
      expr: absent(up{job="ceph-exporter"} == 1)
      for: 5m
      labels:
        severity: critical

預設監控k8s的rule已經很多很全面了,可以自行調整prometheus-operator/templates/all-prometheus-rules.yaml

報警規則可修改values.yamlalertmanager:下面這段

  config:
    global:
      resolve_timeout: 5m
      # The smarthost and SMTP sender used for mail notifications.
      smtp_smarthost: 'smtp.163.com:25'
      smtp_from: '[email protected]'
      smtp_auth_username: '[email protected]'
      smtp_auth_password: 'xxxxxx'
      # The API URL to use for Slack notifications.
      slack_api_url: 'https://hooks.slack.com/services/some/api/token'
    route:
      group_by: ["job", "alertname"]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'noemail'
      routes:
      - match:
          severity: critical
        receiver: critical_email_alert
      - match_re:
          alertname: "^KubeJob*"
        receiver: default_email

    receivers:
      - name: 'default_email'
        email_configs:
        - to : '[email protected]'
          send_resolved: true

      - name: 'critical_email_alert'
        email_configs:
        - to : '[email protected]'
          send_resolved: true

      - name: 'noemail'
        email_configs:
        - to : '[email protected]'
          send_resolved: false

  ## Alertmanager template files to format alerts
  ## ref: https://prometheus.io/docs/alerting/notifications/
  ##      https://prometheus.io/docs/alerting/notification_examples/
  ##
  templateFiles:
    template_1.tmpl: |-
      {{ define "cluster" }}{{ .ExternalURL | reReplaceAll ".*alertmanager\\.(.*)" "$1" }}{{ end }}

      {{ define "slack.k8s.text" }}
      {{- $root := . -}}
      {{ range .Alerts }}
       *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
       *Cluster:*  {{ template "cluster" $root }}
       *Description:* {{ .Annotations.description }}
       *Graph:* <{{ .GeneratorURL }}|:chart_with_upwards_trend:>
       *Runbook:* <{{ .Annotations.runbook }}|:spiral_note_pad:>
       *Details:*
         {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
         {{ end }}

6. 小結

Prometheus Operator通過定義servicemonitor和prometheusrule就能動態調整prometheus和alertmanager配置,更加符合Kubernetes的操作習慣,使Kubernetes監控更優雅。

參考資料:
[1] https://www.kancloud.cn/huyipow/prometheus/527093
[2] https://coreos.com/blog/introducing-operators.html
[3] https://coreos.com/blog/the-prometheus-operator.html
[4] https://github.com/coreos/prometheus-operator
[5] https://prometheus.io/docs/introduction/overview/
[6] https://prometheus.io/docs/alerting/alertmanager/
[7] https://github.com/1046102779/prometheus