Prometheus 監控 Kubernetes MySQL Redis 服務

在 Kubernetes 環境下，透過 Prometheus 監控 MySQL 和 Redis 等關鍵服務，能有效掌握系統效能和健康狀態。本文詳細說明如何佈署 MySQLd Exporter 並設定 Prometheus 收集指標，涵蓋請求率、連線中斷、服務可用性等導向。同時也介紹 Redis 監控的設定步驟和警示規則組態，最後整合 Tornado 應用程式，實作 API 服務延遲和可用性的全面監控，確保系統穩定執行。

使用 Prometheus 監控 MySQL 服務

在 Kubernetes 環境中佈署 MySQL 服務時，監控資料函式庫的效能和健康狀態至關重要。Prometheus 是一個流行的監控系統，可以用來收集 MySQL 的指標資料。在本章中，我們將介紹如何使用 Prometheus 和 MySQLd Exporter 來監控 MySQL 服務。

MySQLd Exporter 的工作原理

MySQLd Exporter 是一個 Prometheus 的 Exporter，它透過連線到 MySQL 伺服器並查詢其狀態來收集指標資料。這些資料然後被暴露給 Prometheus 伺服器，以便進行收集和分析。

在 Kubernetes 中佈署 MySQLd Exporter

要在 Kubernetes 中佈署 MySQLd Exporter，我們需要在 MySQL 的 Deployment 中新增一個 sidecar 容器。這個容器將執行 MySQLd Exporter，並將其組態為連線到 MySQL 伺服器。

設定 MySQLd Exporter 容器

- image: prom/mysqld-exporter:latest
  name: tornado-db-exp
  args:
    - --collect.info_schema.innodb_metrics
    - --collect.info_schema.userstats
    - --collect.perf_schema.eventsstatements
    - --collect.perf_schema.indexiowaits
    - --collect.perf_schema.tableiowaits
  env:
    - name: DATA_SOURCE_NAME
      value: "tornado-db-exp:anotherstrongpassword@(tornado-db:3306)/"
  ports:
    - containerPort: 9104
      name: tornado-db-exp

設定 MySQL 使用者許可權

為了讓 MySQLd Exporter 能夠連線到 MySQL 伺服器並收集指標資料，我們需要建立一個具有有限許可權的 MySQL 使用者。

CREATE USER 'tornado-db-exp'@'localhost' IDENTIFIED BY 'anotherstrongpassword';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'tornado-db-exp';
GRANT SELECT ON performance_schema.* TO 'tornado-db-exp';

內容解密：

CREATE USER 陳述式：建立一個新的 MySQL 使用者 tornado-db-exp，並設定密碼。
GRANT 陳述式：授予 tornado-db-exp 使用者 PROCESS、REPLICATION CLIENT 和 SELECT 許可權，以便能夠收集 MySQL 的指標資料。
SELECT 許可權：授予 tornado-db-exp 使用者對 performance_schema 資料函式庫的 SELECT 許可權，以便能夠收集查詢效能資料。

設定 Prometheus 來收集 MySQL 指標資料

要在 Prometheus 中收集 MySQL 的指標資料，我們需要設定一個 Service，讓 Prometheus 可以透過它來抓取指標資料。

設定 Service

apiVersion: v1
kind: Service
metadata:
  name: tornado-db
  annotations:
    prometheus.io/scrape: 'true'
    prometheus.io/port: '9104'
spec:
  selector:
    app: tornado-db
  type: ClusterIP
  ports:
    - port: 3306
      name: tornado-db
    - port: 9104
      name: tornado-db-exp

設定 Prometheus 的 Kubernetes endpoint job relabelling

relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2

內容解密：

prometheus.io/scrape annotation：告訴 Prometheus 是否要抓取這個 Service 的指標資料。
prometheus.io/port annotation：告訴 Prometheus 要抓取哪個埠的指標資料。
relabel_configs：設定 Prometheus 的 relabelling 組態，以便能夠正確地抓取 MySQL 的指標資料。

建立 MySQL 的監控規則

我們可以建立一些監控規則來監控 MySQL 的效能和健康狀態。例如，我們可以建立一個規則來監控慢查詢的數量。

MySQL 慢查詢警示規則

- alert: MySQLHighSlowQueryRate
  expr: rate(mysql_global_status_slow_queries[2m]) > 5
  labels:
    severity: warning
  annotations:
    summary: MySQL Slow query rate is exceeded on {{ $labels.instance }} for {{ $labels.kubernetes_name }}

內容解密：

alert: 定義了一個警示規則，當慢查詢的數量超過一定閾值時觸發警示。
expr: 定義了警示規則的表示式，使用 rate 函式計算慢查詢的數量。
labels: 定義了警示的標籤，包括 severity 等級。
annotations: 定義了警示的註解，包括 summary 總結。

透過這些設定，我們可以有效地監控 MySQL 的效能和健康狀態，並及時發現潛在的問題。

監控堆積疊 - Tornado

MySQL 請求率記錄

在監控 MySQL 服務時，我們需要了解其請求率。以下列出了用於記錄 MySQL 請求率的規則：

- record: mysql:write_requests:rate2m
  expr: sum(rate(mysql_global_status_commands_total{command=~"insert|update|delete"}[2m])) without (command)
- record: mysql:select_requests:rate2m
  expr: sum(rate(mysql_global_status_commands_total{command="select"}[2m]))
- record: mysql:total_requests:rate2m
  expr: rate(mysql_global_status_commands_total[2m])
- record: mysql:top5_statements:rate5m
  expr: topk(5, sum by (schema,digest_text) (rate(mysql_perf_schema_events_statements_total[5m])))

內容解密：

寫入請求率：我們使用 mysql_global_status_commands_total 這個指標來抓取 insert、update 和 delete 等寫入請求的數量，並計算過去 2 分鐘的請求率。
讀取請求率：同樣地，我們抓取 select 請求的數量，並計算過去 2 分鐘的請求率。
總請求率：計算所有請求的總和，並計算過去 2 分鐘的請求率。
最常使用的 SQL 陳述式：使用 topk 聚合運算元，找出過去 5 分鐘內最常被使用的 SQL 陳述式。

這些規則幫助我們瞭解 MySQL 伺服器的負載和行為。

連線與中斷連線

除了請求率之外，監控連線和中斷連線的情況也非常重要。以下列出了相關的規則：

- alert: MySQLAbortedConnectionsHigh
  expr: rate(mysql_global_status_aborted_connects[2m]) > 5
  labels:
    severity: warning
  annotations:
    summary: MySQL 中斷連線率過高於 {{ $labels.instance }} for {{ $labels.kubernetes_name }}
- record: mysql:connection:rate2m
  expr: rate(mysql_global_status_connections[2m])

內容解密：

中斷連線警示：當中斷連線的速率超過某個閾值時，觸發警示。
連線速率記錄：記錄連線的速率，以便了解連線的使用情況。

MySQL 服務可用性

我們還需要監控 MySQL 服務的可用性。以下列出了相關的警示規則：

- alert: TornadoDBServerDown
  expr: mysql_up{kubernetes_name="tornado-db"} == 0
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: MySQL 伺服器 {{ $labels.instance }} 已關閉！
- alert: TornadoDBServerGone
  expr: absent(mysql_up{kubernetes_name="tornado-db"})
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: 無 Tornado DB 伺服器回報！

內容解密：

伺服器關閉警示：當 mysql_up 指標為 0 時，表示 MySQL 伺服器無法回應，觸發嚴重警示。
伺服器消失警示：當 mysql_up 指標消失時，表示 MySQL 伺服器可能已經消失，觸發嚴重警示。

Redis 監控

與 MySQL 類別似，Prometheus 也提供了 Redis 的匯出器。以下是相關的設定和規則：

Redis Kubernetes 佈署

apiVersion: apps/v1beta2
kind: Deployment
...
- name: redis-exporter
  image: oliver006/redis_exporter:latest
  env:
    - name: REDIS_ADDR
      value: redis://tornado-redis:6379
    - name: REDIS_PASSWORD
      value: tornadoapi
  ports:
    - containerPort: 9121

Redis Kubernetes 服務

apiVersion: v1
kind: Service
metadata:
  name: tornado-redis
  annotations:
    prometheus.io/scrape: 'true'
    prometheus.io/port: '9121'
spec:
  selector:
    app: tornado-redis
  ports:
    - port: 6379
      name: redis
    - port: 9121
      name: redis-exporter
  clusterIP: None

Redis 警示規則

- alert: TornadoRedisCacheMissesHigh
  expr: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total) > 0.8
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: Redis Server {{ $labels.instance }} 快取未命中率過高。
- alert: RedisRejectedConnectionsHigh
  expr: avg(redis_rejected_connections_total) by (addr) < 10
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Redis 例項 {{ $labels.addr }} 可能達到 maxclient 上限。"

內容解密：

快取未命中率警示：當快取未命中率超過某個閾值時，觸發警告。
拒絕連線警示：當平均拒絕連線數過高時，觸發警告。

透過這些監控和警示規則，我們可以及時瞭解 Tornado Stack 中 MySQL 和 Redis 的執行狀況，並在出現問題時及時處理。

監控Tornado服務堆積疊

Redis可用性警示設定

為了確保Redis服務的穩定性，我們設定了兩個關鍵警示規則：

Redis伺服器宕機警示

- alert: TornadoRedisServerDown
  expr: redis_up{kubernetes_name="tornado-redis"} == 0
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: Redis伺服器 {{ $labels.instance }} 已宕機！

此警示在Redis伺服器持續宕機10分鐘後觸發，表明該伺服器可能存在嚴重問題。

Redis伺服器失聯警示

- alert: TornadoRedisServerGone
  expr: absent(redis_up{kubernetes_name="tornado-redis"})
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: 無Tornado Redis伺服器回報！
    description: Werner Heisenberg 說 - Tornado Redis伺服器失聯是確定的事實。

此警示在10分鐘內無法偵測到Redis伺服器的回報時觸發，表示該伺服器可能已完全失聯。

監控Tornado API服務

Tornado API是一個使用Ring並執行在JVM上的Clojure應用程式，具有一個單一的API端點用於購買和銷售物品。我們將使用iapetos Clojure包裝器來實作應用程式的監控指標。

新增iapetos依賴

首先，在project.clj檔案中新增iapetos依賴：

(defproject tornado-api-prometheus "0.1.0-SNAPSHOT"
  :description "Clojure REST服務範例"
  :url "http://artofmonitoring.com"
  :dependencies [[org.clojure/clojure "1.8.0"]
                 ...
                 [iapetos "0.1.8"]
                 [io.prometheus/simpleclient_hotspot "0.0.4"]]
  :plugins [[lein-ring "0.7.3"]]
  ...)

這裡新增了iapetos和Prometheus simpleclient_hotspot依賴，用於匯出JVM指標。

初始化指標登入器

(defonce registry
  (-> (prometheus/collector-registry)
      (jvm/initialize)
      (ring/initialize)
      (prometheus/register
        (prometheus/counter :tornado/item-get)
        (prometheus/counter :tornado/item-bought)
        (prometheus/counter :tornado/item-sold)
        (prometheus/counter :tornado/update-item)
        (prometheus/gauge :tornado/up))))

這裡建立了一個名為registry的指標登入器，初始化了Ring和JVM指標，並定義了五個自定義指標。

新增指標標籤與呼叫

(prometheus/counter :tornado/item-bought
  {:description "總共購買的物品數量"})

為item-bought計數器新增了描述標籤。

在API方法中呼叫指標，例如：

(defn buy-item [item]
  (let [id (uuid)]
    (sql/db-do-commands db-config
      (let [item (assoc item "id" id)]
        (sql/insert! db-config :items item)
        (prometheus/inc (registry :tornado/item-bought))))
    (wcar* (car/ping)
           (car/set id (item "title")))
    (get-item id)))

在購買物品時遞增item-bought計數器。

匯出指標

(def app
  (-> (handler/api app-routes)
      (middleware/wrap-json-body)
      (middleware/wrap-json-response)
      (ring/wrap-metrics registry {:path "/metrics"})))

啟用了/metrics頁面，用於匯出定義的指標。

Tornado Prometheus設定

我們的Clojure匯出器作為端點暴露，無需組態特定的作業來抓取指標。我們獲得了多種指標，包括JVM指標、Ring HTTP指標和應用程式自身的指標。

建立延遲記錄規則

- record: tornado:request_latency_seconds:avg
  expr: http_request_latency_seconds_sum{status="200"} / http_request_latency_seconds_count{status="200"}

建立了一個新的指標tornado:request_latency_seconds:avg，用於計算傳回200 HTTP狀態碼的請求平均延遲。

建立高延遲警示

- alert: TornadoRequestLatencyHigh
  expr: histogram_quantile(0.9, rate(http_request_latency_seconds_bucket{kubernetes_name="tornado-api"}[5m])) > 0.05
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: API伺服器 {{ $labels.instance }} 延遲超過0.05秒。

使用histogram_quantile函式產生90百分位數的延遲值，如果超過0.05秒則觸發警示。

玄貓

技術愛好者，專注於分享程式開發、雲端技術與 AI 應用的心得體會。