Linux Cassandra监控体系搭建:Metrics + Prometheus + Grafana
一、Cassandra Metrics 概述
1.1 Metrics 类型与来源
Cassandra 通过 Dropwizard Metrics 库(Codahale)暴露指标,所有Metrics通过JMX MBean导出。
分类:
命名空间:org.apache.cassandra.metrics.type=...
主要类别:
- Compaction:Pending、BytesCompacted
- ThreadPools:Pending/Dropped
5.0新增:TrieMemtable metrics、SAI索引统计、UCS子策略
1.2 查看Metrics方式
- nodetool proxyhistograms / tpstats(部分)
- Prometheus + Grafana(生产推荐)
二、JMX to Prometheus Exporter 配置
Cassandra 无内置Prometheus endpoint,需使用JMX Exporter(javaagent模式)。
2.1 下载JMX Exporter
最新版(2025年):https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/1.0.1/jmx_prometheus_javaagent-1.0.1.jar
cd /opt
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/1.0.1/jmx_prometheus_javaagent-1.0.1.jar
2.2 配置cassandra-env.sh
在 /opt/cassandra/conf/cassandra-env.sh 添加JVM选项:
# JMX Exporter
JVM_OPTS="$JVM_OPTS -javaagent:/opt/jmx_prometheus_javaagent-1.0.1.jar=9404:/opt/cassandra/conf/prometheus.yaml"
# 可选:JMX认证(生产推荐)
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.authenticate=true"
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.password.file=/etc/cassandra/jmxremote.password"
2.3 创建prometheus.yaml配置文件
/opt/cassandra/conf/prometheus.yaml(规则过滤+重命名):
lowercaseOutputName:true
lowercaseOutputLabelNames:true
whitelistObjectNames:["org.apache.cassandra.metrics:*"]
blacklistObjectNames:[]
rules:
-pattern:'org.apache.cassandra.metrics<type=(\w+), name=(\w+)><>Value: (\d+)'
name:cassandra_$1_$2
value:$3
labels:{}
help:"Cassandra metric $1 $2"
type:GAUGE
-pattern:'org.apache.cassandra.metrics<type=ClientRequest, scope=(\w+), name=(\w+)><>(Count|Mean|OneMinuteRate|99thPercentile): (.*)'
name:cassandra_client_request_$2_$1
value:$4
help:"Cassandra client request $2 for $1"
type:COUNTER
# Compaction规则
-pattern:'org.apache.cassandra.metrics<type=Compaction, name=(.*)><>(Count|Value): (.*)'
name:cassandra_compaction_$1
value:$3
# 更多规则可自定义过滤噪声Metrics
- 端口9404:Exporter监听/metrics
验证:
curl http://localhost:9404/metrics | grep cassandra
生产:防火墙仅限Prometheus访问9404端口。
4.x兼容:相同配置。
三、Prometheus 配置与部署
3.1 Prometheus 安装(单机或Kubernetes)
推荐Prometheus 2.50+(2025最新)。
3.2 prometheus.yml scrape 配置
global:
scrape_interval:15s
scrape_configs:
-job_name:'cassandra'
static_configs:
-targets:
-'10.0.0.10:9404'
-'10.0.0.20:9404'
-'10.0.0.30:9404'
# 所有节点
metrics_path:/metrics
scheme:http
# 可加node_exporter监控OS
-job_name:'node'
static_configs:
-targets:['localhost:9100']
- 大集群:用consul/sd_config动态发现
3.3 验证
Prometheus UI → Targets:cassandra job UP
查询示例:cassandra_compaction_pending_tasks
四、Grafana Dashboard 搭建
4.1 Grafana 安装与数据源
添加Prometheus数据源。
4.2 推荐Dashboard导入(2025最新)
官方/社区Dashboard ID(Grafana.com):
- 官方Cassandra Overview:ID 1599(更新支持5.0 Trie metrics)
- 详细Cassandra Monitoring:ID 5336(社区,包含UCS)
- Node Exporter Full:ID 1860(OS)
导入方式:Grafana → Import → 输入ID
4.3 关键面板自定义
- 集群概览:nodetool status模拟(Up节点数、Load、Owns偏差)
- 读写延迟:cassandra_client_request_latency_99th_percentile
- Compaction:pending_tasks、bytes_compacted
- Storage:live_disk_space_used、total_disk_space
5.0专属面板:trie_memtable_offheap_used
五、关键指标详解(分类表)
| | | |
|---|
| | | |
| | | |
| cassandra_client_request_latency_99th_percentile | | P99延迟,>100ms调查Compaction/GC |
| cassandra_client_request_range_slice_count | | |
| cassandra_compaction_pending_tasks | | >100落后严重,调concurrent_compactors |
| cassandra_compaction_completed_tasks | | |
| cassandra_storage_total_hints_count | | |
| cassandra_table_live_disk_space_used | | |
| | | |
| | | |
| cassandra_thread_pools_pending_tasks | | |
| cassandra_dropped_message_dropped | | |
| cassandra_cache_key_hits / misses | | |
| cassandra_memtable_off_heap_used | | |
生产监控脚本:PromQL查询这些指标趋势。
六、告警规则示例(Alertmanager)
prometheus/rules.yml:
groups:
-name:cassandra.alerts
rules:
-alert:CassandraNodeDown
expr:up{job="cassandra"}==0
for:5m
labels:
severity:critical
annotations:
summary:"节点 {{ $labels.instance }} 下线"
-alert:CassandraHighLatency
expr:cassandra_client_request_latency_99th_percentile>100
for:10m
labels:
severity:warning
-alert:CompactionBacklog
expr:cassandra_compaction_pending_tasks>100
for:15m
labels:
severity:warning
-alert:DiskSpaceHigh
expr:(node_filesystem_free_bytes/node_filesystem_size_bytes)<0.2
for:5m
labels:
severity:critical
-alert:GCLongPause
expr:rate(jvm_gc_pause_seconds_sum[5m])>2
labels:
severity:warning
集成Slack/企业微信/飞书。
七、生产最佳实践与常见坑
- 指标噪声:prometheus.yaml规则过滤无用Metrics
- 高基数:避免label过多(instance已够)
- 大集群:用Thanos或VictoriaMetrics侧车存储
- 版本兼容:4.x用旧exporter规则,5.0加Trie规则
常见坑:
- Exporter未启动 → curl 9404无响应 → 检查cassandra-env.sh