监控系统就像给服务器装了个24小时工作的健康检测仪。我见过太多团队在项目初期随便装个监控工具应付了事,等到服务器真出问题的时候,发现要么监控数据不全,要么告警不及时,最后只能对着宕机的服务干瞪眼。Prometheus+Grafana这套组合,可以说是中小型企业监控的黄金搭档。
Prometheus负责数据采集和存储,就像个不知疲倦的数据收集员。它采用拉取(pull)模式获取指标,相比传统推模式(push)更不容易丢数据。我特别喜欢它的多维数据模型,查询语言PromQL强大到能让你像写SQL一样分析监控数据。记得有次服务器CPU突然飙高,用Prometheus的histogram_quantile函数直接定位到是某个API接口响应变慢导致的,整个过程不到5分钟。
Grafana则是数据可视化的行家。它支持的图表类型丰富到令人发指,从基础的折线图到热力图,再到各种炫酷的仪表盘。最让我惊喜的是它对Prometheus的原生支持,配置数据源后就能直接用PromQL查询。去年给一个电商客户做监控,用Grafana的变量功能实现了按机房、按服务动态过滤的仪表盘,客户CTO看到后直接取消了其他监控工具的采购计划。
这套方案特别适合:
虽然Prometheus对资源要求不高,但生产环境我建议至少给2核4G的配置。去年有个客户用1核2G的机器跑Prometheus,监控数据量上来后经常OOM崩溃。存储方面,预估每百万时间序列大约占用1.5GB磁盘空间。如果监控15秒间隔的1000个指标,保留30天大约需要:
code复制1000指标 × 4点/分钟 × 1440分钟 × 30天 × 1.5KB ≈ 250GB
实际项目中我通常会:
生产环境一定要提前规划好端口,避免后期混乱。这是我的常用端口清单:
| 服务 | 默认端口 | 生产建议 | 说明 |
|---|---|---|---|
| Prometheus | 9090 | 9090 | 不建议改,兼容性好 |
| node_exporter | 9100 | 19100 | 避免与业务端口冲突 |
| mysqld_exporter | 9104 | 19104 | 同上 |
| Grafana | 3000 | 3000 | 前端访问端口保持默认 |
特别提醒:如果服务器有安全组(比如阿里云ECS),一定要提前放行这些端口。有次凌晨处理故障,发现监控数据断了一天,最后查出是安全组没开9100端口。
千万别用root运行这些服务!我吃过亏,某次node_exporter被入侵导致整台服务器沦陷。正确的做法是:
bash复制# 创建专用用户
groupadd prometheus
useradd -g prometheus -s /bin/false prometheus
useradd -g prometheus -s /bin/false grafana
# 设置目录权限
chown -R prometheus:prometheus /usr/local/prometheus
chown -R grafana:grafana /usr/local/grafana
对于MySQL监控,建议创建只读账号而不是直接用root:
sql复制CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'ComplexPassword123!';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
从官网下载最新稳定版(目前是2.46.0):
bash复制wget https://github.com/prometheus/prometheus/releases/download/v2.46.0/prometheus-2.46.0.linux-amd64.tar.gz
tar xzvf prometheus-*.tar.gz -C /usr/local/
mv /usr/local/prometheus-* /usr/local/prometheus
生产环境强烈建议用systemd管理服务。这是我的常用配置模板:
ini复制# /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/local/prometheus/prometheus \
--config.file=/usr/local/prometheus/prometheus.yml \
--storage.tsdb.path=/data/prometheus \
--web.enable-lifecycle \
--storage.tsdb.retention.time=60d \
--storage.tsdb.retention.size=100GB \
--web.listen-address=0.0.0.0:9090
Restart=always
RestartSec=30s
[Install]
WantedBy=multi-user.target
关键参数说明:
--web.enable-lifecycle:允许API热重载配置--storage.tsdb.retention.time:数据保留60天--storage.tsdb.retention.size:限制磁盘用量100GB加载并启动服务:
bash复制systemctl daemon-reload
systemctl enable --now prometheus
默认配置文件需要根据实际环境调整,重点看这几个部分:
yaml复制global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
rule_files:
- 'alert.rules'
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'nodes'
static_configs:
- targets: ['node1:9100', 'node2:9100']
labels:
env: 'production'
- job_name: 'mysql'
params:
collect[]:
- global_status
- innodb_metrics
static_configs:
- targets: ['db-master:9104']
实用技巧:
Prometheus默认每2小时压缩一次数据,对于高负载实例可以调整:
ini复制--storage.tsdb.max-block-duration=2h \
--storage.tsdb.min-block-duration=2h \
--storage.tsdb.wal-compression \
--storage.tsdb.retention.time=60d \
如果数据量很大(超过100万时间序列),建议:
node_exporter是服务器基础监控的核心,安装步骤:
bash复制wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xzvf node_exporter-*.tar.gz
mv node_exporter-*/node_exporter /usr/local/bin/
生产环境建议启用这些采集器:
ini复制[Service]
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--collector.tcpstat \
--collector.netdev \
--collector.filesystem \
--collector.meminfo \
--collector.cpu
重要指标解读:
MySQL监控需要特别注意权限控制,创建专用账号:
sql复制CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'StrongPassword!';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
配置文件.my.cnf示例:
ini复制[client]
user=exporter
password=StrongPassword!
host=127.0.0.1
port=3306
启动参数建议:
ini复制ExecStart=/usr/local/bin/mysqld_exporter \
--config.my-cnf=/etc/.my.cnf \
--collect.global_status \
--collect.info_schema.processlist \
--collect.info_schema.innodb_metrics \
--collect.info_schema.tables \
--collect.perf_schema.eventswaits
关键监控项:
Grafana企业版比开源版多了告警管理和报表功能,安装:
bash复制wget https://dl.grafana.com/enterprise/release/grafana-enterprise-10.1.5.linux-amd64.tar.gz
tar xzvf grafana-enterprise-*.tar.gz -C /usr/local/
生产环境必须配置HTTPS,修改grafana.ini:
ini复制[server]
protocol = https
http_port = 3000
domain = grafana.yourcompany.com
cert_file = /path/to/cert.pem
cert_key = /path/to/key.pem
systemd服务配置示例:
ini复制[Service]
Environment="GF_SECURITY_ADMIN_PASSWORD=ComplexPassword123!"
Environment="GF_PATHS_CONFIG=/usr/local/grafana/conf/grafana.ini"
Environment="GF_PATHS_DATA=/var/lib/grafana"
Environment="GF_PATHS_LOGS=/var/log/grafana"
ExecStart=/usr/local/grafana/bin/grafana-server \
--homepath=/usr/local/grafana \
--config=/usr/local/grafana/conf/grafana.ini
添加Prometheus数据源时要注意:
推荐导入这些官方仪表盘:
导入方法:
在Prometheus中创建alert.rules文件:
yaml复制groups:
- name: host
rules:
- alert: HighCPU
expr: 100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Only {{ $value }}% space left on /"
告警分级建议:
当监控目标超过100个时,需要调整这些参数:
ini复制--storage.tsdb.max-block-duration=2h \
--storage.tsdb.min-block-duration=2h \
--storage.tsdb.wal-compression \
--query.max-concurrency=20 \
--query.timeout=2m \
监控Prometheus自身健康:
问题1:Prometheus内存溢出
解决方法:
问题2:Grafana图表显示"No data"
检查步骤:
问题3:告警不触发
排查方法:
对于关键业务监控,建议部署多实例:
code复制 +-------------+
| Load Balancer |
+------+------+
|
+--------------+--------------+
| |
+----------v----------+ +------------v------------+
| Prometheus Server 1 | | Prometheus Server 2 |
| (shard 1) | | (shard 2) |
+---------------------+ +-------------------------+
| |
+--------------+--------------+
|
+------v------+
| Thanos |
| (Query/Gateway) |
+-------------+
实现要点:
首先部署metrics-server:
bash复制kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
然后安装kube-state-metrics:
bash复制helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-state-metrics prometheus-community/kube-state-metrics
使用Helm安装Prometheus Operator:
bash复制helm install prometheus prometheus-community/kube-prometheus-stack \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set grafana.sidecar.dashboards.enabled=true
关键CRD配置示例:
yaml复制apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp-monitor
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: web
interval: 30s
path: /metrics
在Deployment中添加annotations自动发现:
yaml复制apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
使用PrometheusRelabelConfigs处理复杂标签:
yaml复制- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
Prometheus数据备份方案:
bash复制curl -XPOST http://prometheus:9090/api/v1/admin/tsdb/snapshot
bash复制promtool tsdb create-blocks-from open /data/prometheus /backup/prometheus
bash复制aws s3 sync /data/prometheus s3://my-backup/prometheus/
安全升级步骤:
特别提醒:从2.x升级到3.x时,注意WAL格式变化可能导致启动变慢。
定期执行这些维护操作:
bash复制curl -s -XGET http://grafana:3000/api/search | jq '.[] | select(.type=="dash-db") | .id' | xargs -I{} curl -XDELETE http://grafana:3000/api/dashboards/uid/{}
yaml复制groups:
- name: recording_rules
rules:
- record: instance:node_cpu:avg_rate5m
expr: avg by(instance)(rate(node_cpu_seconds_total[5m]))
promql复制count by(alertname)(ALERTS{alertstate="firing"})