云原生监控系统搭建：Prometheus+Grafana实战指南

硅谷IT胖子

1. 监控系统架构概述

在分布式系统和微服务架构中，监控是保障系统稳定性的关键环节。这套由Node Exporter、Prometheus、Grafana和Alertmanager组成的监控方案，已经成为云原生领域的事实标准。我曾在多个生产环境中部署这套系统，它最大的优势在于组件间的高度解耦和灵活的扩展性。

Node Exporter负责采集主机层面的指标数据，包括CPU、内存、磁盘、网络等基础资源使用情况。Prometheus作为时序数据库和告警引擎，定期拉取各Exporter的指标数据并存储。Grafana则提供强大的数据可视化能力，而Alertmanager专门处理告警的去重、分组和路由。

2. 环境准备与组件部署

2.1 Docker环境配置

建议使用Docker 20.10及以上版本，并确保已安装docker-compose。以下是我的常用环境检查命令：

bash复制# 检查Docker版本
docker version --format '{{.Server.Version}}'

# 创建专用网络（避免使用默认bridge）
docker network create monitor-net

注意：生产环境建议为每个组件单独配置资源限制，避免监控系统本身影响业务运行。

2.2 Node Exporter部署

Node Exporter需要访问主机系统信息，因此采用host网络模式：

yaml复制version: '3'
services:
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    network_mode: host
    pid: host
    restart: unless-stopped
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.ignored-mount-points'
      - '^/(sys|proc|dev|host|etc)($$|/)'

部署后访问http://主机IP:9100/metrics应能看到原始指标数据。

3. Prometheus核心配置

3.1 配置文件详解

创建prometheus.yml配置文件：

yaml复制global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
    metrics_path: '/metrics'
    
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

3.2 Docker部署命令

bash复制docker run -d \
  --name=prometheus \
  --network=monitor-net \
  -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus:latest \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.retention.time=15d

经验：TSDB保留时间根据磁盘容量调整，生产环境建议至少30天。

4. Grafana可视化配置

4.1 基础部署

bash复制docker run -d \
  --name=grafana \
  --network=monitor-net \
  -p 3000:3000 \
  grafana/grafana:latest

首次登录后需要：

添加Prometheus数据源（URL填http://prometheus:9090）
导入Node Exporter仪表板（ID：8919）

4.2 仪表板优化技巧

使用$__rate_interval替代固定时间窗口
为关键指标设置阈值标记
添加Annotations关联告警事件
配置Variables实现环境切换

5. Alertmanager告警管理

5.1 告警规则配置

在Prometheus中添加alert.rules.yml：

yaml复制groups:
- name: host-alerts
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is {{ $value }}%"

5.2 Alertmanager部署

yaml复制version: '3'
services:
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    networks:
      - monitor-net
    ports:
      - 9093:9093
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml