K8s监控方案：Prometheus+Grafana部署与优化指南-代码聚汇网

K8s监控方案：Prometheus+Grafana部署与优化指南

Wong Kosheng

1. 为什么要在K8s中部署监控系统？

在容器化环境中，传统的监控方式已经无法满足动态、弹性的基础设施需求。Kubernetes集群中的Pod可能随时被调度到不同节点，服务实例也会根据负载自动扩缩容。这就需要一个能够自动发现监控目标、动态采集指标的监控方案。

Grafana和Prometheus的组合完美解决了这个问题：

Prometheus作为时序数据库和告警引擎，通过Service Discovery自动发现K8s中的监控目标
Grafana则提供强大的数据可视化和仪表板功能
两者都原生支持Kubernetes环境，部署和维护非常方便

我在生产环境中使用这套组合已经3年多，监控着超过200个微服务的运行状态。下面就把完整的部署方案和踩坑经验分享给大家。

2. 部署前的准备工作

2.1 环境检查清单

在开始安装前，请确认你的K8s集群满足以下条件：

Kubernetes版本 ≥ 1.16（推荐1.20+）
已配置StorageClass用于持久化存储
集群中有至少2个可用节点
kubectl已正确配置并可以管理集群
Helm 3已安装（我们将使用Helm chart部署）

重要提示：生产环境建议为监控组件单独分配节点，避免监控系统影响业务稳定性

2.2 命名空间规划

建议为监控组件创建独立的命名空间：

bash复制kubectl create namespace monitoring

这样既方便管理，也能通过RBAC控制访问权限。

3. 使用Helm部署Prometheus

3.1 添加Prometheus Helm仓库

bash复制helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

3.2 自定义values.yaml配置

创建prometheus-values.yaml文件：

yaml复制# 基础配置
global:
  scrape_interval: 15s
  evaluation_interval: 15s

# 资源限制
server:
  resources:
    limits:
      cpu: 1000m
      memory: 2Gi
    requests:
      cpu: 500m
      memory: 1Gi

# 持久化存储配置
server:
  persistentVolume:
    enabled: true
    size: 50Gi
    storageClass: "standard"

# 自动发现K8s服务
serverFiles:
  prometheus.yml:
    scrape_configs:
    - job_name: 'kubernetes-apiservers'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

3.3 执行安装命令

bash复制helm install prometheus prometheus-community/prometheus \
  --namespace monitoring \
  --values prometheus-values.yaml \
  --version 15.5.3

安装完成后，可以通过端口转发测试：

bash复制kubectl port-forward svc/prometheus-server 9090:80 -n monitoring

然后在浏览器访问 http://localhost:9090 应该能看到Prometheus的Web界面。

4. 部署Grafana

4.1 添加Grafana Helm仓库

bash复制helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

4.2 自定义Grafana配置

创建grafana-values.yaml：

yaml复制# 管理员凭据
adminUser: admin
adminPassword: "YourSecurePassword123!"

# 持久化存储
persistence:
  enabled: true
  size: 10Gi
  storageClassName: standard

# 资源限制
resources:
  limits:
    cpu: 500m
    memory: 1Gi
  requests:
    cpu: 200m
    memory: 512Mi

# 自动配置Prometheus数据源
datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
    - name: Prometheus
      type: prometheus
      url: http://prometheus-server.monitoring.svc.cluster.local
      access: proxy
      isDefault: true

4.3 安装Grafana

bash复制helm install grafana grafana/grafana \
  --namespace monitoring \
  --values grafana-values.yaml \
  --version 6.26.0

访问Grafana：

bash复制kubectl port-forward svc/grafana 3000:80 -n monitoring

浏览器打开 http://localhost:3000 使用配置的用户名密码登录。

5. 配置监控仪表板

5.1 导入K8s集群仪表板

Grafana官方提供了优秀的Kubernetes监控仪表板，ID为3119。导入方法：

登录Grafana后，点击左侧"+" → "Import"
输入3119，点击Load
选择Prometheus数据源，点击Import

5.2 自定义告警规则

在Prometheus中配置告警规则示例：

yaml复制# prometheus-alert-rules.yaml
groups:
- name: Kubernetes Pod Alert
  rules:
  - alert: PodCrashLooping
    expr: kube_pod_container_status_restarts_total > 3
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Pod {{ $labels.pod }} is crash looping"
      description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value }} times."

然后更新Prometheus配置：

bash复制kubectl create configmap prometheus-alert-rules --from-file=prometheus-alert-rules.yaml -n monitoring
kubectl rollout restart deployment prometheus-server -n monitoring

6. 生产环境优化建议

6.1 高可用配置

对于生产环境，建议配置Prometheus高可用：

yaml复制# 在prometheus-values.yaml中添加
alertmanager:
  enabled: true
  replicaCount: 2

prometheus:
  prometheusSpec:
    replicas: 2
    replicaExternalLabelName: "prometheus_replica"

6.2 长期存储方案

Prometheus本地存储不适合长期保存数据，可以集成Thanos或VictoriaMetrics：

yaml复制# 集成Thanos示例
prometheus:
  prometheusSpec:
    thanos:
      image: quay.io/thanos/thanos:v0.24.0
      objectStorageConfig:
        key: thanos.yaml
        name: thanos-objstore-config

6.3 安全加固措施

启用Ingress TLS加密
配置网络策略限制访问
定期轮换管理员密码
启用审计日志

7. 常见问题排查

7.1 Prometheus无法采集指标

可能原因：

ServiceAccount权限不足
网络策略阻止访问
目标服务未暴露metrics端口

检查步骤：

bash复制# 检查ServiceAccount
kubectl describe clusterrolebinding prometheus-server -n monitoring

# 检查网络连通性
kubectl run -it --rm debug --image=busybox --restart=Never -- wget -qO- http://prometheus-server.monitoring.svc.cluster.local

7.2 Grafana无法连接Prometheus

检查点：

确认Prometheus服务名称和命名空间正确
检查网络策略
验证Grafana数据源配置

bash复制# 在Grafana Pod内测试连接
kubectl exec -it deploy/grafana -n monitoring -- curl -v http://prometheus-server.monitoring.svc.cluster.local

7.3 持久化存储问题

如果Pod重启后数据丢失：

检查PVC状态
确认StorageClass配置正确
验证PV是否成功创建

bash复制kubectl get pvc -n monitoring
kubectl describe pvc grafana -n monitoring

8. 监控系统维护建议

定期检查存储空间使用情况
设置Prometheus数据保留策略
监控监控系统本身的状态
定期更新Helm chart版本
备份关键Grafana仪表板配置

这套监控方案在我们生产环境已经稳定运行多年，能够满足从开发测试到生产环境的各类监控需求。根据实际业务规模，可以灵活调整资源配置和架构设计。