1. 为什么要在EKS上部署监控系统?
在Kubernetes集群中运行应用时,监控就像给飞机装仪表盘一样重要。我见过太多团队在EKS上部署完应用后,直到用户投诉才发现问题。Prometheus+Grafana这套组合能让你实时掌握集群的脉搏,从节点资源使用到Pod状态一目了然。
这套方案特别适合:
- 需要7×24小时稳定运行的在线业务系统
- 微服务架构下需要追踪跨服务调用的场景
- 正在从传统监控方案(如Zabbix)向云原生监控迁移的团队
2. 部署前的环境准备
2.1 EKS集群基础配置检查
先确保你的EKS集群已经就绪:
bash复制aws eks --region us-west-2 update-kubeconfig --name my-cluster
kubectl get nodes # 确认节点状态全部Ready
注意:生产环境建议至少3个worker节点,且分布在多个AZ。我曾遇到过单个AZ故障导致监控数据丢失的情况。
2.2 必要的IAM权限配置
Prometheus需要访问AWS资源API的权限,创建如下IAM策略:
json复制{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances",
"ec2:DescribeVolumes",
"ec2:DescribeTags"
],
"Resource": "*"
}
]
}
3. 使用Helm部署Prometheus
3.1 添加Prometheus Helm仓库
bash复制helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
3.2 定制化values.yaml配置
创建custom-values.yaml文件:
yaml复制alertmanager:
enabled: false # 初次部署建议先关闭告警
server:
persistentVolume:
enabled: true
size: 50Gi # 生产环境建议至少100Gi
storageClass: gp2
resources:
limits:
cpu: 2
memory: 4Gi
3.3 执行Helm安装命令
bash复制helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring \
--create-namespace \
-f custom-values.yaml
部署完成后检查Pod状态:
bash复制kubectl -n monitoring get pods
# 应该看到prometheus-operator、prometheus-server等组件
4. Grafana的配置与优化
4.1 访问Grafana控制台
获取admin密码:
bash复制kubectl get secret -n monitoring prometheus-grafana \
-o jsonpath="{.data.admin-password}" | base64 --decode
端口转发到本地:
bash复制kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
4.2 导入预置仪表盘
Grafana官方提供多个Kubernetes监控模板:
- 3119:集群全局视图
- 6417:Pod详细监控
- 315:节点资源使用
导入方法:
- 登录Grafana后点击"+" → Import
- 输入仪表盘ID
- 选择Prometheus数据源
4.3 配置持久化存储
修改values.yaml增加Grafana持久化:
yaml复制grafana:
persistence:
enabled: true
size: 10Gi
storageClassName: gp2
然后执行升级:
bash复制helm upgrade prometheus prometheus-community/kube-prometheus-stack \
-n monitoring \
-f custom-values.yaml
5. 生产环境关键配置
5.1 Prometheus数据保留策略
在values.yaml中配置:
yaml复制prometheus:
prometheusSpec:
retention: 15d # 根据存储容量调整
retentionSize: "50GiB"
5.2 资源限制与自动扩缩
为Prometheus配置HPA:
yaml复制prometheus:
prometheusSpec:
resources:
requests:
cpu: 1
memory: 2Gi
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 5
5.3 跨AZ高可用配置
yaml复制prometheus:
prometheusSpec:
replicaExternalLabelName: "prometheus_replica"
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values: ["prometheus"]
topologyKey: "topology.kubernetes.io/zone"
6. 常见问题排查指南
6.1 Prometheus存储空间不足
症状:监控数据频繁丢失
解决方案:
- 检查PVC使用情况:
bash复制
kubectl -n monitoring get pvc - 扩容PVC:
bash复制
kubectl -n monitoring edit pvc prometheus-prometheus-kube-prometheus-prometheus-db
6.2 Grafana登录失败
可能原因:
- 密码被重置
- 浏览器缓存问题
重置密码步骤:
bash复制kubectl -n monitoring delete secret prometheus-grafana
helm upgrade prometheus prometheus-community/kube-prometheus-stack \
-n monitoring \
-f custom-values.yaml
6.3 监控数据缺失
检查流程:
- 确认ServiceMonitor是否创建:
bash复制
kubectl -n monitoring get servicemonitors - 检查Prometheus Targets页面是否正常
- 查看Prometheus日志:
bash复制
kubectl -n monitoring logs -l app=prometheus
7. 进阶配置技巧
7.1 自定义监控指标
创建ServiceMonitor示例:
yaml复制apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: web
interval: 30s
path: /metrics
7.2 与CloudWatch集成
安装cloudwatch-exporter:
bash复制helm install cloudwatch-exporter prometheus-community/prometheus-cloudwatch-exporter \
-n monitoring \
--set config.aws_region=us-west-2 \
--set config.aws_role_arn=arn:aws:iam::123456789012:role/MonitoringRole
7.3 告警规则配置
示例CPU告警规则:
yaml复制apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: node-alerts
namespace: monitoring
spec:
groups:
- name: node.rules
rules:
- alert: HighNodeCPU
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "{{ $labels.instance }} CPU usage is {{ $value }}%"
8. 性能优化实践
8.1 Prometheus查询优化
- 使用recording rules预计算常用查询:
yaml复制groups:
- name: example.rules
rules:
- record: job:http_inprogress_requests:sum
expr: sum(http_inprogress_requests) by(job)
- 配置查询日志分析慢查询:
yaml复制prometheus:
prometheusSpec:
queryLogFile: /var/log/prometheus/query.log
8.2 长期存储方案
与Thanos集成配置:
yaml复制prometheus:
prometheusSpec:
thanos:
image: quay.io/thanos/thanos:v0.28.0
objectStorageConfig:
key: thanos.yaml
name: thanos-objstore-config
8.3 资源使用调优
监控资源使用黄金指标:
- Prometheus内存使用不超过80%
- 抓取间隔不低于15s(高负载环境)
- 每个Prometheus实例监控目标不超过1,000个
9. 安全加固措施
9.1 启用Grafana认证
配置LDAP集成:
yaml复制grafana:
env:
GF_AUTH_LDAP_ENABLED: "true"
GF_AUTH_LDAP_CONFIG_FILE: "/etc/grafana/ldap.toml"
extraConfigmapMounts:
- name: ldap-config
mountPath: /etc/grafana/ldap.toml
subPath: ldap.toml
configMap: grafana-ldap-config
9.2 Prometheus网络隔离
创建NetworkPolicy:
yaml复制apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: prometheus-allow-only-grafana
namespace: monitoring
spec:
podSelector:
matchLabels:
app: prometheus
ingress:
- from:
- podSelector:
matchLabels:
app.kubernetes.io/name: grafana
9.3 定期备份配置
使用kube-backup工具:
bash复制helm install kube-backup \
--set schedules.prometheus-rules.schedule="0 3 * * *" \
--set schedules.prometheus-rules.type=configmap \
--set schedules.prometheus-rules.namespace=monitoring \
--set schedules.prometheus-rules.labelSelector="app=kube-prometheus-stack" \
stable/kube-backup
10. 成本控制方案
10.1 存储类型选择
针对不同数据采用不同存储类:
yaml复制prometheus:
prometheusSpec:
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp2
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
10.2 采样频率调整
根据业务重要性分级配置:
yaml复制serviceMonitorSelector:
matchExpressions:
- key: monitoring-level
operator: In
values: ["critical"]
interval: 15s
- key: monitoring-level
operator: In
values: ["normal"]
interval: 60s
10.3 自动清理旧数据
配置Prometheus压缩和清理:
yaml复制prometheus:
prometheusSpec:
compaction:
enabled: true
retentionResolutionRaw: 7d
retentionResolution5m: 30d
retentionResolution1h: 1y