在容器编排领域,etcd作为Kubernetes的核心数据存储组件,承载着整个集群的状态信息。一旦etcd数据丢失或损坏,轻则导致服务中断,重则引发集群瘫痪。我在管理生产级K8s集群时,曾经历过因etcd故障导致的12小时服务中断,这促使我开发了一套高可靠的备份恢复方案。
这套工具的核心价值在于:
采用"定时快照+增量日志"的双重保护机制:
bash复制# 典型备份目录结构
/backups/
├── daily/ # 每日全量备份
│ ├── 20230501.db
│ └── 20230502.db
└── wal/ # WAL日志归档
├── 20230501_0001.wal
└── 20230502_0001.wal
关键设计考量:
bash复制#!/bin/bash
# etcd-backup.sh
ETCD_CTL="/usr/local/bin/etcdctl"
ENDPOINTS="https://127.0.0.1:2379"
CERT_DIR="/etc/kubernetes/pki/etcd"
BACKUP_DIR="/backups/$(date +%Y%m%d)"
mkdir -p ${BACKUP_DIR}
# 带证书认证的快照保存
${ETCD_CTL} --endpoints=${ENDPOINTS} \
--cacert=${CERT_DIR}/ca.crt \
--cert=${CERT_DIR}/server.crt \
--key=${CERT_DIR}/server.key \
snapshot save ${BACKUP_DIR}/snapshot.db
# 压缩并加密备份文件
tar czf - ${BACKUP_DIR} | openssl enc -aes-256-cbc -salt -out ${BACKUP_DIR}.tar.gz.enc
关键参数说明:
--endpoints:建议配置多个etcd节点地址提高可靠性snapshot save:触发一致性快照,确保数据完整性- openssl加密:密码应通过KMS或Vault管理
当单个etcd节点故障时:
bash复制# 1. 停止故障节点服务
systemctl stop etcd
# 2. 解密并解压备份文件
openssl enc -d -aes-256-cbc -in backup.tar.gz.enc | tar xz -C /restore
# 3. 执行数据恢复
ETCDCTL_API=3 etcdctl --data-dir /var/lib/etcd-new \
snapshot restore /restore/snapshot.db
# 4. 替换数据目录并重启
mv /var/lib/etcd /var/lib/etcd-bak
mv /var/lib/etcd-new /var/lib/etcd
systemctl start etcd
对于集群级故障,需要特别注意:
yaml复制# /etc/kubernetes/manifests/etcd.yaml 关键参数
- --initial-cluster-state=new
- --initial-cluster-token=etcd-recovery
bash复制etcdctl endpoint status --write-out=table
在/etc/systemd/system/etcd.service中添加:
ini复制[Service]
Environment="ETCD_SNAPSHOT_COUNT=10000" # 提高触发快照的阈值
Environment="ETCD_HEARTBEAT_INTERVAL=500" # 心跳间隔(ms)
Environment="ETCD_ELECTION_TIMEOUT=2500" # 选举超时(ms)
Prometheus应监控这些关键指标:
yaml复制- job_name: 'etcd'
metrics_path: '/metrics'
static_configs:
- targets: ['etcd-1:2379', 'etcd-2:2379']
metric_relabel_configs:
- source_labels: [__name__]
regex: 'etcd_disk_backend_commit_duration_seconds|etcd_server_leader_changes_seen_total'
action: keep
现象:etcdctl snapshot save报错"context deadline exceeded"
解决步骤:
bash复制pidstat -p $(pgrep etcd) -u -d 1 3
bash复制etcdctl --command-timeout=300s snapshot save ...
bash复制find /backups -type f -mtime +7 -exec rm {} \;
现象:节点间数据不一致,出现"request cluster ID mismatch"
解决方案:
--initial-cluster-token创建定时备份任务:
yaml复制apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: etcd-backup
spec:
schedule: "0 */6 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: bitnami/etcd:3.5
command:
- /bin/sh
- -c
- |
etcdctl --endpoints=https://${ETCD_POD_IP}:2379 \
--cert=/certs/server.crt \
--key=/certs/server.key \
--cacert=/certs/ca.crt \
snapshot save /backups/snapshot-$(date +%s).db
restartPolicy: OnFailure
在备份后自动验证数据完整性:
bash复制# 校验快照是否可读
etcdctl --write-out=table snapshot status snapshot.db
# 检查文件哈希值
sha256sum snapshot.db > snapshot.db.sha256
# 恢复测试(在临时目录)
etcdctl snapshot restore snapshot.db --data-dir /tmp/etcd-verify
这套方案经过三年迭代,在数十个生产集群中验证了其可靠性。建议运维团队至少每季度进行一次完整的灾难恢复演练,确保在真实故障时能快速响应。