AWS EKS Kubernetes托管方案实战指南-代码聚汇网

AWS EKS Kubernetes托管方案实战指南

葱切成葱花

1. 为什么选择EKS作为Kubernetes托管方案

在容器编排领域摸爬滚打多年后，我越来越倾向于将生产环境托管在成熟的云服务上。AWS EKS（Elastic Kubernetes Service）作为AWS官方托管的Kubernetes服务，相比自建集群有几个不可替代的优势：

控制面免运维：Master节点由AWS完全托管，自动处理etcd备份、API服务器扩展等底层运维工作。我们团队曾经花费30%的精力维护自建集群的控制面，现在这部分工作完全省去了。
原生集成AWS生态：ELB自动对接Ingress、IAM角色直接绑定ServiceAccount、CloudWatch无缝收集日志。上周我们一个客户需要对接RDS，通过IAM角色绑定Pod只用了15分钟就完成了鉴权配置。
混合云友好：通过EKS Anywhere可以在本地数据中心运行完全兼容的Kubernetes环境。去年我们帮一家金融机构用EKS混合部署方案，既满足了数据本地化要求，又保持了开发环境的一致性。

不过EKS也有其局限性。最明显的是成本结构——每个集群每月固定收取$73的控制面费用，对于小型测试集群来说性价比不高。我通常会建议客户在开发环境使用EKS，但生产环境需要根据实际负载评估。

2. 前期准备：账号权限与工具链配置

2.1 IAM权限精细控制

创建EKS集群前，需要确保操作账号具备足够权限。我强烈建议不要直接使用AdministratorAccess，而是通过自定义策略实现最小权限分配。以下是经过生产验证的权限模板：

json复制{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "eks:CreateCluster",
        "eks:DescribeCluster",
        "iam:CreateServiceLinkedRole"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::<ACCOUNT_ID>:role/eksServiceRole"
    }
  ]
}

关键点：必须包含iam:CreateServiceLinkedRole权限，否则创建过程会卡在等待IAM角色激活阶段。去年我们团队就因此耽误过两小时排查时间。

2.2 本地工具链安装

不同于单纯的EC2操作，EKS管理需要一套特定的命令行工具：

AWS CLI v2：务必安装最新版，旧版本可能缺少eks命令支持

bash复制curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install --update

eksctl：官方推荐的集群管理工具

bash复制curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin

kubectl：版本需要与EKS集群Kubernetes版本匹配

bash复制curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.24.7/2022-10-31/bin/linux/amd64/kubectl
chmod +x kubectl && sudo mv kubectl /usr/local/bin/

验证工具链：

bash复制aws --version | head -n1
eksctl version
kubectl version --short

3. 集群创建实战：从零到生产就绪

3.1 基础集群创建

最简单的单节点集群创建命令：

bash复制eksctl create cluster \
  --name production \
  --version 1.24 \
  --region us-west-2 \
  --nodegroup-name ng-1 \
  --node-type t3.medium \
  --nodes 3

但实际生产环境我推荐使用配置文件方式，这是经过20+集群部署验证的模板：

yaml复制# cluster-config.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: production-cluster
  region: ap-northeast-1
  version: "1.24"

vpc:
  cidr: 10.10.0.0/16

nodeGroups:
  - name: ng-spot
    instanceTypes: ["t3.large", "t3a.large"]
    spot: true
    desiredCapacity: 2
    privateNetworking: true
    labels: { env: production, workload: general }

  - name: ng-on-demand
    instanceType: m5.xlarge
    minSize: 1
    maxSize: 3
    volumeSize: 100
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"

关键参数解析：

spot: true 使用Spot实例降低成本（适合非关键工作负载）
privateNetworking 将节点放入私有子网
volumeSize 默认EBS卷大小（单位GB）
tags 启用集群自动扩展

创建命令：

bash复制eksctl create cluster -f cluster-config.yaml

3.2 网络架构深度配置

默认配置使用AWS管理的VPC，但对于生产环境，我建议使用已有VPC并精细控制网络拓扑：

yaml复制vpc:
  id: vpc-0123456789
  subnets:
    private:
      ap-northeast-1a: { id: subnet-1111111111 }
      ap-northeast-1c: { id: subnet-2222222222 }
    public:
      ap-northeast-1a: { id: subnet-3333333333 }
      ap-northeast-1c: { id: subnet-4444444444 }

网络策略建议：

工作节点部署在私有子网
为Ingress Controller保留公共子网
确保NAT网关可用（出口流量）
安全组最小开放规则：
- 节点间全部TCP通信（用于Pod网络）
- 控制面到节点的HTTPS（6443端口）

4. 集群组件与生产加固

4.1 核心插件安装

EKS默认不安装以下关键组件，需要手动部署：

Cluster Autoscaler：

bash复制eksctl utils associate-iam-oidc-provider \
  --cluster production-cluster \
  --approve

eksctl create iamserviceaccount \
  --name cluster-autoscaler \
  --namespace kube-system \
  --cluster production-cluster \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonEKSClusterAutoscalerPolicy \
  --approve

AWS Load Balancer Controller（替代传统ALB Ingress）：

bash复制helm repo add eks https://aws.github.io/eks-charts
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
  --set clusterName=production-cluster \
  --set serviceAccount.create=false \
  --set serviceAccount.name=aws-load-balancer-controller

4.2 安全加固措施

Pod安全策略（PSP）替代方案：

bash复制kubectl apply -f https://raw.githubusercontent.com/aws/containers-roadmap/master/preview/psp/eks-psp.yaml

镜像扫描：

bash复制helm install trivy aquasec/trivy \
  --set trivy.ignoreUnfixed=true

网络策略引擎：

bash复制kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/master/config/master/calico-operator.yaml

5. 日常运维与问题排查

5.1 常见故障场景

节点无法加入集群：

检查节点安全组是否允许与控制面通信（TCP 443）

查看节点系统日志：

bash复制journalctl -u kubelet --no-pager

验证节点IAM角色是否包含AmazonEKSWorkerNodePolicy

Pod网络异常：

检查Calico/VPC CNI日志：

bash复制kubectl logs -n kube-system -l k8s-app=aws-node

验证VPC CNI版本：

bash复制kubectl describe daemonset -n kube-system aws-node | grep Image

5.2 性能优化技巧

DNS查询优化：

yaml复制# coredns configmap修改
data:
  Corefile: |
    .:53 {
        cache 30
        loop
        reload
        ready
        forward . /etc/resolv.conf
    }

VPC CNI调优：

bash复制kubectl set env ds aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true

控制面日志启用：

bash复制aws eks update-cluster-config \
  --name production-cluster \
  --logging '{"clusterLogging":[{"types":["api","audit","authenticator"],"enabled":true}]}'

6. 成本控制与监控体系

6.1 成本优化策略

Spot实例混合部署：

yaml复制nodeGroups:
  - name: ng-spot
    instanceTypes: ["t3.large", "t3a.large"]
    spot: true
    labels: { "lifecycle": "spot" }

垂直Pod自动缩放：

bash复制helm install vpa recommender \
  --repo https://charts.fairwinds.com/stable \
  --set recommender.extraArgs.v=4

6.2 监控方案集成

Prometheus + Grafana：

bash复制helm install prometheus prometheus-community/prometheus \
  --set server.persistentVolume.storageClass=gp2

AWS原生监控：

bash复制kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluentd-quickstart.yaml

在最近的一个客户项目中，通过上述监控方案我们成功将平均故障响应时间从47分钟缩短到8分钟。特别是CloudWatch Container Insights提供的Pod级指标，帮助我们快速定位到某个Deployment的内存泄漏问题。