第一次接触Calico的网络策略时,我完全被那些YAML配置搞晕了。后来才发现,它其实就是Kubernetes里的"防火墙规则",只不过专门管Pod之间的流量。想象一下公寓楼里的门禁系统:NetworkPolicy就是那个决定谁可以进哪个房间的保安。
先来看个最简单的策略模板:
yaml复制apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: basic-allow
spec:
podSelector:
matchLabels:
role: frontend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
role: backend
这个策略做了三件事:
role=frontend标签的Podrole=backendPod的入站流量实测中常见的新手错误是忘记写policyTypes字段。有次我在生产环境调试两小时,最后发现就是因为漏了这个字段导致策略不生效。记住:必须明确声明要控制Ingress(入站)还是Egress(出站)。
真实场景往往需要组合多种条件。比如我们要实现:
yaml复制apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: advanced-demo
spec:
podSelector:
matchLabels:
app: payment
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
env: monitoring
- podSelector:
matchLabels:
component: prometheus
egress:
- to:
- ipBlock:
cidr: 10.0.0.0/8
ports:
- protocol: TCP
port: 443
这里有个坑点:namespaceSelector和podSelector在同一个from项下是与关系,如果要实现或关系,需要拆分成多个from项。去年我们公司就因为这个配置错误导致监控系统无法采集数据。
当使用Calico的GlobalNetworkPolicy时,可以定义集群级别的安全策略。这个功能在金融行业特别有用,比如强制所有集群的数据库Pod必须隔离:
yaml复制apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
name: db-isolation
spec:
selector: app == 'database'
ingress:
- action: Allow
protocol: TCP
source:
selector: app == 'app-server'
destination:
ports: [5432]
egress:
- action: Deny
注意GlobalNetworkPolicy会覆盖同名的NetworkPolicy。有次我在测试环境调试时,发现本地策略死活不生效,后来才发现是全局策略在起作用。
Calico自带的calicoctl工具可以查看策略的实际生效情况:
bash复制# 查看所有端点状态
calicoctl get workloadendpoint -o wide
# 检查策略命中情况
calicoctl profile <PROFILE_NAME> rules
更直观的方法是使用Calico的Flow Visualizer。有次我们遇到网络抖动,就是通过这个工具发现某条策略意外拦截了健康检查流量。安装方法:
bash复制kubectl apply -f https://docs.projectcalico.org/manifests/flow-visualizer.yaml
策略不生效:
kubectl logs -l k8s-app=calico-node -c felixkubectl get crd networkpolicies.networking.k8s.io误拦截合法流量:
bash复制# 临时放行所有流量调试
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: temp-allow-all
namespace: debug
spec:
podSelector: {}
ingress:
- {}
egress:
- {}
EOF
DNS解析失败:
这是最常见的问题之一。必须确保kube-dns的流量被放行:
yaml复制- ports:
- protocol: UDP
port: 53
我们团队在实践中总结出"三层防护"模型:
典型配置示例:
yaml复制# 基础设施层:禁止所有跨命名空间流量
apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
name: default-deny-cross-ns
spec:
selector: all()
types:
- Ingress
- Egress
ingress:
- action: Allow
source:
selector: projectcalico.org/namespace in {'default', 'kube-system'}
egress:
- action: Allow
destination:
selector: projectcalico.org/namespace in {'default', 'kube-system'}
notin或!=等负向选择器,它们会导致性能下降我们曾通过策略合并将某金融客户的网络延迟从15ms降到3ms。监控策略性能可以用:
bash复制watch -n 1 "calicoctl get heps -o wide | awk '{print \$1,\$5}'"
Calico与Istio的深度集成可以实现七层防护:
yaml复制apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
name: istio-mtls
spec:
selector: app == 'payment'
ingress:
- action: Allow
source:
serviceAccounts:
names: ["frontend"]
namespaceSelector: env == 'prod'
http:
methods: ["GET", "POST"]
paths: ["/api/v1/*"]
利用Calico的Tigera组件可以实现基于行为的动态策略:
配置示例:
yaml复制apiVersion: projectcalico.org/v3
kind: StagedNetworkPolicy
metadata:
name: adaptive-policy
spec:
stage: production
selector: app == 'critical'
ingress:
- action: Pass
source:
selector: app == 'frontend'
validation:
mode: automatic
failureAction: audit
限制Pod只能访问特定外部服务:
yaml复制apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: egress-control
spec:
podSelector:
matchLabels:
role: external-api
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: 203.0.113.0/24
ports:
- protocol: TCP
port: 443
保护节点本身不被攻击:
yaml复制apiVersion: projectcalico.org/v3
kind: HostEndpoint
metadata:
name: node1-eth0
labels:
environment: production
spec:
interfaceName: eth0
node: k8s-node-1
expectedIPs:
- 192.168.0.100
配合主机端点策略:
yaml复制apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
name: host-protection
spec:
selector: environment == 'production'
ingress:
- action: Allow
protocol: TCP
destination:
ports: [22, 6443]
- action: Deny
我们团队采用的自动化流程:
.gitlab-ci.yml示例:
yaml复制policy-test:
stage: test
image: calicoctl
script:
- calicoctl apply -f policies/ --dry-run
- kubectl apply -f policies/ --server-dry-run
使用Kustomize管理不同环境的策略差异:
code复制base/
├── network-policy.yaml
overlays/
├── dev/
│ └── kustomization.yaml
└── prod/
└── kustomization.yaml
生产环境策略需要额外审批流程:
yaml复制# prod/kustomization.yaml
resources:
- ../../base
patchesStrategicMerge:
- extra-restrictions.yaml
启用详细日志记录:
bash复制calicoctl patch felixconfiguration default \
--patch='{"spec": {"policySyncPathPrefix": "/var/log/calico/policy"}}'
对接SIEM系统的配置示例:
yaml复制apiVersion: projectcalico.org/v3
kind: LogConfiguration
metadata:
name: siem-integration
spec:
logLevel: Info
filePath: /var/log/calico/audit.log
syslog:
severity: Warning
endpoint: 10.0.0.100:514
Prometheus需要采集的核心指标:
felix_active_policies:活跃策略数felix_active_selectors:选择器数量felix_ipset_calls:iptables操作频率Grafana监控看板配置建议:
定期导出策略配置:
bash复制calicoctl get networkpolicy --all-namespaces -o yaml > policies_$(date +%F).yaml
calicoctl get globalnetworkpolicy -o yaml > global_policies_$(date +%F).yaml
当错误策略导致业务中断时:
bash复制calicoctl get networkpolicy -o wide | grep <problematic-namespace>
bash复制kubectl annotate networkpolicy <policy-name> \
projectcalico.org/disable=true -n <namespace>
bash复制calicoctl apply -f policies_2023-01-01.yaml
当节点超过500个时需要考虑:
yaml复制apiVersion: projectcalico.org/v3
kind: BGPConfiguration
metadata:
name: default
spec:
nodeToNodeMeshEnabled: false
asNumber: 64512
bash复制calicoctl patch felixconfiguration default \
--patch='{"spec": {"routeRefreshInterval": "120s"}}'
通过Benchmark测试发现:
优化建议:
notin选择器性能测试命令:
bash复制kubectl run -it --rm --restart=Never netperf --image=networkstatic/netperf \
--command -- /bin/bash -c "curl -sSL bench.sh | bash"