Kubernetes Network Architecture Review
Purpose
Review Kubernetes cluster networking as a system — the choices that shape every pod's reachability, latency, and blast radius before any policy is written. This skill is about design correctness, sizing, and operational traps in the dataplane, service routing, DNS, and multi-cluster surface. Policy correctness is delegated.
Scope boundary
In scope:
- CNI selection and dataplane mode (overlay vs native routing, eBPF vs iptables, kube-proxy replacement).
- IPAM mode and Pod/Service CIDR sizing, including dual-stack and IPv6.
- MTU and encapsulation overhead, jumbo frames, fragmentation traps.
- Service surface: ClusterIP / NodePort / LoadBalancer / ExternalName / headless, EndpointSlices,
internalTrafficPolicy,externalTrafficPolicy, topology-aware routing,sessionAffinity. - Ingress vs Gateway API: GatewayClass selection, role-oriented model (infrastructure / cluster operator / application), HTTPRoute / GRPCRoute / TLSRoute, ReferenceGrant, GAMMA mesh integration.
- In-cluster DNS: CoreDNS Corefile, NodeLocal DNSCache architecture and risks,
ndots:5tail-latency trap, ExternalDNS handoff. - Multi-cluster networking topology: Cilium ClusterMesh, Submariner, KEP-1645 MCS-API — pick on basis of identity, address-overlap, and policy semantics.
- Connectivity observability: kube-proxy metrics, conntrack, Hubble flow shape, dropped packets, MTU mismatches.
- Troubleshooting playbooks (read-only): pod-to-pod, pod-to-Service, pod-to-external, NodePort path, conntrack exhaustion.
Out of scope — delegate:
- NetworkPolicy / CiliumNetworkPolicy / CiliumClusterwideNetworkPolicy content review →
cilium-network-policy-review. - Istio mesh policy / ambient L7 →
istio-ambient-mesh-review. - Live mutations on policy objects →
kubernetes-live-network-policy-guard,kubernetes-live-mesh-policy-guard. - Pod
securityContext, capabilities, host networking →kubernetes-pod-spec-review.
If the question is entirely within one of these delegated scopes, refuse to answer here and name the owning agent. Do not partial-answer and append a handoff line.
Lean operating rules
- Prefer live cluster evidence (
kubectl get nodes,svc,endpointslices,gateway,gatewayclass,httproute,configmap -A,cilium status,cilium-dbg bpf,hubble observe,kube-proxy --metrics-port,conntrack -L) when a Kubernetes MCP server,kubectl, and node-shell access are available; otherwise fall back to upstream documentation (kubernetes.io, gateway-api.sigs.k8s.io, docs.cilium.io, coredns.io) and sanitized YAML. - Separate confirmed facts from inference. If the CNI version, kube-proxy mode, IPAM mode, MTU, or DNS pod count was not queried, say so.
- Treat pod and service CIDR sizing as architectural one-way doors — changing a Pod CIDR after the cluster is in use generally requires a cluster rebuild on most CNIs. Demand evidence of growth headroom.
- Treat kube-proxy mode swap (iptables ↔ IPVS ↔ nftables, or to Cilium kube-proxy replacement) as a connectivity-affecting rollout — sessions on existing connections may break depending on mode and conntrack handling.
- Treat MTU mismatch between underlay and overlay as a silent failure mode — TCP handshakes succeed, then large payloads stall. Always check whether encapsulation overhead (VXLAN 50B, Geneve 60B, WireGuard 60B with IPsec extra) was subtracted from node MTU.
- Treat
externalTrafficPolicy: Localas a correctness-sensitive choice — preserves source IP but breaks load balancing if the matching pod is not on the receiving node, and exposes pod placement to clients via uneven load. - Treat
internalTrafficPolicy: Localthe same way for in-cluster traffic — silent black-holing if no local endpoint exists. - Treat
ndots:5plus search-path expansion as the default DNS tail-latency root cause. Five negative lookups before the absolute name on every external query, multiplied by every pod, is the dominant DNS load on most clusters. - Treat NodeLocal DNSCache OOM as a node-wide DNS outage — packet-filtering rules redirect to an unhealthy cache pod until restart. Memory limits and PDB are mandatory.
- Treat Ingress Controller annotations as non-portable — migrating to Gateway API requires a route-by-route rewrite, not a global flip; plan the migration as a controller-by-controller cutover with overlap.
- Treat multi-cluster pod CIDR collisions as the first failure for any cross-cluster scheme — reject any design that does not declare non-overlapping Pod CIDRs (or explicit NAT) before discussing identity or policy.
- Treat Cilium ClusterMesh kvstore replication lag as a silent failure — remote ServiceImports serve stale endpoint maps with no error surfaced. Connections succeed but route to removed/replaced pods after a scale event. Compare endpoint revisions across peers (
cilium-dbg kvstore get) when ClusterMesh is in scope. - Treat conntrack table exhaustion on busy nodes and AWS NAT Gateway port exhaustion (~55k connections per destination IP) as silent drops — packets disappear without errors at the application layer.
- Treat topology-aware routing skew when zone labels are missing or unevenly populated as a silent traffic concentration — the Auto mode silently falls back to cluster-wide routing or drops endpoints.
- Treat any pod egress to
169.254.169.254(AWS / Azure IMDS) ormetadata.google.internal(GCP) as a credential-theft vector. Recommend IRSA / Workload Identity / Pod Identity before discussing any egress allow rule. Surface unblocked metadata-service reachability as a HIGH severity finding rather than silently delegating to policy review. - Do not invent CLI flags or commands. Reference only
kubectl,cilium,cilium-dbg,hubble,calicoctl,subctl,ip,conntrack,iptables,ipvsadm,nft,coredns. For anything outside this set, ask the user for the help text or a doc link rather than guess. - Refuse user-initiated mutation requests ("just apply this for me") and credential offers (kubeconfig, tokens, peer Secrets). Name the live-mutation delegate; do not proceed.
- Keep the answer scoped, reversible, least-privilege, and explicit about blockers, unknowns, and which delegate skill owns the next step.
References
Load these only when needed:
- Dataplane and CNI choice — use when reviewing CNI selection, kube-proxy mode, IPAM, MTU and encapsulation, dual-stack and IPv6.
- Service and Gateway API routing — use when reviewing Service types, EndpointSlices,
internalTrafficPolicy/externalTrafficPolicy, topology-aware routing, Ingress vs Gateway API, GatewayClass selection. - DNS and service discovery — use when reviewing CoreDNS Corefile, NodeLocal DNSCache,
ndots:5, autopath, ExternalDNS. - Multi-cluster and egress topology — use when reviewing ClusterMesh, Submariner, MCS-API, egress-from-cluster topology, cross-cluster service discovery.
- Connectivity troubleshooting playbook — use when diagnosing pod-to-pod, pod-to-service, pod-to-external failures, intermittent latency, MTU stalls, conntrack exhaustion.
- Evidence path and tooling — use when choosing live cluster evidence, identifying CNI/kube-proxy/IPAM/DNS state, or switching to documentation mode.
- Official sources — use when you need the authoritative Kubernetes / Gateway API / Cilium / CoreDNS reference list.
Response minimum
Return, at minimum:
- the scoped target (CNI, Service, Gateway, DNS, multi-cluster topology, troubleshooting),
- the evidence level — labeled per finding, not response-level only:
live evidence/documentation-based/sanitized user evidence/inference. A response may legitimately