Kubernetes & Helm: Production Infrastructure
Create, review, and architect Kubernetes infrastructure - from raw manifests to Helm charts to multi-cluster strategy. The goal is production-ready, security-hardened, cost-aware infrastructure that a team can maintain.
Target versions (May 2026): Kubernetes 1.34-1.36 (1.36.0 "Haru" released April 22, 2026). Upstream Kubernetes has no LTS; community support per minor is ~14 months, so 1.32 reaches upstream EOL ~April 2026. Managed vendor extended support (AKS/EKS) carries 1.32 patches roughly 2 more years - attribute it to the platform, not upstream. Helm 4.2.0, Helm 3.21.x (parallel v3 maintenance, security fixes until Nov 2026).
This skill covers four domains depending on context:
- Manifests - raw YAML for Deployments, Services, Gateway API routes, ConfigMaps, Secrets, PVCs
- Helm - Helm 4 chart scaffolding, OCI registries, templating, multi-environment values
- Architecture - cluster topology, GitOps, security layers, observability, cost, DR
- Compliance - PCI-DSS 4.0 controls, CDE isolation, audit logging, supply chain
When to use
- Creating or reviewing Kubernetes manifests (Deployment, Service, StatefulSet, Job, HTTPRoute, etc.)
- Scaffolding new Helm charts or improving existing ones
- Designing cluster topology, GitOps strategy, or multi-tenancy
- Implementing security contexts, network policies, RBAC, admission control
- Setting up multi-environment deployments (dev/staging/prod)
- Reviewing infrastructure for production or compliance readiness
- Planning observability, cost optimization, or disaster recovery
- PCI-DSS 4.0 compliance for fintech/payment K8s workloads
When NOT to use
- Configuring CI/CD pipelines (use ci-cd)
- Docker/container image optimization (use docker)
- Security audits of application code (use security-audit)
- Provisioning the cluster itself via IaC (use terraform)
- Database engine configuration running on K8s (use databases)
- Broad read-only cluster health checks, status reports, and post-maintenance diagnostics (use cluster-health)
AI Self-Check
This skill runs inside an AI agent. AI tools consistently produce the same K8s security mistakes. Before returning any generated manifest, verify against this list:
- Security context present on every pod AND every container (not just one level)
-
runAsNonRoot: true,readOnlyRootFilesystem: true,allowPrivilegeEscalation: false,drop: ["ALL"] - Resource
requestsANDlimitsset (AI almost never includes these unprompted) - Image tag is pinned (not
:latest, not omitted). Prefer SHA256 digest for production. - No hardcoded secrets in env vars, ConfigMaps, or Helm values
- Namespace specified explicitly (not relying on context default)
- NetworkPolicy included or mentioned (AI almost never generates these alongside deployments)
- No
privileged: trueorhostNetwork: trueunless explicitly requested and justified -
seccompProfile: { type: RuntimeDefault }present (often forgotten) - Using Gateway API
HTTPRoutefor new external access, not legacy Ingress - Liveness and readiness probes defined: every container has at least a readiness probe
- Kube context verified before any kubectl/helm/argocd command
- Requester is authorized for cluster/admin changes, especially in shared chats. If the request comes from a non-admin participant, stop and ask the authorized owner for approval before kubectl, Helm, ArgoCD, or GitOps edits.
- No auto-sync to production without approval gate
- Current source checked: dated versions, CLI flags, API names, and support windows are verified against primary docs before repeating them
- Hidden state identified: local config, credentials, caches, contexts, branches, cluster targets, or previous runs are made explicit before acting
- Verification is real: final checks exercise the actual runtime, parser, service, or integration point instead of only linting prose or happy paths
- Routing overlap checked: overlapping skills, trigger terms, and "When NOT to use" boundaries are checked before returning guidance
- Spec claims verified: claims about tool behavior, output contracts, or repo conventions are checked against current docs, scripts, or skill files
- API versions checked: manifests, Helm templates, and Gateway resources match the target cluster version
- Cluster context verified: namespace, context, and kubeconfig identity are shown before mutating commands
- kube-proxy mode checked on 1.35+ clusters: IPVS mode is deprecated in 1.35 (removal targeted for a future release); recommend nftables mode for new clusters and flag IPVS in reviews
Run generated manifests through kube-score, kubelinter, or checkov when available.
Performance
- Set requests and limits from measured workload behavior; missing requests damage scheduling and autoscaling.
- Use server-side dry-run and diff before apply; avoid repeated full-cluster renders during tight loops.
- Scope watches, logs, and
kubectl getcalls by namespace/labels in large clusters.
Best Practices
- Prefer declarative GitOps or reviewed manifests over live imperative changes for production.
- Back up CRDs and custom resources before upgrades or operator changes.
- Use policy gates for privileged pods, hostPath, broad RBAC, and mutable image tags.
Workflow
Step 1: Determine the domain
Based on the request:
- "Create a deployment/service/manifest" -> Manifests
- "Create a Helm chart" / "package for deployment" -> Helm
- "Design the cluster" / "how should we structure" -> Architecture
- "Make this PCI compliant" / "fintech" -> Compliance
- "Review this manifest/chart" -> Apply production checklist + critical rules + AI self-check
Most real tasks blend domains. Work bottom-up: get the manifests right, then template them, then plan the deployment.
Step 2: Gather requirements
Before writing YAML, determine:
- Workload type: stateless (Deployment) vs stateful (StatefulSet) vs batch (Job/CronJob)
- Container image and pinned tag or SHA256 digest
- Ports exposed (container port, service port, protocol)
- Config: env vars, config files, secrets
- Storage: ephemeral (emptyDir) vs persistent (PVC) with access mode and size
- Resources: CPU/memory requests and limits
- Health: startup, liveness, and readiness probe endpoints
- Access: internal-only (ClusterIP) vs external (Gateway API HTTPRoute / LoadBalancer)
- Scale: replicas, HPA thresholds, pod disruption budget
- Compliance: PCI-DSS scope? CDE workload? Regulated environment?
- Sidecars: logging, security, or proxy sidecars? Use native sidecars (GA in 1.33)
Step 3: Build
Follow the domain-specific section below. Always apply the production checklist (Step 4) and AI self-check before finishing.
Step 4: Validate
# Always verify kube context first
kubectl config current-context
# Manifests
kubectl apply -f <manifest> --dry-run=server # Server-side validation
kube-score score <manifest> # Best practice scoring
checkov -d . --framework kubernetes # Security/compliance scan
# Helm 4
helm lint <chart>/ # Lint chart
helm template <release> <chart>/ # Render templates locally
helm template <release> <chart>/ -f values-prod.yaml # With env overlay
helm install <release> <chart>/ --dry-run --debug # Server-side dry run (needs cluster)
Step 5: GitOps-managed emergency or scaling changes
When changing a live workload managed by ArgoCD, Flux, or another reconciler, read references/gitops-emergency-changes.md; live kubectl scale, kubectl patch, or manual apply may be reverted unless the desired state changes too.
Manifests
Read references/manifest-templates.md for complete, copy-pasteable YAML templ