Kubernetes & Helm: Production Infrastructure

Create, review, and architect Kubernetes infrastructure - from raw manifests to Helm charts to multi-cluster strategy. The goal is production-ready, security-hardened, cost-aware infrastructure that a team can maintain.

Target versions (May 2026): Kubernetes 1.34-1.36 (1.36.0 "Haru" released April 22, 2026). Upstream Kubernetes has no LTS; community support per minor is ~14 months, so 1.32 reaches upstream EOL ~April 2026. Managed vendor extended support (AKS/EKS) carries 1.32 patches roughly 2 more years - attribute it to the platform, not upstream. Helm 4.2.0, Helm 3.21.x (parallel v3 maintenance, security fixes until Nov 2026).

This skill covers four domains depending on context:

Manifests - raw YAML for Deployments, Services, Gateway API routes, ConfigMaps, Secrets, PVCs
Helm - Helm 4 chart scaffolding, OCI registries, templating, multi-environment values
Architecture - cluster topology, GitOps, security layers, observability, cost, DR
Compliance - PCI-DSS 4.0 controls, CDE isolation, audit logging, supply chain

When to use

Creating or reviewing Kubernetes manifests (Deployment, Service, StatefulSet, Job, HTTPRoute, etc.)
Scaffolding new Helm charts or improving existing ones
Designing cluster topology, GitOps strategy, or multi-tenancy
Implementing security contexts, network policies, RBAC, admission control
Setting up multi-environment deployments (dev/staging/prod)
Reviewing infrastructure for production or compliance readiness
Planning observability, cost optimization, or disaster recovery
PCI-DSS 4.0 compliance for fintech/payment K8s workloads

When NOT to use

Configuring CI/CD pipelines (use ci-cd)
Docker/container image optimization (use docker)
Security audits of application code (use security-audit)
Provisioning the cluster itself via IaC (use terraform)
Database engine configuration running on K8s (use databases)
Broad read-only cluster health checks, status reports, and post-maintenance diagnostics (use cluster-health)

AI Self-Check

This skill runs inside an AI agent. AI tools consistently produce the same K8s security mistakes. Before returning any generated manifest, verify against this list:

Run generated manifests through kube-score, kubelinter, or checkov when available.

Performance

Set requests and limits from measured workload behavior; missing requests damage scheduling and autoscaling.
Use server-side dry-run and diff before apply; avoid repeated full-cluster renders during tight loops.
Scope watches, logs, and kubectl get calls by namespace/labels in large clusters.

Best Practices

Prefer declarative GitOps or reviewed manifests over live imperative changes for production.
Back up CRDs and custom resources before upgrades or operator changes.
Use policy gates for privileged pods, hostPath, broad RBAC, and mutable image tags.

Workflow

Step 1: Determine the domain

Based on the request:

"Create a deployment/service/manifest" -> Manifests
"Create a Helm chart" / "package for deployment" -> Helm
"Design the cluster" / "how should we structure" -> Architecture
"Make this PCI compliant" / "fintech" -> Compliance
"Review this manifest/chart" -> Apply production checklist + critical rules + AI self-check

Most real tasks blend domains. Work bottom-up: get the manifests right, then template them, then plan the deployment.

Step 2: Gather requirements

Before writing YAML, determine:

Workload type: stateless (Deployment) vs stateful (StatefulSet) vs batch (Job/CronJob)
Container image and pinned tag or SHA256 digest
Ports exposed (container port, service port, protocol)
Config: env vars, config files, secrets
Storage: ephemeral (emptyDir) vs persistent (PVC) with access mode and size
Resources: CPU/memory requests and limits
Health: startup, liveness, and readiness probe endpoints
Access: internal-only (ClusterIP) vs external (Gateway API HTTPRoute / LoadBalancer)
Scale: replicas, HPA thresholds, pod disruption budget
Compliance: PCI-DSS scope? CDE workload? Regulated environment?
Sidecars: logging, security, or proxy sidecars? Use native sidecars (GA in 1.33)

Step 3: Build

Follow the domain-specific section below. Always apply the production checklist (Step 4) and AI self-check before finishing.

Step 4: Validate

# Always verify kube context first
kubectl config current-context

# Manifests
kubectl apply -f <manifest> --dry-run=server    # Server-side validation
kube-score score <manifest>                     # Best practice scoring
checkov -d . --framework kubernetes             # Security/compliance scan

# Helm 4
helm lint <chart>/                              # Lint chart
helm template <release> <chart>/               # Render templates locally
helm template <release> <chart>/ -f values-prod.yaml  # With env overlay
helm install <release> <chart>/ --dry-run --debug     # Server-side dry run (needs cluster)

Step 5: GitOps-managed emergency or scaling changes

When changing a live workload managed by ArgoCD, Flux, or another reconciler, read references/gitops-emergency-changes.md; live kubectl scale, kubectl patch, or manual apply may be reverted unless the desired state changes too.

Manifests

Read references/manifest-templates.md for complete, copy-pasteable YAML templ

kubernetes

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

internal-comms

babysit

do

smart-explore

Recibe nuevas skills de DevOps e Infra todos los lunes