Cluster Health

Run read-only Kubernetes health checks and report cluster status with evidence. This skill works without private overlays by requiring an explicit kube context or confirmed current context. Local users may add ignored protected overlays for aliases and environment-specific checks.

When to use

User asks to check cluster health, status, diagnostics, node status, or post-maintenance state
Verifying cluster-wide symptoms after upgrades, reboots, Helm changes, GitOps syncs, or incidents
Gathering read-only evidence across nodes, workloads, events, ingress, storage, logs, and policy
Producing a short traffic-light report from Kubernetes and related observability signals

When NOT to use

Writing or reviewing Kubernetes manifests - use kubernetes
Writing Helm charts, Kustomize overlays, or IaC - use kubernetes or terraform
Changing resources, restarting pods, deleting objects, or applying fixes - ask for explicit escalation
Debugging one application deeply after the broad sweep identifies it - use the relevant domain skill

AI Self-Check

Before running checks or reporting results, verify:

Performance

Start with cluster-wide signals before loading symptom-specific references.
Bound logs, events, and object listings by namespace, time window, or selectors.
Prefer summarized evidence over dumping raw Kubernetes output into context.

Best Practices

Treat the current kube context as hidden state until it is explicitly named.
Separate health evidence from remediation; fixes require a separate escalation.
Report permission gaps and missing CRDs as diagnostic findings, not silent skips.
Run only the commands the reference files define. A monitoring context invites improvisation; resist it. When a check you want is not listed, write it as a suggested follow-up instead of guessing a service name, namespace, or path that may not exist.
Do not read a metric's status without knowing what the metric measures. The reference files state what each signal does and does NOT represent; misreading a percentage or a stale value produces a confidently wrong report.

Cluster Registry

This public skill has no built-in private cluster registry.

Users may create local-only overlays under skills/cluster-health/protected/ for private lab, homelab, work, or customer cluster details. The directory is gitignored by this collection. If it exists in the installed skill, read it while using this skill. A user can ask their agent to create or update these files.

Suggested local layout:

protected/
  registry.md            # aliases, kube contexts, CWD patterns, profile mappings
  private-patterns.txt   # terms that must never appear in public files
  <cluster-or-env>.md    # local namespaces, runbooks, dashboards, thresholds

If protected/registry.md exists, read it first and use its alias, context, CWD pattern, and reference mappings.
If the registry maps the target to protected/<cluster-or-env>.md, read that profile before running checks.
If no protected registry exists, require an explicit kube context or ask before using the current context.
Never guess a cluster from a vague request.
Never print protected registry contents in public reports unless the user asks for those exact details.
Treat gitignored as local privacy, not encryption. Do not put protected overlays in shared logs, issues, PR comments, or public reports.

Usage

cluster-health [context-or-alias] [timewindow]

context-or-alias is a kube context, current-context confirmation, or protected overlay alias.
timewindow defaults to 2h; use bounded values such as 30m, 1h, 2h, 6h, or 24h.

Workflow

Step 1: Resolve target

If a protected registry maps the request or current directory to an alias, use that mapping. If no mapping exists, require an explicit kube context or ask whether to use kubectl config current-context.

Step 2: Confirm read-only scope

State the context and time window before running commands. Do not run mutation commands as part of this skill.

Step 3: Run the generic sweep

Start with the cluster-wide checks in references/kubernetes-core.md, then load additional references based on the symptom:

networking or certificate symptoms -> references/networking-ingress.md
release or reconciliation symptoms -> references/helm-gitops.md
pending pods or volume symptoms -> references/storage.md
noisy errors or alert symptoms -> references/monitoring-logs.md
policy, RBAC, or image-risk symptoms -> references/security.md

Step 4: Classify findings

Use GREEN for healthy signals, YELLOW for degraded or ambiguous state, and RED for user-visible outage, data-risk, or control-plane risk. Distinguish transient rollout noise from persistent degradation.

Step 5: Report

Return a concise report:

# Cluster Health Report - <context> (<timewindow>, YYYY-MM-DD HH:MM)

## Summary
- STATUS: GREEN|YELLOW|RED
- Scope: <contexts, namespaces, time window>
- Key findings: <short bullets>

## Evidence
- <area>: <command or source> -> <observed signal>

## Next Actions
- <read-only follow-up or explicit escalation request>

Reference Files

references/kubernetes-core.md - nodes, workloads, events, namespaces, and resource pressure
references/helm-gitops.md - Helm releases, GitOps controllers, and reconciliation state
references/networking-ingress.md - services, ingress, load balancers, DNS, and certificates
references/storage.md - PVs, PVCs, CSI drivers, storage classes, and volume attachment
references/monitoring-logs.md - alerts, metrics availability, log triage, and noisy namespaces
references/security.md - read-only checks for RBAC, secrets exposure signals, image risk, and policy engines

Output Contract

See skills/_shared/output-contract.md for the full contract.

cluster-health

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

internal-comms

babysit

do

smart-explore

Recibe nuevas skills de DevOps e Infra todos los lunes