Cluster Health
Run read-only Kubernetes health checks and report cluster status with evidence. This skill works without private overlays by requiring an explicit kube context or confirmed current context. Local users may add ignored protected overlays for aliases and environment-specific checks.
When to use
- User asks to check cluster health, status, diagnostics, node status, or post-maintenance state
- Verifying cluster-wide symptoms after upgrades, reboots, Helm changes, GitOps syncs, or incidents
- Gathering read-only evidence across nodes, workloads, events, ingress, storage, logs, and policy
- Producing a short traffic-light report from Kubernetes and related observability signals
When NOT to use
- Writing or reviewing Kubernetes manifests - use kubernetes
- Writing Helm charts, Kustomize overlays, or IaC - use kubernetes or terraform
- Changing resources, restarting pods, deleting objects, or applying fixes - ask for explicit escalation
- Debugging one application deeply after the broad sweep identifies it - use the relevant domain skill
AI Self-Check
Before running checks or reporting results, verify:
- Target context is explicit or the current context was confirmed
- Every
kubectlcommand includes--context <context> - Every
helmcommand includes--kube-context <context> - Commands are read-only: no apply, patch, delete, edit, rollout restart, scale, cordon, drain, or exec unless the user explicitly escalates
- Output is capped with
head,tail,--since,--field-selector, or selectors - Time window is bounded and stated in the report
- Protected registry contents are not printed unless the user asks for those exact details
- Findings include evidence, impact, and next action
- Current source checked: dated versions, CLI flags, API names, and support windows are verified against primary docs before repeating them
- Hidden state identified: local config, credentials, caches, contexts, branches, cluster targets, or previous runs are made explicit before acting
- Verification is real: final checks exercise the actual runtime, parser, service, or integration point instead of only linting prose or happy paths
- Routing overlap checked: overlapping skills, trigger terms, and "When NOT to use" boundaries are checked before returning guidance
- Spec claims verified: claims about tool behavior, output contracts, or repo conventions are checked against current docs, scripts, or skill files
- Cluster target explicit: kubeconfig context, namespace, and environment are named before any query
- Read-only posture kept: health checks do not mutate resources or restart workloads unless the user explicitly escalates
- No improvisation: only the read-only commands in the reference files were run; missing coverage was noted as a suggestion, not freelanced with guessed service names, paths, or flags
- Stderr is visible: diagnostic commands surface their failure reason instead of masking it with
2>/dev/null; a missing tool, permission gap, or wrong context is reported, not silently treated as a clean result
Performance
- Start with cluster-wide signals before loading symptom-specific references.
- Bound logs, events, and object listings by namespace, time window, or selectors.
- Prefer summarized evidence over dumping raw Kubernetes output into context.
Best Practices
- Treat the current kube context as hidden state until it is explicitly named.
- Separate health evidence from remediation; fixes require a separate escalation.
- Report permission gaps and missing CRDs as diagnostic findings, not silent skips.
- Run only the commands the reference files define. A monitoring context invites improvisation; resist it. When a check you want is not listed, write it as a suggested follow-up instead of guessing a service name, namespace, or path that may not exist.
- Do not read a metric's status without knowing what the metric measures. The reference files state what each signal does and does NOT represent; misreading a percentage or a stale value produces a confidently wrong report.
Cluster Registry
This public skill has no built-in private cluster registry.
Users may create local-only overlays under skills/cluster-health/protected/ for private lab,
homelab, work, or customer cluster details. The directory is gitignored by this collection. If it
exists in the installed skill, read it while using this skill. A user can ask their agent to create
or update these files.
Suggested local layout:
protected/
registry.md # aliases, kube contexts, CWD patterns, profile mappings
private-patterns.txt # terms that must never appear in public files
<cluster-or-env>.md # local namespaces, runbooks, dashboards, thresholds
- If
protected/registry.mdexists, read it first and use its alias, context, CWD pattern, and reference mappings. - If the registry maps the target to
protected/<cluster-or-env>.md, read that profile before running checks. - If no protected registry exists, require an explicit kube context or ask before using the current context.
- Never guess a cluster from a vague request.
- Never print protected registry contents in public reports unless the user asks for those exact details.
- Treat gitignored as local privacy, not encryption. Do not put protected overlays in shared logs, issues, PR comments, or public reports.
Usage
cluster-health [context-or-alias] [timewindow]
context-or-aliasis a kube context, current-context confirmation, or protected overlay alias.timewindowdefaults to2h; use bounded values such as30m,1h,2h,6h, or24h.
Workflow
Step 1: Resolve target
If a protected registry maps the request or current directory to an alias, use that mapping. If no
mapping exists, require an explicit kube context or ask whether to use kubectl config current-context.
Step 2: Confirm read-only scope
State the context and time window before running commands. Do not run mutation commands as part of this skill.
Step 3: Run the generic sweep
Start with the cluster-wide checks in references/kubernetes-core.md, then load additional
references based on the symptom:
- networking or certificate symptoms ->
references/networking-ingress.md - release or reconciliation symptoms ->
references/helm-gitops.md - pending pods or volume symptoms ->
references/storage.md - noisy errors or alert symptoms ->
references/monitoring-logs.md - policy, RBAC, or image-risk symptoms ->
references/security.md
Step 4: Classify findings
Use GREEN for healthy signals, YELLOW for degraded or ambiguous state, and RED for user-visible outage, data-risk, or control-plane risk. Distinguish transient rollout noise from persistent degradation.
Step 5: Report
Return a concise report:
# Cluster Health Report - <context> (<timewindow>, YYYY-MM-DD HH:MM)
## Summary
- STATUS: GREEN|YELLOW|RED
- Scope: <contexts, namespaces, time window>
- Key findings: <short bullets>
## Evidence
- <area>: <command or source> -> <observed signal>
## Next Actions
- <read-only follow-up or explicit escalation request>
Reference Files
references/kubernetes-core.md- nodes, workloads, events, namespaces, and resource pressurereferences/helm-gitops.md- Helm releases, GitOps controllers, and reconciliation statereferences/networking-ingress.md- services, ingress, load balancers, DNS, and certificatesreferences/storage.md- PVs, PVCs, CSI drivers, storage classes, and volume attachmentreferences/monitoring-logs.md- alerts, metrics availability, log triage, and noisy namespacesreferences/security.md- read-only checks for RBAC, secrets exposure signals, image risk, and policy engines
Output Contract
See skills/_shared/output-contract.md for the full contract.