CI/CD Playbook Skill

Produce a complete, actionable CI/CD playbook for a service or team — covering everything a new engineer needs to understand, contribute to, and operate the pipeline safely.

A good playbook is not a diagram. It is a document that answers: what runs, when, why, who owns it, and what to do when it breaks.

Required Inputs

Ask for these if not already provided:

Service name and brief description
Tech stack — language, framework, containerisation (Docker, etc.)
Source control — GitHub / GitLab / Bitbucket, branching strategy
CI platform — GitHub Actions / CircleCI / Jenkins / BuildKite / other
CD platform / deployment target — Kubernetes, ECS, Lambda, Heroku, VMs, etc.
Environments — e.g. dev, staging, production (and any canary / feature environments)
Deployment frequency — how often does the team ship?
Any existing gates — manual approvals, smoke tests, feature flags
On-call setup — who's responsible during deploys?

Output Format

CI/CD Playbook: [Service Name]

Service: [Name] | Team: [Team name] Last updated: [Date] | Owner: [Name / role] Pipeline platform: [CI tool] → [CD tool / platform]

Overview

[2–3 sentences describing what this service does and why the CI/CD pipeline is structured the way it is. Include the deployment target and how frequently the team ships.]

Deployment frequency: [Multiple times per day / Daily / Weekly / On-demand] Average pipeline duration: [X minutes] Rollback time (p95): [X minutes]

Pipeline Stages

[Branch push]
    │
    ▼
[1. Build & Lint] ──fail──▶ ❌ Block PR
    │
    ▼
[2. Unit Tests] ──fail──▶ ❌ Block PR
    │
    ▼
[3. Integration Tests] ──fail──▶ ❌ Block PR
    │
    ▼
[4. Security Scan] ──fail──▶ ⚠️ [Block / Warn — specify]
    │
    ▼
[5. Build Artefact / Container Image]
    │
    ▼
[6. Deploy to Staging] ──fail──▶ ❌ Block promotion
    │
    ▼
[7. Smoke Tests (Staging)]
    │
    ▼
[8. Manual Approval Gate] ──(if required)
    │
    ▼
[9. Deploy to Production] ──fail──▶ 🔁 Auto-rollback (if configured)
    │
    ▼
[10. Post-deploy checks]

Stage Definitions

Stage 1 — Build & Lint

What runs: [Build command] + [Linter — e.g. ESLint, golangci-lint, flake8] Trigger: Every commit to any branch Blocking: Yes — PR cannot be merged if this fails Typical duration: [X minutes] Owner if it fails: PR author

Common failure causes:

[e.g. Missing dependency — run npm install locally before pushing]
[e.g. Lint rule violation — run npm run lint --fix to auto-fix most issues]

Stage 2 — Unit Tests

What runs: [Test command — e.g. npm test, go test ./..., pytest] Coverage gate: [X]% minimum — pipeline fails below this threshold Trigger: Every commit Blocking: Yes Typical duration: [X minutes]

Coverage report: [Where to find it — e.g. uploaded to Codecov, available in CI artifacts]

Stage 3 — Integration Tests

What runs: [Test suite description — e.g. "API integration tests against a test database using Docker Compose"] Environment: [Ephemeral test environment / shared test DB / etc.] Trigger: Every commit to main and feature branches targeting main Blocking: Yes Typical duration: [X minutes]

If slow: [e.g. "Integration tests can be skipped locally with SKIP_INTEGRATION=true — never skip in CI"]

Stage 4 — Security Scan

Tools: [e.g. Snyk, Trivy, OWASP Dependency Check, Semgrep] What it checks: [Dependency vulnerabilities / SAST / secrets detection — list what applies] Blocking on: Critical and High severity findings Non-blocking on: Medium and Low (flagged, not blocking) Trigger: Every commit to main

How to handle a flagged vulnerability:

Check if a fix is available — upgrade the dependency
If no fix available, open a security ticket and add a suppression with justification
Never suppress without a ticket and owner

Stage 5 — Build Artefact

What is produced: [Docker image / binary / zip — be specific] Registry: [ECR / GCR / Docker Hub / Artifactory — URL] Tagging convention: [service-name]:[git-sha] (also tagged :latest on main) Trigger: Commits to main only (not feature branches)

Stage 6 — Deploy to Staging

Deployment method: [e.g. Helm upgrade / kubectl apply / ecs deploy / Terraform apply] Staging URL: [URL] Trigger: Automatic on successful artefact build from main Who can deploy to staging: Any engineer (automatic)

Environment variables: Managed in [Vault / AWS SSM / GitHub Secrets / etc.] Staging is not production: [Any differences in config, scale, or data — state them here]

Stage 7 — Smoke Tests (Staging)

What runs: [Description — e.g. "10 critical path tests covering login, core API endpoints, and payment flow"] Tool: [e.g. Playwright / Postman / custom script] Pass criteria: All smoke tests pass within [X seconds] timeout Blocking: Yes — production deploy will not proceed if smoke tests fail

Smoke test suite location: [Link to test files or folder]

Stage 8 — Manual Approval Gate

Required for: [Production deploys / deploys affecting >X% of traffic / deploys to specific regions] Who can approve: [e.g. Any engineer on the team / Lead engineer / On-call engineer] Approval timeout: [e.g. 24 hours — auto-cancelled if no approval] How to approve: [GitHub Actions approve step / Slack command / other — with link]

When to withhold approval:

Active incident in production
Deploy is outside the deployment window (see below)
On-call engineer has not been notified

Stage 9 — Deploy to Production

Deployment method: [Same as staging or different — specify] Deployment window: [e.g. Monday–Thursday 09:00–16:00 UTC — no deploys on Fridays or before bank holidays] Canary / progressive rollout: [Yes — X% initial traffic, full rollout after Y minutes / No — full deploy] Deployment notifications: [Slack channel — #deployments]

Who is on-call during deploy: Deploying engineer is responsible until post-deploy checks pass.

Stage 10 — Post-Deploy Checks

Automated checks (run for [X minutes] after deploy):

Error rate: <[X]% (baseline: [Y]%)
P99 latency: <[X]ms (baseline: [Y]ms)
[Key business metric]: within [X]% of baseline

Where to watch: [Datadog / Grafana / CloudWatch dashboard — link]

If a check fails: See Rollback Procedure below.

Environments

Environment	Purpose	Deploy trigger	URL	Data
Dev	Local development	Manual	localhost	Seeded test data
Staging	Pre-production validation	Automatic (main)	[URL]	Anonymised prod copy
Production	Live traffic	Manual approval	[URL]	Live data

Branching Strategy

Model: [Trunk-based / GitFlow / GitHub Flow — describe briefly]

Branch	Purpose	Who merges	Deploy target
`main`	Production-ready code	PR + review	Staging → Production
`feature/*`	Feature development	Author	None (CI only)
`hotfix/*`	Critical production fixes	Lead engineer	Can bypass staging gate with approval

Hotfix process: [Describe when and how to use a hotfix branch — what level of incident justifies bypassing the standard process]

Rollback Procedure

Automated rollback: [Yes — triggered if post-deploy error rate exceeds [X]% / No — manual only]

Manual rollback steps:

# 1. Identify the last known good image tag
[command to list recent deployments]

# 2. Deploy the previous version
[deployment command with previous tag]

# 3. Confirm rollback is live
[smoke test command or health check URL]

# 4. Notify the team
[Slack command or template]

Rollback decision authority: Any engineer on-call can initiate a rollback withou

cicd-playbook

How to add

Drop this on your repo README

Related skills

internal-comms

babysit

do

smart-explore

Get new DevOps e Infra skills every Monday