Overview
Production is not a test environment. Every deployment is a live operation with real consequences — user impact, data integrity risks, and potential outages. This skill encodes the discipline senior engineers apply before, during, and after every production deployment.
The core rule: never deploy without a rollback plan you've verified can execute in under 5 minutes.
When to Use
- Before any deployment to a production or production-equivalent environment
- When reviewing deployment scripts or CI/CD pipelines
- When adding new services or infrastructure changes
Process
Step 1: Pre-Deployment Checklist
- All tests pass — CI is green on the exact commit being deployed. Not "mostly green."
- Migrations are backward-compatible — The old code must work with the new schema (for zero-downtime). New columns are nullable; columns aren't dropped until after full rollout.
- Feature flags configured — New features are behind flags, off by default.
- Rollback plan written — Document exactly how to rollback: which commands, which configs, estimated time.
- Deployment window confirmed — Low-traffic period? On-call engineer available?
- Stakeholders notified — Anyone affected by downtime or behavior change knows.
Verify: All 6 checklist items confirmed. Do not proceed if any is blocked.
Step 2: Staged Rollout
- Never deploy to 100% of traffic immediately. Use a staged rollout:
- Canary: 1–5% of traffic
- Staged: 10% → 25% → 50% → 100%
- Monitor key metrics at each stage for at least 15 minutes before expanding:
- Error rate (baseline vs. current)
- Latency p50, p95, p99
- Business metrics (conversion, orders, etc.)
- Define your abort threshold before starting: "If error rate exceeds X% or latency p99 exceeds Y ms, rollback immediately."
Verify: Rollout stages and abort thresholds are documented before deployment begins.
Step 3: Deploy
- Execute the deployment using your CI/CD pipeline (not manual commands).
- Monitor dashboards in real-time during the rollout.
- Keep communication channel open with on-call engineer.
- Do not perform any other changes during a deployment (no "quick fixes").
Verify: Deployment running via CI/CD, dashboards being monitored actively.
Step 4: Post-Deployment Verification
- Smoke tests pass on production.
- Key user journeys manually verified.
- Error rate within normal range (15 minutes post-deploy).
- No unexpected alerts triggered.
- Run post-deploy integration tests if available.
Verify: All post-deploy checks confirmed green. Deployment marked successful.
Step 5: Rollback (if needed)
- If any abort threshold is hit: rollback immediately, without debate.
- Execute the pre-written rollback plan.
- Verify rollback complete: service restored, error rate normalized.
- Write an incident report — even for near-misses.
Verify: Rollback completes in under 5 minutes. Service restored.
Common Rationalizations (and Rebuttals)
| Excuse | Rebuttal |
|---|---|
| "It works in staging" | Staging is not production. Different data, traffic, and configuration. |
| "It's just a small change" | Small changes cause the majority of outages. |
| "We don't have time for staged rollout" | You have even less time for an incident. |
| "I'll watch it for a few minutes" | 15 minutes minimum. Most production failures take time to materialize under load. |
| "We can rollback if needed" | Do you have a written, tested rollback plan? No? Then you can't. |
Red Flags
- Deploying directly to 100% without a staged rollout
- No rollback plan documented before deployment
- Deploying breaking schema changes without backward compatibility
- Running deployment from a local machine, not CI/CD
- Deploying during high-traffic periods without approval
- "I'll fix any issues after we deploy"
Verification
- All tests passing on exact commit being deployed
- Migrations are backward-compatible
- Rollback plan written and executable in <5 minutes
- Staged rollout plan with abort thresholds defined
- Post-deploy smoke tests passed
- Dashboards clean for 15 minutes post-deploy