Overview

Production is not a test environment. Every deployment is a live operation with real consequences — user impact, data integrity risks, and potential outages. This skill encodes the discipline senior engineers apply before, during, and after every production deployment.

The core rule: never deploy without a rollback plan you've verified can execute in under 5 minutes.

When to Use

Before any deployment to a production or production-equivalent environment
When reviewing deployment scripts or CI/CD pipelines
When adding new services or infrastructure changes

Process

Step 1: Pre-Deployment Checklist

All tests pass — CI is green on the exact commit being deployed. Not "mostly green."
Migrations are backward-compatible — The old code must work with the new schema (for zero-downtime). New columns are nullable; columns aren't dropped until after full rollout.
Feature flags configured — New features are behind flags, off by default.
Rollback plan written — Document exactly how to rollback: which commands, which configs, estimated time.
Deployment window confirmed — Low-traffic period? On-call engineer available?
Stakeholders notified — Anyone affected by downtime or behavior change knows.

Verify: All 6 checklist items confirmed. Do not proceed if any is blocked.

Step 2: Staged Rollout

Never deploy to 100% of traffic immediately. Use a staged rollout:
- Canary: 1–5% of traffic
- Staged: 10% → 25% → 50% → 100%
Monitor key metrics at each stage for at least 15 minutes before expanding:
- Error rate (baseline vs. current)
- Latency p50, p95, p99
- Business metrics (conversion, orders, etc.)
Define your abort threshold before starting: "If error rate exceeds X% or latency p99 exceeds Y ms, rollback immediately."

Verify: Rollout stages and abort thresholds are documented before deployment begins.

Step 3: Deploy

Execute the deployment using your CI/CD pipeline (not manual commands).
Monitor dashboards in real-time during the rollout.
Keep communication channel open with on-call engineer.
Do not perform any other changes during a deployment (no "quick fixes").

Verify: Deployment running via CI/CD, dashboards being monitored actively.

Step 4: Post-Deployment Verification

Smoke tests pass on production.
Key user journeys manually verified.
Error rate within normal range (15 minutes post-deploy).
No unexpected alerts triggered.
Run post-deploy integration tests if available.

Verify: All post-deploy checks confirmed green. Deployment marked successful.

Step 5: Rollback (if needed)

If any abort threshold is hit: rollback immediately, without debate.
Execute the pre-written rollback plan.
Verify rollback complete: service restored, error rate normalized.
Write an incident report — even for near-misses.

Verify: Rollback completes in under 5 minutes. Service restored.

Common Rationalizations (and Rebuttals)

Excuse	Rebuttal
"It works in staging"	Staging is not production. Different data, traffic, and configuration.
"It's just a small change"	Small changes cause the majority of outages.
"We don't have time for staged rollout"	You have even less time for an incident.
"I'll watch it for a few minutes"	15 minutes minimum. Most production failures take time to materialize under load.
"We can rollback if needed"	Do you have a written, tested rollback plan? No? Then you can't.

Red Flags

Deploying directly to 100% without a staged rollout
No rollback plan documented before deployment
Deploying breaking schema changes without backward compatibility
Running deployment from a local machine, not CI/CD
Deploying during high-traffic periods without approval
"I'll fix any issues after we deploy"

Verification

All tests passing on exact commit being deployed
Migrations are backward-compatible
Rollback plan written and executable in <5 minutes
Staged rollout plan with abort thresholds defined
Post-deploy smoke tests passed
Dashboards clean for 15 minutes post-deploy

production-deployment

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

internal-comms

babysit

do

smart-explore

Recibe nuevas skills de DevOps e Infra todos los lunes