Disaster Recovery Plan Skill
Produce a complete disaster recovery plan for a service or system — giving engineers, SREs, and on-call responders everything they need to recover from a disaster scenario in the shortest possible time. A good DR plan is tested regularly, has exact commands (not vague instructions), and makes RTO/RPO targets measurable so the team knows whether recovery succeeded.
Required Inputs
Ask for these if not already provided:
- Service name and what it does (business function and technical role)
- Criticality tier — business impact of extended downtime (e.g. Tier 1 = revenue-critical, Tier 2 = ops impact, Tier 3 = internal only)
- Current infrastructure setup — cloud provider, regions/zones, deployment model (Kubernetes, ECS, VMs, serverless)
- RPO/RTO requirements — Recovery Point Objective (how much data loss is acceptable) and Recovery Time Objective (how long can it be down)
- Backup strategy — what is backed up, how often, where backups are stored, retention policy
- On-call contacts — names and contact details for the responder chain
Output Format
Disaster Recovery Plan: [Service Name]
Team: [Team name] | Tech lead: [Name] Criticality tier: [Tier 1 / Tier 2 / Tier 3] | Last tested: [Date] Next DR test: [Date] | Document owner: [Name] Last updated: [Date] | Review cycle: Quarterly
Emergency? Skip to Section 3 — Failure Scenario Runbooks. Find the scenario that matches your situation and follow the steps exactly.
1. Recovery Targets
| Target | Value | Rationale |
|---|---|---|
| RPO (Recovery Point Objective) | [X minutes/hours] | [e.g. "Last committed transaction — database replication is synchronous"] |
| RTO (Recovery Time Objective) | [Y minutes/hours] | [e.g. "Revenue impact begins at 30 min; target recovery in 15 min"] |
| MTTR target (non-disaster) | [Z minutes] | [Operational incidents, not DR events] |
| Data retention (backups) | [N days/weeks] | [Compliance requirement or operational policy] |
| Backup frequency | [Every X hours] | [RPO-driven — backup interval must be ≤ RPO] |
What these mean in practice:
- If a database is corrupted, we can lose at most [X minutes] of transactions before the business impact is unacceptable.
- The service must be operational again within [Y minutes/hours] of declaring a DR event.
- If either target cannot be met, escalate to [Engineering Manager] immediately.
2. Failure Scenario Inventory
| Scenario | Likelihood | Impact | RTO target | RPO target | Runbook |
|---|---|---|---|---|---|
| Single availability zone failure | Medium | [Partial / Full outage] | [15 min] | [0 — no data loss] | Section 3.1 |
| Full region failure | Low | Full outage | [60 min] | [5 min] | Section 3.2 |
| Database corruption / data loss | Low | Full outage | [90 min] | [RPO value] | Section 3.3 |
| Critical dependency outage | High | [Partial degradation] | [30 min] | [N/A] | Section 3.4 |
| Security breach / ransomware | Very low | Full outage + investigation | [4 hours] | [Last clean backup] | Section 3.5 |
| Accidental bulk data deletion | Low | Partial or full data loss | [60 min] | [RPO value] | Section 3.6 |
3. Failure Scenario Runbooks
3.1 Single Availability Zone Failure
Trigger: One AZ becomes unreachable — pods/instances in that zone stop responding.
Detection: PagerDuty alert [AlertName] fires, or cloud provider status page shows AZ degradation.
Expected RTO: [15 minutes] | Expected RPO: Zero (no data loss if multi-AZ replication is working)
Step 1 — Confirm the failure
# Check pod/instance health across zones
kubectl get pods -o wide -n [namespace] | grep -v Running
# Check which nodes are affected
kubectl get nodes -o wide | grep -v Ready
# Verify cloud provider AZ status
# AWS: https://health.aws.amazon.com/health/status
# GCP: https://status.cloud.google.com
Step 2 — Assess whether auto-recovery has occurred
# If using auto-scaling, check if replacement instances launched
kubectl get pods -n [namespace] --watch
# Check deployment replica count
kubectl get deployment [service-name] -n [namespace]
# Verify load balancer health checks are passing
[cloud provider CLI command to check target group health]
Step 3 — Force rescheduling if auto-recovery stalled
# Cordon the affected node so no new pods schedule on it
kubectl cordon [node-name]
# Drain the node — moves all pods to healthy nodes
kubectl drain [node-name] --ignore-daemonsets --delete-emptydir-data
# Verify pods have rescheduled successfully
kubectl get pods -o wide -n [namespace]
Step 4 — Verify service health
# Smoke test key endpoints
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/health
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/[critical-endpoint]
# Check error rate in monitoring
[dashboard link or query]
Recovery confirmed when: All pods are Running, health check returns 200, error rate is at baseline.
3.2 Full Region Failure
Trigger: The primary region is entirely unavailable. Detection: All service health checks failing, cloud provider status page confirms region-wide event. Expected RTO: [60 minutes] | Expected RPO: [5 minutes — based on cross-region replication lag]
Step 1 — Confirm regional failure (5 minutes)
# Confirm the primary region is unreachable
ping [primary-region-endpoint] || echo "Primary region unreachable"
# Check replication lag on standby region database
[command to check replica lag — e.g. for RDS: aws rds describe-db-instances --region [dr-region]]
Step 2 — Declare DR event and notify (2 minutes)
Post to #incidents:
🔴 DR EVENT — [Service Name] — Region Failure
Primary region: [region] — UNREACHABLE
Activating failover to: [dr-region]
Incident commander: [Name]
Next update: 15 minutes
Page [Engineering Manager] and [CTO/VP Eng] via PagerDuty.
Step 3 — Promote DR database (10 minutes)
# AWS RDS — promote read replica to primary
aws rds promote-read-replica \
--db-instance-identifier [dr-replica-identifier] \
--region [dr-region]
# Wait for promotion to complete
aws rds wait db-instance-available \
--db-instance-identifier [dr-replica-identifier] \
--region [dr-region]
# Record the new database endpoint
aws rds describe-db-instances \
--db-instance-identifier [dr-replica-identifier] \
--region [dr-region] \
--query 'DBInstances[0].Endpoint.Address'
Step 4 — Deploy service in DR region (20 minutes)
# Update service configuration to point at DR database
kubectl set env deployment/[service-name] \
DATABASE_URL=[new-dr-database-url] \
-n [namespace] \
--context [dr-region-context]
# Scale up the DR deployment
kubectl scale deployment/[service-name] --replicas=[N] \
-n [namespace] \
--context [dr-region-context]
# Verify all pods are running
kubectl get pods -n [namespace] --context [dr-region-context]
Step 5 — Cut over DNS / load balancer (5 minutes)
# Update DNS to point to DR region load balancer
# AWS Route 53:
aws route53 change-resource-record-sets \
--hosted-zone-id [zone-id] \
--change-batch file://dr-failover-dns.json
# Verify DNS propagation (may take up to [TTL] seconds)
dig [service-domain] @8.8.8.8
Step 6 — Verify end-to-end
# Full smoke test against DR endpoint
curl -s https://[service-url]/health
[run automated smoke test suite if available]
Recovery confirmed when: DNS resolves to DR region, smoke tests pass, error rate is at baseline.
Post-failover actions (not urgent — after service is stable):
- Do not fail back to primary until root cause is confirmed resolved
- Document data loss window (check replication lag at time of failure)
- Begin post-incident review — see [incident-postmortem skill]
3.3 Database Corruption or Data Loss
Trigger: Data in the database is c