Disaster Recovery Plan Skill

Produce a complete disaster recovery plan for a service or system — giving engineers, SREs, and on-call responders everything they need to recover from a disaster scenario in the shortest possible time. A good DR plan is tested regularly, has exact commands (not vague instructions), and makes RTO/RPO targets measurable so the team knows whether recovery succeeded.

Required Inputs

Ask for these if not already provided:

Service name and what it does (business function and technical role)
Criticality tier — business impact of extended downtime (e.g. Tier 1 = revenue-critical, Tier 2 = ops impact, Tier 3 = internal only)
Current infrastructure setup — cloud provider, regions/zones, deployment model (Kubernetes, ECS, VMs, serverless)
RPO/RTO requirements — Recovery Point Objective (how much data loss is acceptable) and Recovery Time Objective (how long can it be down)
Backup strategy — what is backed up, how often, where backups are stored, retention policy
On-call contacts — names and contact details for the responder chain

Output Format

Disaster Recovery Plan: [Service Name]

Team: [Team name] | Tech lead: [Name] Criticality tier: [Tier 1 / Tier 2 / Tier 3] | Last tested: [Date] Next DR test: [Date] | Document owner: [Name] Last updated: [Date] | Review cycle: Quarterly

Emergency? Skip to Section 3 — Failure Scenario Runbooks. Find the scenario that matches your situation and follow the steps exactly.

1. Recovery Targets

Target	Value	Rationale
RPO (Recovery Point Objective)	[X minutes/hours]	[e.g. "Last committed transaction — database replication is synchronous"]
RTO (Recovery Time Objective)	[Y minutes/hours]	[e.g. "Revenue impact begins at 30 min; target recovery in 15 min"]
MTTR target (non-disaster)	[Z minutes]	[Operational incidents, not DR events]
Data retention (backups)	[N days/weeks]	[Compliance requirement or operational policy]
Backup frequency	[Every X hours]	[RPO-driven — backup interval must be ≤ RPO]

What these mean in practice:

If a database is corrupted, we can lose at most [X minutes] of transactions before the business impact is unacceptable.
The service must be operational again within [Y minutes/hours] of declaring a DR event.
If either target cannot be met, escalate to [Engineering Manager] immediately.

2. Failure Scenario Inventory

Scenario	Likelihood	Impact	RTO target	RPO target	Runbook
Single availability zone failure	Medium	[Partial / Full outage]	[15 min]	[0 — no data loss]	Section 3.1
Full region failure	Low	Full outage	[60 min]	[5 min]	Section 3.2
Database corruption / data loss	Low	Full outage	[90 min]	[RPO value]	Section 3.3
Critical dependency outage	High	[Partial degradation]	[30 min]	[N/A]	Section 3.4
Security breach / ransomware	Very low	Full outage + investigation	[4 hours]	[Last clean backup]	Section 3.5
Accidental bulk data deletion	Low	Partial or full data loss	[60 min]	[RPO value]	Section 3.6

3. Failure Scenario Runbooks

3.1 Single Availability Zone Failure

Trigger: One AZ becomes unreachable — pods/instances in that zone stop responding. Detection: PagerDuty alert [AlertName] fires, or cloud provider status page shows AZ degradation. Expected RTO: [15 minutes] | Expected RPO: Zero (no data loss if multi-AZ replication is working)

Step 1 — Confirm the failure

# Check pod/instance health across zones
kubectl get pods -o wide -n [namespace] | grep -v Running

# Check which nodes are affected
kubectl get nodes -o wide | grep -v Ready

# Verify cloud provider AZ status
# AWS: https://health.aws.amazon.com/health/status
# GCP: https://status.cloud.google.com

Step 2 — Assess whether auto-recovery has occurred

# If using auto-scaling, check if replacement instances launched
kubectl get pods -n [namespace] --watch

# Check deployment replica count
kubectl get deployment [service-name] -n [namespace]

# Verify load balancer health checks are passing
[cloud provider CLI command to check target group health]

Step 3 — Force rescheduling if auto-recovery stalled

# Cordon the affected node so no new pods schedule on it
kubectl cordon [node-name]

# Drain the node — moves all pods to healthy nodes
kubectl drain [node-name] --ignore-daemonsets --delete-emptydir-data

# Verify pods have rescheduled successfully
kubectl get pods -o wide -n [namespace]

Step 4 — Verify service health

# Smoke test key endpoints
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/health
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/[critical-endpoint]

# Check error rate in monitoring
[dashboard link or query]

Recovery confirmed when: All pods are Running, health check returns 200, error rate is at baseline.

3.2 Full Region Failure

Trigger: The primary region is entirely unavailable. Detection: All service health checks failing, cloud provider status page confirms region-wide event. Expected RTO: [60 minutes] | Expected RPO: [5 minutes — based on cross-region replication lag]

Step 1 — Confirm regional failure (5 minutes)

# Confirm the primary region is unreachable
ping [primary-region-endpoint] || echo "Primary region unreachable"

# Check replication lag on standby region database
[command to check replica lag — e.g. for RDS: aws rds describe-db-instances --region [dr-region]]

Step 2 — Declare DR event and notify (2 minutes)

Post to #incidents:

🔴 DR EVENT — [Service Name] — Region Failure
Primary region: [region] — UNREACHABLE
Activating failover to: [dr-region]
Incident commander: [Name]
Next update: 15 minutes

Page [Engineering Manager] and [CTO/VP Eng] via PagerDuty.

Step 3 — Promote DR database (10 minutes)

# AWS RDS — promote read replica to primary
aws rds promote-read-replica \
  --db-instance-identifier [dr-replica-identifier] \
  --region [dr-region]

# Wait for promotion to complete
aws rds wait db-instance-available \
  --db-instance-identifier [dr-replica-identifier] \
  --region [dr-region]

# Record the new database endpoint
aws rds describe-db-instances \
  --db-instance-identifier [dr-replica-identifier] \
  --region [dr-region] \
  --query 'DBInstances[0].Endpoint.Address'

Step 4 — Deploy service in DR region (20 minutes)

# Update service configuration to point at DR database
kubectl set env deployment/[service-name] \
  DATABASE_URL=[new-dr-database-url] \
  -n [namespace] \
  --context [dr-region-context]

# Scale up the DR deployment
kubectl scale deployment/[service-name] --replicas=[N] \
  -n [namespace] \
  --context [dr-region-context]

# Verify all pods are running
kubectl get pods -n [namespace] --context [dr-region-context]

Step 5 — Cut over DNS / load balancer (5 minutes)

# Update DNS to point to DR region load balancer
# AWS Route 53:
aws route53 change-resource-record-sets \
  --hosted-zone-id [zone-id] \
  --change-batch file://dr-failover-dns.json

# Verify DNS propagation (may take up to [TTL] seconds)
dig [service-domain] @8.8.8.8

Step 6 — Verify end-to-end

# Full smoke test against DR endpoint
curl -s https://[service-url]/health
[run automated smoke test suite if available]

Recovery confirmed when: DNS resolves to DR region, smoke tests pass, error rate is at baseline.

Post-failover actions (not urgent — after service is stable):

Do not fail back to primary until root cause is confirmed resolved
Document data loss window (check replication lag at time of failure)
Begin post-incident review — see [incident-postmortem skill]

3.3 Database Corruption or Data Loss

Trigger: Data in the database is c

disaster-recovery-plan

Como adicionar

Cole no README do seu repo

Skills relacionadas

pdf

pptx

docx

canvas-design

Receba novas skills de Documentos toda segunda