When this skill is activated, always start your first response with the 🧢 emoji.
Site Reliability Engineering
SRE is the discipline of applying software engineering to operations problems. It replaces ad-hoc ops work with principled systems: reliability targets backed by error budgets, toil replaced by automation, and incidents treated as system failures rather than human ones. This skill covers the full SRE lifecycle - from defining SLOs through capacity planning and progressive delivery - as practiced by teams operating production systems at scale. Designed for engineers moving from "keep the lights on" to systematic reliability ownership.
When to use this skill
Trigger this skill when the user:
- Needs to define or revise SLOs, SLIs, or SLAs for a service
- Is calculating or acting on an error budget
- Wants to identify, measure, or automate toil
- Is running or writing a postmortem
- Is designing or improving an on-call rotation
- Is forecasting capacity needs or planning a load test
- Is designing a rollout strategy (canary, blue/green, progressive)
Do NOT trigger this skill for:
- Pure infrastructure provisioning without a reliability framing (use a Docker/K8s skill)
- Application performance optimization without an SLO context (use a performance-engineering skill)
Key principles
-
Embrace risk with error budgets - 100% reliability is neither achievable nor desirable. Every extra nine of availability comes at a cost: slower feature velocity, more complex systems, higher operational burden. An error budget makes the trade-off explicit: spend budget on risk-taking (deploys, experiments), save it when reliability is threatened.
-
Eliminate toil - Toil is work that is manual, repetitive, automatable, reactive, and scales with service growth without producing lasting value. Every hour of toil is an hour not spent on reliability improvements. The goal is not zero toil (some is unavoidable) but continuous reduction.
-
SLOs are the contract - SLOs align engineering and business on what reliability is worth. They prevent both over-engineering ("five nines or nothing") and under-investing ("it mostly works"). Write SLOs before writing on-call runbooks; the SLO defines what warrants waking someone up.
-
Blameless postmortems - Systems fail, not people. Blaming individuals creates an environment where engineers hide problems and avoid risk. Blameless postmortems surface systemic issues and produce durable fixes. The goal is learning, not accountability theater.
-
Automate yourself out of a job - The SRE charter is to automate operations work until the team's operational load is below 50% of their time. The remaining capacity is reserved for reliability engineering that makes the next incident less likely or less severe.
Core concepts
SLI / SLO / SLA hierarchy
SLA (Service Level Agreement)
- External contract with customers. Breach triggers penalties.
- Set conservatively: your internal SLO must be tighter than your SLA.
SLO (Service Level Objective)
- Internal target. Drives alerting, error budgets, and engineering decisions.
- Typically SLO = SLA - 0.5 to 1 percentage point headroom.
SLI (Service Level Indicator)
- The actual measurement. A ratio: good events / total events.
- Example: (requests completing < 300ms) / (all requests)
Rule of thumb: Define one availability SLI and one latency SLI per user-facing service. Add correctness SLIs for data pipelines or financial systems.
Error budget mechanics
Error budget = 1 - SLO target
99.9% SLO -> 0.1% budget -> 43.8 min/month at risk
99.5% SLO -> 0.5% budget -> 3.65 hours/month at risk
Budget consumed = (bad events this window) / (total events this window)
Budget remaining = budget_total - budget_consumed
Burn rate = observed error rate / allowed error rate. A burn rate of 1 means you are spending budget at exactly the expected pace. A burn rate of 14.4 on a 30-day window means the budget is gone in 50 hours.
Budget policy (what to do when budget is threatened):
| Budget remaining | Action |
|---|---|
| > 50% | Normal feature velocity, deploys allowed |
| 25-50% | Review recent changes, increase monitoring |
| 10-25% | Freeze non-essential deploys, focus on stability |
| < 10% | Feature freeze, all hands on reliability work |
Toil definition
Toil has all of these properties - if even one is missing, it may be legitimate work:
- Manual: A human is in the loop doing repetitive keystrokes
- Repetitive: Done more than once with the same steps
- Automatable: A script or system could do it
- Reactive: Triggered by a system event, not proactive engineering
- No lasting value: Executing it does not improve the system; it just holds it steady
- Scales with load: More traffic, more toil (a danger sign)
Incident severity levels
| Severity | Customer impact | Response | Example |
|---|---|---|---|
| SEV1 | Complete outage or data loss | Immediate page, war room | Payment service down |
| SEV2 | Degraded core functionality | Page on-call | 20% of requests erroring |
| SEV3 | Minor degradation, workaround exists | Ticket, next business day | Slow dashboard loads |
| SEV4 | Cosmetic issue or internal tool | Backlog | Wrong label in admin UI |
On-call best practices
- Rotate weekly; never longer than two weeks without a break
- Guarantee engineers sleep: no P1 pages between 10pm-8am without escalation
- Track on-call load: pages per shift, time-to-ack, total hours interrupted
- Every on-call shift ends with a handoff: active incidents, lingering alerts, context
- Budget 20-30% of the next sprint for on-call follow-up work
Common tasks
Define SLOs for a service
Step 1: Choose the right SLIs. Start from user journeys, not technical metrics.
| User journey | SLI type | Measurement |
|---|---|---|
| "Page loads fast" | Latency | requests_under_300ms / total_requests |
| "API calls succeed" | Availability | non_5xx_responses / total_responses |
| "Data is correct" | Correctness | correct_outputs / total_outputs |
| "Writes persist" | Durability | successful_writes_verified / total_writes |
Step 2: Set targets using historical data.
1. Pull 30 days of your current SLI measurements
2. Find your current actual performance (e.g., 99.85% availability)
3. Set SLO slightly below current performance (e.g., 99.7%)
4. Tighten over time as you improve reliability
Never set an SLO tighter than your best recent 30-day window without a corresponding reliability investment plan.
Step 3: Choose the window. Rolling 30-day windows are standard. They smooth spikes but respond to sustained degradation. Avoid calendar month windows - they reset budgets on the 1st regardless of what happened on the 31st.
Step 4: Define measurement exclusions. Planned maintenance, dependencies outside your control, and client errors (4xx) typically excluded from SLI calculations.
Calculate and track error budgets
Burn rate alerting (recommended over threshold alerting):
Fast burn alert (page immediately):
Condition: burn_rate > 14.4 for 5 minutes
Meaning: At this rate, 30-day budget exhausted in ~50 hours
Severity: SEV2, page on-call
Slow burn alert (ticket, investigate):
Condition: burn_rate > 3 for 60 minutes
Meaning: Budget exhausted in ~10 days if trend continues
Severity: SEV3, create ticket
Budget depletion alert (SEV1 escalation trigger):
Condition: budget_remaining < 10%
Action: Feature freeze, reliability sprint
Multi-window alerting catches both fast spikes and slow degradation:
- 5-minute window: catches fast burns (major incident)
- 1-hour window: catches slow burns (creeping degradation)
- Both windows alerting together = high-confidence page
Budget depletion actions:
- Stop all non-essential deploys
- Pull toil-redu