Reliability Engineering for Cloud
Reliability is a property of a system, not of any individual component. A cloud service that meets its availability target is not necessarily built from components that individually meet their target — it is built from a system that absorbs component failures without exposing them to users. This skill covers the SRE toolkit (SLIs, SLOs, error budgets, runbooks, postmortems, chaos engineering) alongside NASA's Systems Engineering methodology for design reviews and verification, because cloud operations at serious scale have to borrow discipline from somewhere, and aerospace is the field that has been figuring this out the longest.
Agent affinity: gray (transaction processing reliability, ACID foundations), hamilton-cloud (SRE economics at AWS scale), lamport (formal safety arguments)
Concept IDs: cloud-se-phase-reviews, cloud-taid-verification, cloud-runbook-structure, cloud-procedure-execution, cloud-communication-loops
Service Level Indicators, Objectives, and Agreements
SLI (Service Level Indicator). A quantitative measure of some aspect of service behavior. Examples: fraction of requests that succeed, fraction of requests served in under 200 ms, bytes delivered, freshness of returned data.
SLO (Service Level Objective). A target for an SLI over a window. "99.9% of requests succeed over a 30-day window." SLOs are set internally by the team that operates the service.
SLA (Service Level Agreement). A contractual commitment, typically looser than the SLO, with financial consequences for violation. SLAs are customer-facing; SLOs are the internal discipline that prevents you from ever needing to pay out on an SLA.
Choosing SLIs
Good SLIs are:
- User-centric. Measure what users experience, not internal plumbing.
- Ratio-based. Expressed as "good events / total events" so they are interpretable across traffic levels.
- Implementable. Can be measured from existing data without instrumenting every code path.
Typical SLIs for a request/response service:
- Availability: fraction of non-5xx responses.
- Latency: fraction of responses under X ms.
- Quality: fraction of responses returning full results (not degraded).
Error Budgets
The error budget is the inverse of the SLO. If the SLO is 99.9% availability, the error budget is 0.1% — about 43 minutes per month of downtime.
Error budgets are the unit of negotiation between "ship features" and "improve reliability." When the error budget is being spent faster than planned, the response is to slow feature velocity and address reliability. When the error budget is being preserved, new features and risk-taking are encouraged. This removes the "SRE says no, developers say yes" political argument and replaces it with a shared metric.
Burn rate alerting. Alert when the error budget is being consumed faster than the window allows. A 2% burn over 1 hour (30 days of error budget consumed in 24 hours) is worth paging; a 0.1% burn over a week is worth a ticket.
Runbooks
A runbook is an executable procedure for a specific operational task: deploying a service, recovering from a known failure mode, running a database migration, rotating a credential. Good runbooks share structure:
- Title and scope. What this runbook does and does not cover.
- Prerequisites. State required before execution (access, data, approvals).
- Steps. Each step has an action, an expected output, a verification check, and a timeout.
- Rollback. What to do if the procedure fails partway.
- Escalation. Who to contact when the runbook doesn't match the situation.
- Last reviewed. Date and reviewer.
Runbooks that are out of date are worse than missing runbooks — they give false confidence. Schedule reviews.
Procedure Step Discipline
Every step is a contract: "if I do this, I expect to see that, within this time." When the observed result differs, the operator stops and escalates rather than continuing on autopilot. This is borrowed from aviation crew resource management and nuclear plant operations — fields where continuing on a procedure mismatch is what causes accidents.
Incident Response
An incident is an unplanned disruption that requires coordinated response. Structure:
Detection. Monitoring fires an alert. Ideal: alert fires before users notice. Acceptable: alert fires within seconds of the first user complaint.
Triage. Incident commander (IC) is designated. Severity is assessed. Initial scope is determined (which users, which regions, which services).
Mitigation. Stop the bleeding. Roll back the last change. Drain traffic from the affected region. Fail over to a backup. Mitigation is not a fix — it is a restoration of service so that the fix can be done calmly.
Resolution. The underlying cause is addressed and the system is returned to normal operation.
Postmortem. A blameless write-up of what happened, why, and what to do differently.
The Incident Commander Role
The IC does not fix the incident. The IC coordinates: assigns tasks, tracks who is doing what, communicates with stakeholders, decides when to declare the incident resolved. Without an IC, everyone tries to fix things simultaneously, communication breaks down, and mitigation takes hours longer than it should.
Blameless Postmortems
A postmortem is a report on an incident. It should answer:
- What happened? (Timeline of events.)
- What was the impact?
- What went well?
- What went poorly?
- Where did we get lucky?
- What are the action items, with owners and due dates?
"Blameless" means the narrative avoids assigning fault to individuals. The goal is not "Alice shouldn't have deployed on Friday" — the goal is "our deploy system should have caught this regression." People who fear blame hide their mistakes, and hidden mistakes are the ones that kill systems.
Chaos Engineering
Chaos engineering is the discipline of testing the system by deliberately introducing failures. Not to break things, but to verify that the designed-in resilience actually works.
Principles.
- Build a hypothesis about steady-state behavior ("the service stays available when one replica is killed").
- Vary real-world events (kill a replica, introduce latency, drop packets).
- Run in production ideally, or in a staging environment that is representative.
- Automate and run continuously.
- Minimize blast radius.
Tools: Netflix Chaos Monkey (kill random instances), Chaos Kong (kill regions), Gremlin, Chaos Mesh.
NASA Systems Engineering Phase Gates
NASA's Systems Engineering Handbook defines a lifecycle from Pre-Phase A (concept exploration) through Phase F (decommissioning), with formal reviews at phase transitions:
- MCR — Mission Concept Review. Is the mission needed? Feasible?
- SRR — System Requirements Review. Are the requirements complete and achievable?
- SDR — System Definition Review. Is the architecture defined to satisfy the requirements?
- PDR — Preliminary Design Review. Does the preliminary design meet the requirements?
- CDR — Critical Design Review. Is the detailed design ready for fabrication?
- SIR — System Integration Review. Are integration plans ready?
- ORR — Operational Readiness Review. Is the operations team ready to run the system?
- FRR — Flight Readiness Review. (Aerospace-specific; cloud analog: "go-live review.")
Each review has entrance criteria (you can't hold the review until X is true) and exit criteria (the review isn't closed until Y is agreed).
Mapping to Cloud
Not every cloud deployment needs all eight reviews. A new microservice might need MCR-lite (design doc), SRR-lite (requirements review), CDR-lite (code review), and ORR (runbook and on-call review). A new multi-region rollout with customer SLAs probably deserves the full set, adapted for cloud terminology. The point is not ceremony — it is that you cannot review what you have not defined,