Capacity Planning Skill
Produce a complete capacity planning document for a service. Capacity planning is not about predicting the future exactly — it is about understanding current headroom, modelling growth, and ensuring the team takes infrastructure action before a constraint becomes an incident.
A good capacity plan answers: what is running out first, how long before it runs out, what does it cost to fix it, and who decides when to act.
Required Inputs
Ask for these if not already provided:
- Service name and description — what the service does and who depends on it
- Current traffic and usage metrics — requests per second (or per day), active users, data volume — whatever units are most natural for this service
- Current resource utilisation — CPU %, memory %, disk usage, connection pool utilisation, DB query throughput
- Growth rate or projections — historical growth rate, or known upcoming events (product launch, sales cycle, seasonal peak)
- Tech stack and infrastructure — cloud provider, compute type (VMs, containers, serverless), database, caching layer, CDN
- Cost constraints — current infrastructure spend, acceptable cost ceiling, or target cost per unit of traffic
Output Format
Capacity Plan: [Service Name]
Service: [Name] | Team: [Team name] Author: [Name] | Last updated: [Date] Planning horizon: [12 months — [Month Year] to [Month Year]] Review cadence: [Quarterly]
1. Executive Summary
[3–5 sentences covering: current state, the most critical capacity constraint, the timeline before it becomes a risk, the recommended action, and the cost implication. Written for an engineering manager or VP who needs the key facts without reading the full document.]
Critical finding: [e.g. "The database connection pool will reach 90% utilisation within 6 weeks at current growth. Without action, this will cause request queueing and latency spikes under normal traffic."]
Recommended immediate action: [e.g. "Increase connection pool limit and add a read replica within the next 2 weeks."]
Estimated cost impact: [e.g. "Recommended changes add ~$[X]/month to infrastructure spend."]
2. Current Baseline
All metrics are 30-day averages unless noted. Date captured: [Date]
Traffic
| Metric | Value | Peak (7-day) | Notes |
|---|---|---|---|
| Requests per second (avg) | [X req/s] | [X req/s] | [Peak time / day of week] |
| Requests per day | [X M/day] | [X M/day] | — |
| Active users (DAU/MAU) | [X] / [X] | — | — |
| [Service-specific metric — e.g. jobs processed/hour] | [X] | [X] | — |
| [Service-specific metric — e.g. GB ingested/day] | [X GB] | [X GB] | — |
Compute
| Resource | Current utilisation | Instance type | Count | Notes |
|---|---|---|---|---|
| CPU (avg) | [X%] | [e.g. c5.2xlarge] | [X] | Peak: [X%] |
| Memory (avg) | [X%] | — | — | Peak: [X%] |
| Network egress | [X Mbps] | — | — | — |
| Container / pod count | [X] | [e.g. 2 vCPU / 4 GB] | — | Auto-scaling range: [X–Y] |
Database
| Resource | Current utilisation | Spec | Notes |
|---|---|---|---|
| CPU | [X%] | [e.g. db.r5.2xlarge] | Peak: [X%] |
| Memory | [X%] | [X GB RAM] | — |
| Storage used | [X GB] of [Y GB] ([Z%]) | [X GB provisioned] | Growth: [~X GB/month] |
| IOPS (avg) | [X] of [Y provisioned] | [Y IOPS] | Peak: [X IOPS] |
| Connection pool | [X] of [Y max] ([Z%]) | Max connections: [Y] | [ORM pool size: X] |
| Query P99 latency | [X ms] | — | [Slowest query: X] |
| Read/write ratio | [X%] reads / [Y%] writes | — | — |
Cache
| Resource | Current utilisation | Spec | Notes |
|---|---|---|---|
| Memory used | [X GB] of [Y GB] ([Z%]) | [e.g. cache.r6g.large] | Eviction rate: [X%] |
| Hit rate | [X%] | — | Miss rate: [Y%] |
| Connections | [X] | Max: [Y] | — |
Storage / Object Store
| Resource | Current usage | Growth rate | Notes |
|---|---|---|---|
| [S3 / GCS / Blob] | [X GB / TB] | [~X GB/month] | [Lifecycle policies in place? Y/N] |
| Disk (if applicable) | [X GB] of [Y GB] | [~X GB/month] | [RAID / EBS type] |
Cost Baseline
| Component | Current monthly cost | % of total |
|---|---|---|
| Compute (app servers) | $[X] | [X%] |
| Database | $[X] | [X%] |
| Cache | $[X] | [X%] |
| Storage | $[X] | [X%] |
| CDN / bandwidth | $[X] | [X%] |
| Other ([describe]) | $[X] | [X%] |
| Total | $[X] | 100% |
Unit economics: $[X] per [1,000 requests / 1,000 users / GB processed]
3. Growth Projections
Assumptions
| Assumption | Value | Source | Confidence |
|---|---|---|---|
| Monthly traffic growth rate | [X%] | [Historical trend / product forecast] | [High / Medium / Low] |
| Seasonal peak factor | [+X% in [month(s)]] | [Last year's data / expected launch] | [High / Medium] |
| Upcoming events | [e.g. Marketing campaign — [Month], expected +[X]% traffic spike] | [Marketing plan] | [Medium] |
| User growth | [X new users/month] | [Sales pipeline / growth model] | [Medium] |
| Data growth | [X GB/month] | [Current trend] | [High] |
Traffic Forecast
| Timeframe | Req/s (avg) | Req/s (peak) | DAU | Data volume (cumulative) |
|---|---|---|---|---|
| Now (baseline) | [X] | [X] | [X] | [X GB/TB] |
| +3 months | [X] | [X] | [X] | [X GB/TB] |
| +6 months | [X] | [X] | [X] | [X GB/TB] |
| +12 months | [X] | [X] | [X] | [X GB/TB] |
Growth formula: [Baseline] × (1 + [monthly rate])^[months] + seasonal adjustment
Capacity Headroom Analysis
When does each resource run out at current utilisation and projected growth?
| Resource | Current utilisation | Safe ceiling | Headroom remaining | Months to ceiling |
|---|---|---|---|---|
| App CPU | [X%] | 70% | [X%] | [X months] |
| App memory | [X%] | 80% | [X%] | [X months] |
| DB CPU | [X%] | 70% | [X%] | [X months] |
| DB storage | [X GB] of [Y GB] | 80% = [Z GB] | [X GB] | [X months] |
| DB IOPS | [X] of [Y] | 80% = [Z] | [X IOPS] | [X months] |
| DB connections | [X] of [Y] | 80% = [Z] | [X] | [X months] |
| Cache memory | [X GB] of [Y GB] | 75% = [Z GB] | [X GB] | [X months] |
| Storage (object) | [X TB] | No hard limit — cost trigger | — | [Cost trigger: $X/month] |
Red flags (resources hitting ceiling within 3 months):
- [Resource]: [current]% → ceiling in [X weeks] — Action required
- [Resource]: [current]% → ceiling in [X weeks] — Action required
4. Resource Requirements
Compute Requirements
| Timeframe | Required instances | Recommended instance type | Auto-scaling range | Notes |
|---|---|---|---|---|
| Now | [X] | [type] | [min: X, max: Y] | Current configuration |
| +3 months | [X] | [type] | [min: X, max: Y] | [Any instance type change needed?] |
| +6 months | [X] | [type or upgrade] | [min: X, max: Y] | [Consider [larger type / horizontal scale]] |
| +12 months | [X] | [type or upgrade] | [min: X, max: Y] | [State of horizontal vs vertical decision] |
Memory headroom target: Maintain ≥30% available memory at average load; ≥20% at peak. CPU headroom target: Maintain ≥30% available CPU at average load; ≥15% at peak.
Database Requirements
| Timeframe | Instance type | Storage | IOPS | Read replica | Notes |
|---|---|---|---|---|---|
| Now | [type] | [X GB] | [X] | [Y/N] | Current |
| +3 months | [type] | [X GB] | [X] | [Y/N] | [Upgrade storage / IOPS] |
| +6 months | [type or upgrade] | [X GB] | [X] | Yes | [Read replica recommended by this point] |
| +12 months | [type] | [X GB] | [X] | [X replicas] | [Consider sharding / partitioning at this scale] |
Storage growth management:
- Current growth: [~X GB/month]
- Storage auto-scaling: [Enabled / Not enabled — enable by [date]]
- Archiving policy: [Records older than X months moved to [cold storage / archive tier]]
Cache Requirements
| Timeframe | Node type | Nodes | Memory | Notes |
|---|---|---|---|---|
| Now | [type] | [X] | [X GB] | Current |
| +6 months | [type] | [X] | [X GB] | [Scale out or upgrade] |
| +12 months | [type] | [X] | [X GB] | [Clu |