Capacity Planning Skill

Produce a complete capacity planning document for a service. Capacity planning is not about predicting the future exactly — it is about understanding current headroom, modelling growth, and ensuring the team takes infrastructure action before a constraint becomes an incident.

A good capacity plan answers: what is running out first, how long before it runs out, what does it cost to fix it, and who decides when to act.

Required Inputs

Ask for these if not already provided:

Service name and description — what the service does and who depends on it
Current traffic and usage metrics — requests per second (or per day), active users, data volume — whatever units are most natural for this service
Current resource utilisation — CPU %, memory %, disk usage, connection pool utilisation, DB query throughput
Growth rate or projections — historical growth rate, or known upcoming events (product launch, sales cycle, seasonal peak)
Tech stack and infrastructure — cloud provider, compute type (VMs, containers, serverless), database, caching layer, CDN
Cost constraints — current infrastructure spend, acceptable cost ceiling, or target cost per unit of traffic

Output Format

Capacity Plan: [Service Name]

Service: [Name] | Team: [Team name] Author: [Name] | Last updated: [Date] Planning horizon: [12 months — [Month Year] to [Month Year]] Review cadence: [Quarterly]

1. Executive Summary

[3–5 sentences covering: current state, the most critical capacity constraint, the timeline before it becomes a risk, the recommended action, and the cost implication. Written for an engineering manager or VP who needs the key facts without reading the full document.]

Critical finding: [e.g. "The database connection pool will reach 90% utilisation within 6 weeks at current growth. Without action, this will cause request queueing and latency spikes under normal traffic."]

Recommended immediate action: [e.g. "Increase connection pool limit and add a read replica within the next 2 weeks."]

Estimated cost impact: [e.g. "Recommended changes add ~$[X]/month to infrastructure spend."]

2. Current Baseline

All metrics are 30-day averages unless noted. Date captured: [Date]

Traffic

Metric	Value	Peak (7-day)	Notes
Requests per second (avg)	[X req/s]	[X req/s]	[Peak time / day of week]
Requests per day	[X M/day]	[X M/day]	—
Active users (DAU/MAU)	[X] / [X]	—	—
[Service-specific metric — e.g. jobs processed/hour]	[X]	[X]	—
[Service-specific metric — e.g. GB ingested/day]	[X GB]	[X GB]	—

Compute

Resource	Current utilisation	Instance type	Count	Notes
CPU (avg)	[X%]	[e.g. c5.2xlarge]	[X]	Peak: [X%]
Memory (avg)	[X%]	—	—	Peak: [X%]
Network egress	[X Mbps]	—	—	—
Container / pod count	[X]	[e.g. 2 vCPU / 4 GB]	—	Auto-scaling range: [X–Y]

Database

Resource	Current utilisation	Spec	Notes
CPU	[X%]	[e.g. db.r5.2xlarge]	Peak: [X%]
Memory	[X%]	[X GB RAM]	—
Storage used	[X GB] of [Y GB] ([Z%])	[X GB provisioned]	Growth: [~X GB/month]
IOPS (avg)	[X] of [Y provisioned]	[Y IOPS]	Peak: [X IOPS]
Connection pool	[X] of [Y max] ([Z%])	Max connections: [Y]	[ORM pool size: X]
Query P99 latency	[X ms]	—	[Slowest query: X]
Read/write ratio	[X%] reads / [Y%] writes	—	—

Cache

Resource	Current utilisation	Spec	Notes
Memory used	[X GB] of [Y GB] ([Z%])	[e.g. cache.r6g.large]	Eviction rate: [X%]
Hit rate	[X%]	—	Miss rate: [Y%]
Connections	[X]	Max: [Y]	—

Storage / Object Store

Resource	Current usage	Growth rate	Notes
[S3 / GCS / Blob]	[X GB / TB]	[~X GB/month]	[Lifecycle policies in place? Y/N]
Disk (if applicable)	[X GB] of [Y GB]	[~X GB/month]	[RAID / EBS type]

Cost Baseline

Component	Current monthly cost	% of total
Compute (app servers)	$[X]	[X%]
Database	$[X]	[X%]
Cache	$[X]	[X%]
Storage	$[X]	[X%]
CDN / bandwidth	$[X]	[X%]
Other ([describe])	$[X]	[X%]
Total	$[X]	100%

Unit economics: $[X] per [1,000 requests / 1,000 users / GB processed]

3. Growth Projections

Assumptions

Assumption	Value	Source	Confidence
Monthly traffic growth rate	[X%]	[Historical trend / product forecast]	[High / Medium / Low]
Seasonal peak factor	[+X% in [month(s)]]	[Last year's data / expected launch]	[High / Medium]
Upcoming events	[e.g. Marketing campaign — [Month], expected +[X]% traffic spike]	[Marketing plan]	[Medium]
User growth	[X new users/month]	[Sales pipeline / growth model]	[Medium]
Data growth	[X GB/month]	[Current trend]	[High]

Traffic Forecast

Timeframe	Req/s (avg)	Req/s (peak)	DAU	Data volume (cumulative)
Now (baseline)	[X]	[X]	[X]	[X GB/TB]
+3 months	[X]	[X]	[X]	[X GB/TB]
+6 months	[X]	[X]	[X]	[X GB/TB]
+12 months	[X]	[X]	[X]	[X GB/TB]

Growth formula: [Baseline] × (1 + [monthly rate])^[months] + seasonal adjustment

Capacity Headroom Analysis

When does each resource run out at current utilisation and projected growth?

Resource	Current utilisation	Safe ceiling	Headroom remaining	Months to ceiling
App CPU	[X%]	70%	[X%]	[X months]
App memory	[X%]	80%	[X%]	[X months]
DB CPU	[X%]	70%	[X%]	[X months]
DB storage	[X GB] of [Y GB]	80% = [Z GB]	[X GB]	[X months]
DB IOPS	[X] of [Y]	80% = [Z]	[X IOPS]	[X months]
DB connections	[X] of [Y]	80% = [Z]	[X]	[X months]
Cache memory	[X GB] of [Y GB]	75% = [Z GB]	[X GB]	[X months]
Storage (object)	[X TB]	No hard limit — cost trigger	—	[Cost trigger: $X/month]

Red flags (resources hitting ceiling within 3 months):

[Resource]: [current]% → ceiling in [X weeks] — Action required
[Resource]: [current]% → ceiling in [X weeks] — Action required

4. Resource Requirements

Compute Requirements

Timeframe	Required instances	Recommended instance type	Auto-scaling range	Notes
Now	[X]	[type]	[min: X, max: Y]	Current configuration
+3 months	[X]	[type]	[min: X, max: Y]	[Any instance type change needed?]
+6 months	[X]	[type or upgrade]	[min: X, max: Y]	[Consider [larger type / horizontal scale]]
+12 months	[X]	[type or upgrade]	[min: X, max: Y]	[State of horizontal vs vertical decision]

Memory headroom target: Maintain ≥30% available memory at average load; ≥20% at peak. CPU headroom target: Maintain ≥30% available CPU at average load; ≥15% at peak.

Database Requirements

Timeframe	Instance type	Storage	IOPS	Read replica	Notes
Now	[type]	[X GB]	[X]	[Y/N]	Current
+3 months	[type]	[X GB]	[X]	[Y/N]	[Upgrade storage / IOPS]
+6 months	[type or upgrade]	[X GB]	[X]	Yes	[Read replica recommended by this point]
+12 months	[type]	[X GB]	[X]	[X replicas]	[Consider sharding / partitioning at this scale]

Storage growth management:

Current growth: [~X GB/month]
Storage auto-scaling: [Enabled / Not enabled — enable by [date]]
Archiving policy: [Records older than X months moved to [cold storage / archive tier]]

Cache Requirements

Timeframe	Node type	Nodes	Memory	Notes
Now	[type]	[X]	[X GB]	Current
+6 months	[type]	[X]	[X GB]	[Scale out or upgrade]
+12 months	[type]	[X]	[X GB]	[Clu

capacity-planning

How to add

Drop this on your repo README

Related skills

internal-comms

babysit

do

smart-explore

Get new DevOps e Infra skills every Monday