System Design Framework

A structured approach to designing large-scale distributed systems. Apply these principles when architecting new services, reviewing system designs, estimating capacity, or preparing for system design discussions.

Core Principle

Start with requirements, not solutions. Every system design begins by clarifying what you are building, for whom, and at what scale. Jumping to architecture before understanding constraints produces over-engineered or under-engineered systems.

The foundation: Scalable systems are not invented from scratch -- they are assembled from well-understood building blocks (load balancers, caches, queues, databases, CDNs) connected by clear data flows. The skill lies in choosing the right blocks, sizing them correctly, and understanding the tradeoffs each choice introduces. A four-step process -- scope, high-level design, deep dive, wrap-up -- keeps the design focused and communicable.

Scoring

Goal: 10/10. When reviewing or creating system designs, rate them 0-10 based on adherence to the principles below. A 10/10 means the design clearly states requirements, includes back-of-the-envelope estimates, uses appropriate building blocks, addresses scaling and reliability, and acknowledges tradeoffs. Lower scores indicate gaps to address. Always provide the current score and specific improvements needed to reach 10/10.

The System Design Framework

Six areas for building reliable, scalable distributed systems:

1. The Four-Step Process

Core concept: Every system design follows four stages: (1) understand the problem and establish design scope, (2) propose a high-level design and get buy-in, (3) dive deep into critical components, (4) wrap up with tradeoffs and future improvements.

Why it works: Without a structured process, designs either stay too abstract or get lost in premature detail. The four-step approach ensures you invest time proportionally -- broad strokes first, depth where it matters.

Key insights:

Step 1 consumes ~5-10 minutes: ask clarifying questions, list functional and non-functional requirements, agree on scale (DAU, QPS, storage)
Step 2 consumes ~15-20 minutes: draw a high-level diagram with APIs, services, data stores, and data flow arrows
Step 3 consumes ~15-20 minutes: pick 2-3 components that are hardest or most critical and design them in detail
Step 4 consumes ~5 minutes: summarize tradeoffs, identify bottlenecks, suggest future improvements
Never skip Step 1 -- ambiguity in scope leads to wasted design effort
Get explicit agreement on assumptions before proceeding

Code applications:

Context	Pattern	Example
New service kickoff	Write a one-page design doc with all four steps before coding	Requirements, API contract, data model, capacity estimate, then implementation
Architecture review	Walk reviewers through the four steps sequentially	Present scope, high-level diagram, deep-dive on the riskiest component, open questions
Incident postmortem	Trace the failure back through the four-step lens	Which requirement was missed? Which building block failed? What tradeoff bit us?

See: references/four-step-process.md

2. Back-of-the-Envelope Estimation

Core concept: Use powers of two, latency numbers, and simple arithmetic to estimate QPS, storage, bandwidth, and server count before committing to an architecture.

Why it works: Estimation prevents two failure modes: over-provisioning (wasting money) and under-provisioning (outages under load). A 2-minute calculation can save weeks of rework.

Key insights:

Know the powers of two: 2^10 = 1 thousand, 2^20 = 1 million, 2^30 = 1 billion, 2^40 = 1 trillion
Memory read ~100 ns, SSD read ~100 us, disk seek ~10 ms, round-trip same datacenter ~0.5 ms, cross-continent ~150 ms
Availability nines: 99.9% = 8.77 hours downtime/year, 99.99% = 52.6 minutes/year
QPS estimation: DAU x average-actions-per-day / 86,400 seconds; peak QPS is typically 2-5x average
Storage estimation: records-per-day x record-size x retention-period
Always round aggressively -- the goal is order of magnitude, not precision

Code applications:

Context	Pattern	Example
Capacity planning	Estimate QPS then multiply by growth factor	100M DAU x 5 actions / 86400 = ~5,800 QPS avg, ~30K QPS peak
Storage budgeting	Estimate per-record size and multiply by volume and retention	500M tweets/day x 300 bytes x 365 days = ~55 TB/year
SLA definition	Convert availability nines to allowed downtime	Four nines (99.99%) = ~52 minutes downtime per year

See: references/estimation-numbers.md

3. Building Blocks

Core concept: Scalable systems are assembled from a standard toolkit: DNS, CDN, load balancers, reverse proxies, application servers, caches, message queues, and consistent hashing.

Why it works: Each block solves a specific scaling or reliability problem. Knowing when and why to introduce each block prevents both premature complexity and avoidable bottlenecks.

Key insights:

DNS resolves domain names; CDN caches static assets at edge locations close to users
Load balancers distribute traffic -- L4 (transport layer, fast, simple) vs L7 (application layer, content-aware routing)
Caching layers: client-side, CDN, web server, application (e.g., Redis/Memcached), database query cache
Cache strategies: cache-aside (app manages), read-through (cache manages reads), write-through (cache manages writes synchronously), write-behind (cache writes asynchronously)
Message queues (Kafka, RabbitMQ, SQS) decouple producers from consumers, absorb traffic spikes, and enable async processing
Consistent hashing distributes keys across nodes with minimal redistribution when nodes are added or removed

Code applications:

Context	Pattern	Example
Read-heavy workload	Add cache-aside with Redis in front of the database	Cache user profiles with TTL; invalidate on write
Traffic spikes	Insert a message queue between API and workers	Enqueue image-resize jobs; workers pull at their own pace
Global users	Place a CDN in front of static assets	Serve JS/CSS/images from edge; origin only serves API
Uneven load	Use consistent hashing for shard assignment	Add a node and only ~1/n keys need to move

See: references/building-blocks.md

4. Database Design and Scaling

Core concept: Choose SQL vs NoSQL based on data shape and access patterns, then scale vertically first, horizontally (replication and sharding) when vertical limits are reached.

Why it works: The database is usually the first bottleneck. Understanding replication, sharding strategies, and denormalization tradeoffs lets you delay expensive re-architectures and plan growth deliberately.

Key insights:

Vertical scaling (bigger machine) is simpler but has a ceiling; horizontal scaling (more machines) is harder but nearly unlimited
Replication: leader-follower (one writer, many readers) for read-heavy; multi-leader for multi-region writes
Sharding strategies: hash-based (even distribution, hard range queries), range-based (efficient range queries, risk of hotspots), directory-based (flexible, extra lookup)
SQL when you need ACID transactions, complex joins, and a well-defined schema; NoSQL when you need flexible schema, horizontal scale, or very high write throughput
Denormalization trades storage and write complexity for faster reads -- use it when read performance is critical and data doesn't change frequently
Celebrity/hotspot problem: if one shard gets disproportionate traffic, add a secondary partition or cache layer

Code applications:

| Context | Pattern | Example | |---------|---

system-design

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

webapp-testing

brand-guidelines

frontend-design

web-artifacts-builder

Recibe nuevas skills de Design e Frontend todos los lunes

System Design Framework

Core Principle

Scoring

The System Design Framework

1. The Four-Step Process

2. Back-of-the-Envelope Estimation

3. Building Blocks

4. Database Design and Scaling

Comentarios · Sin comentarios