Designing Data-Intensive Applications Framework

A principled approach to building reliable, scalable, and maintainable data systems. Apply these principles when choosing databases, designing schemas, architecting distributed systems, or reasoning about consistency and fault tolerance.

Core Principle

Data outlives code. Applications are rewritten, languages change, frameworks come and go -- but data and its structure persist for decades. Every architectural decision must prioritize the long-term correctness, durability, and evolvability of the data layer above all else.

The foundation: Most applications are data-intensive, not compute-intensive. The hard problems are the amount of data, its complexity, and the speed at which it changes. Understanding the trade-offs between consistency, availability, partition tolerance, latency, and throughput is what separates robust systems from fragile ones.

Scoring

Goal: 10/10. When reviewing or designing data architectures, rate them 0-10 based on adherence to the principles below. A 10/10 means deliberate trade-off choices for data models, storage engines, replication, partitioning, transactions, and processing pipelines; lower scores indicate accidental complexity or ignored failure modes. Always provide the current score and specific improvements needed to reach 10/10.

The DDIA Framework

Seven domains for reasoning about data-intensive systems:

1. Data Models and Query Languages

Core concept: The data model shapes how you think about the problem. Relational, document, and graph models each impose different constraints and enable different query patterns.

Why it works: Choosing the wrong data model forces application code to compensate for representational mismatch, adding accidental complexity that compounds over time.

Key insights:

Relational models excel at many-to-many relationships and ad-hoc queries
Document models excel at one-to-many relationships and data locality
Graph models excel at highly interconnected data with recursive traversals
Schema-on-write (relational) catches errors early; schema-on-read (document) offers flexibility
Polyglot persistence -- use different stores for different access patterns -- is often the right answer
Impedance mismatch between objects and relations is a real cost; document models reduce it for self-contained aggregates

Code applications:

Context	Pattern	Example
User profiles with nested data	Document model for self-contained aggregates	Store profile, addresses, and preferences in one MongoDB document
Social network connections	Graph model for relationship traversal	Neo4j Cypher query: `MATCH (a)-[:FOLLOWS*2]->(b)` for friend-of-friend
Financial ledger with joins	Relational model for referential integrity	PostgreSQL with foreign keys between accounts, transactions, and entries
Mixed access patterns	Polyglot persistence	PostgreSQL for transactions + Elasticsearch for full-text search + Redis for caching

See: references/data-models.md

2. Storage Engines

Core concept: Storage engines make a fundamental trade-off between read performance and write performance. Log-structured engines (LSM trees) optimize writes; page-oriented engines (B-trees) balance reads and writes.

Why it works: Understanding the internals of your database's storage engine lets you predict performance characteristics, choose appropriate indexes, and avoid pathological workloads.

Key insights:

LSM trees: append-only writes, periodic compaction, excellent write throughput, higher read amplification
B-trees: in-place updates, predictable read latency, write amplification from page splits
Write amplification means one logical write causes multiple physical writes -- critical for SSDs with limited write cycles
Column-oriented storage dramatically improves analytical query performance through compression and vectorized processing
In-memory databases are fast not because they avoid disk, but because they avoid encoding overhead

Code applications:

Context	Pattern	Example
High write throughput	LSM-tree engine	Cassandra or RocksDB for time-series ingestion at 100K+ writes/sec
Mixed read/write OLTP	B-tree engine	PostgreSQL B-tree indexes for transactional workloads with point lookups
Analytical queries on large datasets	Column-oriented storage	ClickHouse or Parquet files for scanning billions of rows with few columns
Low-latency caching	In-memory store	Redis for sub-millisecond lookups; Memcached for simple key-value caching

See: references/storage-engines.md

3. Replication

Core concept: Replication keeps copies of data on multiple machines for fault tolerance, scalability, and latency reduction. The core challenge is handling changes to replicated data consistently.

Why it works: Every replication strategy trades off between consistency, availability, and latency. Making this trade-off explicit prevents subtle data anomalies that surface only under load or failure.

Key insights:

Single-leader replication: simple, strong consistency possible, but the leader is a bottleneck and single point of failure
Multi-leader replication: better write availability across data centers, but conflict resolution is complex
Leaderless replication: highest availability, uses quorum reads/writes, but requires careful conflict handling
Replication lag causes read-your-writes violations, monotonic read violations, and causality violations
Synchronous replication guarantees durability but increases latency; asynchronous replication risks data loss on leader failure
CRDTs and last-writer-wins are conflict resolution strategies with very different correctness guarantees

Code applications:

Context	Pattern	Example
Read-heavy web app	Single-leader with read replicas	PostgreSQL primary + read replicas behind pgBouncer for read scaling
Multi-region writes	Multi-leader replication	CockroachDB or Spanner for geo-distributed writes with bounded staleness
Shopping cart availability	Leaderless with merge	DynamoDB with last-writer-wins or application-level merge for cart conflicts
Collaborative editing	CRDTs for conflict-free merging	Yjs or Automerge for real-time collaborative document editing

See: references/replication.md

4. Partitioning

Core concept: Partitioning (sharding) distributes data across multiple nodes so that each node handles a subset of the total data, enabling horizontal scaling beyond a single machine.

Why it works: Without partitioning, a single node becomes the bottleneck for storage capacity and throughput. Effective partitioning distributes load evenly and avoids hotspots.

Key insights:

Key-range partitioning supports efficient range scans but risks hotspots on sequential keys
Hash partitioning distributes load evenly but destroys sort order and makes range queries expensive
Secondary indexes can be partitioned locally (each partition has its own index) or globally (index partitioned separately)
Local secondary indexes require scatter-gather queries; global secondary indexes require cross-partition updates
Hotspots can occur even with hash partitioning if a single key is extremely popular (celebrity problem)
Rebalancing strategies: fixed number of partitions, dynamic splitting, or proportional to node count

Code applications:

Context	Pattern	Example
Time-series data	Key-range partitioning by time + source	Partition by `(sensor_id, date)` to avoid write hotspot on current day
User data at scale	Hash partitioning on user ID	Cassandra cons

ddia-systems

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

xlsx

mem-search

weekly-digests

how-it-works

Recibe nuevas skills de Dados e Análise todos los lunes

Designing Data-Intensive Applications Framework

Core Principle

Scoring

The DDIA Framework

1. Data Models and Query Languages

2. Storage Engines

3. Replication

4. Partitioning

Comentarios · Sin comentarios