Data Architecture
Purpose
Guide architects and platform engineers through strategic data architecture decisions for modern cloud-native data platforms.
When to Use This Skill
Invoke this skill when:
- Designing a new data platform or modernizing legacy systems
- Choosing between data lake, data warehouse, or data lakehouse
- Deciding on data modeling approaches (dimensional, normalized, data vault, wide tables)
- Evaluating centralized vs data mesh architecture
- Selecting open table formats (Apache Iceberg, Delta Lake, Apache Hudi)
- Designing medallion architecture (bronze, silver, gold layers)
- Implementing data governance and cataloging
Core Concepts
1. Storage Paradigms
Three primary patterns for analytical data storage:
Data Lake: Centralized repository for raw data at scale
- Schema-on-read, cost-optimized ($0.02-0.03/GB/month)
- Use when: Diverse data sources, exploratory analytics, ML/AI training data
Data Warehouse: Structured repository optimized for BI
- Schema-on-write, ACID transactions, fast queries
- Use when: Known BI requirements, strong governance needed
Data Lakehouse: Hybrid combining lake flexibility with warehouse reliability
- Open table formats (Iceberg, Delta Lake), ACID on object storage
- Use when: Mixed BI + ML workloads, cost optimization (60-80% cheaper than warehouse)
Decision Framework:
- BI/Reporting only + Known queries → Data Warehouse
- ML/AI primary + Raw data needed → Data Lake or Lakehouse
- Mixed BI + ML + Cost optimization → Data Lakehouse (recommended)
- Exploratory/Unknown use cases → Data Lake
For detailed comparison, see references/storage-paradigms.md.
2. Data Modeling Approaches
Four primary modeling patterns:
Dimensional (Kimball): Star/snowflake schemas for BI
- Use when: Known query patterns, BI dashboards, trend analysis
Normalized (3NF): Eliminate redundancy for transactional systems
- Use when: OLTP systems, frequent updates, strong consistency
Data Vault 2.0: Flexible model with complete audit trail
- Use when: Compliance requirements, multiple sources, agile warehousing
Wide Tables: Denormalized, optimized for columnar storage
- Use when: ML feature stores, data science notebooks, high-performance dashboards
Decision Framework:
- Analytical (BI) + Known queries → Dimensional (Star Schema)
- Transactional (OLTP) → Normalized (3NF)
- Compliance/Audit → Data Vault 2.0
- Data Science/ML → Wide Tables
For detailed patterns, see references/modeling-approaches.md.
3. Data Mesh Principles
Decentralized architecture for large organizations (>500 people).
Four Core Principles:
- Domain-oriented decentralization
- Data as a product (SLAs, quality, documentation)
- Self-serve data infrastructure
- Federated computational governance
Readiness Assessment (Score 1-5 each):
- Domain clarity
- Team maturity
- Platform capability
- Governance maturity
- Scale need
- Organizational buy-in
Scoring: 24-30: Strong candidate | 18-23: Hybrid | 12-17: Build foundation first | 6-11: Centralized
Red Flags: Small org (<100 people), unclear domains, no platform team, weak governance
For full guide, see references/data-mesh-guide.md.
4. Medallion Architecture
Standard lakehouse pattern: Bronze (raw) → Silver (cleaned) → Gold (business-level)
Bronze Layer: Exact copy of source data, immutable, append-only
Silver Layer: Validated, deduplicated, typed data
Gold Layer: Business logic, aggregates, dimensional models, ML features
Data Quality by Layer:
- Bronze → Silver: Schema validation, type checks, deduplication
- Silver → Gold: Business rule validation, referential integrity
- Gold: Anomaly detection, statistical checks
For patterns, see references/medallion-pattern.md.
5. Open Table Formats
Enable ACID transactions on data lakes:
Apache Iceberg: Multi-engine, vendor-neutral (Context7: 79.7 score)
- Use when: Avoid vendor lock-in, multi-engine flexibility
Delta Lake: Databricks ecosystem, Spark-optimized
- Use when: Committed to Databricks
Apache Hudi: Optimized for CDC and frequent upserts
- Use when: CDC-heavy workloads
Recommendation: Apache Iceberg for new projects (vendor-neutral, broadest support)
For comparison, see references/table-formats.md.
6. Modern Data Stack
Standard Layers:
- Ingestion: Fivetran, Airbyte, Kafka
- Storage: Snowflake, Databricks, BigQuery
- Transformation: dbt (Context7: 87.0 score), Spark
- Orchestration: Airflow, Dagster, Prefect
- Visualization: Tableau, Looker, Power BI
- Governance: DataHub, Alation, Great Expectations
Tool Selection:
- Fivetran vs Airbyte: Pre-built connectors vs cost-sensitive
- Snowflake vs Databricks: BI-focused vs ML-focused
- dbt vs Spark: SQL-based vs large-scale processing
For detailed recommendations, see references/tool-recommendations.md and references/modern-data-stack.md.
7. Data Governance
Data Catalog: Searchable inventory (DataHub, Alation, Collibra)
Data Lineage: Track data flow (OpenLineage, Marquez)
Data Quality: Validation and testing (Great Expectations, Soda, dbt tests)
Access Control:
- RBAC: Role-based (sales_analyst role)
- ABAC: Attribute-based (row-level security)
- Column-level: Dynamic data masking for PII
For governance patterns, see references/governance-patterns.md.
Decision Frameworks
Framework 1: Storage Paradigm Selection
Step 1: Identify Primary Use Case
- BI/Reporting only → Data Warehouse
- ML/AI primary → Data Lake or Lakehouse
- Mixed BI + ML → Data Lakehouse
- Exploratory → Data Lake
Step 2: Evaluate Budget
- High budget, known queries → Data Warehouse
- Cost-sensitive, flexible → Data Lakehouse
Recommendation by Org Size:
- Startup (<50): Data Warehouse (simplicity)
- Growth (50-500): Data Lakehouse (balance)
- Enterprise (>500): Hybrid or unified Lakehouse
See references/decision-frameworks.md.
Framework 2: Data Modeling Approach
Decision Tree:
- Analytical (BI) workload → Dimensional or Wide Tables
- Transactional (OLTP) → Normalized (3NF)
- Compliance/Audit → Data Vault 2.0
- Data Science/ML → Wide Tables
See references/decision-frameworks.md.
Framework 3: Data Mesh Readiness
Use 6-factor assessment. Score interpretation:
- 24-30: Proceed with data mesh
- 18-23: Hybrid approach
- 12-17: Build foundation first
- 6-11: Centralized
See references/decision-frameworks.md.
Framework 4: Open Table Format Selection
Decision Tree:
- Multi-engine flexibility → Apache Iceberg
- Databricks ecosystem → Delta Lake
- Frequent upserts/CDC → Apache Hudi
Recommendation: Apache Iceberg for new projects
See references/decision-frameworks.md.
Common Scenarios
Startup Data Platform
Context: 50-person startup, PostgreSQL + MongoDB + Stripe
Recommendation:
- Storage: BigQuery or Snowflake
- Ingestion: Airbyte or Fivetran
- Transformation: dbt
- Orchestration: dbt Cloud
- Architecture: Simple data warehouse
Enterprise Modernization
Context: Legacy Oracle warehouse, need cloud migration
Recommendation:
- Storage: Data Lakehouse (Databricks or Snowflake with Iceberg)
- Strategy: Incremental migration with CDC
- Architecture: Medallion (bronze, silver, gold)
- Cost Savings: 60-80%
Data Mesh Assessment
Context: 200-pers