Crawl Data — Discover and Register Data Models
Scan a codebase directory for data model definitions and propose registry entries.
Arguments
$1— Path to scan (required). Absolute or relative path to the codebase to crawl.--domain <name>— Domain to assign discovered models to (optional, will ask if omitted).--write— Write proposed entries to registry immediately (default: preview only).
If no path is provided, ask the user which directory to scan.
Workflow
1. Discover Type-to-Folder Mapping
Read models/registry-mapping.yaml to find:
- The folder path for
data_conceptentries - The folder path for
data_aggregateentries - The folder path for
data_entityentries - The
_template.mdin each folder for frontmatter structure
Never hardcode paths. Always derive from the YAML.
2. Scan for Data Model Definitions
Search the target directory for data model files. Use these detection patterns:
SQL Schema files:
- Glob:
**/*.sql,**/migrations/**/*.sql,**/schema/**/*.sql - Content match:
CREATE TABLE,ALTER TABLE - Extract: table name, column names/types, constraints, foreign keys
Prisma models:
- Glob:
**/schema.prisma,**/*.prisma - Content match:
model <Name> { - Extract: model name, fields with types, relations (
@relation)
TypeORM / Sequelize / Drizzle (TypeScript ORMs):
- Grep for:
@Entity(),@Table,Model.init,pgTable(,mysqlTable( - Look in:
**/models/**,**/entities/**,**/schema/** - Extract: class/table name, decorated columns, relations
Pydantic / dataclass models (Python):
- Grep for:
class.*BaseModel,@dataclass,class.*Model.*models.Model(Django) - Look in:
**/models/**,**/schemas/**,**/domain/** - Extract: class name, field names/types, validators
TypeScript interfaces / types:
- Grep for:
export interface,export type.*=.*{ - Look in:
**/types/**,**/interfaces/**,**/models/** - Extract: interface/type name, property names/types
Protobuf messages:
- Glob:
**/*.proto - Content match:
message <Name> { - Extract: message name, fields with types
3. Extract Data Model Information
For each discovered model, extract:
| Field | Source |
|---|---|
name | Table/class/interface name, converted to Title Case |
description | JSDoc/docstring/comment above definition, or TBD |
entity_type | root if standalone, child if has foreign key to parent, value-object if embedded |
attributes | Column/field definitions with name, type, required status |
classification | pii if field names suggest personal data (email, phone, ssn, address), internal otherwise |
4. Build Data Hierarchy
Group discovered models into the three-level registry hierarchy:
-
Data Concept — High-level business concept (e.g., "Customer", "Order", "Payment")
- Inferred from: table name prefixes, module/folder grouping, foreign key clusters
- One concept per logical grouping
-
Data Aggregate — Bounded collection of entities (DDD aggregate root + children)
- Inferred from: root tables with child tables referencing them
- If unclear, each standalone table/model becomes its own aggregate
-
Data Entity — Individual table/model with attributes
- Direct mapping from discovered models
- Includes extracted attributes array
5. Check for Duplicates
Before proposing entries, check existing registry entries:
- List existing data_concept, data_aggregate, and data_entity entries
- Compare by name (case-insensitive)
- Flag potential duplicates
6. Present Findings
Show the user a summary:
**Data Model Discovery Results** — scanned: <path>
Found X data models:
**Proposed Hierarchy:**
📦 Customer (Data Concept)
└─ 🗃️ Customer Aggregate (Data Aggregate)
├─ 📝 Customer (root entity, 8 attributes)
├─ 📝 Customer Address (child entity, 5 attributes)
└─ 📝 Customer Preference (value-object, 3 attributes)
📦 Order (Data Concept)
└─ 🗃️ Order Aggregate (Data Aggregate)
├─ 📝 Order (root entity, 12 attributes)
└─ 📝 Order Line Item (child entity, 6 attributes)
**Duplicates:** (if any)
- "Customer" already exists in registry as data-concepts/customer.md
**Classification hints:**
- Customer → PII detected (email, phone fields)
- Order → internal
7. Write Entries (if --write or user confirms)
For each approved entry:
- Generate kebab-case filename from the model name
- Read the
_template.mdfor the target type - Fill in discovered fields:
name,description,status: draftclassificationfor data_conceptentity_typeandattributesfor data_entity- Parent relationships (
parent_data_concept,parent_data_aggregate)
- Write to the correct registry folder
- Report what was written
8. Post-Scan Report
**Written X entries:**
- data_concept: registry-v2/<path>/concept-name.md
- data_aggregate: registry-v2/<path>/aggregate-name.md
- data_entity: registry-v2/<path>/entity-name.md
**Next steps:**
1. Review and fill in TBD fields (owner, description)
2. Verify classification (PII, business-confidential, internal)
3. Wire component relationships: `owned_by_component`
4. Run `/validate` to check model consistency
Detection Priority
- SQL CREATE TABLE statements (highest confidence — explicit schema)
- Prisma models (high confidence — typed, complete)
- ORM entity decorators (high confidence — structured)
- Protobuf messages (high confidence — typed)
- Pydantic/dataclass models (medium confidence — may be DTOs not domain models)
- TypeScript interfaces (lower confidence — may be API shapes not data models)
Notes
- Always propose as
status: draft— never auto-promote to active - PII classification is a hint based on field names — human must verify
- If a model has no clear parent, make it both a data_concept and data_aggregate (flat hierarchy)
- Skip migration files — focus on current model state, not history
- Skip test fixtures and mock data files
- Large codebases: limit scan to first 100 models and suggest narrowing the path