CODEMAP — Codebase Navigation Index Generator
Generate hierarchical CODEMAP.md files that serve as a structured navigation index for large codebases. Each directory gets its own CODEMAP.md containing a simplified directory tree, domain-annotated file/subdirectory summaries, task-to-file routing guides, file-level dependency tables, and directory-level key exports with source location annotations. For extra-large source files (>1000 lines), generate a companion deep-analysis file with feature-to-code index. Agent navigation follows a layered lazy-loading strategy: Task Guide first → Domain filter → drill down → batch-read only the source files actually needed.
Language Rule
All generated CODEMAP.md files and analysis.md files MUST be written in the same language as the user's request. If the user asks in Chinese, all summaries, purpose descriptions, function descriptions, and section headings (except code identifiers and file names) are written in Chinese. If the user asks in English, write in English. Code identifiers (ClassName, function_name, file paths) always retain their original form regardless of language.
Two Operating Modes
Use the AskUserQuestion tool to ask the user before doing anything else. Present all three questions in a single call:
Question 1 (header: "Mode", single-select):
- Learning (Recommended) — read-only study, no code changes expected
- Maintenance — active development, bugs/features/refactors
Question 2 (header: "Sub-agents", single-select):
- Yes, max 3 (Recommended) — parallel sub-agent generation with default limit of 3
- Yes, custom limit — parallel sub-agent generation, user specifies max count
- No — single-agent serial generation
Question 3 (header: "Ignore", single-select):
- Defaults only (Recommended) — use built-in ignore list + .gitignore
- Add custom patterns — user provides additional ignore patterns
Mode affects:
| Aspect | Learning | Maintenance |
|---|---|---|
| CODEMAP frontmatter | mode: learning | mode: maintenance, includes commit: <hash> |
| Update strategy | One-time generation, no updates | Incremental via git diff, regenerate changed dirs only |
| Content tone | May include brief design-intent notes | Concise, purely navigational |
| AGENTS.md clause | Declares CODEMAP existence + relaxed navigation rules (Domain as guidance, deps as reference, anti-speculation) | Additionally declares: strict constraint rules (Task Guide First, dependency gating, Two-Stage Read Protocol) + full update rules with decision tree |
Ignore Rules (Three Layers)
Apply in order, merge results:
Layer 1 — Built-in defaults:
node_modules/, .git/, dist/, build/, out/, target/,
__pycache__/, .venv/, venv/, env/, .env, .egg-info/,
*.pyc, *.pyo, *.min.js, *.min.css, *.map,
*.lock, package-lock.json, yarn.lock, pnpm-lock.yaml,
.DS_Store, Thumbs.db, *.log,
.idea/, .vscode/, .vs/, *.swp, *.swo,
coverage/, .nyc_output/, .pytest_cache/, .mypy_cache/,
*.so, *.dylib, *.dll, *.o, *.obj, *.exe,
*.png, *.jpg, *.jpeg, *.gif, *.ico, *.svg, *.bmp,
*.woff, *.woff2, *.ttf, *.eot
Layer 2 — Project .gitignore: If exists, read and merge patterns.
Layer 3 — User custom: From Q3 above.
Generation Workflow
Phase 0: Global Context
- Read
README.md(orREADME.rst,README.txt) at project root.- If no README, fall back to
package.json/pyproject.toml/Cargo.toml/go.mod/pom.xmldescription fields + root file list. - If none of the above exist, synthesize from root file list + first-level subdirectory names. Reasonable functional guesses are acceptable when information is scarce, but always note "based on structure inference, subject to code verification" for any guessed content.
- If no README, fall back to
- Combine with the filtered directory topology from Phase 1 to produce a project global context summary. Keep it concise — summarize what the project does, its high-level architecture, and major functional domains. Do not pad with installation instructions, badges, or changelogs.
Phase 1: Scan, Filter, and Measure
- Run
Glob+lson the project root, applying all three ignore layers. - Produce a filtered directory topology: for each first-level subdirectory, collect its source files recursively.
- Identify root-level loose files (files not inside any subdirectory).
- Collect metrics for sub-agent dispatch (run these commands after filtering):
- Per first-level subdirectory: total file count, total code line count (
wc -lor equivalent on all source files), total size on disk (du -shor equivalent). - Project total: sum of the above.
- These metrics determine sub-agent count and load balancing (see Phase 2).
- Per first-level subdirectory: total file count, total code line count (
- Identify large files: Flag any source file exceeding 1000 lines.
- If the count of large files is ≤ 5: generate deep-analysis companion files for all of them automatically.
- If the count is > 5: present the list of large files (with line counts) to the user and ask:
Found N files exceeding 1000 lines. Generate deep-analysis files for: A) All of them B) Only the top 5 largest C) Let me pick which ones D) None — skip deep analysis - See "Large File Deep Analysis" section for the companion file format.
Phase 2: Sub-agent Dispatch (if enabled)
Determine actual sub-agent count K:
Let L = total code line count after filtering, S = total size on disk, N = user-specified max sub-agents.
| Condition | K |
|---|---|
| L <= 3,000 lines OR S <= 500 KB | 1 (not worth splitting) |
| 3,000 < L <= 15,000 OR 500 KB < S <= 3 MB | min(N, 2) |
| L > 15,000 OR S > 3 MB | N |
When line count and size suggest different K values, use the larger K.
Load balancing — Greedy Bin Packing by code line count:
- Sort first-level subdirectories by code line count descending.
- Maintain K bins (one per sub-agent), each tracking total line count.
- For each directory: assign to the bin with the smallest current total.
- Root-level loose files: assign to the lightest bin, or handle by the main agent if total lines <= 200.
The goal is to equalize code line count across sub-agents (not file count), since line count correlates more closely with actual reading and analysis effort.
Each sub-agent receives:
- Project global context summary (compressed, from Phase 0)
- Ignore rules
- Output language (matching the user's request language)
- List of directories assigned to it (with paths)
- List of large files (>1000 lines) within its assigned directories that the user confirmed for deep analysis (from Phase 1 Step 5 selection)
- Instruction: for each assigned directory and all its subdirectories, read source files and generate CODEMAP.md files following the format spec below. Additionally:
- Analyze each file's imports to populate the Domain column, Cross-Dir Dependencies column, and File Dependencies table.
- Identify functional domains within the directory and create the Task Guide table mapping task types to target files.
- For
.analysis.mdfiles, generate the Feature Index mapping development intents to line ranges.
Each sub-agent produces:
- One CODEMAP.md per directory it is responsible for
- One
<filename>.analysis.mdper large file (>1000 lines) in its scope
Phase 2 (alternative): Serial Generation (if sub-agents disabled)
Main agent processes first-level subdirectories one by one in descending line-count order. For each directory: read all source files within it, generate CODEMAP.md for it and all its subdirectories. Generate deep-analysis files for any source file exceeding 1000 lines. All new fields (Domain, Cross-Dir Dependencies, File Dependencies, Task Guide, Feature Index) must be populated.
Phase 3: Root Assembly
- Main agent reads the CODEMAP.md of each first-level subdirectory (summary line + Task Guide + Dependencies sections, not full content).
- Genera