Batch Cohort Analysis Skill
You are assisting a medical researcher in generating multiple analysis scripts from a single validated methodology template, each differing only in the exposure/outcome variable combination. This replicates the "80-person research team" pattern: one PI designs the methodology, and many researchers execute the same approach with different variable swaps.
When to Use
- Researcher has a validated analysis template (e.g., from /replicate-study or /cross-national)
- Wants to explore multiple exposure → outcome combinations on the same database
- Goal: systematic variable-swap code generation + batch execution + result matrix
Inputs
- Database path(s): CSV/SAS data files (KNHANES, NHANES, NHIS, or any cleaned cohort)
- Methodology template: One of:
- Path to a validated R/Python analysis script (from /replicate-study or /cross-national)
- A paper type template name:
nhis_cohort,cross_national,survey_weighted - A source paper to extract methodology from (falls back to /replicate-study Phase 1)
- Combination spec: A list of exposure/outcome pairs, provided as:
- Inline list:
exposures: [depression, obesity, smoking]; outcomes: [diabetes, hypertension, CVD] - CSV file with columns:
exposure,outcome, (optional)subgroup_vars "all"keyword: generates all pairwise combinations from the lists
- Inline list:
Optional Inputs
- Covariate set: Fixed covariate list for all analyses (default: use template's set)
- Subgroup variables: Variables to stratify by (default: sex, age group)
- Output format:
code_only(just scripts) |execute(run + collect results) |full(code + results + summary) - Cross-national mode: If TRUE, generates paired scripts for both countries per combination
Workflow
Phase 1: Template Validation
- Read the methodology template (R script or paper type reference).
- Identify the slot variables — parts that change per combination:
EXPOSURE_VAR: raw variable name in the databaseEXPOSURE_LABEL: human-readable label for tables/figuresEXPOSURE_CODING: how to derive binary/categorical exposureOUTCOME_VAR: raw variable nameOUTCOME_LABEL: human-readable labelOUTCOME_CODING: how to derive binary outcome
- Verify the template runs successfully on at least one combination before batch generation.
- Output: template summary with identified slots → user approval.
Phase 2: Variable Specification
For each exposure and outcome in the combination spec:
- Look up the variable in the database:
- KNHANES: check variable name exists in the CSV header
- NHANES: check which table contains the variable (use codebook.csv if available)
- NHIS: check claims code or variable name
- Define coding:
- Binary: threshold or category mapping (e.g.,
HE_glu >= 126 → diabetes = 1) - Categorical: level definitions (e.g.,
smoking: current/former/never)
- Binary: threshold or category mapping (e.g.,
- Check covariate overlap: If the exposure IS one of the standard covariates, remove it from the adjustment set for that analysis (no self-adjustment).
- Output: combination matrix with all variable specifications.
| # | Exposure | Exposure Coding | Outcome | Outcome Coding | Covariates (adjusted) | Notes |
|---|----------|-----------------|---------|----------------|----------------------|-------|
| 1 | Depression (PHQ≥10) | BP_PHQ sum ≥10 | Diabetes | HE_glu≥126|HbA1c≥6.5|DE1_dg=1 | age,sex,edu,income,smoking,alcohol,obesity,CVD | — |
| 2 | Obesity (BMI≥25) | HE_obe ≥4 | Diabetes | same | age,sex,edu,income,smoking,alcohol,depression,CVD | obesity removed from covariates |
| ... | | | | | | |
Phase 3: Batch Code Generation
For each combination in the matrix:
- Clone the template script.
- Replace slot variables with the combination-specific values.
- Adjust covariates: Remove exposure variable from covariate list if present.
- Set output paths: Each combination gets its own results subdirectory.
- Generate a master runner script (
run_all.Rorrun_all.sh) that:- Executes all N scripts sequentially (or in parallel via
future/parallel) - Captures errors per script without stopping the batch
- Logs execution time per analysis
- Executes all N scripts sequentially (or in parallel via
Phase 4: Batch Execution (if execute or full mode)
- Run the master script.
- Collect results from each combination's output directory.
- Handle failures gracefully:
- Log which combinations failed and why
- Common failures: convergence issues, too few events, empty subgroups
- Suggest fixes for failed combinations
Phase 5: Summary Matrix
Aggregate all results into a single summary:
Main Results Matrix (summary_matrix.csv):
| Exposure | Outcome | N | Events | Model 1 OR (95% CI) | Model 2 OR (95% CI) | Model 3 OR (95% CI) | p-value | Significant |
|---|---|---|---|---|---|---|---|---|
| Depression | Diabetes | 5,811 | 487 | 2.14 (1.52–3.01) | 1.89 (1.33–2.69) | 1.36 (0.91–2.05) | 0.137 | No |
| Obesity | Diabetes | 5,811 | 487 | 3.45 (2.71–4.39) | 3.38 (2.65–4.32) | 3.12 (2.42–4.02) | <0.001 | Yes |
| ... |
Subgroup Summary (subgroup_matrix.csv): Same format, stratified by subgroup variables.
Heatmap (optional): Visual matrix of effect sizes × significance, exposure on Y-axis, outcome on X-axis.
Output Files
{working_dir}/batch_{timestamp}/
├── README.md — Batch run summary (N combinations, template used, date)
├── combination_matrix.csv — All exposure/outcome specs with coding
├── template/
│ └── base_template.R — The validated template (frozen copy)
├── scripts/
│ ├── 01_depression_diabetes.R
│ ├── 02_obesity_diabetes.R
│ ├── ...
│ └── run_all.R — Master execution script
├── results/
│ ├── 01_depression_diabetes/
│ │ ├── table1.csv
│ │ ├── main_results.csv
│ │ └── subgroup_results.csv
│ ├── 02_obesity_diabetes/
│ │ └── ...
│ └── ...
├── summary/
│ ├── summary_matrix.csv — Main results across all combinations
│ ├── subgroup_matrix.csv — Subgroup results across all combinations
│ ├── failed_runs.csv — Combinations that failed + error messages
│ └── heatmap.png — Optional effect size × significance visual
└── logs/
└── batch_execution.log — Timing + error log
Critical Rules
- Never modify the core methodology across combinations — only swap exposure/outcome/covariates.
- Remove self-adjustment: If exposure = BMI, remove obesity from covariates. If exposure = education/income, remove the same variable from covariates. If outcome = MetS, consider removing obesity from covariates. Document all removals.
- Weighted analysis mandatory for KNHANES/NHANES/NHIS — inherited from template.
- Event count check: Before running, verify each outcome has ≥10 events per covariate (EPV rule). Flag underpowered combinations.
- Multiple comparisons: When generating >5 combinations, include a Bonferroni-corrected significance column in the summary matrix. Add a note about exploratory vs confirmatory framing.
- Reproducibility: Freeze the template version. Include a SHA256 hash of the data file in README.
- No p-hacking framing: The summary matrix is for hypothesis generation, not confirmation. State this explicitly in README and any manuscript output.
- Outcome definitions MUST include physician diagnosis: Diabetes = FPG≥126 OR HbA1c≥6.5 OR physician-diagnosed (KNHANES: DE1_dg=1, NHANES: DIQ010="Yes"). Hypertension = SBP≥140 OR DBP≥90 OR physician-diagnosed (KNHANES: DI1_dg=1, NHANES: BPQ020="Yes"). Lab-only definitions systematically overestimate exposure→outcome associations (validated: Joo 2026 replication showed US depression→DM wOR 1.92 without vs 1.54 with physician dx).
- Full covariate set is default: Alw