BioGeoBEARS Biogeographic Analysis
Overview
BioGeoBEARS (BioGeography with Bayesian and Likelihood Evolutionary Analysis in R Scripts) performs probabilistic inference of ancestral geographic ranges on phylogenetic trees. This skill helps set up complete biogeographic analyses by:
- Validating and reformatting input files (phylogenetic tree and geographic distribution data)
- Generating organized analysis folder structure
- Creating customized RMarkdown analysis scripts
- Guiding users through parameter selection and model choices
- Producing publication-ready visualizations
When to Use This Skill
Use this skill when users request:
- "Analyze biogeography on my phylogeny"
- "Reconstruct ancestral ranges for my species"
- "Run BioGeoBEARS analysis"
- "Which areas did my ancestors occupy?"
- "Test biogeographic models (DEC, DIVALIKE, BAYAREALIKE)"
The skill triggers when users mention phylogenetic biogeography, ancestral area reconstruction, or provide tree + distribution data.
Required Inputs
Users must provide:
-
Phylogenetic tree (Newick format, .nwk, .tre, or .tree file)
- Must be rooted
- Tip labels will be matched to geography file
- Branch lengths required
-
Geographic distribution data (any tabular format)
- Species names (matching tree tips)
- Presence/absence data for different geographic areas
- Can be CSV, TSV, Excel, or already in PHYLIP format
Workflow
Step 1: Gather Information
When a user requests a BioGeoBEARS analysis, ask for:
-
Input file paths:
- "What is the path to your phylogenetic tree file?"
- "What is the path to your geographic distribution file?"
-
Analysis parameters (if not specified):
- Maximum range size (how many areas can a species occupy simultaneously?)
- Which models to compare (default: all six - DEC, DEC+J, DIVALIKE, DIVALIKE+J, BAYAREALIKE, BAYAREALIKE+J)
- Output directory name (default: "biogeobears_analysis")
Use the AskUserQuestion tool to gather this information efficiently:
Example questions:
- "Maximum range size" - options based on number of areas (e.g., for 4 areas: "All 4 areas", "3 areas", "2 areas")
- "Models to compare" - options: "All 6 models (recommended)", "Only base models (DEC, DIVALIKE, BAYAREALIKE)", "Only +J models", "Custom selection"
- "Visualization type" - options: "Pie charts (show probabilities)", "Text labels (show most likely states)", "Both"
Step 2: Validate and Prepare Input Files
Validate Tree File
Use the Read tool to check the tree file:
# In R, basic validation:
library(ape)
tr <- read.tree("path/to/tree.nwk")
print(paste("Tips:", length(tr$tip.label)))
print(paste("Rooted:", is.rooted(tr)))
print(tr$tip.label) # Check species names
Verify:
- File can be parsed as Newick
- Tree is rooted (if not, ask user which outgroup to use)
- Note the tip labels for geography file validation
Validate and Reformat Geography File
Use scripts/validate_geography_file.py to validate or reformat the geography file.
If file is already in PHYLIP format (starts with numbers):
python scripts/validate_geography_file.py path/to/geography.txt --validate --tree path/to/tree.nwk
This checks:
- Correct tab delimiters
- Species names match tree tips
- Binary codes are correct length
- No spaces in species names or binary codes
If file is in CSV/TSV format (needs reformatting):
python scripts/validate_geography_file.py path/to/distribution.csv --reformat -o geography.data --delimiter ","
Or for tab-delimited:
python scripts/validate_geography_file.py path/to/distribution.txt --reformat -o geography.data --delimiter tab
The script will:
- Detect area names from header row
- Convert presence/absence data to binary (handles "1", "present", "TRUE", etc.)
- Remove spaces from species names (replace with underscores)
- Create properly formatted PHYLIP file
Always validate the reformatted file before proceeding:
python scripts/validate_geography_file.py geography.data --validate --tree path/to/tree.nwk
Step 3: Set Up Analysis Folder Structure
Create an organized directory for the analysis:
biogeobears_analysis/
├── input/
│ ├── tree.nwk # Original or copied tree
│ ├── geography.data # Validated/reformatted geography file
│ └── original_data/ # Original input files
│ ├── original_tree.nwk
│ └── original_distribution.csv
├── scripts/
│ └── run_biogeobears.Rmd # Generated RMarkdown script
├── results/ # Created by analysis (output directory)
│ ├── [MODEL]_result.Rdata # Saved model results
│ └── plots/ # Visualization outputs
│ ├── [MODEL]_pie.pdf
│ └── [MODEL]_text.pdf
└── README.md # Analysis documentation
Create this structure programmatically:
mkdir -p biogeobears_analysis/input/original_data
mkdir -p biogeobears_analysis/scripts
mkdir -p biogeobears_analysis/results/plots
# Copy files
cp path/to/tree.nwk biogeobears_analysis/input/
cp geography.data biogeobears_analysis/input/
cp original_files biogeobears_analysis/input/original_data/
Step 4: Generate RMarkdown Analysis Script
Use the template at scripts/biogeobears_analysis_template.Rmd and customize it with user parameters.
Copy and customize the template:
cp scripts/biogeobears_analysis_template.Rmd biogeobears_analysis/scripts/run_biogeobears.Rmd
Create a parameter file or modify the YAML header in the Rmd to use the user's specific settings:
Example customization via R code:
# Edit YAML parameters programmatically or provide as params when rendering
rmarkdown::render(
"biogeobears_analysis/scripts/run_biogeobears.Rmd",
params = list(
tree_file = "../input/tree.nwk",
geog_file = "../input/geography.data",
max_range_size = 4,
models = "DEC,DEC+J,DIVALIKE,DIVALIKE+J,BAYAREALIKE,BAYAREALIKE+J",
output_dir = "../results"
),
output_file = "../results/biogeobears_report.html"
)
Or create a run script:
# biogeobears_analysis/run_analysis.sh
#!/bin/bash
cd "$(dirname "$0")/scripts"
R -e "rmarkdown::render('run_biogeobears.Rmd', params = list(
tree_file = '../input/tree.nwk',
geog_file = '../input/geography.data',
max_range_size = 4,
models = 'DEC,DEC+J,DIVALIKE,DIVALIKE+J,BAYAREALIKE,BAYAREALIKE+J',
output_dir = '../results'
), output_file = '../results/biogeobears_report.html')"
Step 5: Create README Documentation
Generate a README.md in the analysis directory explaining:
- What files are present
- How to run the analysis
- What parameters were used
- How to interpret results
Example:
# BioGeoBEARS Analysis
## Overview
Biogeographic analysis of [NUMBER] species across [NUMBER] geographic areas.
## Input Data
- **Tree**: `input/tree.nwk` ([NUMBER] tips)
- **Geography**: `input/geography.data` ([NUMBER] species × [NUMBER] areas)
- **Areas**: [A, B, C, ...]
## Parameters
- Maximum range size: [NUMBER]
- Models tested: [LIST]
## Running the Analysis
### Option 1: Using RMarkdown directly
```r
library(rmarkdown)
render("scripts/run_biogeobears.Rmd",
output_file = "../results/biogeobears_report.html")
Option 2: Using the run script
bash run_analysis.sh
Outputs
Results will be saved in results/:
biogeobears_report.html- Full analysis report with visualizations[MODEL]_result.Rdata- Saved R objects for each modelplots/[MODEL]_pie.pdf- Ancestral range reconstructions (pie charts)plots/[MODEL]_text.pdf- Ancestral range reconstructions (text labels)
Interpreting Results
The HTML report includes:
- Model Comparison - AIC scores, AIC weights, best-fit model
- Parameter Estimates - Dispersal (d), extinction (e), founder-event (j) rates
- Likelihood Ratio Tests - Statistical comparisons of nested models