BioGeoBEARS Biogeographic Analysis

Overview

BioGeoBEARS (BioGeography with Bayesian and Likelihood Evolutionary Analysis in R Scripts) performs probabilistic inference of ancestral geographic ranges on phylogenetic trees. This skill helps set up complete biogeographic analyses by:

Validating and reformatting input files (phylogenetic tree and geographic distribution data)
Generating organized analysis folder structure
Creating customized RMarkdown analysis scripts
Guiding users through parameter selection and model choices
Producing publication-ready visualizations

When to Use This Skill

Use this skill when users request:

"Analyze biogeography on my phylogeny"
"Reconstruct ancestral ranges for my species"
"Run BioGeoBEARS analysis"
"Which areas did my ancestors occupy?"
"Test biogeographic models (DEC, DIVALIKE, BAYAREALIKE)"

The skill triggers when users mention phylogenetic biogeography, ancestral area reconstruction, or provide tree + distribution data.

Required Inputs

Users must provide:

Phylogenetic tree (Newick format, .nwk, .tre, or .tree file)
- Must be rooted
- Tip labels will be matched to geography file
- Branch lengths required
Geographic distribution data (any tabular format)
- Species names (matching tree tips)
- Presence/absence data for different geographic areas
- Can be CSV, TSV, Excel, or already in PHYLIP format

Workflow

Step 1: Gather Information

When a user requests a BioGeoBEARS analysis, ask for:

Input file paths:
- "What is the path to your phylogenetic tree file?"
- "What is the path to your geographic distribution file?"
Analysis parameters (if not specified):
- Maximum range size (how many areas can a species occupy simultaneously?)
- Which models to compare (default: all six - DEC, DEC+J, DIVALIKE, DIVALIKE+J, BAYAREALIKE, BAYAREALIKE+J)
- Output directory name (default: "biogeobears_analysis")

Use the AskUserQuestion tool to gather this information efficiently:

Example questions:
- "Maximum range size" - options based on number of areas (e.g., for 4 areas: "All 4 areas", "3 areas", "2 areas")
- "Models to compare" - options: "All 6 models (recommended)", "Only base models (DEC, DIVALIKE, BAYAREALIKE)", "Only +J models", "Custom selection"
- "Visualization type" - options: "Pie charts (show probabilities)", "Text labels (show most likely states)", "Both"

Step 2: Validate and Prepare Input Files

Validate Tree File

Use the Read tool to check the tree file:

# In R, basic validation:
library(ape)
tr <- read.tree("path/to/tree.nwk")
print(paste("Tips:", length(tr$tip.label)))
print(paste("Rooted:", is.rooted(tr)))
print(tr$tip.label)  # Check species names

Verify:

File can be parsed as Newick
Tree is rooted (if not, ask user which outgroup to use)
Note the tip labels for geography file validation

Validate and Reformat Geography File

Use scripts/validate_geography_file.py to validate or reformat the geography file.

If file is already in PHYLIP format (starts with numbers):

python scripts/validate_geography_file.py path/to/geography.txt --validate --tree path/to/tree.nwk

This checks:

Correct tab delimiters
Species names match tree tips
Binary codes are correct length
No spaces in species names or binary codes

If file is in CSV/TSV format (needs reformatting):

python scripts/validate_geography_file.py path/to/distribution.csv --reformat -o geography.data --delimiter ","

Or for tab-delimited:

python scripts/validate_geography_file.py path/to/distribution.txt --reformat -o geography.data --delimiter tab

The script will:

Detect area names from header row
Convert presence/absence data to binary (handles "1", "present", "TRUE", etc.)
Remove spaces from species names (replace with underscores)
Create properly formatted PHYLIP file

Always validate the reformatted file before proceeding:

python scripts/validate_geography_file.py geography.data --validate --tree path/to/tree.nwk

Step 3: Set Up Analysis Folder Structure

Create an organized directory for the analysis:

biogeobears_analysis/
├── input/
│   ├── tree.nwk                 # Original or copied tree
│   ├── geography.data            # Validated/reformatted geography file
│   └── original_data/            # Original input files
│       ├── original_tree.nwk
│       └── original_distribution.csv
├── scripts/
│   └── run_biogeobears.Rmd       # Generated RMarkdown script
├── results/                      # Created by analysis (output directory)
│   ├── [MODEL]_result.Rdata      # Saved model results
│   └── plots/                    # Visualization outputs
│       ├── [MODEL]_pie.pdf
│       └── [MODEL]_text.pdf
└── README.md                     # Analysis documentation

Create this structure programmatically:

mkdir -p biogeobears_analysis/input/original_data
mkdir -p biogeobears_analysis/scripts
mkdir -p biogeobears_analysis/results/plots

# Copy files
cp path/to/tree.nwk biogeobears_analysis/input/
cp geography.data biogeobears_analysis/input/
cp original_files biogeobears_analysis/input/original_data/

Step 4: Generate RMarkdown Analysis Script

Use the template at scripts/biogeobears_analysis_template.Rmd and customize it with user parameters.

Copy and customize the template:

cp scripts/biogeobears_analysis_template.Rmd biogeobears_analysis/scripts/run_biogeobears.Rmd

Create a parameter file or modify the YAML header in the Rmd to use the user's specific settings:

Example customization via R code:

# Edit YAML parameters programmatically or provide as params when rendering
rmarkdown::render(
  "biogeobears_analysis/scripts/run_biogeobears.Rmd",
  params = list(
    tree_file = "../input/tree.nwk",
    geog_file = "../input/geography.data",
    max_range_size = 4,
    models = "DEC,DEC+J,DIVALIKE,DIVALIKE+J,BAYAREALIKE,BAYAREALIKE+J",
    output_dir = "../results"
  ),
  output_file = "../results/biogeobears_report.html"
)

Or create a run script:

# biogeobears_analysis/run_analysis.sh
#!/bin/bash
cd "$(dirname "$0")/scripts"

R -e "rmarkdown::render('run_biogeobears.Rmd', params = list(
  tree_file = '../input/tree.nwk',
  geog_file = '../input/geography.data',
  max_range_size = 4,
  models = 'DEC,DEC+J,DIVALIKE,DIVALIKE+J,BAYAREALIKE,BAYAREALIKE+J',
  output_dir = '../results'
), output_file = '../results/biogeobears_report.html')"

Step 5: Create README Documentation

Generate a README.md in the analysis directory explaining:

What files are present
How to run the analysis
What parameters were used
How to interpret results

Example:

# BioGeoBEARS Analysis

## Overview

Biogeographic analysis of [NUMBER] species across [NUMBER] geographic areas.

## Input Data

- **Tree**: `input/tree.nwk` ([NUMBER] tips)
- **Geography**: `input/geography.data` ([NUMBER] species × [NUMBER] areas)
- **Areas**: [A, B, C, ...]

## Parameters

- Maximum range size: [NUMBER]
- Models tested: [LIST]

## Running the Analysis

### Option 1: Using RMarkdown directly

```r
library(rmarkdown)
render("scripts/run_biogeobears.Rmd",
       output_file = "../results/biogeobears_report.html")

Option 2: Using the run script

bash run_analysis.sh

Outputs

Results will be saved in results/:

biogeobears_report.html - Full analysis report with visualizations
[MODEL]_result.Rdata - Saved R objects for each model
plots/[MODEL]_pie.pdf - Ancestral range reconstructions (pie charts)
plots/[MODEL]_text.pdf - Ancestral range reconstructions (text labels)

Interpreting Results

The HTML report includes:

Model Comparison - AIC scores, AIC weights, best-fit model
Parameter Estimates - Dispersal (d), extinction (e), founder-event (j) rates
Likelihood Ratio Tests - Statistical comparisons of nested models

biogeobears

Como adicionar

Cole no README do seu repo

Skills relacionadas

xlsx

mem-search

weekly-digests

how-it-works

Receba novas skills de Dados e Análise toda segunda