PyDESeq2
Overview
PyDESeq2 is a Python implementation of DESeq2 for differential expression analysis with bulk RNA-seq data. Design and execute complete workflows from data loading through result interpretation, including single-factor and multi-factor designs, Wald tests with multiple testing correction, optional apeGLM shrinkage, and integration with pandas and AnnData.
When to Use This Skill
This skill should be used when:
- Analyzing bulk RNA-seq count data for differential expression
- Comparing gene expression between experimental conditions (e.g., treated vs control)
- Performing multi-factor designs accounting for batch effects or covariates
- Converting R-based DESeq2 workflows to Python
- Integrating differential expression analysis into Python-based pipelines
- Users mention "DESeq2", "differential expression", "RNA-seq analysis", or "PyDESeq2"
Quick Start Workflow
For users who want to perform a standard differential expression analysis:
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
# 1. Load data
counts_df = pd.read_csv("counts.csv", index_col=0).T # Transpose to samples × genes
metadata = pd.read_csv("metadata.csv", index_col=0)
# 2. Filter low-count genes
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
# 3. Initialize and fit DESeq2
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition",
refit_cooks=True
)
dds.deseq2()
# 4. Perform statistical testing
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
# 5. Access results
results = ds.results_df
significant = results[results.padj < 0.05]
print(f"Found {len(significant)} significant genes")
Core Workflow Steps
Step 1: Data Preparation
Input requirements:
- Count matrix: Samples × genes DataFrame with non-negative integer read counts
- Metadata: Samples × variables DataFrame with experimental factors
Common data loading patterns:
# From CSV (typical format: genes × samples, needs transpose)
counts_df = pd.read_csv("counts.csv", index_col=0).T
metadata = pd.read_csv("metadata.csv", index_col=0)
# From TSV
counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T
# From AnnData
import anndata as ad
adata = ad.read_h5ad("data.h5ad")
counts_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names)
metadata = adata.obs
Data filtering:
# Remove low-count genes
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
# Remove samples with missing metadata
samples_to_keep = ~metadata.condition.isna()
counts_df = counts_df.loc[samples_to_keep]
metadata = metadata.loc[samples_to_keep]
Step 2: Design Specification
The design formula specifies how gene expression is modeled.
Single-factor designs:
design = "~condition" # Simple two-group comparison
Multi-factor designs:
design = "~batch + condition" # Control for batch effects
design = "~age + condition" # Include continuous covariate
design = "~group + condition + group:condition" # Interaction effects
Design formula guidelines:
- Use Wilkinson formula notation (R-style)
- Put adjustment variables (e.g., batch) before the main variable of interest
- Ensure variables exist as columns in the metadata DataFrame
- Use appropriate data types (categorical for discrete variables)
Step 3: DESeq2 Fitting
Initialize the DeseqDataSet and run the complete pipeline:
from pydeseq2.dds import DeseqDataSet
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition",
refit_cooks=True, # Refit after removing outliers
n_cpus=1 # Parallel processing (adjust as needed)
)
# Run the complete DESeq2 pipeline
dds.deseq2()
What deseq2() does:
- Computes size factors (normalization)
- Fits genewise dispersions
- Fits dispersion trend curve
- Computes dispersion priors
- Fits MAP dispersions (shrinkage)
- Fits log fold changes
- Calculates Cook's distances (outlier detection)
- Refits if outliers detected (optional)
Step 4: Statistical Testing
Perform Wald tests to identify differentially expressed genes:
from pydeseq2.ds import DeseqStats
ds = DeseqStats(
dds,
contrast=["condition", "treated", "control"], # Test treated vs control
alpha=0.05, # Significance threshold
cooks_filter=True, # Filter outliers
independent_filter=True # Filter low-power tests
)
ds.summary()
Contrast specification:
- Format:
[variable, test_level, reference_level] - Example:
["condition", "treated", "control"]tests treated vs control - If
None, uses the last coefficient in the design
Result DataFrame columns:
baseMean: Mean normalized count across sampleslog2FoldChange: Log2 fold change between conditionslfcSE: Standard error of LFCstat: Wald test statisticpvalue: Raw p-valuepadj: Adjusted p-value (FDR-corrected via Benjamini-Hochberg)
Step 5: Optional LFC Shrinkage
Apply shrinkage to reduce noise in fold change estimates:
ds.lfc_shrink() # Applies apeGLM shrinkage
When to use LFC shrinkage:
- For visualization (volcano plots, heatmaps)
- For ranking genes by effect size
- When prioritizing genes for follow-up experiments
Important: Shrinkage affects only the log2FoldChange values, not the statistical test results (p-values remain unchanged). Use shrunk values for visualization but report unshrunken p-values for significance.
Step 6: Result Export
Save results and intermediate objects:
import pickle
# Export results as CSV
ds.results_df.to_csv("deseq2_results.csv")
# Save significant genes only
significant = ds.results_df[ds.results_df.padj < 0.05]
significant.to_csv("significant_genes.csv")
# Save DeseqDataSet for later use
with open("dds_result.pkl", "wb") as f:
pickle.dump(dds.to_picklable_anndata(), f)
Common Analysis Patterns
Two-Group Comparison
Standard case-control comparison:
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
results = ds.results_df
significant = results[results.padj < 0.05]
Multiple Comparisons
Testing multiple treatment groups against control:
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()
treatments = ["treatment_A", "treatment_B", "treatment_C"]
all_results = {}
for treatment in treatments:
ds = DeseqStats(dds, contrast=["condition", treatment, "control"])
ds.summary()
all_results[treatment] = ds.results_df
sig_count = len(ds.results_df[ds.results_df.padj < 0.05])
print(f"{treatment}: {sig_count} significant genes")
Accounting for Batch Effects
Control for technical variation:
# Include batch in design
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~batch + condition")
dds.deseq2()
# Test condition while controlling for batch
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
Continuous Covariates
Include continuous variables like age or dosage:
# Ensure continuous variable is numeric
metadata["age"] = pd.to_numeric(metadata["age"])
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~age + condition")
dds.deseq2()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
Using the Analysis Script
This skill includes a complete command-line script for standard analyses:
# Basic usage
python scripts/run_deseq2_analysis.py \
--counts counts.csv \
--metadata metadata.csv \
--design "~condition" \
--contrast condition treated control \
--output results/
# Wit