Bioinformatics Installer for ENCODE Data Analysis
Install all bioinformatics tools needed for ENCODE data analysis, organized by assay type. This skill provides ready-to-use conda environment definitions, R/Bioconductor install scripts, Python package lists, and Nextflow pipeline infrastructure setup. Every environment is version-pinned for reproducibility and tested against ENCODE uniform processing standards.
When to Use
- User wants to install bioinformatics tools needed for ENCODE data analysis
- User asks about "install tools", "conda environment", "setup bioinformatics", or "install HOMER/MACS2/deeptools"
- User needs pre-configured conda environments for specific assay pipelines (ChIP-seq, ATAC-seq, RNA-seq, etc.)
- User wants to install R/Bioconductor packages (DESeq2, Seurat, ChIPseeker) or Python packages (Scanpy, pysam)
- Example queries: "install tools for ChIP-seq analysis", "set up a conda environment for ATAC-seq", "install deeptools and bedtools"
Overview
ENCODE data analysis requires a broad ecosystem of tools spanning command-line aligners, peak callers, signal processors, statistical analysis frameworks in R, Python visualization and single-cell packages, and workflow engines. Setting up these tools correctly — with compatible versions, proper channel priorities, and no dependency conflicts — is a significant barrier for new users and a reproducibility concern for experienced analysts.
This skill solves that by providing:
- 7 assay-specific conda environments with pinned tool versions matching ENCODE pipeline standards
- R/Bioconductor install script covering 50+ packages across 8 categories
- Python install script for single-cell, Hi-C, and genomics packages
- Nextflow + container setup for pipeline execution on local, HPC, and cloud platforms
All environments use the same channel priority (conda-forge > bioconda > defaults) and are tested for cross-platform compatibility on Linux x86_64 and macOS (Intel + Apple Silicon where possible).
Quick Start
Install a complete environment for any assay type with a single command:
# ChIP-seq (histone or TF)
conda env create -f skills/bioinformatics-installer/environments/chipseq-env.yml
# ATAC-seq
conda env create -f skills/bioinformatics-installer/environments/atacseq-env.yml
# RNA-seq
conda env create -f skills/bioinformatics-installer/environments/rnaseq-env.yml
# Hi-C
conda env create -f skills/bioinformatics-installer/environments/hic-env.yml
# Whole-Genome Bisulfite Sequencing (WGBS)
conda env create -f skills/bioinformatics-installer/environments/wgbs-env.yml
# DNase-seq
conda env create -f skills/bioinformatics-installer/environments/dnaseseq-env.yml
# CUT&RUN / CUT&Tag
conda env create -f skills/bioinformatics-installer/environments/cutandrun-env.yml
Using mamba for faster solves (recommended):
mamba env create -f skills/bioinformatics-installer/environments/chipseq-env.yml
Install R and Python packages:
# All R/Bioconductor packages
Rscript skills/bioinformatics-installer/scripts/install-r-packages.R --all
# All Python packages
bash skills/bioinformatics-installer/scripts/install-python-packages.sh --all
# Nextflow + Docker
bash skills/bioinformatics-installer/scripts/install-nextflow.sh --docker
Per-Assay Environments
ChIP-seq Environment (encode-chipseq)
For histone modification and transcription factor ChIP-seq processing following ENCODE uniform pipeline standards (Landt et al. 2012, ENCODE Consortium 2020).
| Tool | Version | Purpose |
|---|---|---|
| BWA-MEM | 0.7.17 | Read alignment to reference genome (Li & Durbin 2009) |
| samtools | 1.19 | BAM manipulation, sorting, indexing, flagstat (Li et al. 2009) |
| MACS2 | 2.2.9.1 | Peak calling for narrow (TF) and broad (histone) marks (Zhang et al. 2008) |
| Picard | 3.1.1 | Duplicate marking and library complexity metrics (Broad Institute) |
| phantompeakqualtools | 1.2.2 | Strand cross-correlation (NSC/RSC) quality metrics (Kharchenko et al. 2008) |
| IDR | 2.0.3 | Irreproducible Discovery Rate for replicate consistency (Li et al. 2011) |
| deeptools | 3.5.5 | Signal normalization (bamCoverage), fingerprint, correlation (Ramirez et al. 2016) |
| bedtools | 2.31.0 | Interval operations, blacklist filtering (Quinlan & Hall 2010) |
| FastQC | 0.12.1 | Raw read quality assessment (Andrews 2010) |
| Trim Galore | 0.6.10 | Adapter and quality trimming via Cutadapt (Krueger 2012) |
| MultiQC | 1.21 | Aggregate QC report across all pipeline stages (Ewels et al. 2016) |
| bedGraphToBigWig | — | Convert bedGraph signal to bigWig for genome browser viewing (Kent et al. 2010) |
Memory: BWA index for GRCh38 requires ~5.5 GB RAM. Peak calling with MACS2 typically requires 4-8 GB. phantompeakqualtools loads full BAM into memory.
Environment file: environments/chipseq-env.yml
ATAC-seq Environment (encode-atacseq)
For chromatin accessibility profiling via ATAC-seq following ENCODE standards (Buenrostro et al. 2013, Corces et al. 2017).
| Tool | Version | Purpose |
|---|---|---|
| Bowtie2 | 2.5.3 | Alignment (preferred over BWA for ATAC-seq short fragments) (Langmead & Salzberg 2012) |
| MACS2 | 2.2.9.1 | Peak calling with --nomodel --shift -100 --extsize 200 for ATAC (Zhang et al. 2008) |
| samtools | 1.19 | BAM manipulation, mitochondrial read filtering |
| Picard | 3.1.1 | Duplicate marking, insert size metrics |
| deeptools | 3.5.5 | alignmentSieve (Tn5 offset), bamCoverage (signal tracks), plotFingerprint |
| bedtools | 2.31.0 | Blacklist filtering, interval operations |
| FastQC | 0.12.1 | Raw read quality and adapter content assessment |
| Trim Galore | 0.6.10 | Adapter trimming (Nextera adapters for ATAC-seq) |
| MultiQC | 1.21 | Aggregate QC reporting |
Key ATAC-seq parameters: Tn5 transposase introduces a +4/-5 bp offset that must be corrected. Fragment size distribution should show nucleosomal ladder (sub-nucleosomal, mono-, di-, tri-). TSS enrichment score should be >= 5 (GRCh38), >= 6 (hg19), or >= 10 (mm10) for high-quality data (ENCODE data standards).
Environment file: environments/atacseq-env.yml
RNA-seq Environment (encode-rnaseq)
For gene expression quantification following ENCODE RNA-seq standards (Conesa et al. 2016, ENCODE Consortium 2020).
| Tool | Version | Purpose |
|---|---|---|
| STAR | 2.7.11b | Splice-aware alignment with 2-pass mapping (Dobin et al. 2013) |
| RSEM | 1.3.3 | Gene/transcript quantification with expectation-maximization (Li & Dewey 2011) |
| Kallisto | 0.50.1 | Pseudoalignment-based transcript quantification (Bray et al. 2016) |
| Salmon | 1.10.3 | Quasi-mapping transcript quantification with GC bias correction (Patro et al. 2017) |
| featureCounts (subread) | 2.0.6 | Gene-level read counting for count-based DE methods (Liao et al. 2014) |
| samtools | 1.19 | BAM handling, flagstat, idxstats |
| FastQC | 0.12.1 | Read quality assessment |
| Trim Galore | 0.6.10 | Adapter and quality trimming |
| MultiQC | 1.21 | Aggregate QC report |
| RSeQC | 5.0.3 | RNA-seq-specific QC: gene body coverage, read distribution, inner distance (Wang et al. 2012) |
Memory: STAR genome generation requires 32+ GB RAM for human genome. STAR alignment requires ~30 GB RAM. Kallisto and Salmon are memory-efficient alternatives (~4 GB).
Environment file: environments/rnaseq-env.yml
Hi-C Environment (encode-hic)
For chromatin conformation capture processing following ENCODE Hi-C standards (Yardimci et al. 2019, Rao et al. 2014).
| Tool | Version | Purpose |
|---|---|---|
| BWA-MEM | 0.7.17 | Chimeric read alignment (each mate aligned independently) |
| pairtools | 1.0.3 | Parse, sort, deduplicate, filter contact pairs (Open2C) |
| cooler | 0.9.3 | Multi-resolution contact matrix storage and balancing (Abdennur & Mirny 2020) |
| Juicer | 2.20.00 | Contact matrix generation and HiCCUPS loop calling |