Bioinformatics Installer for ENCODE Data Analysis

Install all bioinformatics tools needed for ENCODE data analysis, organized by assay type. This skill provides ready-to-use conda environment definitions, R/Bioconductor install scripts, Python package lists, and Nextflow pipeline infrastructure setup. Every environment is version-pinned for reproducibility and tested against ENCODE uniform processing standards.

When to Use

User wants to install bioinformatics tools needed for ENCODE data analysis
User asks about "install tools", "conda environment", "setup bioinformatics", or "install HOMER/MACS2/deeptools"
User needs pre-configured conda environments for specific assay pipelines (ChIP-seq, ATAC-seq, RNA-seq, etc.)
User wants to install R/Bioconductor packages (DESeq2, Seurat, ChIPseeker) or Python packages (Scanpy, pysam)
Example queries: "install tools for ChIP-seq analysis", "set up a conda environment for ATAC-seq", "install deeptools and bedtools"

Overview

ENCODE data analysis requires a broad ecosystem of tools spanning command-line aligners, peak callers, signal processors, statistical analysis frameworks in R, Python visualization and single-cell packages, and workflow engines. Setting up these tools correctly — with compatible versions, proper channel priorities, and no dependency conflicts — is a significant barrier for new users and a reproducibility concern for experienced analysts.

This skill solves that by providing:

7 assay-specific conda environments with pinned tool versions matching ENCODE pipeline standards
R/Bioconductor install script covering 50+ packages across 8 categories
Python install script for single-cell, Hi-C, and genomics packages
Nextflow + container setup for pipeline execution on local, HPC, and cloud platforms

All environments use the same channel priority (conda-forge > bioconda > defaults) and are tested for cross-platform compatibility on Linux x86_64 and macOS (Intel + Apple Silicon where possible).

Quick Start

Install a complete environment for any assay type with a single command:

# ChIP-seq (histone or TF)
conda env create -f skills/bioinformatics-installer/environments/chipseq-env.yml

# ATAC-seq
conda env create -f skills/bioinformatics-installer/environments/atacseq-env.yml

# RNA-seq
conda env create -f skills/bioinformatics-installer/environments/rnaseq-env.yml

# Hi-C
conda env create -f skills/bioinformatics-installer/environments/hic-env.yml

# Whole-Genome Bisulfite Sequencing (WGBS)
conda env create -f skills/bioinformatics-installer/environments/wgbs-env.yml

# DNase-seq
conda env create -f skills/bioinformatics-installer/environments/dnaseseq-env.yml

# CUT&RUN / CUT&Tag
conda env create -f skills/bioinformatics-installer/environments/cutandrun-env.yml

Using mamba for faster solves (recommended):

mamba env create -f skills/bioinformatics-installer/environments/chipseq-env.yml

Install R and Python packages:

# All R/Bioconductor packages
Rscript skills/bioinformatics-installer/scripts/install-r-packages.R --all

# All Python packages
bash skills/bioinformatics-installer/scripts/install-python-packages.sh --all

# Nextflow + Docker
bash skills/bioinformatics-installer/scripts/install-nextflow.sh --docker

Per-Assay Environments

ChIP-seq Environment (`encode-chipseq`)

For histone modification and transcription factor ChIP-seq processing following ENCODE uniform pipeline standards (Landt et al. 2012, ENCODE Consortium 2020).

Tool	Version	Purpose
BWA-MEM	0.7.17	Read alignment to reference genome (Li & Durbin 2009)
samtools	1.19	BAM manipulation, sorting, indexing, flagstat (Li et al. 2009)
MACS2	2.2.9.1	Peak calling for narrow (TF) and broad (histone) marks (Zhang et al. 2008)
Picard	3.1.1	Duplicate marking and library complexity metrics (Broad Institute)
phantompeakqualtools	1.2.2	Strand cross-correlation (NSC/RSC) quality metrics (Kharchenko et al. 2008)
IDR	2.0.3	Irreproducible Discovery Rate for replicate consistency (Li et al. 2011)
deeptools	3.5.5	Signal normalization (bamCoverage), fingerprint, correlation (Ramirez et al. 2016)
bedtools	2.31.0	Interval operations, blacklist filtering (Quinlan & Hall 2010)
FastQC	0.12.1	Raw read quality assessment (Andrews 2010)
Trim Galore	0.6.10	Adapter and quality trimming via Cutadapt (Krueger 2012)
MultiQC	1.21	Aggregate QC report across all pipeline stages (Ewels et al. 2016)
bedGraphToBigWig	—	Convert bedGraph signal to bigWig for genome browser viewing (Kent et al. 2010)

Memory: BWA index for GRCh38 requires ~5.5 GB RAM. Peak calling with MACS2 typically requires 4-8 GB. phantompeakqualtools loads full BAM into memory.

Environment file: environments/chipseq-env.yml

ATAC-seq Environment (`encode-atacseq`)

For chromatin accessibility profiling via ATAC-seq following ENCODE standards (Buenrostro et al. 2013, Corces et al. 2017).

Tool	Version	Purpose
Bowtie2	2.5.3	Alignment (preferred over BWA for ATAC-seq short fragments) (Langmead & Salzberg 2012)
MACS2	2.2.9.1	Peak calling with --nomodel --shift -100 --extsize 200 for ATAC (Zhang et al. 2008)
samtools	1.19	BAM manipulation, mitochondrial read filtering
Picard	3.1.1	Duplicate marking, insert size metrics
deeptools	3.5.5	alignmentSieve (Tn5 offset), bamCoverage (signal tracks), plotFingerprint
bedtools	2.31.0	Blacklist filtering, interval operations
FastQC	0.12.1	Raw read quality and adapter content assessment
Trim Galore	0.6.10	Adapter trimming (Nextera adapters for ATAC-seq)
MultiQC	1.21	Aggregate QC reporting

Key ATAC-seq parameters: Tn5 transposase introduces a +4/-5 bp offset that must be corrected. Fragment size distribution should show nucleosomal ladder (sub-nucleosomal, mono-, di-, tri-). TSS enrichment score should be >= 5 (GRCh38), >= 6 (hg19), or >= 10 (mm10) for high-quality data (ENCODE data standards).

Environment file: environments/atacseq-env.yml

RNA-seq Environment (`encode-rnaseq`)

For gene expression quantification following ENCODE RNA-seq standards (Conesa et al. 2016, ENCODE Consortium 2020).

Tool	Version	Purpose
STAR	2.7.11b	Splice-aware alignment with 2-pass mapping (Dobin et al. 2013)
RSEM	1.3.3	Gene/transcript quantification with expectation-maximization (Li & Dewey 2011)
Kallisto	0.50.1	Pseudoalignment-based transcript quantification (Bray et al. 2016)
Salmon	1.10.3	Quasi-mapping transcript quantification with GC bias correction (Patro et al. 2017)
featureCounts (subread)	2.0.6	Gene-level read counting for count-based DE methods (Liao et al. 2014)
samtools	1.19	BAM handling, flagstat, idxstats
FastQC	0.12.1	Read quality assessment
Trim Galore	0.6.10	Adapter and quality trimming
MultiQC	1.21	Aggregate QC report
RSeQC	5.0.3	RNA-seq-specific QC: gene body coverage, read distribution, inner distance (Wang et al. 2012)

Memory: STAR genome generation requires 32+ GB RAM for human genome. STAR alignment requires ~30 GB RAM. Kallisto and Salmon are memory-efficient alternatives (~4 GB).

Environment file: environments/rnaseq-env.yml

Hi-C Environment (`encode-hic`)

For chromatin conformation capture processing following ENCODE Hi-C standards (Yardimci et al. 2019, Rao et al. 2014).

Tool	Version	Purpose
BWA-MEM	0.7.17	Chimeric read alignment (each mate aligned independently)
pairtools	1.0.3	Parse, sort, deduplicate, filter contact pairs (Open2C)
cooler	0.9.3	Multi-resolution contact matrix storage and balancing (Abdennur & Mirny 2020)
Juicer	2.20.00	Contact matrix generation and HiCCUPS loop calling

bioinformatics-installer

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

internal-comms

babysit

do

smart-explore

Recibe nuevas skills de DevOps e Infra todos los lunes

Bioinformatics Installer for ENCODE Data Analysis

When to Use

Overview

Quick Start

Per-Assay Environments

ChIP-seq Environment (`encode-chipseq`)

ATAC-seq Environment (`encode-atacseq`)

RNA-seq Environment (`encode-rnaseq`)

Hi-C Environment (`encode-hic`)

Comentarios · Sin comentarios

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

internal-comms

babysit

do

smart-explore

Recibe nuevas skills de DevOps e Infra todos los lunes

Bioinformatics Installer for ENCODE Data Analysis

When to Use

Overview

Quick Start

Per-Assay Environments

ChIP-seq Environment (encode-chipseq)

ATAC-seq Environment (encode-atacseq)

RNA-seq Environment (encode-rnaseq)

Hi-C Environment (encode-hic)

Comentarios · Sin comentarios

ChIP-seq Environment (`encode-chipseq`)

ATAC-seq Environment (`encode-atacseq`)

RNA-seq Environment (`encode-rnaseq`)

Hi-C Environment (`encode-hic`)