Download ENCODE Files
When to Use
- User wants to download ENCODE data files to their local machine
- User asks to "download", "get", or "fetch" ENCODE files
- User needs specific file formats (BED, FASTQ, BAM, bigWig) from experiments
- User wants to batch download files matching search criteria
- User needs to verify file integrity after download (MD5 checksums)
- User asks about organizing downloaded files by experiment or format
Help the user download ENCODE data files to their local machine.
Download Strategy
-
Specific files by accession: Use
encode_download_fileswith file accession IDs (e.g., "ENCFF635JIA"). -
Batch download by criteria: Use
encode_batch_downloadto search and download in one step.- Always start with
dry_run=True(default) to preview what will be downloaded - Show the user the file count, total size, and file list
- Only proceed with
dry_run=Falseafter user confirms
- Always start with
-
Download organization options:
"flat": All files in one directory"experiment": Organized by experiment accession (recommended)"format": Organized by file format"experiment_format": Organized by experiment, then format
Important Notes
- All downloads include MD5 verification by default (
verify_md5=True) - Ask the user for a download directory if not specified
- Warn about large downloads (>1GB total or >50 files)
- Files already downloaded will be skipped (idempotent)
- For restricted files, credentials must be configured first via
encode_manage_credentials
Pitfalls & Edge Cases
- Disk space: BAM files can be 5-50GB each; FASTQ files 1-20GB. Before any batch download, warn the user about estimated total size from the dry_run preview. A single ChIP-seq experiment can produce 10-30GB of raw data files.
- MD5 verification failures: If MD5 verification fails, the file may be corrupted or incompletely downloaded. Always re-download rather than skipping verification. Never set
verify_md5=Falseunless the user explicitly requests it and understands the risk. - Downloading too much data: Users often request BAM files when they only need peak calls or signal tracks. Suggest
preferred_default=Trueto get ENCODE's recommended files, or filter byoutput_type(e.g., "IDR thresholded peaks", "fold change over control") to avoid downloading raw data unnecessarily. - Restricted/unreleased data: Files with status other than "released" may require ENCODE credentials. Use
encode_manage_credentials(action="check")to verify credentials are configured before attempting to download restricted data. - Mixed assemblies in batch download: Always specify the
assemblyfilter (e.g., "GRCh38") in batch downloads. Without it, you may download files aligned to different genome assemblies (hg19, GRCh38, mm10), making downstream analysis impossible. - Timeout on large files: For downloading many files or very large files,
encode_batch_downloadhandles retries and concurrent downloads better than individualencode_download_filescalls. The default limit of 100 files provides a safety cap.
File Type Guide
When users request "files" without specifying a type, use this priority to suggest the right output_type:
- Peak analysis:
output_type="IDR thresholded peaks"(most stringent, recommended for ChIP-seq/ATAC-seq) - Signal visualization:
file_format="bigWig",output_type="fold change over control"(for genome browser tracks) - Gene expression:
output_type="gene quantifications"(for RNA-seq TPM/FPKM tables) - Raw data reprocessing:
file_format="fastq"(only when user needs to run their own pipeline) - Quick defaults:
preferred_default=True(ENCODE's recommended files for any experiment)
What to Download for Each Analysis
| Analysis Goal | File Format | Output Type | Why This File |
|---|---|---|---|
| Peak locations (ChIP/ATAC) | bed narrowPeak | IDR thresholded peaks | Gold-standard replicated peaks passing irreproducibility threshold |
| Broad domain marks (H3K27me3) | bed broadPeak | replicated peaks | Broad marks need broadPeak format, not narrowPeak |
| Signal visualization | bigWig | fold change over control | Normalized signal track for genome browser display |
| Signal statistics | bigWig | signal p-value | Statistical significance of signal over background |
| Raw data reprocessing | fastq | reads | Starting from scratch with your own pipeline |
| Alignment inspection | bam | alignments | Check read mapping quality, fragment sizes, duplicates |
| Browser-compatible peaks | bigBed | peaks | UCSC/IGV-compatible binary peak format |
| Gene expression levels | tsv | gene quantifications | TPM/FPKM tables for RNA-seq differential expression |
| Transcript isoforms | tsv | transcript quantifications | Isoform-level expression for splicing analysis |
| 3D genome contacts | hic | contact matrix | Hi-C interaction matrices for loop/TAD calling |
| Methylation levels | bed | methylation state at CpG | Per-CpG methylation fractions for WGBS |
Assay-Specific Recommendations
| Assay | Primary Download | Secondary Download |
|---|---|---|
| Histone ChIP-seq | IDR thresholded peaks (bed) | fold change over control (bigWig) |
| TF ChIP-seq | IDR thresholded peaks (bed) | fold change over control (bigWig) |
| ATAC-seq | IDR thresholded peaks (bed) | fold change over control (bigWig) |
| DNase-seq | peaks (bed) | signal of unique reads (bigWig) |
| RNA-seq | gene quantifications (tsv) | signal of unique reads (bigWig) |
| WGBS | methylation state at CpG (bed) | signal (bigWig) |
| Hi-C | contact matrix (hic) | contact domains (bed) |
| CUT&RUN | peaks (bed) | fold change over control (bigWig) |
| CUT&Tag | peaks (bed) | fold change over control (bigWig) |
| eCLIP | peaks (bed) | signal of unique reads (bigWig) |
File Selection Priority
When multiple files exist for the same experiment, choose files in this priority order:
-
preferred_default=True: ENCODE curators mark recommended files. Always prefer these when available. Use
encode_list_files(experiment_accession="ENCSR...", preferred_default=True)to find them. -
Peak file hierarchy (most to least stringent):
- IDR thresholded peaks — replicated, irreproducibility-filtered (gold standard)
- Optimal IDR thresholded peaks — union of replicate-level peaks
- Conservative IDR thresholded peaks — intersection of replicate-level peaks
- Pseudoreplicated peaks — peaks from pooled pseudoreplicates
- Replicated peaks — peaks found in both replicates (broad marks)
-
Signal track hierarchy:
- fold change over control — normalized signal, best for comparing across experiments
- signal p-value — statistical significance of enrichment
- signal of unique reads — uniquely mapped read signal
- signal of all reads — includes multi-mapped reads (noisier)
-
Assembly preference:
- GRCh38 for human (current standard) — always use this
- hg19 for human (legacy) — only if collaborators require it
- mm10 for mouse (current standard)
- Never mix assemblies within an analysis
-
Replicate preference:
- Replicated files (combined replicates) over single-replicate files
- Biological replicates over technical replicates
- Isogenic replication over anisogenic
-
Status preference:
- released — fully validated, use these
- archived — older versions, avoid unless specifically needed
- revoked — quality issues found, never use
Storage Estimates
Plan disk space before downloading. Use dry_run=True to get exact sizes for your query.
| File Type | Typical Size per File | 10 Experiments | 50 Experiments |
|---|---|---|---|
| BED peaks (narrowPeak) | 1-10 MB | 10-100 MB | 50-500 MB |
| BED peaks (broadPeak) | 5-50 MB | 50-500 MB | 250 MB - 2.5 GB |
| bigWig signal tracks | 200 MB - 2 GB | 2-20 GB | 10-100 GB |
| bigBed peaks | 1-20 MB | 10-200 MB | 50 MB - 1 GB |
| TSV quantifications | 5-50 MB | 50-500 MB |