polars-bio

Overview

polars-bio is a high-performance Python library for genomic interval operations and bioinformatics file I/O, built on Polars, Apache Arrow, and Apache DataFusion. It provides a familiar DataFrame-centric API for interval arithmetic (overlap, nearest, merge, coverage, complement, subtract) and reading/writing common bioinformatics formats (BED, VCF, BAM, CRAM, GFF/GTF, FASTA, FASTQ).

Key value propositions:

6-38x faster than bioframe on real-world genomic benchmarks
Streaming/out-of-core support for large genomes via DataFusion
Cloud-native file I/O (S3, GCS, Azure) with predicate pushdown
Two API styles: functional (pb.overlap(df1, df2)) and method-chaining (df1.lazy().pb.overlap(df2))
SQL interface for genomic data via DataFusion SQL engine

When to Use This Skill

Use this skill when:

Performing genomic interval operations (overlap, nearest, merge, coverage, complement, subtract)
Reading/writing bioinformatics file formats (BED, VCF, BAM, CRAM, GFF/GTF, FASTA, FASTQ)
Processing large genomic datasets that don't fit in memory (streaming mode)
Running SQL queries on genomic data files
Migrating from bioframe to a faster alternative
Computing read depth/pileup from BAM/CRAM files
Working with Polars DataFrames containing genomic intervals

Quick Start

Installation

Requires Python 3.11–3.14 (see PyPI).

uv pip install "polars-bio==0.31.0"

For pandas compatibility (pandas ≥3.0):

uv pip install "polars-bio[pandas]==0.31.0"

Basic Overlap Example

import polars as pl
import polars_bio as pb

# Create two interval DataFrames
df1 = pl.DataFrame({
    "chrom": ["chr1", "chr1", "chr1"],
    "start": [1, 5, 22],
    "end":   [6, 9, 30],
})

df2 = pl.DataFrame({
    "chrom": ["chr1", "chr1"],
    "start": [3, 25],
    "end":   [8, 28],
})

# Functional API (returns LazyFrame by default)
result = pb.overlap(df1, df2)
result_df = result.collect()

# Get a DataFrame directly
result_df = pb.overlap(df1, df2, output_type="polars.DataFrame")

# Method-chaining API (via .pb accessor on LazyFrame)
result = df1.lazy().pb.overlap(df2)
result_df = result.collect()

Reading a BED File

import polars_bio as pb

# Eager read (loads entire file)
df = pb.read_bed("regions.bed")

# Lazy scan (streaming, for large files)
lf = pb.scan_bed("regions.bed")
result = lf.collect()

Core Capabilities

1. Genomic Interval Operations

polars-bio provides 8 core interval operations for genomic range arithmetic. All operations accept Polars DataFrames with chrom, start, end columns (configurable). All operations return a LazyFrame by default (use output_type="polars.DataFrame" for eager results).

Operations:

overlap / count_overlaps - Find or count overlapping intervals between two sets (overlap_output="left" returns df1-only hits since 0.30.0)
nearest - Find nearest intervals (with configurable k, overlap, distance params)
merge - Merge overlapping/bookended intervals within a set
cluster - Assign cluster IDs to overlapping intervals
coverage - Compute per-interval coverage counts (two-input operation)
complement - Find gaps between intervals within a genome
subtract - Remove portions of intervals that overlap another set

Example:

import polars_bio as pb

# Find overlapping intervals (returns LazyFrame)
result = pb.overlap(df1, df2, suffixes=("_1", "_2"))

# Count overlaps per interval
counts = pb.count_overlaps(df1, df2)

# Merge overlapping intervals
merged = pb.merge(df1)

# Find nearest intervals
nearest = pb.nearest(df1, df2)

# Collect any LazyFrame result to DataFrame
result_df = result.collect()

Reference: See references/interval_operations.md for detailed documentation on all operations, parameters, output schemas, and performance considerations.

2. Bioinformatics File I/O

Read and write common bioinformatics formats with read_*, scan_*, write_*, and sink_* functions. Supports cloud storage (S3, GCS, Azure) and compression (GZIP, BGZF).

Supported formats:

BED - Genomic intervals (read_bed, scan_bed, write_* via generic)
VCF - Genetic variants (read_vcf, scan_vcf, write_vcf, sink_vcf)
VCF Zarr - Analysis-ready Zarr stores (read_vcf_zarr, scan_vcf_zarr; local directory paths)
BAM - Aligned reads (read_bam, scan_bam, write_bam, sink_bam)
CRAM - Compressed alignments (read_cram, scan_cram, write_cram, sink_cram)
GFF - Gene annotations (read_gff, scan_gff)
GTF - Gene annotations (read_gtf, scan_gtf)
FASTA - Reference sequences (read_fasta, scan_fasta, write_fasta, sink_fasta)
FASTQ - Sequencing reads (read_fastq, scan_fastq, write_fastq, sink_fastq)
SAM - Text alignments (read_sam, scan_sam, write_sam, sink_sam)
Hi-C pairs - Chromatin contacts (read_pairs, scan_pairs)

Example:

import polars_bio as pb

# Read VCF file
variants = pb.read_vcf("samples.vcf.gz")

# Lazy scan BAM file (streaming)
alignments = pb.scan_bam("aligned.bam")

# Read GFF annotations
genes = pb.read_gff("annotations.gff3")

# Cloud storage (individual params, not a dict)
df = pb.read_bed("s3://bucket/regions.bed",
                 allow_anonymous=True)

Reference: See references/file_io.md for per-format column schemas, parameters, cloud storage options, and compression support.

3. SQL Data Processing

Register bioinformatics files as tables and query them using DataFusion SQL. Combines the power of SQL with polars-bio's genomic-aware readers.

import polars as pl
import polars_bio as pb

# Register files as SQL tables (path first, name= keyword)
pb.register_vcf("samples.vcf.gz", name="variants")
pb.register_bed("target_regions.bed", name="regions")

# Query with SQL (returns LazyFrame)
result = pb.sql("SELECT chrom, start, end, ref, alt FROM variants WHERE qual > 30")
result_df = result.collect()

# Register a Polars DataFrame as a SQL table
pb.from_polars("my_intervals", df)
result = pb.sql("SELECT * FROM my_intervals WHERE chrom = 'chr1'").collect()

Reference: See references/sql_processing.md for register functions, SQL syntax, and examples.

4. Pileup Operations

Compute per-base read depth from BAM/CRAM files with CIGAR-aware depth calculation.

import polars_bio as pb

# Compute depth across a BAM file
depth_lf = pb.depth("aligned.bam")
depth_df = depth_lf.collect()

# With quality filter
depth_lf = pb.depth("aligned.bam", min_mapping_quality=20)

Reference: See references/pileup_operations.md for parameters and integration patterns.

Key Concepts

Coordinate Systems

polars-bio defaults to 1-based coordinates (genomic convention). This can be changed globally:

import polars_bio as pb

# Switch to 0-based half-open coordinates (default is 1-based / False)
pb.set_option("datafusion.bio.coordinate_system_zero_based", True)

# Switch back to 1-based (default)
pb.set_option("datafusion.bio.coordinate_system_zero_based", False)

I/O functions also accept use_zero_based to set coordinate metadata on the resulting DataFrame:

# Read BED with explicit 0-based metadata
df = pb.read_bed("regions.bed", use_zero_based=True)

Important: BED files are always 0-based half-open in the file format. polars-bio handles the conversion automatically when reading BED files. Coordinate metadata is attached to DataFrames by I/O functions and propagated through operations.

Two API Styles

Functional API - standalone functions, explicit inputs:

result = pb.overlap(df1, df2, suffixes=("_1", "_2"))
merged = pb.merge(df)

Method-chaining API - via .pb accessor on LazyFrames (not DataFrames):

result = df1.lazy().pb.overlap(df2)
merged = df.lazy().pb.merg

polars-bio

Como adicionar

Cole no README do seu repo

Skills relacionadas

internal-comms

babysit

do

smart-explore

Receba novas skills de DevOps e Infra toda segunda

polars-bio

Overview

When to Use This Skill

Quick Start

Installation

Basic Overlap Example

Reading a BED File

Core Capabilities

1. Genomic Interval Operations

2. Bioinformatics File I/O

3. SQL Data Processing

4. Pileup Operations

Key Concepts

Coordinate Systems

Two API Styles

Comentários · Nenhum comentário