Genomic Foundation Models: Nucleotide Transformers, HyenaDNA, and Evo
Tokenization Strategies
- Character-level (
A,C,G,T,N): highest resolution, long sequences - k-mer tokens (e.g., k=6): compressed representation; k=6 → 4096-token vocab, k=8 → 65,536 — use k≤6 for explicit k-mer tokenization
- BPE/subword: data-driven token units (used in some genomic LMs)
- Nucleotide Transformer: 6-mer tokens, stride=1, 4096-vocab; ~L/6 tokens per sequence — loses single-nucleotide re
[Description truncada. Veja o README completo no GitHub.]