Geneformer and scGPT for Single-Cell Modeling
Architecture Overview
Geneformer (Theodoris et al., Nature 2023)
Pretrained on ~30M human single-cell transcriptomes.
Rank-value tokenization:
- Rank all ~20,000 genes by expression per cell (highest = rank 1)
- Keep top-K genes (K=2048) — eliminates dropout problem
- Each gene → learned embedding vector; ordered sequence encodes cell state
- Raw count values never seen — only gene ordering
Pretraining: Masked LM — predict maske
[Description truncada. Veja o README completo no GitHub.]