SAELens: Sparse Autoencoders for Mechanistic Interpretability
SAELens is the primary library for training and analyzing Sparse Autoencoders (SAEs) - a technique for decomposing polysemantic neural network activations into sparse, interpretable features. Based on Anthropic's groundbreaking research on monosemanticity.
GitHub: jbloomAus/SAELens (1,100+ stars)
The Problem: Polysemanticity & Superposition
Individual neurons in neural networks are
[Description truncada. Veja o README completo no GitHub.]