bm25
Ranked content search over any text corpus. One CLI, in-memory BM25 index per process, with a session-local disk cache so repeat invocations against the same corpus load in tens of milliseconds instead of rebuilding.
Setup
uv pip install --system --break-system-packages bm25s
Install is sub-second on a warm uv cache. That's the entire dependency.
Usage
BM25=/mnt/skills/user/bm25/scripts/bm25.py
# Local directory
python3 $BM25 ./repo 'csrf middleware'
# Multiple queries against the same in-memory index (build once, query many)
python3 $BM25 ./repo 'csrf middleware' 'session backend' 'queryset filter'
# Cloned GitHub repo via tarball (one HTTP call)
python3 $BM25 'github.com/django/django' 'atomic transaction'
python3 $BM25 'github.com/django/django@stable/5.0.x' 'atomic transaction'
# Project knowledge or uploads
python3 $BM25 project 'RAG scaling laws'
python3 $BM25 uploads 'tax loss harvesting'
# Filters
python3 $BM25 ./repo 'auth flow' --exclude 'tests/*' --exclude '*/tests/*'
python3 $BM25 ./repo 'config' --include '*.py' --include '*.toml'
# Interactive (REPL — single corpus, many queries)
python3 $BM25 ./repo --interactive
# JSON output for piping
python3 $BM25 ./repo 'auth flow' --json
Corpus types
| Spec | Meaning |
|---|---|
./path or /abs/path | Local directory |
uploads | /mnt/user-data/uploads/ |
project | /mnt/project/ |
github.com/owner/repo[@ref] | Tarball fetch via GitHub API (GH_TOKEN used if set) |
Options
| Option | Default | Description |
|---|---|---|
--top-k N | 10 | Results per query |
--include GLOB | (auto) | Repeatable. If set, only files matching one of these globs are indexed |
--exclude GLOB | Repeatable. Skip files matching these globs | |
--snippet-lines N | 3 | Lines of snippet context per hit (0 = none) |
--max-file-bytes N | 2,000,000 | Skip files larger than this |
--json | Machine-readable output | |
--interactive / -i | REPL mode for ad-hoc querying within one session | |
--stats | Print discover + index timings as JSON | |
--no-cache | Bypass the session-local index cache; build in-memory only |
With no --include, a default set of text/code extensions is indexed (Python,
JS/TS, Go, Rust, Markdown, JSON, YAML, etc.). Standard noise dirs are skipped
unconditionally: .git, node_modules, __pycache__, .venv, dist, etc.
When to use bm25
| Question shape | Tool |
|---|---|
"Find lines matching class.*Error" | grep / ripgrep |
"Show me where parse_input is defined" | tree-sitting (find:/source:) |
| "Which files are about CSRF handling?" | bm25 |
| "Rank these docs by relevance to 'rate limiting strategies'" | bm25 |
| "What's the implementation of the atomic transaction context manager?" | bm25, then tree-sitting source: |
| "Find code by natural-language concept (in a code repo)" | searching-codebases (which has its own TF-IDF mode) |
The boundary with searching-codebases: that skill is code-specific (routes
between regex and TF-IDF, expands via tree-sitting AST). bm25 is the simpler
general-purpose tool — any corpus, no AST awareness, no routing. Prefer
searching-codebases for code; reach for bm25 when the corpus is mixed
(docs + code), non-code (notes, transcripts, PDFs converted to text), or when
you specifically want BM25's length-normalized scoring.
Design notes
- Session-local disk cache at
/home/claude/.bm25-cache/<key>/. The key is a hash of(resolved_corpus_path, include_globs, exclude_globs, max_file_bytes)— any change invalidates naturally. First invocation builds and saves; subsequent invocations against the same corpus and filters load in tens of milliseconds. The cache lives in/home/claude, which is ephemeral, so it expires at the session boundary — same lifetime as the corpus state itself, no cross-session staleness. ~5–35MB per cached index, depending on corpus size. --no-cachebypasses both load and save — useful only if you've mutated the corpus mid-session (rare) or want to confirm a rebuild matches.- Reuse within a single invocation. The retriever stays in memory
between queries in one process. Passing multiple queries positionally,
or using
--interactive, amortizes any rebuild cost across queries. - No AST awareness. Chunking is per-file. For symbol-level results in
code, combine with
tree-sittingqueries on the same paths. - Tokenizer. Default
bm25s.tokenizewith stopwords disabled — over a small Django sample, AST-derived token streams (identifiers/strings/ comments only) gave near-identical rankings, so we don't bother.
Output format
Default (human-readable):
QUERY: csrf middleware
----------------------------------------------------------------------
1. 5.51 django/core/checks/security/csrf.py
def _csrf_middleware():
return "django.middleware.csrf.CsrfViewMiddleware" in settings.MIDDLEWARE
2. 5.34 docs/howto/csrf.txt
...
--json produces {"query": ..., "results": [{"path", "score", "snippet"}, ...]}.
Architecture
bm25.py CLI
├── resolve_corpus(spec) → local Path (downloads tarball if github.com/...)
├── cache_key(...) → 16-hex sha256 of inputs that determine the index
├── CorpusIndex.load(cache_dir) → returns cached index if present, else None
├── CorpusIndex.build(...) → walks files, tokenizes, indexes with bm25s
├── CorpusIndex.save(cache_dir) → persists to /home/claude/.bm25-cache/<key>/
├── query(q, k) → ranked (doc_idx, score) pairs
└── best_snippet(doc, q, lines) → pick line w/ most query-term hits + context
Cache contents per directory:
bm25/— bm25s.BM25.save() output (NumPy arrays + vocab)corpus.pkl— pickled{paths, docs}so we can render snippets without re-reading the source filesmanifest.json— corpus root, files count, built_at timestamp
No network beyond optional tarball fetch on github.com/... corpora. No
state outside /home/claude/, which is ephemeral.