Ebook Library Organizer
Why this skill exists
Most people with large ebook collections have the same problem: years of accumulated PDF/EPUB downloads from different sources, sitting in messy folders on a NAS, with no proper metadata, lots of duplicates, scanned books that aren't searchable, fragmented multi-PDF handbooks, and filenames that lie. They want to:
- Find a specific book in seconds
- Search inside their books (full-text)
- Browse by subject category, not by random folder
- Stop wasting disk space on duplicates
- Have a real catalog (titles, authors, covers) instead of garbled filenames
This skill provides the entire pipeline to fix all of that, safely. The framework is domain-agnostic — it works for fiction, science, law, medicine, business, cookbooks, anything. The default categorisation rules ship as an architecture/urban-design/philosophy/engineering preset (it's what the original author had) but you should swap that preset for your domain before running Phase 8. See scripts/category_rules.py and examples/category_rules_fiction_preset.py.
When to use
Trigger this skill aggressively when the user describes:
- A "messy", "huge", "unorganized" ebook / PDF / book / Calibre library
- Books on a NAS, network share, or external drive that they can't find anything in
- Scans that aren't searchable
- Duplicates across folders
- Calibre that broke or never finished importing
- Wanting to "categorize", "tag", "audit", or "consolidate" research papers
- Hundreds-to-thousands of PDFs with no metadata
Do NOT trigger for: ebooks already neatly tagged in a working Calibre library, single-book operations, reading apps (Kindle/Kobo setup), or non-book PDF organisation (invoices, contracts, etc.).
The pipeline at a high level
There are 9 phases. Each is independently safe (dry-run first, log of every operation, restorable). Run them in order on first use. Resume mid-pipeline anytime by re-running a phase — every script is idempotent.
| Phase | What it does | Reversible via |
|---|---|---|
| 0 | Backup. Manual, by the user. | n/a |
| 1 | Triage: move non-books (CAD, audio, images, junk) out of the library | phase1_executed.csv |
| 2 | Dedup: SHA-1 content hash, keep one of each, quarantine the rest | phase2_executed.csv |
| 3 | Merge handbooks: combine chapter-PDF folders into single PDFs via qpdf | phase3_executed.csv |
| 4 | OCR: parallel ocrmypdf on scanned PDFs, copying via local SSD for speed | originals backed up to _recycle/pre-ocr/ |
| 4.5 | OCR quality audit: score each OCR'd book, re-OCR bad ones with aggressive flags | originals backed up to _recycle/pre-ocr-v2/ |
| 5 | Calibre import: bulk-import to a fresh Calibre library with folder-derived tags | Calibre's own database |
| 5b | Metadata fetch: extract ISBNs from page text, look up titles/authors/covers | per-book log |
| 8 | Hierarchical tags: apply category tree (architecture.history.modern etc.) to all books | per-book log |
| 9 | Category-folder export: copy each book ONCE into its primary category folder for filesystem browsing | none — read-only export |
Phase numbering is non-contiguous (4.5, no 6/7) because the original development session had a few branches we trimmed. Don't worry about it — follow the order in the table.
Required tools
Before starting, the user needs these installed. The references/prerequisites.md file has full install instructions per OS.
| Tool | What we use it for | Free? |
|---|---|---|
| Calibre | catalog database, GUI, calibredb CLI | yes (GPL) |
| Tesseract OCR | OCR engine | yes (Apache) |
| Ghostscript | PDF rasterisation for OCR | yes (AGPL personal) |
| qpdf | merging chapter PDFs | yes (Apache) |
ocrmypdf (Python) | wraps Tesseract + adds searchable text layer | yes (MPL) |
pypdf (Python) | reading PDF metadata + pages | yes (MIT) |
pdfplumber (Python) | optional, alternative reader | yes (MIT) |
isbnlib (Python) | optional ISBN validation | yes (LGPL) |
How to drive this skill
When you (Claude) help a user with this skill:
1. Capture intent and check prerequisites
Ask the user:
- Where is the messy library? (path)
- Where should the clean Calibre library go? (recommend local SSD, not NAS — Calibre over SMB has known case-sensitivity bugs)
- Where should the final category-folder export go? (the user's "browseable" view, can be on NAS)
- Have they got a backup? This is non-negotiable. If no backup, stop and help them make one before any phase runs.
Then run the prerequisites check:
python scripts/check_prerequisites.py
This reports which tools are missing and links to install commands.
2. Phase 0: backup
If the user has no backup, walk them through making one. Options:
- Copy the source folder to a separate drive
- Use Windows File History / Time Machine
- rsync to another NAS
Don't proceed until backup is confirmed.
3. Phase 1: triage (dry-run, review, apply)
# Dry run — writes a CSV of every proposed move, moves NOTHING
python scripts/phase1_triage.py \
--source "<source library path>" \
--plan-csv plan/phase1_moves.csv
# User reviews plan/phase1_moves.csv in Excel/text editor
# Apply once approved
python scripts/phase1_triage.py \
--source "<source library path>" \
--plan-csv plan/phase1_moves.csv \
--apply \
--log-csv logs/phase1_executed.csv
What it identifies:
- Non-book file types (CAD
.dwg/.eps, audio.mp3/.wav, images.jpg, Office docs, etc.) - Broken Calibre artefacts at the library root (corrupted
metadata.db,calibre_test_*.txtfiles) - Junk OS files (
.DS_Store,Thumbs.db) - "Page-scan books" — folders of sequentially-numbered JPGs that ARE the book (these are detected and left alone — moving the JPGs would destroy the book)
- "Coherent project archives" — folders mixing CAD + PDFs + Word docs that should travel together (don't split them)
Everything is moved, not deleted, into <source>/_non_books/<category>/ or <source>/_recycle/ for review. Nothing is destroyed.
4. Phase 2: dedup
python scripts/phase2_dedup.py \
--source "<source library path>" \
--plan-csv plan/phase2_moves.csv \
--summary-csv plan/phase2_summary.csv
# Review plan/phase2_summary.csv — one row per duplicate group (sortable in Excel)
python scripts/phase2_dedup.py \
--source "<source library path>" \
--plan-csv plan/phase2_moves.csv \
--apply --log-csv logs/phase2_executed.csv
Strategy: two-stage.
- Group all books by exact byte size (cheap)
- Within each size collision, SHA-1 hash the files
- Among files with identical hash, pick a winner using a scoring function: prefer files NOT in junk folders (Reading now, NewNeedcategorize), prefer descriptive filenames over ones with rip-site tags (
nebks.com,bookfi,_alt), prefer shorter folder depth. - Quarantine losers to
_recycle/duplicates/<original-path>/
In our reference run this recovered 4.7 GB of pure duplicates from a 29 GB library.
5. Phase 3: handbook merging
Many engineering handbooks (Mechanical Engineers', Perry's Chemical, Roark's Formulas, etc.) are distributed as one PDF per chapter — turning a single 1000-page book into 100 separate files. This phase detects those folders and uses qpdf --empty --pages to concatenate them into a single PDF.
python scripts/phase3_merge_handbooks.py \
--source "<source library path>" \
--handbook-base "<source>/Studies/All Handbooks/" \
--apply --log-csv logs/phase3_executed.csv
Chapter ordering: front matter (_fm, _pref, _toc, _how) first, then numbered chapters, then index (_ind*, _app*). The original fragment folder gets moved to _recycle/handbook_fragments/<HandbookName>/ after success.
6. Phase 4: OCR (the long step)
# Step 1: scan to find which PDFs need OCR
python scripts/phase4_ocr_scan.py \
--sou