SSkilltecabyclaudinhocode
Enviar skill
← Voltar para o catálogo

ebook-library-organizer

Documentos

Use this skill whenever the user wants to clean up, organize, deduplicate, OCR, or catalog a messy ebook library — especially when they describe a large pile of PDF / EPUB / DJVU files on a NAS share or external drive with poor metadata, scanned books that aren't searchable, duplicate copies in multiple folders, broken Calibre imports, fragmented chapter PDFs, or filename garbage like "(Malestrom)

0estrelas
Ver no GitHub ↗Autor: Dr-P-AkbariLicença: MIT

Ebook Library Organizer

Why this skill exists

Most people with large ebook collections have the same problem: years of accumulated PDF/EPUB downloads from different sources, sitting in messy folders on a NAS, with no proper metadata, lots of duplicates, scanned books that aren't searchable, fragmented multi-PDF handbooks, and filenames that lie. They want to:

  • Find a specific book in seconds
  • Search inside their books (full-text)
  • Browse by subject category, not by random folder
  • Stop wasting disk space on duplicates
  • Have a real catalog (titles, authors, covers) instead of garbled filenames

This skill provides the entire pipeline to fix all of that, safely. The framework is domain-agnostic — it works for fiction, science, law, medicine, business, cookbooks, anything. The default categorisation rules ship as an architecture/urban-design/philosophy/engineering preset (it's what the original author had) but you should swap that preset for your domain before running Phase 8. See scripts/category_rules.py and examples/category_rules_fiction_preset.py.

When to use

Trigger this skill aggressively when the user describes:

  • A "messy", "huge", "unorganized" ebook / PDF / book / Calibre library
  • Books on a NAS, network share, or external drive that they can't find anything in
  • Scans that aren't searchable
  • Duplicates across folders
  • Calibre that broke or never finished importing
  • Wanting to "categorize", "tag", "audit", or "consolidate" research papers
  • Hundreds-to-thousands of PDFs with no metadata

Do NOT trigger for: ebooks already neatly tagged in a working Calibre library, single-book operations, reading apps (Kindle/Kobo setup), or non-book PDF organisation (invoices, contracts, etc.).

The pipeline at a high level

There are 9 phases. Each is independently safe (dry-run first, log of every operation, restorable). Run them in order on first use. Resume mid-pipeline anytime by re-running a phase — every script is idempotent.

PhaseWhat it doesReversible via
0Backup. Manual, by the user.n/a
1Triage: move non-books (CAD, audio, images, junk) out of the libraryphase1_executed.csv
2Dedup: SHA-1 content hash, keep one of each, quarantine the restphase2_executed.csv
3Merge handbooks: combine chapter-PDF folders into single PDFs via qpdfphase3_executed.csv
4OCR: parallel ocrmypdf on scanned PDFs, copying via local SSD for speedoriginals backed up to _recycle/pre-ocr/
4.5OCR quality audit: score each OCR'd book, re-OCR bad ones with aggressive flagsoriginals backed up to _recycle/pre-ocr-v2/
5Calibre import: bulk-import to a fresh Calibre library with folder-derived tagsCalibre's own database
5bMetadata fetch: extract ISBNs from page text, look up titles/authors/coversper-book log
8Hierarchical tags: apply category tree (architecture.history.modern etc.) to all booksper-book log
9Category-folder export: copy each book ONCE into its primary category folder for filesystem browsingnone — read-only export

Phase numbering is non-contiguous (4.5, no 6/7) because the original development session had a few branches we trimmed. Don't worry about it — follow the order in the table.

Required tools

Before starting, the user needs these installed. The references/prerequisites.md file has full install instructions per OS.

ToolWhat we use it forFree?
Calibrecatalog database, GUI, calibredb CLIyes (GPL)
Tesseract OCROCR engineyes (Apache)
GhostscriptPDF rasterisation for OCRyes (AGPL personal)
qpdfmerging chapter PDFsyes (Apache)
ocrmypdf (Python)wraps Tesseract + adds searchable text layeryes (MPL)
pypdf (Python)reading PDF metadata + pagesyes (MIT)
pdfplumber (Python)optional, alternative readeryes (MIT)
isbnlib (Python)optional ISBN validationyes (LGPL)

How to drive this skill

When you (Claude) help a user with this skill:

1. Capture intent and check prerequisites

Ask the user:

  1. Where is the messy library? (path)
  2. Where should the clean Calibre library go? (recommend local SSD, not NAS — Calibre over SMB has known case-sensitivity bugs)
  3. Where should the final category-folder export go? (the user's "browseable" view, can be on NAS)
  4. Have they got a backup? This is non-negotiable. If no backup, stop and help them make one before any phase runs.

Then run the prerequisites check:

python scripts/check_prerequisites.py

This reports which tools are missing and links to install commands.

2. Phase 0: backup

If the user has no backup, walk them through making one. Options:

  • Copy the source folder to a separate drive
  • Use Windows File History / Time Machine
  • rsync to another NAS

Don't proceed until backup is confirmed.

3. Phase 1: triage (dry-run, review, apply)

# Dry run — writes a CSV of every proposed move, moves NOTHING
python scripts/phase1_triage.py \
  --source "<source library path>" \
  --plan-csv plan/phase1_moves.csv

# User reviews plan/phase1_moves.csv in Excel/text editor

# Apply once approved
python scripts/phase1_triage.py \
  --source "<source library path>" \
  --plan-csv plan/phase1_moves.csv \
  --apply \
  --log-csv logs/phase1_executed.csv

What it identifies:

  • Non-book file types (CAD .dwg/.eps, audio .mp3/.wav, images .jpg, Office docs, etc.)
  • Broken Calibre artefacts at the library root (corrupted metadata.db, calibre_test_*.txt files)
  • Junk OS files (.DS_Store, Thumbs.db)
  • "Page-scan books" — folders of sequentially-numbered JPGs that ARE the book (these are detected and left alone — moving the JPGs would destroy the book)
  • "Coherent project archives" — folders mixing CAD + PDFs + Word docs that should travel together (don't split them)

Everything is moved, not deleted, into <source>/_non_books/<category>/ or <source>/_recycle/ for review. Nothing is destroyed.

4. Phase 2: dedup

python scripts/phase2_dedup.py \
  --source "<source library path>" \
  --plan-csv plan/phase2_moves.csv \
  --summary-csv plan/phase2_summary.csv

# Review plan/phase2_summary.csv — one row per duplicate group (sortable in Excel)

python scripts/phase2_dedup.py \
  --source "<source library path>" \
  --plan-csv plan/phase2_moves.csv \
  --apply --log-csv logs/phase2_executed.csv

Strategy: two-stage.

  1. Group all books by exact byte size (cheap)
  2. Within each size collision, SHA-1 hash the files
  3. Among files with identical hash, pick a winner using a scoring function: prefer files NOT in junk folders (Reading now, NewNeedcategorize), prefer descriptive filenames over ones with rip-site tags (nebks.com, bookfi, _alt), prefer shorter folder depth.
  4. Quarantine losers to _recycle/duplicates/<original-path>/

In our reference run this recovered 4.7 GB of pure duplicates from a 29 GB library.

5. Phase 3: handbook merging

Many engineering handbooks (Mechanical Engineers', Perry's Chemical, Roark's Formulas, etc.) are distributed as one PDF per chapter — turning a single 1000-page book into 100 separate files. This phase detects those folders and uses qpdf --empty --pages to concatenate them into a single PDF.

python scripts/phase3_merge_handbooks.py \
  --source "<source library path>" \
  --handbook-base "<source>/Studies/All Handbooks/" \
  --apply --log-csv logs/phase3_executed.csv

Chapter ordering: front matter (_fm, _pref, _toc, _how) first, then numbered chapters, then index (_ind*, _app*). The original fragment folder gets moved to _recycle/handbook_fragments/<HandbookName>/ after success.

6. Phase 4: OCR (the long step)

# Step 1: scan to find which PDFs need OCR
python scripts/phase4_ocr_scan.py \
  --sou

Como adicionar

/plugin marketplace add Dr-P-Akbari/ebook-library-organizer

O comando exato pode variar conforme o repositório. Confira o README no GitHub.

Comentários · Nenhum comentário

Entre para comentar. Entrar

  • Ainda não há comentários. Seja o primeiro.