PDF Data Extraction
Extract text and structured data from PDF documents using a multi-backend approach with automatic fallback.
Overview
This skill provides PDF text extraction with 9 different backends, automatic GPU detection, and intelligent backend selection. The extraction system tries backends in order until one succeeds, producing markdown output optimized for further processing.
Quick Start Workflow
To extract text from PDFs:
-
Single file extraction (installed CLI - recommended):
extract-pdfs /path/to/document.pdfOutput: Creates
document.mdin the same directory. -
Batch extraction (directory):
extract-pdfs /path/to/pdfs/ /path/to/output/Output: Creates
.mdfiles for all PDFs in output directory. -
Custom output file:
extract-pdfs document.pdf output.md -
Specific backends:
extract-pdfs document.pdf --backends markitdown pdfplumber -
List available backends:
extract-pdfs --list-backendsOutput: Shows available backends and GPU status.
Alternative Execution Methods
If the extract-pdfs CLI isn't installed, install it first (recommended):
# Install as global UV tool (from repo root):
cd "${CLAUDE_PLUGIN_ROOT}/../.." && uv tool install --force --editable plugins/pdf-extractor
extract-pdfs --list-backends # verify
Or use these fallback methods without installing:
# uv run (recommended fallback — no install required):
uv run --project "${CLAUDE_PLUGIN_ROOT}" python -m pdf_extraction document.pdf
# Standalone script execution
python "${CLAUDE_PLUGIN_ROOT}/src/pdf_extraction/cli.py" document.pdf
Backend Selection Guide
Custom Backend Ordering
Specify backends in any order with --backends. The system tries each in order, stopping on first success:
# Tables first, then general extraction
extract-pdfs document.pdf --backends pdfplumber markitdown pdfminer
# Scanned documents: vision-based first
extract-pdfs scanned.pdf --backends marker docling markitdown
# Most permissive fallback order (handles problematic PDFs)
extract-pdfs document.pdf --backends pdfminer pypdf2 markitdown
# Single backend only (no fallback)
extract-pdfs document.pdf --backends markitdown
CPU-Only Systems (Default)
For systems without GPU, the recommended backend order:
markitdown- Microsoft's lightweight converter (MIT, fast, no models)pdfplumber- Excellent for tables (MIT)pdfminer- Pure Python, reliable (MIT)pypdf2- Basic extraction, always available (BSD-3)
GPU Systems
For systems with CUDA-enabled GPU:
docling- IBM layout analysis (MIT, ~500MB models)marker- Vision-based, best for scanned docs (GPL-3.0, ~1GB models)- Plus all CPU backends as fallback
Backend Comparison
| Backend | License | Models | Best For | Speed |
|---|---|---|---|---|
| markitdown | MIT | None | General text, forms | Fast |
| pdfplumber | MIT | None | Tables, structured data | Fast |
| pdfminer | MIT | None | Simple text documents | Fast |
| pypdf2 | BSD-3 | None | Basic extraction | Fast |
| docling | MIT | ~500MB | Layout analysis | Medium |
| marker | GPL-3.0 | ~1GB | Scanned documents | Slow |
| pymupdf4llm | AGPL-3.0 | None | LLM-optimized output | Fast |
| pdfbox | Apache-2.0 | None | Tables (Java-based) | Medium |
| pdftotext | System | None | Simple text (CLI) | Fast |
Backend Decision Matrix
| Document Type | Recommended Backend(s) | Why |
|---|---|---|
| Digital text PDF (default) | markitdown, pdfplumber | Fast, accurate |
| PDF with tables/invoices | pdfplumber, pdfbox | Best table structure |
| Complex layouts/columns | docling (GPU) | Layout analysis |
| Scanned documents/images | marker, docling (GPU) | OCR/vision required |
| Insurance policies/forms | markitdown, pdfplumber | Handles form fields |
| Academic papers | docling | Equations, figures |
| Maximum compatibility | pdfminer, pypdf2 | Fewest dependencies |
| Commercial use required | markitdown, pdfplumber | MIT license |
Programmatic Usage
To use the extraction library directly in Python code:
from pdf_extraction import extract_single_pdf, pdf_to_txt, detect_gpu_availability
# Check available backends
gpu_info = detect_gpu_availability()
print(f"Recommended backends: {gpu_info['recommended_backends']}")
# Extract single file
result = extract_single_pdf(
input_file='/path/to/document.pdf',
output_file='/path/to/output.md',
backends=['markitdown', 'pdfplumber']
)
if result['success']:
print(f"Extracted with {result['backend_used']}")
print(f"Quality metrics: {result['quality_metrics']}")
# Batch extract directory
output_files, metadata = pdf_to_txt(
input_dir='/path/to/pdfs/',
output_dir='/path/to/output/',
resume=True, # Skip already-extracted files
return_metadata=True
)
Extraction Metadata
Every extraction returns metadata for quality assessment:
{
'success': True,
'backend_used': 'markitdown',
'extraction_time_seconds': 2.5,
'output_size_bytes': 15234,
'quality_metrics': {
'char_count': 15234,
'line_count': 450,
'word_count': 2800,
'table_markers': 12, # Count of | (tables)
'has_structure': True # Has markdown structure
},
'encrypted': False,
'error': None
}
Handling Common Scenarios
Encrypted PDFs
The system detects encrypted PDFs and reports them:
if result['encrypted']:
print("PDF is password-protected")
Encrypted PDFs cannot be extracted without the password.
Empty or Failed Extractions
When all backends fail:
- Check if PDF is encrypted
- Try with
--backends pdfminer pypdf2(most permissive) - Check PDF isn't corrupted
- Consider OCR-based backends for scanned documents
Resume Batch Processing
To continue interrupted batch extraction:
extract-pdfs /path/to/pdfs/ /path/to/output/
The resume=True default skips already-extracted files.
To force re-extraction:
extract-pdfs /path/to/pdfs/ --no-resume
Tables and Structured Data
For PDFs with tables, prioritize:
extract-pdfs document.pdf --backends pdfplumber markitdown
The output will contain markdown tables when detected:
| Column1 | Column2 | Column3 |
|---------|---------|---------|
| Data | Data | Data |
Module Structure Reference
Source Code Layout
Location: ${CLAUDE_PLUGIN_ROOT}/src/pdf_extraction/
| File | Purpose |
|---|---|
__init__.py | Package exports (extract_single_pdf, pdf_to_txt, etc.) |
__main__.py | Support for python -m pdf_extraction |
cli.py | CLI entry point with argparse |
backends.py | BackendExtractor base class + 9 backend implementations |
extractors.py | extract_single_pdf(), pdf_to_txt() functions |
utils.py | GPU detection, quality metrics, encryption check |
Key Classes and Functions
| Component | Location | Purpose |
|---|---|---|
BackendExtractor | backends.py:35-123 | Base class with Template Method pattern |
DoclingExtractor | backends.py:130-142 | IBM Docling backend (MIT, GPU) |
MarkerExtractor | backends.py:145-158 | Vision-based marker backend (GPL-3.0, GPU) |
MarkItDownExtractor | backends.py:161-173 | Microsoft MarkItDown (MIT, CPU) |
PdfplumberExtractor | backends.py:244-253 | Table-focused extraction (MIT) |
PdfminerExtractor | backends.py:219-226 | Pure Python fallback (MIT) |
Pypdf2Extractor | backends.py:229-241 | Basic extraction, always available (BSD-3) |
BACKEND_REGISTRY | backends.py:279-292 | Dict mapping backend names to factories |
detect_gpu_availability() | utils.py:9-40 | Auto-detect GPU and recommend backends |
extract_single_pdf() | extractors.py:13-80 | Extract one PDF with backend fall |