PDF Conversion Router
Route every PDF conversion through a short analysis step before choosing tools or CLI flags.
The goal is not "extract the most text". The goal is:
- preserve structure
- preserve attachment between labels and values
- choose the most faithful output shape
- avoid noisy defaults when a better route exists
When to Use
- The user wants a PDF converted into another format.
- The requested output is
.md,.html,.txt,.json,.docx, or structured notes. - The PDF may be scanned, OCR-heavy, table-heavy, slide-based, medical, academic, or multi-column.
Core Rule
Never start with one fixed default pipeline.
Always:
- classify the PDF
- classify the target output
- choose the strongest route for that combination
- validate the result on representative sections
- if needed, retry with better settings before delivering
Heuristics are starting points, not guarantees.
Do not promote one flag combination into a universal default just because it worked well on one PDF. Prefer document-specific evidence over habit.
Primary Engine Rule
Use opendataloader-pdf as the primary conversion engine for every PDF conversion task by default.
This skill should assume:
opendataloader-pdfis always the first conversion attempt- other tools are used to classify, validate, OCR, inspect, or support cleanup
- other extractors are not the default replacement for the main conversion route
Use other tools only for one of these reasons:
- quick classification of the PDF
- OCR preprocessing before conversion
- validation against layout-preserving text
- manual repair when the generated output is still noisy
- fallback only if
opendataloader-pdfcannot produce a usable result
Step 1: Classify the Source PDF
Identify the document class as quickly as possible:
- Native digital PDF with selectable text
- OCR PDF with noisy text
- Image-only/scanned PDF
- Slide deck / presentation export
- Medical or lab report
- Table-heavy business/finance document
- Narrative report / letter / article
- Mixed layout document with diagrams, tables, and prose
Useful fast checks:
pdfinfo input.pdf
pdftotext -layout input.pdf -
If text is missing or very poor, treat OCR as required.
Document-Type Heuristics
Use these as default starting points:
-
medical / lab report
markdown-with-html + --table-method cluster + --image-output off -
slide deck / PowerPoint export
markdown-with-html + --image-output offadd--table-method clusteronly if the default route under-structures important tabular content if tables are visually obvious but missing or badly fused, treat this as a detection problem, not a Markdown formatting problem if the selected route already reconstructs a real table but clips leading characters at column boundaries, treat that as a boundary-splitting defect, not a missing-table failure -
narrative / article / letter start with
markdownortextusemarkdown-with-htmlonly if structure clearly matters -
table-heavy business / finance PDF start with
markdown-with-htmladd--table-method clusterwhen rows or columns flatten -
scanned / image-heavy PDF OCR first, then convert with
opendataloader-pdf -
mixed-layout PDF prefer
markdown-with-htmlvalidate one easy section and one hard section before accepting output
Step 2: Choose the Output Shape
Pick the output that best matches the document and the user's goal.
-
markdown-with-htmlUse by default when the user wants Markdown and fidelity matters. Prefer this for tables, medical reports, slides, mixed-layout PDFs, and anything likely to break in pure Markdown. -
markdownUse only when clean plain Markdown matters more than layout fidelity. -
htmlUse when visual structure matters more than LLM readability. -
textUse for quick linear extraction, narrative documents, or when structure is unimportant. -
jsonUse when downstream machine processing matters more than human readability. -
docxUse when the user wants editable office output and layout reconstruction matters.
Step 3: Choose the Extraction Route
For OpenDataLoader CLI
Use OpenDataLoader as the default route.
Preferred defaults:
-
For Markdown output with fidelity priority:
-f markdown-with-html -
For medical PDFs: add
--table-method cluster -
For table-heavy PDFs: add
--table-method cluster -
For slide decks: start without
--table-method clusteradd it only after a structure check shows meaningful improvement if a pseudo-table is already collapsed inside one detected row, changing only the Markdown flavor usually will not fix it if the active engine build recovers the pseudo-table structure, prefer fixing residual boundary artifacts before escalating to hybrid/full mode -
For conversions where images are not requested: add
--image-output off -
For slide decks, medical reports, and structure-sensitive PDFs: prefer validating both the command success and the actual rendered structure
-
For referts/reports where exact values matter: validate key sections after conversion instead of trusting first pass
For medical or lab PDFs
Default route:
opendataloader-pdf -f markdown-with-html --table-method cluster --image-output off
Then verify:
- main table headers
- attachment of value, unit, and reference range
- legends/comments separated from result rows
If a clinical table is flattened, compare against pdftotext -layout before accepting output.
For slide decks
Prefer:
opendataloader-pdf -f markdown-with-html --image-output off
Then check for:
- repeated footers
- page numbers
- diagram pseudo-tables
- orphan symbols and chart labels
If CLI output is still poor, do a cleanup pass tuned for slides instead of assuming the raw extract is final. If the slide contains obvious table-like blocks that are not detected as tables at all, prefer a same-engine retry with a stronger route such as hybrid/full mode before jumping to unrelated extractors. If the slide now produces a real table, validate the first column and header boundaries before assuming the table is fully correct.
For scanned PDFs
If the text layer is poor or absent:
- run OCR first
- then convert the OCR'd PDF with
opendataloader-pdf
Prefer conservative reconstruction over aggressive guessing.
Step 4: Validation Gates
Before claiming success, inspect the output for the patterns most likely to break.
For medical PDFs:
- values attached to correct exam names
- units and reference ranges not merged into neighbors
- comments not merged into rows
For slides:
- bullets normalized
- footers/page numbers removed when they are noise
- diagrams not causing crashes
- remaining tables readable enough to follow
- first column labels not losing their first character at inferred column boundaries
- pseudo-table recovery not breaking row grouping or spilling labels into the next column
For table-heavy documents:
- no catastrophic row flattening
- headers preserved
- repeated empty separator rows minimized
- sparse or single-column tables not accidentally collapsed into prose
- table bodies not fused into a single HTML or Markdown row containing many logical records
For every document class:
- check the first representative section, not just the top of the file
- check one complex section, not only a simple section
- prefer document-level confidence over success on page 1
Red Flags
Treat these as signals that the current output is not ready:
- table rows flattened into long prose lines
- table header looks correct but the entire body is fused into one row with multi-value cells
- labels detached from values
- units or reference ranges drifting into adjacent rows
- repeated page footers or page numbers
- pseudo-tables with mostly empty cells
- legitimate sparse tables collapsed into paragraphs
- single-column tables flattened because they looked "too simple"
- stray symbols, b