docling-skill
Convert local documents into a stable source.* sidecar set for agent consumption. Treat this skill as the ingestion layer, not as ad hoc text extraction.
Preconditions
- If you use the relative command, run from the
docling-skillrepo root. - Runtime:
condaenvironmentdocling, or pip-installeddocling-skillCLI. - Always provide an explicit output directory.
Command
Conda environment:
conda run -n docling python \
-m docling_skill.cli \
"<input_path>" \
"<output_dir>"
Or if installed via pip:
docling-skill "<input_path>" "<output_dir>"
Optional flags:
--ocr-engine auto|tesseract|ocrmac|rapidocr
--ocr-lang <lang> # repeatable or comma-separated
--force-full-page-ocr
--no-ocr-remediation
Inputs:
input_path: Absolute or repo-relative local document path. Supported inputs:pdf,docx,pptx,xls,xlsx,csv,html,txt,md,png,jpg,jpeg,tif,tiff,bmp, andwebp.output_dir: Explicit directory where outputs should be written.
Legacy .doc and .ppt files are intentionally rejected. Save them as .docx/.pptx or PDF before ingestion.
Outputs
The extractor writes:
source.mdsource.docling.jsonsource.images.jsonsource.manifest.jsonsource.meta.json
Use source.manifest.json before consuming any other output.
Artifact roles:
source.manifest.json: Quality risk, routing, remediation,preferred_agent_artifact,authoritative_artifact,available_artifacts, selected attempt metadata, and evidence signals.source.md: Default agent-readable Markdown. Image placeholders appear as[[image:picture-p3-0]]. Narrow CJK cleanup may be applied here for agent readability.source.docling.json: Authoritative structured Docling export from the same conversion result assource.md; use for recovery, machine-readable structure, or deeper inspection. It is not rewritten by the CJK Markdown cleanup.source.images.json: Extracted image sidecars withid,placeholder,page_no,bbox,mime_type, andbase64when image extraction is available.source.meta.json: Ingestion metadata only:job_id,input_type,source_title,source_url,source_attachment,author,published_at,extractor,pipeline_family,quality_status,quality_reasons, andchar_count.
Do not add downstream knowledge fields such as tags, keywords, category, summary, or embeddings to source.meta.json.
Workflow Boundary
docling-skillis the ingestion layer, not the full workflow.- It emits
source.*directly instead of<stem>.*. - It does not do chunking. Chunking belongs to the shared normalize stage after ingestion.
- It does not emit knowledge-base semantic fields.
- It does not fetch remote URLs. Remote acquisition belongs to the fetcher/browser layer upstream.
Manifest Check
Read source.manifest.json before consuming source.md:
Minimum fields to inspect:
manifest["quality"]["status"]manifest["quality"]["risk_level"]manifest["quality"]["reasons"]manifest["quality"]["warnings"]manifest["quality"]["signals"]manifest["quality"]["content_trust"]manifest["preferred_agent_artifact"]manifest["authoritative_artifact"]manifest["available_artifacts"]manifest["selected_attempt"]
python3 -c 'import json, pathlib; p = pathlib.Path("PATH_TO_MANIFEST"); m = json.loads(p.read_text(encoding="utf-8")); q = m["quality"]; print({"status": q["status"], "risk_level": q["risk_level"], "agent_ready": q["agent_ready"], "reasons": q["reasons"], "warnings": q["warnings"], "selected_attempt": m["selected_attempt"], "ocr_remediation_applied": m["ocr_remediation_applied"]})'
Decision Flow
- Resolve the input document path and an explicit output directory.
- Run the extractor.
- Read
source.manifest.jsonbefore consumingsource.md. - Decide from
manifest["quality"]:goodwithrisk_level: low: no hard failure was detected; usesource.mdas the primary text artifact.goodwithrisk_level: medium:source.mdis default-usable, but checkwarningsandsignalsbefore relying on it.salvaged: usesource.md, but treat it as OCR-remediated and medium risk.failed_for_agent: do not present it as clean ingestion; report the failure and the manifest reasons.agent_ready: truemeanssource.mdis a default agent input; it does not prove semantic fidelity.- For Chinese-heavy output, inspect
signals.text_normalizationandsignals.text_integrityfor CJK glyph cleanup, bad replacement characters, and formula placeholders. - For PDFs with page warnings, inspect
signals.page_coverage.failed_pagesandsignals.page_coverage.first_page_failed; long documents can be medium risk when only isolated pages failed. - For text-native inputs,
goodmeans minimum usable structure survived in Markdown; it is not just "the parse succeeded" or "the Markdown is non-empty." - For
docx,html, andmd, accept surviving paragraph/body structure, including concise body text, or preserved list structure when the list is the document's real content;txtstays looser.
- Treat
manifest["preferred_agent_artifact"]as the default agent entrypoint. In this contract that is alwayssource.md. - Treat
manifest["authoritative_artifact"]as the recovery/deep-inspection artifact. In this contract that is alwayssource.docling.json. - Check
manifest["selected_attempt"]to see which attempt won. A remediation attempt can still end asfailed_for_agent. - If image analysis matters, resolve placeholders through
source.images.json.
The automatic quality model is a risk screen, not a semantic audit. Low risk does not prove source fidelity or complete source-to-Markdown alignment.
Images
When analysis depends on a specific figure or chart:
- Find the placeholder in
.md, for example[[image:picture-p2-1]]. - Look up the matching entry in
source.images.jsonbyidorplaceholder. - Pass the corresponding base64 image through the current runtime's supported multimodal input path.
Image handling notes:
- Embedded images in local PDFs are supported.
- Common local image files (
png,jpg,jpeg,tif,tiff,bmp,webp) are supported through Docling's native image input. - Image-only outputs with no usable OCR text should be treated as high risk when
quality.statusisfailed_for_agent. - Image extraction is not universal across all supported formats.
- HTML and webpage image capture should be owned by the fetcher/browser layer, not this ingestion step.
Spreadsheets
For xls, xlsx, and csv inputs:
- Treat
source.mdas a readable preview. - Use
source.docling.jsonas the required authoritative artifact when merged cells, multi-row headers, multiple sheets, table spans, or cell offsets matter. - Check
manifest["spreadsheet"]forsource_format,sheet_count,table_count,merged_cell_count,has_merged_cells, andhas_multi_sheet.normalized_fromis conditional and appears only when a source format was normalized before ingestion, for example fromxlstoxlsx. - Do not infer merged or nested table semantics from Markdown alone; Markdown may flatten or visually repeat merged values.
- Formula evaluation is not guaranteed; spreadsheets that depend on recalculation or contain stale cached formula values should be manually preprocessed into clean
xlsxorcsvbefore ingestion. - Macro-enabled workbooks (
xlsm), password-protected files, corrupt files, chart/image semantics, and unusually complex workbooks should be manually preprocessed into cleanxlsxorcsvbefore ingestion.
Example listing command:
python3 -c 'import json, pathlib; imgs = json.loads(pathlib.Path("PATH_TO_IMAGES_JSON").read_text(encoding="utf-8")); [print({"placeholder": img["placeholder"], "page_no": img["page_no"], "base64_len": len(img["base64"])}) for im