docling-skill

Convert local documents into a stable source.* sidecar set for agent consumption. Treat this skill as the ingestion layer, not as ad hoc text extraction.

Preconditions

If you use the relative command, run from the docling-skill repo root.
Runtime: conda environment docling, or pip-installed docling-skill CLI.
Always provide an explicit output directory.

Command

Conda environment:

conda run -n docling python \
  -m docling_skill.cli \
  "<input_path>" \
  "<output_dir>"

Or if installed via pip:

docling-skill "<input_path>" "<output_dir>"

Optional flags:

--ocr-engine auto|tesseract|ocrmac|rapidocr
--ocr-lang <lang>          # repeatable or comma-separated
--force-full-page-ocr
--no-ocr-remediation

Inputs:

input_path: Absolute or repo-relative local document path. Supported inputs: pdf, docx, pptx, xls, xlsx, csv, html, txt, md, png, jpg, jpeg, tif, tiff, bmp, and webp.
output_dir: Explicit directory where outputs should be written.

Legacy .doc and .ppt files are intentionally rejected. Save them as .docx/.pptx or PDF before ingestion.

Outputs

The extractor writes:

source.md
source.docling.json
source.images.json
source.manifest.json
source.meta.json

Use source.manifest.json before consuming any other output.

Artifact roles:

source.manifest.json: Quality risk, routing, remediation, preferred_agent_artifact, authoritative_artifact, available_artifacts, selected attempt metadata, and evidence signals.
source.md: Default agent-readable Markdown. Image placeholders appear as [[image:picture-p3-0]]. Narrow CJK cleanup may be applied here for agent readability.
source.docling.json: Authoritative structured Docling export from the same conversion result as source.md; use for recovery, machine-readable structure, or deeper inspection. It is not rewritten by the CJK Markdown cleanup.
source.images.json: Extracted image sidecars with id, placeholder, page_no, bbox, mime_type, and base64 when image extraction is available.
source.meta.json: Ingestion metadata only: job_id, input_type, source_title, source_url, source_attachment, author, published_at, extractor, pipeline_family, quality_status, quality_reasons, and char_count.

Do not add downstream knowledge fields such as tags, keywords, category, summary, or embeddings to source.meta.json.

Workflow Boundary

docling-skill is the ingestion layer, not the full workflow.
It emits source.* directly instead of <stem>.*.
It does not do chunking. Chunking belongs to the shared normalize stage after ingestion.
It does not emit knowledge-base semantic fields.
It does not fetch remote URLs. Remote acquisition belongs to the fetcher/browser layer upstream.

Manifest Check

Read source.manifest.json before consuming source.md:

Minimum fields to inspect:

manifest["quality"]["status"]
manifest["quality"]["risk_level"]
manifest["quality"]["reasons"]
manifest["quality"]["warnings"]
manifest["quality"]["signals"]
manifest["quality"]["content_trust"]
manifest["preferred_agent_artifact"]
manifest["authoritative_artifact"]
manifest["available_artifacts"]
manifest["selected_attempt"]

python3 -c 'import json, pathlib; p = pathlib.Path("PATH_TO_MANIFEST"); m = json.loads(p.read_text(encoding="utf-8")); q = m["quality"]; print({"status": q["status"], "risk_level": q["risk_level"], "agent_ready": q["agent_ready"], "reasons": q["reasons"], "warnings": q["warnings"], "selected_attempt": m["selected_attempt"], "ocr_remediation_applied": m["ocr_remediation_applied"]})'

Decision Flow

Resolve the input document path and an explicit output directory.
Run the extractor.
Read source.manifest.json before consuming source.md.
Decide from manifest["quality"]:
- good with risk_level: low: no hard failure was detected; use source.md as the primary text artifact.
- good with risk_level: medium: source.md is default-usable, but check warnings and signals before relying on it.
- salvaged: use source.md, but treat it as OCR-remediated and medium risk.
- failed_for_agent: do not present it as clean ingestion; report the failure and the manifest reasons.
- agent_ready: true means source.md is a default agent input; it does not prove semantic fidelity.
- For Chinese-heavy output, inspect signals.text_normalization and signals.text_integrity for CJK glyph cleanup, bad replacement characters, and formula placeholders.
- For PDFs with page warnings, inspect signals.page_coverage.failed_pages and signals.page_coverage.first_page_failed; long documents can be medium risk when only isolated pages failed.
- For text-native inputs, good means minimum usable structure survived in Markdown; it is not just "the parse succeeded" or "the Markdown is non-empty."
- For docx, html, and md, accept surviving paragraph/body structure, including concise body text, or preserved list structure when the list is the document's real content; txt stays looser.
Treat manifest["preferred_agent_artifact"] as the default agent entrypoint. In this contract that is always source.md.
Treat manifest["authoritative_artifact"] as the recovery/deep-inspection artifact. In this contract that is always source.docling.json.
Check manifest["selected_attempt"] to see which attempt won. A remediation attempt can still end as failed_for_agent.
If image analysis matters, resolve placeholders through source.images.json.

The automatic quality model is a risk screen, not a semantic audit. Low risk does not prove source fidelity or complete source-to-Markdown alignment.

Images

When analysis depends on a specific figure or chart:

Find the placeholder in .md, for example [[image:picture-p2-1]].
Look up the matching entry in source.images.json by id or placeholder.
Pass the corresponding base64 image through the current runtime's supported multimodal input path.

Image handling notes:

Embedded images in local PDFs are supported.
Common local image files (png, jpg, jpeg, tif, tiff, bmp, webp) are supported through Docling's native image input.
Image-only outputs with no usable OCR text should be treated as high risk when quality.status is failed_for_agent.
Image extraction is not universal across all supported formats.
HTML and webpage image capture should be owned by the fetcher/browser layer, not this ingestion step.

Spreadsheets

For xls, xlsx, and csv inputs:

Treat source.md as a readable preview.
Use source.docling.json as the required authoritative artifact when merged cells, multi-row headers, multiple sheets, table spans, or cell offsets matter.
Check manifest["spreadsheet"] for source_format, sheet_count, table_count, merged_cell_count, has_merged_cells, and has_multi_sheet. normalized_from is conditional and appears only when a source format was normalized before ingestion, for example from xls to xlsx.
Do not infer merged or nested table semantics from Markdown alone; Markdown may flatten or visually repeat merged values.
Formula evaluation is not guaranteed; spreadsheets that depend on recalculation or contain stale cached formula values should be manually preprocessed into clean xlsx or csv before ingestion.
Macro-enabled workbooks (xlsm), password-protected files, corrupt files, chart/image semantics, and unusually complex workbooks should be manually preprocessed into clean xlsx or csv before ingestion.

Example listing command:

python3 -c 'import json, pathlib; imgs = json.loads(pathlib.Path("PATH_TO_IMAGES_JSON").read_text(encoding="utf-8")); [print({"placeholder": img["placeholder"], "page_no": img["page_no"], "base64_len": len(img["base64"])}) for im

docling-skill

docling-skill

Preconditions

Command

Outputs

Workflow Boundary

Manifest Check

Decision Flow

Images

Spreadsheets

Como adicionar

Comentários · Nenhum comentário