DOCX Template Translator

Core Idea

Treat the input file as the content source and the Word template as the formatting source. Do not expect pandoc or PDF import to infer template semantics. Build a project-specific Python postprocessor after inspecting the template and the converted body document.

Do not treat the bundled starter pipeline or a preset JSON file as a finished converter for institutional templates. For thesis/dissertation templates, you must create or patch a project-specific pipeline for the concrete template and source project before claiming success.

Workflow

Identify inputs:
- Source: .tex project, .pdf, .md, or an existing rough .docx.
- Template: required .docx.
- Output location and document metadata.
Inspect the template with the real CLI form:
- python scripts/inspect_docx_template.py template.docx --out template_report.json
Create a rough body .docx:
- LaTeX/Markdown: use pandoc when available.
- PDF: try Word COM import or pdf2docx; prefer PDF only when the original source is unavailable.
- Existing DOCX: use it as the rough body source.
Write or patch a project-specific Python pipeline:
- Start from scripts/adaptive_docx_pipeline.py.
- Copy it into the run/output directory or project workspace before patching; do not edit the bundled script in place for a one-off conversion.
- Decide, from the template inspection, which template paragraphs/tables/sections are reusable and which are sample placeholders to delete.
- Mark protected native-template regions before coding. For thesis templates, cover pages, English cover pages, originality/declaration pages, authorization pages, signatures, and their section breaks are protected by default until the first generated abstract/body marker.
- Replace or fill template front matter such as cover pages, declarations, abstracts, keywords, TOC placeholders, headers, footers, page numbering, and section breaks when the source provides those fields.
- In protected regions, replace text inside existing paragraphs/runs/tables without deleting and rebuilding the paragraph. Preserve paragraph styles, run fonts/sizes/bold, alignment, spacing, and page breaks unless the user explicitly asks to alter the template.
- Insert the rough body at the real body start or rebuild the document around the template parts. Do not blindly append the rough body to the end of the template.
- Copy template front matter if needed.
- Append rough body content while remapping DOCX relationships.
- Remap copied style IDs by visible style name before applying formatting; otherwise Heading 1/2/3 can silently become an unrelated template style when source and template style IDs collide.
- Remap styles to the template's real body, heading, caption, reference, and TOC styles.
- Scope global formatting passes to generated content only, for example with formatting_start_marker. Never run body-style remapping across native cover/declaration pages.
- Clean or rebuild section header/footer references when deleting sample template sections; stale back-matter headers such as 致谢 must not appear on body pages.
- Add or repair figure/table captions, table borders, hyperlinks, bookmarks, citations, and page breaks.
Finalize with Microsoft Word when available:
- Use scripts/finalize_word_docx.py to update fields/TOC and export a PDF preview.
Automated and visual verification:
- Use scripts/validate_docx_conversion.py final.docx --template template.docx --protected-until "中文摘要" --pdf final.pdf --out validation.json for placeholder/order/header/image/table checks plus protected-front-matter format checks. Choose the real first generated marker for non-Zhengzhou templates.
- Then run scripts/validate_docx_render.py final.docx --pdf final.pdf --out validation_render.json for render-level checks: TOC field presence, numId↔abstractNum consistency, multilevel heading format, reference-counter independence, body-header static-text leakage, and PDF field-error strings. The structural validator can return PASS while the document is visibly broken; the render validator is what catches "empty TOC", "chapters not auto-numbered", "references start at [47]", "body header still says 致谢", and "STYLEREF prints 错误!使用'开始'选项卡…".
- Use scripts/render_pdf_preview.py to inspect cover pages, abstracts, TOC, representative tables, figures, formulas, and references.

Mandatory Quality Gate

Before reporting success, run an automated and visual QA pass. If any check fails, patch the project-specific pipeline and rerun; do not present the output as complete.

Confirm the rough body is not appended after a back-matter placeholder such as 致谢, Acknowledgements, 参考文献, or sample appendices.
Confirm template placeholder text is gone or intentionally preserved. Common failures include names like 李四, 王五, 张三, red formatting instructions, lorem ipsum, sample chapter headings, and template-only reference lists.
Confirm source metadata and source front matter replaced the template placeholders: title, author, advisor, major/department, date, Chinese abstract, English abstract, keywords, declarations when applicable.
Confirm protected front matter still matches the template's formatting. Content may change, but cover/declaration/signature pages must preserve paragraph styles, run-level fonts/sizes/bold, spacing, alignment, and page-break structure unless explicitly modified.
Confirm TOC entries point to the generated source chapters, not only to the template's sample chapters.
Confirm heading paragraphs are still heading styles after OOXML insertion; style ID collisions must not break TOC generation.
Confirm body pages use the intended body style and do not inherit the last template section's header/footer.
Confirm representative images, formulas, tables, captions, references, and citations survive the reconstruction.
Record failures in the run report with PASS/FAIL/PARTIAL wording and concrete evidence.

Render-level Quality Gate (`validate_docx_render.py`)

The structural quality gate above checks counts and presence. It can return PASS while the rendered Word/PDF is visibly broken because pandoc-derived DOCX bodies often ship with a TOC paragraph that has no field, a Heading 1 style with no <w:numPr>, a numId rebound to a single-level abstract during reference repair, or a body section header whose static text is "致谢". Run validate_docx_render.py after validate_docx_conversion.py to catch those:

TOC field presence: <w:fldChar w:fldCharType="begin"> plus <w:instrText> TOC . If absent, Word's "update fields" cannot populate a non-existent TOC. Use scripts/inject_toc_field.py to add one before finalization.
numId ↔ abstractNum consistency: every (numId, ilvl) pair used by a paragraph or by a style's <w:numPr> must resolve to a defined <w:lvl ilvl=N> inside the bound abstract numbering. Missing levels silently fall back to level 0 — that is how 1.1 / 1.1.1 headings collapse to [1] after a reference repair re-points numId=1 at a single-level abstract.
Multilevel heading format: the abstract numbering bound to Heading 1 (whether at style level or via inline numPr on body H1 paragraphs) must have lvlText matching the user-supplied chapter prefix pattern (default 第%1章 or Chapter %1) at level 0 and a multilevel pattern (default contains both %1 and %2) at levels 1/2. Configure with --chapter-prefix-pattern and --multilevel-pattern for non-default templates.
Reference counter independence: any non-heading paragraph appearing after the last 参考文献 / References Heading 1 must not reuse a numId already used by Heading 1/2/3. This is the bug where 33 references render as [47]–[79] because their counter was shared with H2/H3 paragraphs upstream.
Body header is not a back-matter literal: for every body section that uses a `<w:he

docx-template-translator

How to add

Drop this on your repo README

Related skills

pdf

pptx

docx

canvas-design

Get new Documentos skills every Monday

DOCX Template Translator

Core Idea

Workflow

Mandatory Quality Gate

Render-level Quality Gate (`validate_docx_render.py`)

Comments · No comments

How to add

Drop this on your repo README

Related skills

pdf

pptx

docx

canvas-design

Get new Documentos skills every Monday

DOCX Template Translator

Core Idea

Workflow

Mandatory Quality Gate

Render-level Quality Gate (validate_docx_render.py)

Comments · No comments

Render-level Quality Gate (`validate_docx_render.py`)