DOCX Template Translator
Core Idea
Treat the input file as the content source and the Word template as the formatting source. Do not expect pandoc or PDF import to infer template semantics. Build a project-specific Python postprocessor after inspecting the template and the converted body document.
Do not treat the bundled starter pipeline or a preset JSON file as a finished converter for institutional templates. For thesis/dissertation templates, you must create or patch a project-specific pipeline for the concrete template and source project before claiming success.
Workflow
- Identify inputs:
- Source:
.texproject,.pdf,.md, or an existing rough.docx. - Template: required
.docx. - Output location and document metadata.
- Source:
- Inspect the template with the real CLI form:
python scripts/inspect_docx_template.py template.docx --out template_report.json
- Create a rough body
.docx:- LaTeX/Markdown: use pandoc when available.
- PDF: try Word COM import or
pdf2docx; prefer PDF only when the original source is unavailable. - Existing DOCX: use it as the rough body source.
- Write or patch a project-specific Python pipeline:
- Start from
scripts/adaptive_docx_pipeline.py. - Copy it into the run/output directory or project workspace before patching; do not edit the bundled script in place for a one-off conversion.
- Decide, from the template inspection, which template paragraphs/tables/sections are reusable and which are sample placeholders to delete.
- Mark protected native-template regions before coding. For thesis templates, cover pages, English cover pages, originality/declaration pages, authorization pages, signatures, and their section breaks are protected by default until the first generated abstract/body marker.
- Replace or fill template front matter such as cover pages, declarations, abstracts, keywords, TOC placeholders, headers, footers, page numbering, and section breaks when the source provides those fields.
- In protected regions, replace text inside existing paragraphs/runs/tables without deleting and rebuilding the paragraph. Preserve paragraph styles, run fonts/sizes/bold, alignment, spacing, and page breaks unless the user explicitly asks to alter the template.
- Insert the rough body at the real body start or rebuild the document around the template parts. Do not blindly append the rough body to the end of the template.
- Copy template front matter if needed.
- Append rough body content while remapping DOCX relationships.
- Remap copied style IDs by visible style name before applying formatting; otherwise
Heading 1/2/3can silently become an unrelated template style when source and template style IDs collide. - Remap styles to the template's real body, heading, caption, reference, and TOC styles.
- Scope global formatting passes to generated content only, for example with
formatting_start_marker. Never run body-style remapping across native cover/declaration pages. - Clean or rebuild section header/footer references when deleting sample template sections; stale back-matter headers such as
致谢must not appear on body pages. - Add or repair figure/table captions, table borders, hyperlinks, bookmarks, citations, and page breaks.
- Start from
- Finalize with Microsoft Word when available:
- Use
scripts/finalize_word_docx.pyto update fields/TOC and export a PDF preview.
- Use
- Automated and visual verification:
- Use
scripts/validate_docx_conversion.py final.docx --template template.docx --protected-until "中 文 摘 要" --pdf final.pdf --out validation.jsonfor placeholder/order/header/image/table checks plus protected-front-matter format checks. Choose the real first generated marker for non-Zhengzhou templates. - Then run
scripts/validate_docx_render.py final.docx --pdf final.pdf --out validation_render.jsonfor render-level checks: TOC field presence, numId↔abstractNum consistency, multilevel heading format, reference-counter independence, body-header static-text leakage, and PDF field-error strings. The structural validator can return PASS while the document is visibly broken; the render validator is what catches "empty TOC", "chapters not auto-numbered", "references start at [47]", "body header still says 致谢", and "STYLEREF prints 错误!使用'开始'选项卡…". - Use
scripts/render_pdf_preview.pyto inspect cover pages, abstracts, TOC, representative tables, figures, formulas, and references.
- Use
Mandatory Quality Gate
Before reporting success, run an automated and visual QA pass. If any check fails, patch the project-specific pipeline and rerun; do not present the output as complete.
- Confirm the rough body is not appended after a back-matter placeholder such as
致谢,Acknowledgements,参考文献, or sample appendices. - Confirm template placeholder text is gone or intentionally preserved. Common failures include names like
李四,王五,张三, red formatting instructions, lorem ipsum, sample chapter headings, and template-only reference lists. - Confirm source metadata and source front matter replaced the template placeholders: title, author, advisor, major/department, date, Chinese abstract, English abstract, keywords, declarations when applicable.
- Confirm protected front matter still matches the template's formatting. Content may change, but cover/declaration/signature pages must preserve paragraph styles, run-level fonts/sizes/bold, spacing, alignment, and page-break structure unless explicitly modified.
- Confirm TOC entries point to the generated source chapters, not only to the template's sample chapters.
- Confirm heading paragraphs are still heading styles after OOXML insertion; style ID collisions must not break TOC generation.
- Confirm body pages use the intended body style and do not inherit the last template section's header/footer.
- Confirm representative images, formulas, tables, captions, references, and citations survive the reconstruction.
- Record failures in the run report with PASS/FAIL/PARTIAL wording and concrete evidence.
Render-level Quality Gate (validate_docx_render.py)
The structural quality gate above checks counts and presence. It can return
PASS while the rendered Word/PDF is visibly broken because pandoc-derived
DOCX bodies often ship with a TOC paragraph that has no field, a Heading 1
style with no <w:numPr>, a numId rebound to a single-level abstract during
reference repair, or a body section header whose static text is "致谢". Run
validate_docx_render.py after validate_docx_conversion.py to catch those:
- TOC field presence:
<w:fldChar w:fldCharType="begin">plus<w:instrText> TOC. If absent, Word's "update fields" cannot populate a non-existent TOC. Usescripts/inject_toc_field.pyto add one before finalization. - numId ↔ abstractNum consistency: every
(numId, ilvl)pair used by a paragraph or by a style's<w:numPr>must resolve to a defined<w:lvl ilvl=N>inside the bound abstract numbering. Missing levels silently fall back to level 0 — that is how1.1/1.1.1headings collapse to[1]after a reference repair re-pointsnumId=1at a single-level abstract. - Multilevel heading format: the abstract numbering bound to Heading 1 (whether at style level or via inline numPr on body H1 paragraphs) must have
lvlTextmatching the user-supplied chapter prefix pattern (default第%1章orChapter %1) at level 0 and a multilevel pattern (default contains both%1and%2) at levels 1/2. Configure with--chapter-prefix-patternand--multilevel-patternfor non-default templates. - Reference counter independence: any non-heading paragraph appearing after the last
参考文献/ReferencesHeading 1 must not reuse anumIdalready used by Heading 1/2/3. This is the bug where 33 references render as[47]–[79]because their counter was shared with H2/H3 paragraphs upstream. - Body header is not a back-matter literal: for every body section that uses a `<w:he