hwp-pipeline
Read, convert, and edit Korean Hangul documents (.hwp / .hwpx) without Hancom Office.
What this skill does
Three capabilities in one skill:
- Read HWP: Convert
.hwpto JSON, Markdown, or HTML via@ohah/hwpjs(Node.js) - Convert HWP to HWPX: Transform
.hwpto editable.hwpxviahwp2hwpx(Java) - Edit HWPX: Create, modify, and template-fill
.hwpxdocuments viapython-hwpx
When to use
- "Convert this HWP file to Markdown"
- "Extract images from a Hangul document"
- "Fill in this government form template (.hwp)"
- "Replace placeholders in this HWPX document"
- "Create a new HWPX document with tables"
When not to use
- The file is
.docxor.pdf(use the appropriate tool) - OCR or scanned PDF recovery is needed
Prerequisites
| Capability | Requires |
|---|---|
| Read HWP (JSON/MD/HTML) | Node.js 18+ |
| Convert HWP to HWPX | Java 11+, Maven |
| Edit HWPX | Python 3.10+, python-hwpx, lxml |
Run setup.sh to install all dependencies automatically.
Quick decision tree
- Extract text from HWP → Use
@ohah/hwpjs(Section A) - Convert HWP to Markdown/HTML/JSON → Use
@ohah/hwpjs(Section A) - Edit or fill an HWP form → Convert to HWPX first (Section B), then edit (Section C)
- Extract text from HWPX → Use
scripts/text_extract.py(Section C) - Replace placeholders in HWPX → Use
scripts/zip_replace_all.py(Section C) - Create a new HWPX document → Use
python-hwpxdirectly (Section C)
Section A: Read HWP
Uses @ohah/hwpjs (Node.js CLI). Works on macOS, Linux, and Windows.
Install
npm install -g @ohah/hwpjs
export NODE_PATH="$(npm root -g)"
Convert to JSON
hwpjs to-json document.hwp -o output.json --pretty
Convert to Markdown
hwpjs to-markdown document.hwp -o output.md --include-images
# or save images separately:
hwpjs to-markdown document.hwp -o output.md --images-dir ./images
Convert to HTML
hwpjs to-html document.hwp -o output.html
Extract images only
hwpjs extract-images document.hwp -o ./images
Batch processing
hwpjs batch ./documents -o ./output --format json --recursive
Section B: Convert HWP to HWPX
Uses hwp2hwpx (Java, Apache 2.0 by neolord0). Converts binary .hwp to ZIP-based .hwpx so it can be edited with python-hwpx.
First-time setup
bash setup.sh
This clones hwp2hwpx, patches the compiler version, builds a fat JAR, and installs python-hwpx.
Convert
SKILL_DIR="$(cd "$(dirname "$0")" && pwd)" # or set to skill install path
java -jar "$SKILL_DIR/java/hwp2hwpx-fat.jar" input.hwp output.hwpx
If the fat JAR is not available, use the classpath method:
cd "$SKILL_DIR/java"
CP=".:target/hwp2hwpx-1.0.0.jar:$(find ~/.m2/repository/kr/dogfoot -name '*.jar' | tr '\n' ':')"
java -cp "$CP" Convert input.hwp output.hwpx
Limitations
hwpparser(Python) is NOT supported: it loses table structure during conversion.- Polaris Office cannot produce HWPX output.
Section C: Edit HWPX
Uses python-hwpx (MIT by airmang). Full API reference: references/api.md.
Install
pip install -U python-hwpx lxml
Extract text from HWPX
python3 scripts/text_extract.py input.hwpx
python3 scripts/text_extract.py input.hwpx --include-nested --format json
Create a new HWPX document
from hwpx import HwpxDocument
doc = HwpxDocument.new()
doc.add_paragraph("Document title")
table = doc.add_table(2, 2)
table.set_cell_text(0, 0, "Name")
table.set_cell_text(0, 1, "Value")
doc.save_to_path("output.hwpx")
See examples/01_create_and_save.py.
Replace placeholders (including table cells)
python3 scripts/zip_replace_all.py template.hwpx output.hwpx \
--replace "{name}=John" "{date}=2026-01-01" --auto-fix-ns
See examples/03_template_replace.py for the full pipeline.
Style-based search and replace
with HwpxDocument.open("input.hwpx") as doc:
doc.replace_text_in_runs("TODO", "DONE", text_color="#FF0000")
doc.save_to_path("output.hwpx")
Fix namespaces after ZIP-level edits
python3 scripts/fix_namespaces.py input.hwpx --inplace --backup
Workflow patterns
Pattern 1: Fill a government/institutional HWP form
input.hwp → [hwp2hwpx] → template.hwpx → [zip_replace_all.py] → filled.hwpx
- Convert HWP to HWPX:
java -jar java/hwp2hwpx-fat.jar form.hwp form.hwpx - Identify placeholders:
python3 scripts/text_extract.py form.hwpx --include-nested - Replace placeholders:
python3 scripts/zip_replace_all.py form.hwpx filled.hwpx --replace "{field}=value" --auto-fix-ns
Pattern 2: Extract and analyze HWP content
input.hwp → [hwpjs] → output.json or output.md
Pattern 3: Create a new structured HWPX
doc = HwpxDocument.new()
# Add paragraphs, tables, memos
doc.save_to_path("output.hwpx")
Unstable areas
set_header_text()andset_footer_text()may produce inconsistent layouts depending on the document version. If headers/footers cause problems, fix them in the template and only automate body, tables, and memos.- Always verify output files after automated processing.
Bundled resources
references/api.md— python-hwpx API signatures and usage notesscripts/text_extract.py— One-command text extraction CLIscripts/zip_replace_all.py— Global placeholder replacement including table cellsscripts/fix_namespaces.py— XML namespace normalization after ZIP-level editsexamples/01_create_and_save.py— Create a new document with paragraphs and tablesexamples/02_extract_and_inspect.py— Extract text and inspect document structureexamples/03_template_replace.py— Full pipeline: placeholder replacement + namespace fix
Done when
- Requested conversion output exists and is valid
- Template placeholders are replaced and verified
- Batch processing count matches input count
- Output file opens correctly (verify with text extraction)