RTL Document Translation Skill
Translate structured business documents to right-to-left (RTL) languages while maintaining pixel-perfect formatting, colors, table structures, and professional appearance.
When to Use This Skill
Invoke this skill when the user requests:
- Translating DOCX files to Arabic, Hebrew, Urdu, or other RTL languages
- Preserving exact document structure (tables, sections, formatting)
- Maintaining colors, backgrounds, and visual styling
- Converting business/financial documents to RTL formats
- Creating RTL versions that match English originals exactly
Do NOT use for:
- Simple text translation (use translation APIs directly)
- Creating new documents from scratch
- PDF-only workflows (this skill works with DOCX)
Core Methodology
1. Phased Approach (Critical)
Phase 1: Analysis → Phase 2: Translation Dictionary → Phase 3: Document Generation → Phase 4: Verification
Never skip directly to generation. Structure analysis prevents catastrophic errors like:
- Splitting multi-line cells into multiple rows
- Missing table dimensions
- Incorrect section orientations
2. RTL Formatting (3 Levels)
RTL documents require THREE distinct formatting levels:
Level 1 - Text Direction:
paragraph.paragraph_format.bidi = True
run.font.rtl = True
run.font.complex_script = True
Level 2 - Text Alignment:
paragraph.alignment = WD_ALIGN_PARAGRAPH.RIGHT
Level 3 - Layout Direction: For data/financial tables: Keep columns in LEFT-TO-RIGHT order
- Temporal sequences (Month 1, 2, 3...) progress L→R
- Row labels stay in same positions as English
- Only TEXT WITHIN cells is RTL
Example: Month headers should be:
[الشهر] [1] [2] [3] [4] ← Correct (columns L→R, text RTL)
[4] [3] [2] [1] [الشهر] ← Wrong (mirrored columns)
Implementation Patterns
Pattern 1: Background Color Detection
Problem: Simple attribute access fails Solution: Use XML traversal
from docx.oxml.ns import qn
def get_cell_background(cell):
"""Reliably extract cell background color"""
tc = cell._element
tcPr = tc.tcPr if hasattr(tc, 'tcPr') and tc.tcPr is not None else None
if tcPr is None:
return None
# CRITICAL: Use findall(), not direct attribute access
shd_list = tcPr.findall(qn('w:shd'))
for shd in shd_list:
fill = shd.get(qn('w:fill'))
if fill and fill != 'auto':
return fill.upper()
return None
Why: tcPr.shading doesn't work consistently. XML traversal is bulletproof.
Pattern 2: Set Cell Background
from docx.oxml import OxmlElement
def set_cell_background(cell, rgb_hex):
"""Set cell background color (e.g., 'CC0029' for red)"""
tc = cell._element
tcPr = tc.get_or_add_tcPr()
# Remove existing shading
for shd in tcPr.findall(qn('w:shd')):
tcPr.remove(shd)
# Add new shading
shd = OxmlElement('w:shd')
shd.set(qn('w:fill'), rgb_hex)
tcPr.append(shd)
Pattern 3: Quote Normalization
Problem: DOCX files contain curly quotes (U+201C, U+201D) that break dictionary lookups
Solution: Multi-pass normalization
def normalize_text(text):
"""Normalize quotes and unicode spaces for reliable matching"""
# Convert curly quotes → straight quotes
text = text.replace('\u201c', '"').replace('\u201d', '"')
text = text.replace('\u2018', "'").replace('\u2019', "'")
# Normalize unicode spaces → regular spaces
text = re.sub(r'[\u2002\u2003\u2009\u200A\u00A0]+', ' ', text)
return text.strip()
Pattern 4: Multi-Pass Translation Matching
Problem: Exact string matches fail due to whitespace variations, quotes, formatting
Solution: Progressive fallback strategy
def translate_text(text, translation_dict):
"""Multi-pass translation with normalization fallbacks"""
if not text or not text.strip():
return text
# Pass 1: Exact match
if text in translation_dict:
return translation_dict[text]
# Pass 2: Stripped
if text.strip() in translation_dict:
return translation_dict[text.strip()]
# Pass 3: Normalized quotes
normalized_quotes = text.replace('\u201c', '"').replace('\u201d', '"')
normalized_quotes = normalized_quotes.replace('\u2018', "'").replace('\u2019', "'")
if normalized_quotes in translation_dict:
return translation_dict[normalized_quotes]
# Pass 4: Stripped + normalized
if normalized_quotes.strip() in translation_dict:
return translation_dict[normalized_quotes.strip()]
# Pass 5: Unicode spaces
cleaned = re.sub(r'[\u2002\u2003\u2009\u200A\u00A0]+', ' ', text).strip()
if cleaned in translation_dict:
return translation_dict[cleaned]
# Pass 6: Combined (quotes + spaces)
cleaned_quotes = re.sub(r'[\u2002\u2003\u2009\u200A\u00A0]+', ' ', normalized_quotes).strip()
if cleaned_quotes in translation_dict:
return translation_dict[cleaned_quotes]
# Pass 7: Normalized whitespace (collapse multiple spaces)
normalized_ws = ' '.join(text.split())
if normalized_ws in translation_dict:
return translation_dict[normalized_ws]
# No match found - return as-is
return text
Success Rate: 95%+ vs 60% with exact-match-only
Pattern 5: RTL Cell Formatting
def apply_rtl_to_cell(cell, arabic_text, font_size=10, bold=False, text_color=None):
"""Apply complete RTL formatting to table cell"""
# Clear cell
cell.text = ''
# Add paragraph with Arabic text
paragraph = cell.paragraphs[0]
run = paragraph.add_run(arabic_text)
# RTL text direction (Level 1)
paragraph.paragraph_format.bidi = True
run.font.rtl = True
run.font.complex_script = True
# Right alignment (Level 2)
paragraph.alignment = WD_ALIGN_PARAGRAPH.RIGHT
# Font settings
run.font.name = 'Simplified Arabic' # or 'Times New Roman' for formal docs
run._element.rPr.rFonts.set(qn('w:ascii'), 'Simplified Arabic')
run._element.rPr.rFonts.set(qn('w:hAnsi'), 'Simplified Arabic')
run._element.rPr.rFonts.set(qn('w:cs'), 'Simplified Arabic')
run.font.size = Pt(font_size)
if bold:
run.font.bold = True
if text_color:
run.font.color.rgb = RGBColor(*text_color)
return cell
Pattern 6: Auto-Correct White Text on Dark Backgrounds
Problem: Text becomes invisible on dark backgrounds
Solution: Auto-detect and correct
def apply_colors_to_cell(cell, eng_cell, ar_text, font_size=10, bold=False):
"""Apply colors with auto-correction for visibility"""
# Get background color
bg_color = get_cell_background(eng_cell)
# Get text color from English
text_color = None
if eng_cell.paragraphs and eng_cell.paragraphs[0].runs:
for run in eng_cell.paragraphs[0].runs:
if run.font.color and run.font.color.rgb:
rgb = run.font.color.rgb
text_color = (rgb[0], rgb[1], rgb[2])
break
# AUTO-CORRECTION: Set white text for dark backgrounds
if bg_color and bg_color in ['CC0029', 'C00000', '000000']: # Red/black
text_color = (255, 255, 255) # White
# Apply formatting
apply_rtl_to_cell(cell, ar_text, font_size, bold, text_color)
# Set background
if bg_color:
set_cell_background(cell, bg_color)
Pattern 7: Nested Table Content Extraction ⭐
Problem: cell.text property doesn't include text from nested tables within the cell. This causes cells with forms, checklists, or complex layouts to appear empty.
Detection:
if cell.tables:
print(f"Cell contains {len(cell.tables)} nested table(s)")
Solution: Extract content from nested tables using cell.tables property
def extract_cell_content_with_nested_tables(cell):
"""
Extract all text from a cell, including text from nested tables.
Han