--- name: pdf-translate description: Use when translating a PDF (especially OCR or image-based) to another language and producing faithful HTML output. Handles image extraction, layout preservation, contextual translation, and style-matched reconstruction. Triggers on "translate this PDF", "PDF en francais", "convert PDF to HTML translated", "traduire ce document". --- # PDF Translate Translate a PDF into another language and produce an HTML document that preserves the original layout, images, and visual style. Optimized for OCR/image-based PDFs where the text layer is unreliable. ## Pipeline ```dot digraph pipeline { rankdir=LR; PDF [shape=folder]; Images [shape=box, label="Page PNGs\n+ embedded images"]; Analysis [shape=box, label="Claude Vision\nread + translate\n+ layout map"]; HTML [shape=box, label="Faithful HTML\n/design-html"]; QA [shape=diamond, label="Visual QA\nPDF vs HTML"]; PDF -> Images [label="STEP 1"]; Images -> Analysis [label="STEP 2-3"]; Analysis -> HTML [label="STEP 4"]; HTML -> QA [label="STEP 5"]; QA -> Analysis [label="fix", style=dashed]; } ``` ## STEP 0: Dependencies Check before starting. Install what's missing. ```bash # Option A: poppler (lighter) command -v pdftoppm && echo "OK" || echo "INSTALL: sudo apt install poppler-utils" # Option B: PyMuPDF (more powerful — extracts embedded images with coordinates) python3 -c "import fitz; print('OK')" 2>/dev/null || echo "INSTALL: pip install pymupdf" ``` Prefer PyMuPDF if both available — it extracts embedded images + gives page dimensions. ## STEP 1: PDF → Page Images + Embedded Assets ```bash # Create working directory mkdir -p pdf-translate-work/{pages,assets} # Convert pages to high-res PNGs pdftoppm -png -r 300 input.pdf pdf-translate-work/pages/page # OR with PyMuPDF: python3 -c " import fitz doc = fitz.open('input.pdf') for i, page in enumerate(doc): pix = page.get_pixmap(dpi=300) pix.save(f'pdf-translate-work/pages/page-{i+1:03d}.png') for img_idx, img in enumerate(page.get_images(full=True)): xref = img[0] base = doc.extract_image(xref) with open(f'pdf-translate-work/assets/img-p{i+1}-{img_idx+1}.{base[\"ext\"]}', 'wb') as f: f.write(base['image']) " ``` ## STEP 2: First Pass — Style Analysis Read page 1 (and optionally 2-3 more) with Claude Vision. Extract: - **Typography**: font style (serif/sans), heading sizes, body size, weight - **Colors**: background, text, accent, header colors - **Layout**: single/multi column, margins, header/footer pattern - **Spacing**: line height, paragraph gaps, section gaps - **Special elements**: callout boxes, sidebars, tables, captions, footnotes Output a style brief — this feeds into STEP 4. ## STEP 3: Page-by-Page Read + Translate For each page image, use Claude Vision (Read tool on PNG): 1. **Read** the text content from the image (ignore OCR text layer) 2. **Map layout**: identify text blocks, headings, images, tables, their relative positions 3. **Translate** to target language preserving: - Register and tone (formal/informal/technical) - Technical terms (keep original in parentheses on first occurrence if ambiguous) - Sentence structure adapted to target language (not word-for-word) 4. **Note** image references: what each image shows, where it sits relative to text ### Cross-page context Maintain a running glossary of translated terms across pages. If page 1 translates "stakeholder" as "partie prenante", every subsequent page must use the same term. Output per page: ```markdown ## Page N ### Layout [column structure, image positions] ### Content (translated) [translated text with markdown structure] ### Images - img-pN-1.png: [description], position: [top-right / inline / full-width] ``` ## STEP 4: HTML Reconstruction Invoke `/design-html` (or `/frontend-design`) with: 1. The style brief from STEP 2 2. All translated page content from STEP 3 3. Extracted images from `pdf-translate-work/assets/` Requirements for the HTML: - Single self-contained HTML file (inline CSS, base64 images or relative paths) - Match original typography feel (use closest web-safe or Google Font) - Preserve column layout, spacing, color scheme - Images at original positions with proper sizing - Print-friendly: `@media print` styles, page breaks where original had them - Responsive: readable on screen, faithful on print ## STEP 5: Visual QA Compare original PDF and translated HTML side by side: 1. Read a few pages of the original PDF (Read tool, pages parameter) 2. Take screenshot of the HTML (if /browse available) 3. Check: layout match, no missing content, images present, style fidelity 4. Fix discrepancies → iterate STEP 4 ## Decision: OCR vs Native PDF ```dot digraph ocr_check { Check [shape=diamond, label="Does PDF have\nreliable text layer?"]; Native [shape=box, label="Can use marker\n+ Claude translate"]; OCR [shape=box, label="Use Vision pipeline\n(this skill)"]; Test [shape=box, label="Copy text from PDF.\nGarbled or missing?"]; Check -> Test [label="unsure"]; Test -> OCR [label="yes"]; Test -> Native [label="no, text is clean"]; Check -> Native [label="native PDF"]; Check -> OCR [label="scanned/OCR"]; } ``` If the PDF has a clean text layer, `marker` (pip install marker-pdf) is faster. This skill's Vision pipeline is for when the text layer is unreliable. ## Common Mistakes | Mistake | Fix | |---|---| | Using OCR text layer from scanned PDF | Read page images with Vision instead | | Translating page-by-page without glossary | Maintain cross-page term consistency | | Generic HTML that doesn't match original style | Extract style brief first (STEP 2) | | Word-for-word translation | Adapt sentence structure to target language | | Forgetting `prefers-reduced-motion` or print styles | Include in HTML output | | Images as decoration only | Preserve original placement and sizing |