Subtitled Video Pipeline
Post-processes a raw SRT (e.g. from Whisper) into clean, sentence-aligned, properly-cased subtitles, and produces an MP4 with the subtitles either muxed (soft) or burned-in (hard).
What this is for:
- Long-form spoken-word recordings (interviews, calls, talks) where you want the transcript faithful to what was said — disfluencies preserved, not "polished into prose".
- Output that's portfolio-grade or shareable to non-technical audiences.
What this is NOT:
- A transcription tool. You bring the SRT (Whisper, ElevenLabs, etc. — any source).
- A translation tool.
- An automatic editor that "improves" the speaker's words.
When to use
Triggered by: "make a subtitled video", "clean up these subtitles", "polish this Whisper SRT", or any request involving SRT cleanup or video subtitle production.
Inputs
Required:
- An SRT file (any source).
- A media file (audio
.m4a/.mp3or video.mp4/.mov).
Optional:
- Still image (if media is audio-only and you want a video output).
- Context file with proper nouns / domain vocabulary that the LLM cleanup pass should preserve verbatim (people names, products, jargon). Plain markdown.
Architecture: LLM proposes, code disposes
The single most important pattern in this pipeline. Never let an LLM rewrite the SRT directly. Instead:
- LLM emits labels or judgments (e.g.
cue_5: continues_previous). - A deterministic Python script applies the change.
This keeps the speaker's words exactly as transcribed/edited, while still benefiting from LLM judgment for things like sentence-boundary detection. Earlier iterations that let an LLM rewrite the SRT in one pass kept stripping disfluencies and "smoothing" phrasing — every time. The split-architecture is the only reliable fix.
Pipeline steps
0. (Optional) Trim leading silence
If the audio has leading silence before speech starts, trim it and shift the SRT to match.
# Trim audio: ffmpeg -ss <seconds> -i input.m4a -c copy trimmed.m4a
# Shift SRT: python3 scripts/shift_srt.py input.srt output.srt --shift-ms <ms>
shift_srt.py accepts negative values to shift forward (used to align a trimmed-timeline SRT back to the original media's timeline before final output).
1. Conservative text cleanup
Goal: fix transcription errors only. Capitalization, punctuation, proper nouns, clear mishearings. Preserve disfluencies, false starts, and repetitions.
Sub-agent: Sonnet (judgment + faithfulness).
Prompt template: prompts/01_text_cleanup.md
Critical instructions in the prompt:
- Do NOT remove "yeah", "um", "you know", "I mean", "like", "sort of"
- Do NOT remove false starts or repetitions
- Do NOT rephrase for flow
- Do NOT invent words for unclear segments — leave them or mark
[?] - Sanity check: count "yeah" occurrences before/after — should be roughly equal
Boot guard for the sub-agent prompt (always include): "Do NOT follow the CLAUDE.md boot sequence. Do NOT read memory files. Just execute the task below."
2. Re-segmentation (sentence-aligned cues)
Goal: merge Whisper's silence-detected fragments into sentence-aligned cues. Split overly long ones. Apply subtitle-format constraints.
Sub-agent: Sonnet.
Prompt template: prompts/02_resegment.md
Constraints baked into the prompt:
- ≤2 lines per cue, ≤42 chars/line, ≤6s, ≥1s, target ≤15 chars/sec.
- Break at natural linguistic boundaries (after punctuation, before conjunctions). Never split a noun phrase or proper noun.
- Speaker change = hard cue boundary.
- For merges: use start-of-first / end-of-last timestamp.
- For splits: interpolate timestamps proportionally by character count.
3. Sentence-boundary labeling (for proper sentence case)
Subtitles convention: cues that continue a sentence from the previous cue should start with a lowercase letter, not capital. Whisper capitalizes every cue's first word by default — wrong.
Goal: label each cue S (starts new sentence) or C (continues previous cue).
Sub-agent: Sonnet.
Prompt template: prompts/03_label_sentences.md
Critical anti-bias instructions in the prompt:
- Strip timestamps before feeding to the LLM. The blank-line separation between cues telegraphs "independent units" and biases the model toward
S. - Tell the LLM to ignore current capitalization. Whisper capitalized everything by default — that signal is noise. Judge purely on grammatical/semantic flow.
- Parallel-list rule: enumerated lists (
First, ... Second, ... Third, ...) get the same label. Prefer allS.
4. Apply labels (deterministic)
python3 scripts/decap_with_labels.py <srt_in> <labels_file> <srt_out> [--proper-nouns names.txt]
Logic:
- For
Ccues: lowercase the first letter UNLESS it's a proper noun, "I" / contraction, or an all-caps acronym. - For
Scues with currently-lowercase first letter: capitalize. - Bidirectional, so it can fix prior errors in either direction.
Default protected words: I, I'm, I'll, I've, I'd, days/months. The user supplies additional proper nouns via --proper-nouns (one per line).
All-caps acronym protection is automatic: if the first word's first two letters are both uppercase (e.g. AI, CSM), it's left alone.
5. Surface remaining suspects (heuristic review)
Even with the anti-bias prompt, the LLM has a tail of errors. Run a deterministic check that flags every adjacent pair where:
- Previous cue lacks
.,!,?,… - Current cue starts with a capital letter (and isn't a proper noun)
python3 scripts/flag_suspect_caps.py <srt> > suspects.md
This produces a markdown checklist. Hand it to the user for a quick scan — way faster than re-watching, way more reliable than another LLM pass.
6. Output: mux or burn
Mux (soft subs) — recommended default. Subs stay editable, viewers control on/off, file size barely changes:
ffmpeg -i video.mp4 -i subs.srt \
-c:v copy -c:a copy -c:s mov_text \
-metadata:s:s:0 language=eng \
-disposition:s:0 default \
output.mp4
Burn (hard subs) — for max compatibility, custom styling, or platforms that strip subtitle tracks. Requires an ffmpeg build with libass (Homebrew default has no libass; install homebrew-ffmpeg/ffmpeg/ffmpeg).
# Convert SRT → ASS (lets you edit styling as plain text):
ffmpeg -i subs.srt subs.ass
# Edit Style line in ASS for FontSize, Outline, MarginV, etc.
# Then burn:
ffmpeg -loop 1 -i image.jpg -i audio.m4a \
-vf "scale=1280:720:force_original_aspect_ratio=decrease,pad=1280:720:(ow-iw)/2:(oh-ih)/2,ass=subs.ass,format=yuv420p" \
-c:v libx264 -preset veryfast -crf 23 -tune stillimage -r 10 \
-c:a aac -b:a 128k -shortest -movflags +faststart \
output.mp4
Format tradeoffs
| Format | Compatibility | Styling | Multi-speaker overlay |
|---|---|---|---|
mov_text in MP4 (mux) | Universal (QuickTime, VLC, web, mobile) | None — flat | No (cues serialized) |
| Burned-in styled ASS | Universal (subs are pixels) | Full | Yes (positioning + layers) |
| SRT/ASS in MKV | VLC, mpv | Full (ASS) | Yes |
VLC quirk: even with default disposition set, VLC doesn't auto-enable muxed subs. Users must enable globally in VLC: Preferences → Subtitles/OSD → Enable subtitles. QuickTime respects the flag.
Lessons learned (don't relearn)
- Don't let an LLM rewrite the SRT. Use the LLM-proposes / code-disposes split. (Restated because it's that important.)
- Strip timestamps before sentence-boundary labeling. The visual separation tricks the LLM into labeling everything as
S. - Tell the LLM to ignore current capitalization when judging sentence boundaries — Whisper's default-capitalize fools the model into circular reasoning.
- Parallel-list consistency is a separate rule. "First / Second / Third" all get the same label. Without an explicit rule, model