SSkilltecabyclaudinhocode
Enviar skill
← Voltar para o catálogo

subtitled-video

Documentos

Turn a raw Whisper SRT (or any transcript) into clean, well-formatted subtitles and produce a subtitled video. Optimized for spoken-word audio where preserving disfluencies matters.

0estrelas
Ver no GitHub ↗Autor: RaphazZzeLicença: MIT

Subtitled Video Pipeline

Post-processes a raw SRT (e.g. from Whisper) into clean, sentence-aligned, properly-cased subtitles, and produces an MP4 with the subtitles either muxed (soft) or burned-in (hard).

What this is for:

  • Long-form spoken-word recordings (interviews, calls, talks) where you want the transcript faithful to what was said — disfluencies preserved, not "polished into prose".
  • Output that's portfolio-grade or shareable to non-technical audiences.

What this is NOT:

  • A transcription tool. You bring the SRT (Whisper, ElevenLabs, etc. — any source).
  • A translation tool.
  • An automatic editor that "improves" the speaker's words.

When to use

Triggered by: "make a subtitled video", "clean up these subtitles", "polish this Whisper SRT", or any request involving SRT cleanup or video subtitle production.


Inputs

Required:

  • An SRT file (any source).
  • A media file (audio .m4a/.mp3 or video .mp4/.mov).

Optional:

  • Still image (if media is audio-only and you want a video output).
  • Context file with proper nouns / domain vocabulary that the LLM cleanup pass should preserve verbatim (people names, products, jargon). Plain markdown.

Architecture: LLM proposes, code disposes

The single most important pattern in this pipeline. Never let an LLM rewrite the SRT directly. Instead:

  1. LLM emits labels or judgments (e.g. cue_5: continues_previous).
  2. A deterministic Python script applies the change.

This keeps the speaker's words exactly as transcribed/edited, while still benefiting from LLM judgment for things like sentence-boundary detection. Earlier iterations that let an LLM rewrite the SRT in one pass kept stripping disfluencies and "smoothing" phrasing — every time. The split-architecture is the only reliable fix.


Pipeline steps

0. (Optional) Trim leading silence

If the audio has leading silence before speech starts, trim it and shift the SRT to match.

# Trim audio: ffmpeg -ss <seconds> -i input.m4a -c copy trimmed.m4a
# Shift SRT: python3 scripts/shift_srt.py input.srt output.srt --shift-ms <ms>

shift_srt.py accepts negative values to shift forward (used to align a trimmed-timeline SRT back to the original media's timeline before final output).

1. Conservative text cleanup

Goal: fix transcription errors only. Capitalization, punctuation, proper nouns, clear mishearings. Preserve disfluencies, false starts, and repetitions.

Sub-agent: Sonnet (judgment + faithfulness).

Prompt template: prompts/01_text_cleanup.md

Critical instructions in the prompt:

  • Do NOT remove "yeah", "um", "you know", "I mean", "like", "sort of"
  • Do NOT remove false starts or repetitions
  • Do NOT rephrase for flow
  • Do NOT invent words for unclear segments — leave them or mark [?]
  • Sanity check: count "yeah" occurrences before/after — should be roughly equal

Boot guard for the sub-agent prompt (always include): "Do NOT follow the CLAUDE.md boot sequence. Do NOT read memory files. Just execute the task below."

2. Re-segmentation (sentence-aligned cues)

Goal: merge Whisper's silence-detected fragments into sentence-aligned cues. Split overly long ones. Apply subtitle-format constraints.

Sub-agent: Sonnet.

Prompt template: prompts/02_resegment.md

Constraints baked into the prompt:

  • ≤2 lines per cue, ≤42 chars/line, ≤6s, ≥1s, target ≤15 chars/sec.
  • Break at natural linguistic boundaries (after punctuation, before conjunctions). Never split a noun phrase or proper noun.
  • Speaker change = hard cue boundary.
  • For merges: use start-of-first / end-of-last timestamp.
  • For splits: interpolate timestamps proportionally by character count.

3. Sentence-boundary labeling (for proper sentence case)

Subtitles convention: cues that continue a sentence from the previous cue should start with a lowercase letter, not capital. Whisper capitalizes every cue's first word by default — wrong.

Goal: label each cue S (starts new sentence) or C (continues previous cue).

Sub-agent: Sonnet.

Prompt template: prompts/03_label_sentences.md

Critical anti-bias instructions in the prompt:

  • Strip timestamps before feeding to the LLM. The blank-line separation between cues telegraphs "independent units" and biases the model toward S.
  • Tell the LLM to ignore current capitalization. Whisper capitalized everything by default — that signal is noise. Judge purely on grammatical/semantic flow.
  • Parallel-list rule: enumerated lists (First, ... Second, ... Third, ...) get the same label. Prefer all S.

4. Apply labels (deterministic)

python3 scripts/decap_with_labels.py <srt_in> <labels_file> <srt_out> [--proper-nouns names.txt]

Logic:

  • For C cues: lowercase the first letter UNLESS it's a proper noun, "I" / contraction, or an all-caps acronym.
  • For S cues with currently-lowercase first letter: capitalize.
  • Bidirectional, so it can fix prior errors in either direction.

Default protected words: I, I'm, I'll, I've, I'd, days/months. The user supplies additional proper nouns via --proper-nouns (one per line).

All-caps acronym protection is automatic: if the first word's first two letters are both uppercase (e.g. AI, CSM), it's left alone.

5. Surface remaining suspects (heuristic review)

Even with the anti-bias prompt, the LLM has a tail of errors. Run a deterministic check that flags every adjacent pair where:

  • Previous cue lacks ., !, ?,
  • Current cue starts with a capital letter (and isn't a proper noun)
python3 scripts/flag_suspect_caps.py <srt> > suspects.md

This produces a markdown checklist. Hand it to the user for a quick scan — way faster than re-watching, way more reliable than another LLM pass.

6. Output: mux or burn

Mux (soft subs) — recommended default. Subs stay editable, viewers control on/off, file size barely changes:

ffmpeg -i video.mp4 -i subs.srt \
  -c:v copy -c:a copy -c:s mov_text \
  -metadata:s:s:0 language=eng \
  -disposition:s:0 default \
  output.mp4

Burn (hard subs) — for max compatibility, custom styling, or platforms that strip subtitle tracks. Requires an ffmpeg build with libass (Homebrew default has no libass; install homebrew-ffmpeg/ffmpeg/ffmpeg).

# Convert SRT → ASS (lets you edit styling as plain text):
ffmpeg -i subs.srt subs.ass
# Edit Style line in ASS for FontSize, Outline, MarginV, etc.
# Then burn:
ffmpeg -loop 1 -i image.jpg -i audio.m4a \
  -vf "scale=1280:720:force_original_aspect_ratio=decrease,pad=1280:720:(ow-iw)/2:(oh-ih)/2,ass=subs.ass,format=yuv420p" \
  -c:v libx264 -preset veryfast -crf 23 -tune stillimage -r 10 \
  -c:a aac -b:a 128k -shortest -movflags +faststart \
  output.mp4

Format tradeoffs

FormatCompatibilityStylingMulti-speaker overlay
mov_text in MP4 (mux)Universal (QuickTime, VLC, web, mobile)None — flatNo (cues serialized)
Burned-in styled ASSUniversal (subs are pixels)FullYes (positioning + layers)
SRT/ASS in MKVVLC, mpvFull (ASS)Yes

VLC quirk: even with default disposition set, VLC doesn't auto-enable muxed subs. Users must enable globally in VLC: Preferences → Subtitles/OSD → Enable subtitles. QuickTime respects the flag.


Lessons learned (don't relearn)

  1. Don't let an LLM rewrite the SRT. Use the LLM-proposes / code-disposes split. (Restated because it's that important.)
  2. Strip timestamps before sentence-boundary labeling. The visual separation tricks the LLM into labeling everything as S.
  3. Tell the LLM to ignore current capitalization when judging sentence boundaries — Whisper's default-capitalize fools the model into circular reasoning.
  4. Parallel-list consistency is a separate rule. "First / Second / Third" all get the same label. Without an explicit rule, model

Como adicionar

/plugin marketplace add RaphazZze/subtitled-video

O comando exato pode variar conforme o repositório. Confira o README no GitHub.

Comentários · Nenhum comentário

Entre para comentar. Entrar

  • Ainda não há comentários. Seja o primeiro.