TikTok Karaoke Captions
Burn TikTok-style karaoke captions and a persistent headline banner into any local video, fully offline on macOS Apple Silicon.
When to use
Trigger this skill when the user wants any of:
- 给视频加字幕 / add subtitles to a video
- TikTok 风格字幕 / karaoke captions / per-word highlight captions
- 顶部 headline 横幅 / persistent headline / video title overlay
- 用 Whisper 转录视频 / auto-transcribe a video
- 字幕烧入视频 / burn subtitles into a video
- 用脚本原文校准字幕 / forced-alignment subtitles from a known script
Quick reference
# 1. Pure auto-caption
python3 ~/Desktop/Repos/AI_Skills/tiktok-karaoke-captions/caption.py video.mp4
# 2. Script-aligned (zero typos)
python3 ~/Desktop/Repos/AI_Skills/tiktok-karaoke-captions/caption.py video.mp4 \
--script-file script.txt
# 3. Full TikTok package
python3 ~/Desktop/Repos/AI_Skills/tiktok-karaoke-captions/caption.py video.mp4 \
--script-file script.txt --headline "BLACK FRIDAY · 50% OFF"
# Common flags
--caption-mode classic # static line-level SRT instead of karaoke
--max-words-per-chunk 4 # 1–4 words per chunk (default 3)
--no-uppercase # keep original casing
--model small # lighter model (480 MB vs default medium 1.5 GB)
--language zh # Chinese audio
--srt-only # just generate SRT/ASS, no burn-in
--out-dir ./output # custom output dir
Outputs (in --out-dir or alongside the input)
<stem>.srt— line-level SRT (always written)<stem>.ass— karaoke ASS (tiktok mode only)<stem>.whisper.json— raw Whisper word timestamps (debug)<stem>-captioned.mp4— final video with text burned in
Mechanism
- Audio extract: ffmpeg → 16 kHz mono WAV
- Transcribe:
uvx --from mlx-whisper mlx_whisper(Apple Silicon native, ~10 sec for 15s video at medium model after warmup; auto-retries with bigger model if output looks broken) - Align: if
--script-filegiven,difflib.SequenceMatchermaps each script word → Whisper word timestamp (forced alignment, zero typos) - Chunk: split into 1–3 word chunks at sentence/comma boundaries
- ASS karaoke: each chunk → N events, current word highlighted yellow via
{\c&H0000FFFF&}…{\c}inline tags - Burn: single ffmpeg pass —
drawbox+drawtextfor headline pill +subtitlesfilter for ASS - ffmpeg fallback: if system ffmpeg lacks libass (Homebrew bottle), auto-uses
static-ffmpegfrom PyPI viauvx(no system changes)
Requirements
- macOS on Apple Silicon (mlx-whisper requirement)
uvinstalled (brew install uv) — only system dependency- ~1.8 GB disk for first-run downloads (whisper-medium model + deps + static-ffmpeg)
Optional: Deepgram cloud (faster + more reliable than local)
Set DEEPGRAM_API_KEY env var to enable Deepgram Nova-3 — when the key is set, it becomes the default primary backend (faster: ~2s vs ~10s, and more reliable than local mlx-whisper). Local Whisper is used as the fallback when Deepgram fails. Pass --prefer-local to flip this back to local-first. Free $200 credit at https://console.deepgram.com/signup. Cost ~$0.001 per 15s clip.
Bundled fonts (open-source, commercial OK)
- Archivo Black (OFL 1.1) — headline pill
- Roboto Black / Bold / Regular (Apache 2.0) — captions
See fonts/README.md for details and licenses.
CLI reference
Run caption.py --help for the full list of flags.