/watch — Claude watches a video
You don't have a video input; this skill gives you one. A Python script downloads the video, extracts frames as JPEGs, gets a timestamped transcript (native captions first, then Whisper API as fallback), and prints frame paths. You then Read each frame path to see the images and combine them with the transcript to answer the user.
Step 0 — Setup preflight (runs every /watch invocation, silent on success)
Python interpreter: every python3 ... command in this skill is for macOS/Linux. On Windows, substitute python — the python3 command on Windows is the Microsoft Store stub and will not run the script.
Before every /watch run, verify that dependencies and an API key are in place:
python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py" --check
This is a <100ms lookup. On exit 0, the script emits nothing — proceed to Step 1 without comment. Do NOT announce "setup is complete" to the user — they don't need a status message on every turn. The only acceptable user-visible output from Step 0 is when remediation is required.
On non-zero exit, follow the table:
| Exit | Meaning | Action |
|---|---|---|
2 | Missing binaries (ffmpeg / ffprobe / yt-dlp) | Run installer |
3 | No Whisper API key | Do NOT block. Local STT handles transcription without any key — proceed directly to Step 2. Only run the installer if the user explicitly wants Whisper API. |
4 | Both missing | Run installer for binaries, then proceed — no API key needed |
The installer is idempotent — safe to re-run:
python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py"
On macOS with Homebrew, it auto-installs ffmpeg and yt-dlp. On Linux/Windows, it prints the exact install commands for the user to run. It scaffolds ~/.config/watch/.env with commented placeholders at 0600 perms, and writes SETUP_COMPLETE=true once deps + a key are in place so the next session knows this user has already been through the wizard.
If an API key is still missing after install: use AskUserQuestion to ask the user whether they have a Groq API key (preferred — cheaper, faster) or an OpenAI key. Then write it into ~/.config/watch/.env — set the matching GROQ_API_KEY=... or OPENAI_API_KEY=... line. If they don't want to set up Whisper, proceed with --no-whisper and tell them videos without native captions will come back frames-only.
Structured mode (optional): python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py" --json emits {status, first_run, missing_binaries, whisper_backend, has_api_key, config_file, platform} where status is one of ready | needs_install | needs_key | needs_install_and_key. Use this when you need to branch on specifics (e.g. "is this the user's very first run?" → first_run: true).
Within a single session, you can skip Step 0 on follow-up /watch calls — once --check returned 0, nothing about the environment changes between turns.
When to use
- User pastes a video URL (YouTube, Vimeo, X, TikTok, Twitch clip, most yt-dlp-supported sites) and asks about it.
- User points at a local video file (
.mp4,.mov,.mkv,.webm, etc.) and asks about it. - User types
/watch <url-or-path> [question].
Recommended limits
- Best accuracy: videos under 10 minutes. Frame coverage scales inversely with duration.
- Hard caps: 100 frames total and 2 fps. Token cost grows with frame count, so the script targets a frame budget by duration (and never exceeds 2 fps even when the budget would imply more):
- ≤30s → ~1-2 fps (up to 30 frames)
- 30s-1min → ~40 frames
- 1-3min → ~60 frames
- 3-10min → ~80 frames
- >10min → 100 frames, sparsely spaced (warning printed)
- If the user hands you a long video, consider asking whether they want a specific section before burning tokens on a sparse scan.
How to invoke
CRITICAL: Always run watch.py — never build your own pipeline. Do not manually call yt-dlp, ffmpeg, ffprobe, or any whisper binary yourself. watch.py handles the full pipeline: download → frames → captions → local STT → Whisper API → report. Your job is to run the script and read its output.
Step 1 — parse the user input. Separate the video source (URL or path) from any question the user asked. Example: /voxpip https://youtu.be/abc what language is this in? → source = https://youtu.be/abc, question = what language is this in?.
Step 2 — run the watch script. Pass the source verbatim. Do not shell-escape it yourself beyond normal quoting:
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "<source>"
Optional flags:
--start T/--end T— focus on a section. AcceptsSS,MM:SS, orHH:MM:SS. When either is set, fps auto-scales denser (see "Focusing on a section" below).--max-frames N— lower the cap for tighter token budget (e.g.--max-frames 40)--resolution W— change frame width in px (default 512; bump to 1024 only if the user needs to read on-screen text)--fps F— override auto-fps (clamped to 2 fps max)--out-dir DIR— keep working files somewhere specific (default: an auto-generated tmp dir)--whisper groq|openai— force a specific Whisper backend (default: prefer Groq if both keys exist)--no-whisper— disable the Whisper fallback entirely (frames-only if no captions)
Focusing on a section (higher frame rate)
When the user asks about a specific moment — "what happens at the 2 minute mark?", "zoom into 0:45 to 1:00", "the first 10 seconds" — pass --start and/or --end. The script switches to focused-mode budgets, which are denser than full-video budgets (still capped at 2 fps):
- ≤5s → 2 fps (up to 10 frames)
- 5-15s → 2 fps (up to 30 frames)
- 15-30s → ~2 fps (up to 60 frames)
- 30-60s → ~1.3 fps (up to 80 frames)
- 60-180s → ~0.6 fps (100 frames, capped)
Focused mode is the right call for:
- Any moment/range the user names explicitly ("around 2:30", "the intro", "the last 30 seconds").
- Any video longer than ~10 minutes where the user's question is about a specific part — running focused on the relevant section is far more useful than a sparse scan of the whole thing.
- Re-runs after a full scan didn't have enough detail in some region.
Transcript is auto-filtered to the same range. Frame timestamps are absolute (real video timeline, not offset-from-start).
Examples:
# Last 10 seconds of a 1 minute video
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" video.mp4 --start 50 --end 60
# Zoom into 2:15 → 2:45 at 3 fps (90 frames)
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "$URL" --start 2:15 --end 2:45 --fps 3
# From 1h12m to the end of the video
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "$URL" --start 1:12:00
Step 3 — Read every frame path the script lists. The Read tool renders JPEGs directly as images for you. Read all frames in a single message (parallel tool calls) so you see them together. The frames are in chronological order with a t=MM:SS timestamp so you can align them to the transcript.
Step 4 — answer the user. You now have two streams of evidence:
- Frames — what's on screen at each timestamp
- Transcript — what's said at each timestamp. The report's header shows the source (
captions= yt-dlp pulled native subs;whisper (groq)orwhisper (openai)= transcribed by API).
If the user asked a specific question, answer it directly citing timestamps. If they didn't ask anything, summarize what happens in the video — structure, key moments, notable visuals, spoken content.
Step 5 — clean up. The script prints a working directory at the end. If the user isn't going to ask follow-ups about this video, delete it with rm -rf <dir>. If they might, leave it in place.
Transcription
The script tries four tiers in order — it stops at the first that succeeds:
- Native captions (free, preferred). yt-dlp pulls manual or auto-generated subtitles from the source platform if available. 2