Watch a video
This skill prepares a video for analysis. It pulls a transcript and extracts a small number of frames for visual context. No remote transcription API is used. Captions come straight from the platform when available; videos without captions are transcribed locally with whisper.cpp (~140MB model, runs on CPU). If both fail, the script falls back to vision-only analysis from frames.
When to use
Trigger when the user shares a video URL or file path and asks anything about its contents. Examples:
- "Summarize this YouTube video: <url>"
- "What does the speaker say about <topic> in this video?"
- "At what timestamp does <X> happen?"
- "Describe what's on screen at 2:30"
- "Who's in this clip?"
Do NOT trigger for non-video URLs, image-only files, or audio files.
How to run
The script lives at scripts/watch.py inside this skill's directory. Resolve it to an absolute path (the directory containing this SKILL.md) before invoking. After running install.sh, the canonical path is ~/.claude/skills/watch/scripts/watch.py.
python3 <skill-dir>/scripts/watch.py "<source>" [options]
Options (all optional):
--mode {fast,balanced,accurate}— token vs. accuracy preset. Defaultbalanced. See the table below.--start <time>and--end <time>— trim to a range. Use when the user asks about a specific segment ("what happens in the last 2 minutes").--max-frames N— override the mode's frame cap.--resolution N— override the mode's frame width in pixels. Bump to 768+ ONLY when the user needs to read small on-screen text (slide content, code, UI labels).--no-transcribe— skip the local whisper.cpp fallback and go straight to vision-only when no platform captions exist. Use when the user is in a hurry and a transcript isn't critical.--whisper-model {tiny,tiny.en,base,base.en,small,small.en}— override the whisper model (defaultbase). Bump tosmallfor noisy or technical audio ifbaseproduces obvious errors.--keep— don't print the cleanup hint.
Mode selection
| Mode | With transcript | Vision only | When to use |
|---|---|---|---|
fast | 15 frames @ 320px | 30 frames @ 480px | Summaries, "what's this video about", or anything where the transcript clearly carries the answer. |
balanced | 25 frames @ 384px | 50 frames @ 512px | Default. Most general questions. |
accurate | 60 frames @ 512px | 100 frames @ 768px | Visual details that matter — tracking who appears when, reading dense slides, transcribing on-screen code, dense action choreography. Costs 3-4x more tokens. |
"Transcript" means either platform captions or local whisper.cpp output. "Vision only" is when both are unavailable (or --no-transcribe was passed).
Pick the lowest mode that can answer the user's question. If unsure, default to balanced. If the user explicitly asks for "thorough" / "detailed visual analysis" / "find every instance of X on screen", use accurate.
Time format: SS, MM:SS, or HH:MM:SS.
Workflow
- Run the script. The output is a markdown summary with metadata, transcript (if available), and a list of frame paths with timestamps.
- Read the transcript first. It's already in your context — that alone often answers the question. For 80%+ of questions about videos with a transcript (captions or whisper output), it's sufficient.
- Read frames selectively. The frame list is an index, not a directive to read them all. Use the Read tool only on the frames you actually need:
- "What is the speaker wearing?" → read 1-2 frames near the start
- "What does the slide at 3:42 say?" → read the single frame closest to 3:42
- "Describe the visuals throughout" → read every 4th frame or so
- Vision-only mode (no transcript) → start with ~6-8 evenly-spaced frames, then drill in
- Answer the user's question. Cite timestamps from the transcript or frames when relevant: "At [02:14], …".
- Clean up. When done with the conversation about this video, delete the work directory shown at the bottom of the summary:
rm -rf <work_dir>.
Token discipline
Reading a frame costs ~1k tokens (low-res mode for 384px) to ~1.5k+ for higher resolutions. Don't load all frames if the transcript answers the question.
If a video is over ~15 minutes, prefer narrowing with --start/--end before running the script, especially if the user asked about a specific segment.
Examples
User: "Summarize https://youtu.be/abc123"
python3 <skill-dir>/scripts/watch.py "https://youtu.be/abc123"
Read transcript, summarize. Don't load any frames unless the user asks about visuals.
User: "What's on the slide at 5:30 in this talk: <url>"
python3 <skill-dir>/scripts/watch.py "<url>" --start 5:00 --end 6:00 --resolution 768
Find the frame closest to 5:30 in the output, read it, transcribe slide contents.
User: "Describe what happens in the last 30s of /Users/me/clip.mp4"
python3 <skill-dir>/scripts/watch.py /Users/me/clip.mp4 --start <duration-30s>
(First run with no --start to see the duration if unknown, then re-run.)
Setup
watch.py auto-installs ffmpeg, yt-dlp, and whisper.cpp on first run via the bundled setup.sh (Homebrew on macOS, apt/dnf/pacman + source build on Linux). You don't need to run anything manually.
The first caption-less video also triggers a one-time ~140MB model download to ~/.cache/watch-yt/models/. Subsequent runs reuse it.
If auto-install fails (e.g., Homebrew missing on macOS, no build tools on Linux), the script prints the actual error — relay it to the user and tell them how to fix it (install Homebrew from https://brew.sh, or install git/cmake/g++ via their package manager). Don't try to bypass with --no-install unless the user asks. Transcription gracefully degrades: if whisper.cpp is unavailable, captioned videos still work and uncaptioned ones fall back to vision-only.