Watch Video
Claude can't stream video directly. This skill fakes it: a Python pipeline (vendored from bradautomates/claude-video under scripts/) downloads the video, extracts auto-scaled JPEG frames with ffmpeg, pulls a timestamped transcript (native captions first, Whisper API fallback), and prints a markdown report listing every frame path. Claude then Reads each frame, aligns it to the spoken text, and writes a structured notes file.
When to invoke
Slash-command only. Run this skill ONLY when the user literally types /watch-video. That is the sole trigger.
Do NOT invoke on:
- Casual phrases like "watch this video," "summarize this," "take notes on this YouTube video," "analyze this reel"
- A bare YouTube URL pasted with a question ("what useful tips are in this video?" + URL)
- Any natural-language ask about a video that lacks the explicit
/watch-videocommand
Default for YouTube questions without the slash command: pull the transcript the fastest way available — fetch the YouTube page / use yt-dlp captions / a transcript site — skim it, and answer from that. The frame-extraction pipeline is overkill unless the user explicitly asks for it via /watch-video.
If you'd rather have auto-trigger behavior, edit the
description:line above to match the keywords you want Claude to fire on (e.g. "Use when the user wants Claude to watch, analyze, or take notes on a video"). Slash-only is the default in this repo because explicit invocation prevents accidental token burn on long videos.
Dependencies
- ffmpeg + ffprobe on PATH — for frame and audio extraction
- yt-dlp on PATH — for downloading and caption fetching
- Python 3.9+ — the bundled scripts use
from __future__ import annotationsso 3.9 works - Optional: Whisper API key for videos without native captions. Set
GROQ_API_KEY(preferred — cheaper/faster, runswhisper-large-v3) orOPENAI_API_KEYin~/.config/watch/.env. Without one, captioned videos work fine; uncaptioned videos return frames-only.
Run python scripts/setup.py --check to verify dependencies, or python scripts/setup.py to scaffold the .env and check binaries. On macOS, the installer auto-installs missing binaries via Homebrew. On Linux/Windows, it prints exact install commands.
Pipeline
The work happens in scripts/watch.py. It downloads, extracts, transcribes, and prints a markdown report to stdout that lists every frame path. The pipeline auto-scales the frame budget by duration (hard cap 100 frames / 2 fps), so no manual interval tuning.
python scripts/watch.py "<youtube-url-or-local-path>" [flags]
Flags worth knowing:
--start T/--end T— focus on a section (SS,MM:SS, orHH:MM:SS). Auto-scales fps denser inside the range. Use this for any question about a specific moment, or for any video > 10 min where the user's question is about one part.--max-frames N— lower the cap for tighter token budget (default 80, hard max 100).--resolution W— frame width in px (default 512; bump to 1024 only if on-screen text is unreadable).--whisper groq|openai— force a specific Whisper backend (default: prefer Groq if both keys exist).--no-whisper— skip transcription entirely if no captions. Frames-only output.--out-dir DIR— keep working files somewhere specific (default: an auto-generated tmp dir).
Auto-fps budgets (full-video mode):
- ≤30s → up to 30 frames
- 30s–1min → ~40 frames
- 1–3min → ~60 frames
- 3–10min → ~80 frames
- >10min → 100 frames sparse (warning printed; consider
--start/--end)
Step-by-step workflow
1. Run the pipeline
Default invocation, no flags:
python scripts/watch.py "<source>"
For long videos where the user asked about a specific moment, pass --start/--end:
python scripts/watch.py "<source>" --start 2:15 --end 2:45
The script writes everything to a tmp working directory and prints a markdown report to stdout. Capture the stdout — it contains:
- Header (Title, Uploader, Duration, Transcript source:
captions/whisper (groq)/whisper (openai)/none available) ## Framessection with- \<absolute-path>` (t=MM:SS)` lines## Transcriptsection with[MM:SS] text...lines- Footer with
Work dir: <path>
2. Read every frame
Read all the listed frame paths in a single message (parallel Read tool calls). The Read tool renders JPEGs as images. Each frame's filename + t=MM:SS from the report tells you when it occurred — pair each frame with the matching transcript line at that timestamp.
For very long videos (>10 min, sparse mode): the budget already capped at 100 frames, so reading all of them is fine.
3. Write the summary
Output file: <work-dir>/<slug>-notes.md by default. If the user said "save notes somewhere permanent," ask where.
Structure the markdown like this:
# <Title>
**Source:** <URL or local path>
**Duration:** <mm:ss>
**Uploader:** <if from YouTube>
**Transcript source:** <captions / whisper (groq) / whisper (openai) / none>
## One-line summary
<≤20 words — the core claim or hook of the video>
## TL;DR
<3–5 bullet points capturing the main arguments, moments, or beats>
## Timeline
- **[00:00]** <what's happening visually + the key line being said>
- **[00:15]** ...
<one row per meaningful beat, not per frame>
## Key quotes
> "<verbatim quote>" — [mm:ss]
## Visual notes
<what the video shows that the transcript alone would miss — setting, B-roll, on-screen text, graphics, transitions, subject's emotion>
## Takeaways (optional)
<Only include if the content is directly relevant to the user's domain or goals. Omit otherwise.>
4. Clean up
After the .md is written, delete the work dir (it contains the full downloaded video + frames, which is large). The script prints Work dir: <path> in the footer of its report; pass that path to rm -rf.
If the user specified a non-tmp --out-dir, ask before deleting.
Common gotchas
- YouTube Shorts / age-gated / members-only — yt-dlp may fail. Surface its stderr verbatim; don't retry silently.
- No captions + no Whisper key — the report says
Transcript: none availableand points atsetup.py. Tell the user they can add a Groq key to~/.config/watch/.envfor Whisper, or use--no-whisperfor frames-only. - Local file with no audio track — Whisper extraction errors out cleanly. Use
--no-whisperfor frames-only. - Very long videos (>30 min) — confirm with the user before running. The pipeline caps at 100 frames so the budget is bounded, but a sparse 100-frame scan of a 60-min video isn't very useful. Almost always better to run focused on the specific section.
- Cloudflare 403 on Groq —
whisper.pyalready sets a custom User-Agent to clear Cloudflare's default-Python-UA block. If you ever see a 403, that's the failure mode. - No automatic Groq → OpenAI fallback — the script picks one Whisper backend at the start of each run (Groq if its key is set, else OpenAI) and stays on it. If Groq rate-limits, the script retries Groq twice then errors out. To use OpenAI instead, pass
--whisper openai.
What NOT to do
- Don't attempt to "watch" a video by inventing content based on title or thumbnail. If the pipeline fails, say so.
- Don't write the summary before actually reading frames. The transcript alone misses visual context (B-roll, on-screen text, emotion, graphics, transitions).
- Don't skip cleanup. Frame dumps + downloaded videos are large.
- Don't auto-fire on a video URL. Slash-command-only — see "When to invoke" above.
Engine attribution
The pipeline scripts under scripts/ are vendored from bradautomates/claude-video (MIT). See THIRD_PARTY_NOTICES.md for the full upstream license.