/scenelens — Claude watches a video, smarter
You don't have a video input; this skill gives you one. Compared to a fixed-fps frame grab, scenelens:
- Picks frames at scene changes — content-aware sampling instead of time-uniform sampling. Same frame budget, far better signal.
- Runs OCR on every frame — on-screen text (slides, code, terminals, dashboards) is extracted as text alongside the image, so you don't burn vision tokens reading static pixels.
- Auto-chunks long audio — Whisper's 25 MB cap no longer fails outright on long videos.
A Python script does all of this and prints a markdown report. You then Read each frame path to see the images and combine them with OCR + transcript to answer the user.
Step 0 — Setup preflight (silent on success)
Python interpreter: every python3 ... command in this skill is for macOS/Linux. On Windows, substitute python — python3 on Windows is the Microsoft Store stub and won't run the script.
Before every /scenelens call, verify dependencies and an API key are in place:
python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py" --check
This is a <100 ms lookup. On exit 0, the script emits nothing — proceed to Step 1 silently. Do NOT announce "setup is complete" — that's spam.
On non-zero exit:
| Exit | Meaning | Action |
|---|---|---|
2 | Missing required binaries (ffmpeg / ffprobe / yt-dlp) | Run installer |
3 | No Whisper API key | Run installer to scaffold .env, then ask user for a key |
4 | Both missing | Run installer, then ask for a key |
The installer is idempotent:
python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py"
On macOS with Homebrew, it auto-installs ffmpeg, yt-dlp, and (optionally) tesseract. On Linux/Windows, it prints exact install commands.
Tesseract is optional. Without it, the OCR pass is silently skipped — frames are still extracted, transcript still pulled. The skill works; it just loses the OCR sidechannel. The installer prints the install command for tesseract on each platform.
If an API key is still missing after install: use AskUserQuestion to ask whether the user has a Groq API key (preferred — cheaper, faster) or an OpenAI key, then write it into ~/.config/scenelens/.env on the matching GROQ_API_KEY=... or OPENAI_API_KEY=... line. If they don't want Whisper, proceed with --no-whisper and tell them captions-less videos come back frames-only.
Structured mode: python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py" --json emits {status, first_run, missing_binaries, missing_optional, ocr_available, whisper_backend, has_api_key, config_file, platform}.
Within a single session, skip Step 0 on follow-up calls — once --check returned 0, nothing has changed.
When to use
- User pastes a video URL (YouTube, Vimeo, X, TikTok, Twitch clip, anything yt-dlp supports) and asks about it.
- User points at a local video file (
.mp4,.mov,.mkv,.webm, etc.) and asks about it. - User types
/scenelens <url-or-path> [question].
How to invoke
Step 1 — parse user input. Separate the video source from any question. /scenelens https://youtu.be/abc what hook did they open with? → source = https://youtu.be/abc, question = what hook did they open with?.
Step 2 — run the script. Pass the source verbatim:
python3 "${CLAUDE_SKILL_DIR}/scripts/scenelens.py" "<source>"
Optional flags:
--mode auto|scene|fixed— frame selection strategy. Defaultauto: scene-aware first, fixed-fps fallback if scene changes are sparse. Forcefixedfor content with no hard cuts (e.g. a single-take talking head).--scene-threshold F— sensitivity (0-1, default 0.30). Lower = more frames captured. Bump to 0.20 for subtle visual changes.--start T/--end T— focus on a section (SS,MM:SS,HH:MM:SS).--max-frames N— lower the cap for tighter token budget.--resolution W— frame width in px (default 512; bump to 1024 only when the user must read tiny on-screen text and OCR isn't catching it).--no-ocr— skip the OCR pass. Use for content with no on-screen text (podcasts, interviews) to save a few hundred ms.--ocr-lang CODE— Tesseract language (defaulteng).--fps F— only applies in fixed-fps mode. Capped at 2 fps.--whisper groq|openai— force a specific backend. Default: prefer Groq when both keys exist.--no-whisper— disable Whisper entirely; frames-only if no captions.--sub-langs L1,L2— caption languages in priority order (defaulten,en-US,en-GB,en-orig).--out-dir DIR— keep working files somewhere specific.
Step 3 — Read every frame path the script lists. The Read tool renders JPEGs directly as images. Read all frames in a single message (parallel tool calls). Each frame has a t=MM:SS timestamp. When OCR text is present, the report shows it inline — use that text directly instead of trying to read pixels.
Step 4 — answer the user. You now have THREE streams of evidence:
- Frames — what's on screen (chosen at scene cuts when possible)
- OCR — on-screen text, already extracted
- Transcript — what was said, with timestamps
If the user asked something specific, answer with timestamp citations. Otherwise summarize: structure, key visuals, what was said.
Step 5 — clean up. The script prints a working directory at the end. If the user isn't asking follow-ups, delete it with rm -rf <dir>.
Frame selection — why scene-aware matters
A 10-minute video with one demo and nine minutes of talking head:
- Fixed fps: 80 frames evenly spaced — 8 of them on the demo, 72 on the head.
- Scene-aware: dense around scene cuts — the demo frames cluster on UI changes, the head frames spread sparsely.
Same token cost, dramatically better signal. The default mode is auto: scene detection first, with automatic fallback to fixed-fps when fewer than 8 scene changes are detected (single-take videos, screen recordings of static UI). Use --mode fixed to force the legacy behavior; use --mode scene to disable the fallback.
Focusing on a section
When the user names a moment ("around 2:30", "the first 10 seconds", "the last 30 seconds"), pass --start / --end. Frame budget tightens around the range, transcript filters to the same window, frame timestamps stay absolute (real video timeline).
python3 "${CLAUDE_SKILL_DIR}/scripts/scenelens.py" video.mp4 --start 50 --end 60
python3 "${CLAUDE_SKILL_DIR}/scripts/scenelens.py" "$URL" --start 2:15 --end 2:45
python3 "${CLAUDE_SKILL_DIR}/scripts/scenelens.py" "$URL" --start 1:12:00
Transcription
- Native captions (free, preferred). yt-dlp pulls manual or auto-generated subtitles when available.
- Whisper API fallback. If captions are missing, the script extracts mono 16 kHz mp3 audio (~480 kB/min) and uploads it to Groq's
whisper-large-v3(preferred) or OpenAI'swhisper-1. - Auto-chunking for long audio. Audio >24 MB is split into chunks under the 25 MB API cap, each transcribed separately, then merged with offset timestamps. A 4-hour podcast no longer fails — it just makes more API calls.
Both keys live in ~/.config/scenelens/.env. Unlike skills that fall back to a project-local .env, scenelens reads ONLY from ~/.config/scenelens/.env and process env — to avoid silently picking up keys from random project directories.
Failure modes
- Setup preflight failed → run
python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py"(auto-installs ffmpeg/yt-dlp via brew on macOS, scaffolds.env). For an API key, ask the user viaAskUserQuestionand write it to~/.config/scenelens/.env. - No transcript available → captions missing AND (no Whisper key OR Whisper API failed). Proceed frames-only and tell the user.
--mode scenereturned no frames → the video has no detectable scene changes. Re-run with--mode auto(default) or--mode fixed.- OCR not available → tesseract not installed.