Video/Audio Processor
Process video and audio recordings into text transcripts and visual frame captures for analysis. Auto-detects your hardware and picks the optimal transcription model, device, and settings.
Phase 0: Detect Compute Environment
Run this at the start of every invocation. It determines everything downstream.
python3 -c "
import platform, shutil, subprocess, json
info = {'os': platform.system(), 'arch': platform.machine(), 'ram_gb': 0, 'gpu': 'none', 'vram_gb': 0, 'device': 'cpu', 'dtype': 'float32'}
# RAM
try:
if platform.system() == 'Darwin':
import os; info['ram_gb'] = round(os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES') / (1024**3))
elif platform.system() == 'Linux':
with open('/proc/meminfo') as f:
for line in f:
if line.startswith('MemTotal'):
info['ram_gb'] = round(int(line.split()[1]) / (1024**2))
break
else:
import ctypes
mem = ctypes.c_ulonglong(0)
ctypes.windll.kernel32.GetPhysicallyInstalledMemory(ctypes.byref(mem))
info['ram_gb'] = round(mem.value / (1024**2))
except: pass
# GPU detection
try:
import torch
if torch.cuda.is_available():
info['gpu'] = torch.cuda.get_device_name(0)
info['vram_gb'] = round(torch.cuda.get_device_properties(0).total_mem / (1024**3))
info['device'] = 'cuda'
info['dtype'] = 'float16'
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
info['gpu'] = 'Apple Silicon (MPS)'
info['vram_gb'] = info['ram_gb'] # unified memory
info['device'] = 'mps'
info['dtype'] = 'float16'
elif hasattr(torch, 'hip') or 'AMD' in str(getattr(torch, '_C', '')):
info['gpu'] = 'AMD (ROCm)'
info['device'] = 'cuda' # ROCm uses cuda API
info['dtype'] = 'float16'
except ImportError:
pass
# ffmpeg detection
info['ffmpeg'] = shutil.which('ffmpeg') is not None
info['ffprobe'] = shutil.which('ffprobe') is not None
print(json.dumps(info))
"
Parse the JSON output and store it internally as COMPUTE. Present the results to the user:
Detected hardware:
- OS: {os} ({arch})
- RAM: {ram_gb} GB
- GPU: {gpu} ({vram_gb} GB VRAM)
- Compute device: {device}
- ffmpeg: {installed/missing}
Whisper Engine Selection
Based on hardware, choose the transcription engine:
| Condition | Engine | Reason |
|---|---|---|
| CUDA GPU available | faster-whisper | GPU-accelerated CTranslate2, 4x faster than standard Whisper |
| MPS (Apple Silicon) | openai-whisper | Standard Whisper, uses MPS acceleration when available |
| CPU only | openai-whisper | Standard Whisper on CPU, reliable fallback |
Model Selection Matrix
Based on hardware and recording length, recommend a model:
| Condition | Recommended Model | Speed (1hr recording) | Accuracy |
|---|---|---|---|
| CUDA >= 8 GB VRAM | large-v3 | ~5 min | Best |
| CUDA 4-7 GB VRAM | medium | ~8 min | High |
| MPS (Apple Silicon, 16+ GB) | medium | ~15 min | High |
| MPS (Apple Silicon, 8 GB) | small | ~10 min | Good |
| CPU + RAM >= 16 GB | small | ~30 min | Good |
| CPU + RAM < 16 GB | base | ~15 min | Adequate |
Present the recommendation and let the user choose:
AskUserQuestion: "Which transcription model should I use?"
Options:
- [Recommended model] (Recommended) - [reason based on their hardware]
- large-v3 - Best accuracy, needs 8+ GB VRAM or lots of RAM
- medium - High accuracy, balanced speed
- small - Good accuracy, faster processing
- base - Adequate for most meetings, fastest
Store the chosen model as WHISPER_MODEL and engine as WHISPER_ENGINE.
Phase 1: Identify the Recording
Find the file. Zoom recordings typically follow the pattern:
GMT{date}-{time}_Recording_{resolution}.mp4(video)GMT{date}-{time}_Recording.m4a(audio only)GMT{date}-{time}_RecordingnewChat.txt(chat log)
Check for all three artifacts. Read the chat file first (it's small and gives immediate context).
Phase 2: Install Dependencies
Check and install what's needed based on platform.
ffmpeg
# Check if ffmpeg is available
ffmpeg -version 2>&1 | head -1
ffprobe -version 2>&1 | head -1
If ffmpeg is missing, install based on platform:
| Platform | Install command |
|---|---|
| macOS (Homebrew) | brew install ffmpeg |
| macOS (no Homebrew) | Guide user to install Homebrew first, then ffmpeg |
| Ubuntu/Debian | sudo apt update && sudo apt install -y ffmpeg |
| Fedora/RHEL | sudo dnf install -y ffmpeg |
| Arch Linux | sudo pacman -S ffmpeg |
| Windows (winget) | winget install Gyan.FFmpeg |
| Windows (choco) | choco install ffmpeg |
| Windows (scoop) | scoop install ffmpeg |
Python + Whisper
python3 -c "import torch; print(torch.__version__)" 2>&1
If torch is missing, install based on platform:
| Platform | Install command |
|---|---|
| macOS (MPS) | pip3 install torch torchvision torchaudio |
| Linux (CUDA) | pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 |
| Linux (ROCm) | pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0 |
| Linux/Windows (CPU) | pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu |
| Windows (CUDA) | pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 |
Then install the transcription engine:
# For CUDA systems (faster-whisper with CTranslate2 acceleration)
pip3 install faster-whisper
# For MPS / CPU systems (standard OpenAI Whisper)
pip3 install openai-whisper
Phase 3: Get Duration
Use ffprobe (cross-platform) to get the file duration:
ffprobe -v quiet -show_entries format=duration -of csv=p=0 "/path/to/file"
This returns duration in seconds. Convert to human-readable format and display to user. Use this to estimate transcription time based on the selected model and hardware.
Present the plan:
Processing plan:
- File: {filename} ({duration})
- Engine: {WHISPER_ENGINE} with {WHISPER_MODEL} model on {device}
- Estimated transcription time: {estimate}
- Frame extraction: 1 per minute ({frame_count} frames)
Want me to adjust anything before starting?
Time Estimates
| Device | base | small | medium | large-v3 |
|---|---|---|---|---|
| CUDA (RTX 3060+) | ~1 min/hr | ~2 min/hr | ~4 min/hr | ~5 min/hr |
| CUDA (faster-whisper) | ~30s/hr | ~1 min/hr | ~2 min/hr | ~3 min/hr |
| MPS (M1/M2/M3/M4) | ~3 min/hr | ~6 min/hr | ~15 min/hr | ~25 min/hr |
| CPU (16GB+ RAM) | ~8 min/hr | ~15 min/hr | ~30 min/hr | ~60 min/hr |
Phase 4: Extract Visual Frames
Extract one frame per minute for visual context. Use a cross-platform temp directory:
# Create temp directory (cross-platform)
python3 -c "import tempfile, os; d = os.path.join(tempfile.gettempdir(), 'av-frames'); os.makedirs(d, exist_ok=True); print(d)"
Then extract frames:
ffmpeg -i "/path/to/video.mp4" \
-vf "fps=1/60,scale=1280:-1" -q:v 3 {TEMP_DIR}/frame_%03d.jpg
Use the Read tool on each frame to view it. Frames show:
- Screen shares (code, slides, UIs)
- Participant names (bottom-left of each video tile)
- Timestamps (if visible in screen share)
Phase 5: Transcribe Audio
Run the transcription engine. Choose the correct code based on WHISPER_ENGINE:
faster-whisper (CUDA systems)
from faster_whisper import WhisperModel
import json
model = WhisperModel("{WHISPER_MODEL}", device="cuda", compute_type="float16")
segments, info = model.transcribe("/path/to/audio", beam_size=5, language="en")
results = []
for segment in segments:
results.append({
"start": segment.start,
"end": segment.end,
"text": segm