Faster Whisper
Local speech-to-text using faster-whisper — a CTranslate2 reimplementation of OpenAI's Whisper that runs 4-6x faster with identical accuracy. With GPU acceleration, expect ~20x realtime transcription (a 10-minute audio file in ~30 seconds).
When to Use
Use this skill when you need to:
- Transcribe audio/video files — meetings, interviews, podcasts, lectures, YouTube videos
- Generate subtitles — SRT, VTT, ASS, LRC, or TTML broadcast-standard subtitles
- Identify speakers — diarization labels who said what (
--diarize) - Transcribe from URLs — YouTube links and direct audio URLs (auto-downloads via yt-dlp)
- Transcribe podcast feeds —
--rss <feed-url>fetches and transcribes episodes - Batch process files — glob patterns, directories, skip-existing support; ETA shown automatically
- Convert speech to text locally — no API costs, works offline (after model download)
- Translate to English — translate any language to English with
--translate - Do multilingual transcription — supports 99+ languages with auto-detection
- Transcribe a batch of files in different languages —
--language-mapassigns a different language per file - Transcribe multilingual audio —
--multilingualfor mixed-language audio - Transcribe audio with specific terms — use
--initial-promptfor jargon-heavy content or any other terms to look out for - Preprocess noisy audio (before transcription) —
--normalizeand--denoisebefore transcription - Stream output —
--streamshows segments as they're transcribed - Clip time ranges —
--clip-timestampsto transcribe specific sections - Search the transcript —
--search "term"finds all timestamps where a word/phrase appears - Detect chapters —
--detect-chaptersfinds section breaks from silence gaps - Export speaker audio —
--export-speakers DIRsaves each speaker's turns as separate WAV files - Spreadsheet output —
--format csvproduces a properly-quoted CSV with timestamps
Trigger phrases: "transcribe this audio", "convert speech to text", "what did they say", "make a transcript", "audio to text", "subtitle this video", "who's speaking", "translate this audio", "translate to English", "find where X is mentioned", "search transcript for", "when did they say", "at what timestamp", "add chapters", "detect chapters", "find breaks in the audio", "table of contents for this recording", "TTML subtitles", "DFXP subtitles", "broadcast format subtitles", "Netflix format", "ASS subtitles", "aegisub format", "advanced substation alpha", "mpv subtitles", "LRC subtitles", "timed lyrics", "karaoke subtitles", "music player lyrics", "HTML transcript", "confidence-colored transcript", "color-coded transcript", "separate audio per speaker", "export speaker audio", "split by speaker", "transcript as CSV", "spreadsheet output", "transcribe podcast", "podcast RSS feed", "different languages in batch", "per-file language", "transcribe in multiple formats", "srt and txt at the same time", "output both srt and text", "remove filler words", "clean up ums and uhs", "strip hesitation sounds", "remove you know and I mean", "transcribe left channel", "transcribe right channel", "stereo channel", "left track only", "wrap subtitle lines", "character limit per line", "max chars per subtitle", "detect paragraphs", "paragraph breaks", "group into paragraphs", "add paragraph spacing"
⚠️ Agent guidance — keep invocations minimal:
CORE RULE: default command (./scripts/transcribe audio.mp3) is the fastest path — add flags only when the user explicitly asks for that capability.
Transcription:
- Only add
--diarizeif the user asks "who said what" / "identify speakers" / "label speakers" - Only add
--format srt/vtt/ass/lrc/ttmlif the user asks for subtitles/captions in that format - Only add
--format csvif the user asks for CSV or spreadsheet output - Only add
--word-timestampsif the user needs word-level timing - Only add
--initial-promptif there's domain-specific jargon to prime - Only add
--translateif the user wants non-English audio translated to English - Only add
--normalize/--denoiseif the user mentions bad audio quality or noise - Only add
--streamif the user wants live/progressive output for long files - Only add
--clip-timestampsif the user wants a specific time range - Only add
--temperature 0.0if the model is hallucinating on music/silence - Only add
--vad-thresholdif VAD is aggressively cutting speech or including noise - Only add
--min-speakers/--max-speakerswhen you know the speaker count - Only add
--hf-tokenif the token is not cached at~/.cache/huggingface/token - Only add
--max-words-per-linefor subtitle readability on long segments - Only add
--filter-hallucinationsif the transcript contains obvious artifacts (music markers, duplicates) - Only add
--merge-sentencesif the user asks for sentence-level subtitle cues - Only add
--clean-fillerif the user asks to remove filler words (um, uh, you know, I mean, hesitation sounds) - Only add
--channel left|rightif the user mentions stereo tracks, dual-channel recordings, or asks for a specific channel - Only add
--max-chars-per-line Nwhen the user specifies a character limit per subtitle line (e.g., "Netflix format", "42 chars per line"); takes priority over--max-words-per-line - Only add
--detect-paragraphsif the user asks for paragraph breaks or structured text output;--paragraph-gap(default 3.0s) only if they want a custom gap - Only add
--speaker-names "Alice,Bob"when the user provides real names to replace SPEAKER_1/2 — always requires--diarize - Only add
--hotwords WORDSwhen the user names specific rare terms not well served by--initial-prompt; prefer--initial-promptfor general domain jargon - Only add
--prefix TEXTwhen the user knows the exact words the audio starts with - Only add
--detect-language-onlywhen the user only wants to identify the language, not transcribe - Only add
--stats-file PATHif the user asks for performance stats, RTF, or benchmark info - Only add
--parallel Nfor large CPU batch jobs; GPU handles one file efficiently on its own — don't add for single files or small batches - Only add
--retries Nfor unreliable inputs (URLs, network files) where transient failures are expected - Only add
--burn-in OUTPUTonly when user explicitly asks to embed/burn subtitles into the video; requires ffmpeg and a video file input - Only add
--keep-tempwhen the user may re-process the same URL to avoid re-downloading - Only add
--output-templatewhen user specifies a custom naming pattern in batch mode - Multi-format output (
--format srt,text): only when user explicitly wants multiple formats in one pass; always pair with-o <dir> - Any word-level feature auto-runs wav2vec2 alignment (~5-10s overhead)
--diarizeadds ~20-30s on top of that
Search:
- Only add
--search "term"when the user asks to find/locate/search for a specific word or phrase in audio --searchreplaces the normal transcript output — it prints only matching segments with timestamps- Add
--search-fuzzyonly when the user mentions approximate/partial matching or typos - To save search results to a file, use
-o results.txt
Chapter detection:
- Only add
--detect-chapterswhen the user asks for chapters, sections, a table of contents, or "where does the topic change" - Default
--chapter-gap 8(8-second silence = new chapter) works for most podcasts/lectures; tune down for dense content --chapter-format youtube(default) outputs YouTube-ready timestamps; usejsonfor programmatic use- Always use
--chapters-file PATHwhen combining chapters with a transcript output — avoids mixing chapter markers into the transcript text - If the user only wants chapters (not the transcript), pipe stdout to a file with
-o /dev/nulland use `--chapters-