OpenAI Whisper Patterns

Quick Guide: Use client.audio.transcriptions.create() for speech-to-text and client.audio.translations.create() for non-English audio to English text. Choose gpt-4o-transcribe for highest accuracy, gpt-4o-mini-transcribe for cost-efficiency, whisper-1 for timestamps/SRT/VTT, or gpt-4o-transcribe-diarize for speaker identification. Files must be under 25 MB -- chunk larger files. Use prompt to guide vocabulary and style. Streaming is available via stream: true for progressive output on gpt-4o-transcribe models.

<critical_requirements>

CRITICAL: Before Using This Skill

All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering, import type, named constants)

(You MUST choose the correct model for the use case -- gpt-4o-transcribe for accuracy, whisper-1 for timestamps/SRT/VTT output, gpt-4o-transcribe-diarize for speaker labels)

(You MUST chunk audio files larger than 25 MB before sending to the API -- the API rejects files exceeding this limit)

(You MUST pass response_format: "verbose_json" when using timestamp_granularities -- timestamps only work with this format on whisper-1)

(You MUST set chunking_strategy: "auto" when using gpt-4o-transcribe-diarize with audio longer than 30 seconds -- the API requires it)

</critical_requirements>

Auto-detection: Whisper, whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-transcribe-diarize, audio.transcriptions, audio.translations, transcription, speech-to-text, diarization, diarized_json, timestamp_granularities, verbose_json

When to use:

Transcribing audio files (meetings, interviews, podcasts, voice notes) to text
Translating non-English audio to English text
Generating subtitles in SRT or VTT format from audio
Getting word-level or segment-level timestamps for video editing
Identifying speakers in multi-speaker audio (diarization)
Streaming transcription results progressively as the model processes audio

Key patterns covered:

Model selection (whisper-1 vs gpt-4o-transcribe vs gpt-4o-mini-transcribe vs gpt-4o-transcribe-diarize)
Response formats (json, text, srt, vtt, verbose_json, diarized_json)
Timestamps (word-level, segment-level) and subtitle generation
Prompting for vocabulary, acronyms, and style
Chunking large files (> 25 MB) with context preservation
Streaming transcription with stream: true
Translation to English via audio.translations.create()
Speaker diarization with speaker references

When NOT to use:

Text-to-speech (TTS) -- use the OpenAI TTS API (client.audio.speech.create())
Real-time bidirectional voice conversations -- use the OpenAI Realtime API
Transcription with non-OpenAI providers -- use a provider-agnostic speech SDK

Examples Index

Core: Transcription, Translation, Timestamps, Chunking, Streaming, Diarization -- All audio API patterns

Philosophy

The OpenAI Audio API provides speech-to-text transcription and translation through multiple models optimized for different needs. The API is simple -- you send an audio file and get text back -- but choosing the right model, response format, and parameters is critical for quality results.

Core principles:

Model selection matters -- gpt-4o-transcribe produces the highest accuracy with lower hallucination rates. whisper-1 is the only model supporting SRT/VTT/verbose_json with timestamps. gpt-4o-transcribe-diarize adds speaker identification.
File size is the primary constraint -- 25 MB limit means you must chunk longer audio. Split at sentence boundaries to preserve context.
Prompting improves accuracy -- The prompt parameter guides vocabulary, acronyms, and formatting style. It does not give instructions -- it provides context the model matches against.
Response format determines available features -- Timestamps require verbose_json on whisper-1. Diarization requires diarized_json. SRT/VTT are only on whisper-1.

When to use the Audio API:

You need accurate transcription of recorded audio files
You need subtitles (SRT/VTT) from audio
You need to identify who is speaking in a conversation
You need to translate non-English speech to English text

When NOT to use:

Real-time voice chat -- use the Realtime API instead
Text-to-speech -- use client.audio.speech.create()
You need transcription in a non-English target language (translation only outputs English)

</philosophy>

Core Patterns

Pattern 1: Basic Transcription

Send an audio file and receive text back. The model auto-detects the language.

const transcription = await client.audio.transcriptions.create({
  model: "gpt-4o-transcribe",
  file: createReadStream(audioPath),
});

Use gpt-4o-transcribe for highest accuracy. Do not use whisper-1 with verbose_json when you only need plain text -- it adds overhead and has higher hallucination rates. See core.md for full examples.

Pattern 2: Model Selection

Each model has distinct capabilities and tradeoffs.

What do you need?
+-- Highest accuracy, plain text -> gpt-4o-transcribe
+-- Cost-efficient, plain text -> gpt-4o-mini-transcribe
+-- Timestamps (word/segment) -> whisper-1 (verbose_json)
+-- SRT or VTT subtitles -> whisper-1 (srt/vtt format)
+-- Speaker identification -> gpt-4o-transcribe-diarize
+-- Streaming output -> gpt-4o-transcribe or gpt-4o-mini-transcribe

Model Capabilities Matrix

Feature	whisper-1	gpt-4o-transcribe	gpt-4o-mini-transcribe	gpt-4o-transcribe-diarize
Response formats	json, text, srt, vtt, verbose_json	json, text	json, text	json, text, diarized_json
Timestamps	word + segment	No	No	No
Streaming	No	Yes	Yes	No
Prompt support	Yes (224 tokens)	Yes	Yes	No
Logprobs	No	Yes	Yes	No
Speaker labels	No	No	No	Yes
Language param	Yes	Yes	Yes	Yes

Pattern 3: Prompting for Vocabulary and Style

The prompt parameter provides context -- not instructions. It guides spelling of names, acronyms, and formatting style. Do not use it to give instructions like "please transcribe carefully" -- it matches style and vocabulary context.

const VOCABULARY_PROMPT = "Kubernetes, kubectl, etcd, NGINX, gRPC, PostgreSQL";

const transcription = await client.audio.transcriptions.create({
  model: "gpt-4o-transcribe",
  file: createReadStream(audioPath),
  prompt: VOCABULARY_PROMPT,
});

Use cases: Acronyms/proper nouns, preserving context across chunks (pass tail of previous transcript), maintaining filler words, writing style guidance. See core.md for detailed vocabulary examples.

Pattern 4: Chunking Large Files

Audio files exceeding 25 MB must be split before transcription. Split at sentence boundaries (e.g., via ffmpeg) to preserve context. Pass the tail of the previous transcript as prompt for continuity across chunks.

const MAX_FILE_SIZE_BYTES = 25 * 1024 * 1024; // 25 MB
// Split with ffmpeg: ffmpeg -i long.mp3 -f segment -segment_time 600

ai-provider-openai-whisper

How to add

Drop this on your repo README

Related skills

webapp-testing

brand-guidelines

frontend-design

web-artifacts-builder

Get new Design e Frontend skills every Monday