wjs-transcribing-audio
Spoken audio in → timestamped SRT in the same language out. This skill stops at the source-language SRT. Translation to another language is the next skill (/wjs-translating-subtitles).
When to use
- User provides a video or audio file and wants a transcript / SRT in the source language.
- User already has a translated SRT and the source SRT is missing.
- User asks "做 SRT" / "make subtitles" / "出逐字稿" with no translation step requested yet.
When NOT to use
- Source-language SRT already exists → skip straight to
/wjs-translating-subtitles. - User wants the transcript in a different language than spoken → run this skill first, then
/wjs-translating-subtitles. - User wants only the dub or burn-in → if SRT exists, skip; otherwise run this first.
Routing: which engine
| Source language | Default engine | Why |
|---|---|---|
| Chinese (zh-CN, zh-HK, zh-TW) | Volcano (豆包) ASR | Materially better accuracy than Whisper for Chinese — user's standing preference |
| Any other (es, en, pt, fr, it, ja, ko, …) | OpenAI Whisper API with word-level granularity | Whisper's multilingual is strong; word timestamps let us assemble cues ourselves |
| Offline / no API access | Local openai-whisper (medium) | Quality floor; same loop/blob failure modes apply |
For Chinese, do not default to Whisper unless the user explicitly asks for it or Volcano is unavailable. This is a deliberate routing decision — see user's memory on Chinese ASR priority.
OpenAI Whisper API path (non-Chinese, and Chinese fallback)
The key principle: do not request response_format=srt. Whisper cue-segmentation fails on long monologues (30-second blob cues) and quiet stretches (loop hallucinations). Request word-level timestamps and assemble cues yourself — the post-processing is deterministic and free.
Why not response_format=srt
Two failure modes that wreck whisper-1 SRT output on long content:
- 30-second blob cues. In long monologues,
whisper-1withresponse_format=srtemits one cue covering the full 30scondition_on_previous_textwindow. Transcript is fine; timing is unusable for on-screen reading. - Loop hallucination on quiet tails. Greedy
temperature=0on low-energy audio produces "你如果不把拥抱浪费写在这上面,你很难的" repeated 50 times.
Both stem from letting Whisper decide cue boundaries. Fix: word-level timestamps + your own punctuation-aware assembler.
Calling the API
# 1. Compress for upload — 64kbps mono MP3 is plenty for speech.
# OpenAI limit is 25MB per request; chunk into 10-min pieces
# (≈4.5MB at 64kbps) for resilience under flaky proxies.
ffmpeg -hide_banner -loglevel error -y \
-ss <start> -t 600 -i input.mp4 \
-vn -ac 1 -ar 16000 -c:a libmp3lame -b:a 64k chunk.mp3
# 2. Request word-level timestamps. Do NOT request response_format=srt.
import httpx, os
data = {
"model": "whisper-1",
"language": "es", # pin source language; never auto-detect
"response_format": "verbose_json",
"timestamp_granularities[]": "word", # ← the critical flag
"temperature": "0.2", # enable fallback chain (anti-loop)
}
with open("chunk.mp3", "rb") as f:
r = httpx.post(
"https://api.openai.com/v1/audio/transcriptions",
headers={"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"},
data=data,
files={"file": ("chunk.mp3", f, "audio/mpeg")},
timeout=600.0,
)
r.raise_for_status()
j = r.json()
words = j["words"] # [{"word": "hola", "start": 0.12, "end": 0.34}, ...]
segments = j["segments"] # see surprise below
Surprise: words[] has no punctuation, segments[] is inconsistent
Whisper's words[] array typically has no punctuation in word["word"] — each entry is a bare token like "做", "个", "测", "试". Punctuation, when present, lives only in segments[] text field.
Worse, segments[] text is inconsistently punctuated across chunks of the same file: chunk 0 of a 79-min podcast might emit 285 bare segments ("做个测试" "你在" "呵呵") at 1-2s each with no punctuation; chunk 7 might emit 34 segments at 14-30s each with punctuation. Both behaviors ship in the same API response.
So the right recipe combines both: use segments[] for natural pause boundaries (already aligned to breath), but treat them as raw input to your own cue assembler, which uses word timestamps to split anywhere the segments are too long.
Cue assembly recipe
TARGET_DUR = 3.0 # try to make cues this long
MAX_CUE_DUR = 5.0 # never exceed
MAX_CHARS = 18 # ~one line at Fontsize 14 on 1080-wide vertical
MAX_GAP = 1.0 # silence threshold → force cue boundary
MIN_PIECE = 0.3 # below this, merge with neighbor
SPLIT_PUNCT = set(",。!?;,.;!?")
# Step A: merge short segments[] toward TARGET_DUR (use segments,
# not words — Whisper's segment boundaries are already
# pause-aligned).
def assemble(segments, offset):
cues, buf = [], []
def flush():
if buf:
cues.append((buf[0]["start"]+offset, buf[-1]["end"]+offset,
"".join(s["text"].strip() for s in buf)))
buf.clear()
for s in segments:
dur = s["end"] - s["start"]
# Long single segment WITH internal punct → split standalone
if dur > MAX_CUE_DUR and any(c in s["text"] for c in SPLIT_PUNCT):
flush(); cues.extend(split_long_segment(s, offset)); continue
if not buf: buf.append(s); continue
if (s["start"] - buf[-1]["end"]) >= MAX_GAP \
or (buf[-1]["end"] - buf[0]["start"]) >= TARGET_DUR \
or (s["end"] - buf[0]["start"]) > MAX_CUE_DUR:
flush()
buf.append(s)
flush(); return cues
# Step B: final pass — split every internal comma/period to its own cue
# (proportional timestamps by char position). Coalesce pieces
# shorter than MIN_PIECE forward.
# Step C: any cue still > MAX_CHARS gets split at the largest inter-word
# gap using words[] timestamps. Recursive until under cap.
Tweak TARGET_DUR and MAX_CHARS to platform reading rhythm. The 18-char cap matters for burn-in on vertical 1080×1920 at Fontsize=14 — longer wraps to multiple unreadable lines.
Operational details
- Auth: credentials live in
~/code/.env. Load withset -a; source ~/code/.env; set +abefore invoking. - SOCKS proxy on this machine:
httpxneeds thesocksioextra — useuvx --with httpx --with socksio python ...(without it you getImportError: Using SOCKS proxy, but the 'socksio' package is not installed). - Chunking: 10-min pieces at 64kbps mono MP3 (~4.5MB each) are the reliability sweet spot. 20-min chunks (~9MB) sometimes RST under flaky proxies. Concurrency
max_workers=2is more reliable than4. - Retry: every API call wrapped in 5× exponential backoff (
time.sleep(min(2**n, 30))) —RemoteProtocolError: Server disconnectedis common and transient. - Offset stitching: each chunk's words come back with timestamps relative to that chunk. When merging, add the chunk's absolute start offset to every word's
start/endbefore assembling cues. - Loop guard (belt + suspenders): even with
temperature=0.2, occasionally a sub-chunk still loops. After assembly, run a loop-detector on each cue's text — if any phrase of length 8–40 chars repeats 3+ times consecutively, drop the cue.
Anti-patterns (do not do)
- ❌ Do not request
response_format=srtfor content longer than ~2 minutes. - ❌ Do not "fix" bad cues with a second API call. If you got blob cues or loop hallucinations from your first call, redo with word-level granularity once — don't re-transcribe just the broken sub-range.
- ❌ Do not use
temperature=0on potentially-quiet audio (yoga, spiritual content, podcast outros). Greedy decoding loops.0.2enables the fallback chain. - ❌ Do not skip
language=.... Aut