wjs-transcribing-audio

Spoken audio in → timestamped SRT in the same language out. This skill stops at the source-language SRT. Translation to another language is the next skill (/wjs-translating-subtitles).

When to use

User provides a video or audio file and wants a transcript / SRT in the source language.
User already has a translated SRT and the source SRT is missing.
User asks "做 SRT" / "make subtitles" / "出逐字稿" with no translation step requested yet.

When NOT to use

Source-language SRT already exists → skip straight to /wjs-translating-subtitles.
User wants the transcript in a different language than spoken → run this skill first, then /wjs-translating-subtitles.
User wants only the dub or burn-in → if SRT exists, skip; otherwise run this first.

Routing: which engine

Source language	Default engine	Why
Chinese (zh-CN, zh-HK, zh-TW)	Volcano (豆包) ASR	Materially better accuracy than Whisper for Chinese — user's standing preference
Any other (es, en, pt, fr, it, ja, ko, …)	OpenAI Whisper API with word-level granularity	Whisper's multilingual is strong; word timestamps let us assemble cues ourselves
Offline / no API access	Local `openai-whisper` (medium)	Quality floor; same loop/blob failure modes apply

For Chinese, do not default to Whisper unless the user explicitly asks for it or Volcano is unavailable. This is a deliberate routing decision — see user's memory on Chinese ASR priority.

OpenAI Whisper API path (non-Chinese, and Chinese fallback)

The key principle: do not request response_format=srt. Whisper cue-segmentation fails on long monologues (30-second blob cues) and quiet stretches (loop hallucinations). Request word-level timestamps and assemble cues yourself — the post-processing is deterministic and free.

Why not response_format=srt

Two failure modes that wreck whisper-1 SRT output on long content:

30-second blob cues. In long monologues, whisper-1 with response_format=srt emits one cue covering the full 30s condition_on_previous_text window. Transcript is fine; timing is unusable for on-screen reading.
Loop hallucination on quiet tails. Greedy temperature=0 on low-energy audio produces "你如果不把拥抱浪费写在这上面,你很难的" repeated 50 times.

Both stem from letting Whisper decide cue boundaries. Fix: word-level timestamps + your own punctuation-aware assembler.

Calling the API

# 1. Compress for upload — 64kbps mono MP3 is plenty for speech.
#    OpenAI limit is 25MB per request; chunk into 10-min pieces
#    (≈4.5MB at 64kbps) for resilience under flaky proxies.
ffmpeg -hide_banner -loglevel error -y \
  -ss <start> -t 600 -i input.mp4 \
  -vn -ac 1 -ar 16000 -c:a libmp3lame -b:a 64k chunk.mp3

# 2. Request word-level timestamps. Do NOT request response_format=srt.
import httpx, os
data = {
    "model": "whisper-1",
    "language": "es",                        # pin source language; never auto-detect
    "response_format": "verbose_json",
    "timestamp_granularities[]": "word",     # ← the critical flag
    "temperature": "0.2",                    # enable fallback chain (anti-loop)
}
with open("chunk.mp3", "rb") as f:
    r = httpx.post(
        "https://api.openai.com/v1/audio/transcriptions",
        headers={"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"},
        data=data,
        files={"file": ("chunk.mp3", f, "audio/mpeg")},
        timeout=600.0,
    )
r.raise_for_status()
j = r.json()
words    = j["words"]      # [{"word": "hola", "start": 0.12, "end": 0.34}, ...]
segments = j["segments"]   # see surprise below

Surprise: words[] has no punctuation, segments[] is inconsistent

Whisper's words[] array typically has no punctuation in word["word"] — each entry is a bare token like "做", "个", "测", "试". Punctuation, when present, lives only in segments[] text field.

Worse, segments[] text is inconsistently punctuated across chunks of the same file: chunk 0 of a 79-min podcast might emit 285 bare segments ("做个测试" "你在" "呵呵") at 1-2s each with no punctuation; chunk 7 might emit 34 segments at 14-30s each with punctuation. Both behaviors ship in the same API response.

So the right recipe combines both: use segments[] for natural pause boundaries (already aligned to breath), but treat them as raw input to your own cue assembler, which uses word timestamps to split anywhere the segments are too long.

Cue assembly recipe

TARGET_DUR = 3.0   # try to make cues this long
MAX_CUE_DUR = 5.0  # never exceed
MAX_CHARS = 18     # ~one line at Fontsize 14 on 1080-wide vertical
MAX_GAP = 1.0      # silence threshold → force cue boundary
MIN_PIECE = 0.3    # below this, merge with neighbor
SPLIT_PUNCT = set("，。！？；,.;!?")

# Step A: merge short segments[] toward TARGET_DUR (use segments,
#         not words — Whisper's segment boundaries are already
#         pause-aligned).
def assemble(segments, offset):
    cues, buf = [], []
    def flush():
        if buf:
            cues.append((buf[0]["start"]+offset, buf[-1]["end"]+offset,
                         "".join(s["text"].strip() for s in buf)))
            buf.clear()
    for s in segments:
        dur = s["end"] - s["start"]
        # Long single segment WITH internal punct → split standalone
        if dur > MAX_CUE_DUR and any(c in s["text"] for c in SPLIT_PUNCT):
            flush(); cues.extend(split_long_segment(s, offset)); continue
        if not buf: buf.append(s); continue
        if (s["start"] - buf[-1]["end"]) >= MAX_GAP \
           or (buf[-1]["end"] - buf[0]["start"]) >= TARGET_DUR \
           or (s["end"] - buf[0]["start"]) > MAX_CUE_DUR:
            flush()
        buf.append(s)
    flush(); return cues

# Step B: final pass — split every internal comma/period to its own cue
#         (proportional timestamps by char position). Coalesce pieces
#         shorter than MIN_PIECE forward.

# Step C: any cue still > MAX_CHARS gets split at the largest inter-word
#         gap using words[] timestamps. Recursive until under cap.

Tweak TARGET_DUR and MAX_CHARS to platform reading rhythm. The 18-char cap matters for burn-in on vertical 1080×1920 at Fontsize=14 — longer wraps to multiple unreadable lines.

Operational details

Auth: credentials live in ~/code/.env. Load with set -a; source ~/code/.env; set +a before invoking.
SOCKS proxy on this machine: httpx needs the socksio extra — use uvx --with httpx --with socksio python ... (without it you get ImportError: Using SOCKS proxy, but the 'socksio' package is not installed).
Chunking: 10-min pieces at 64kbps mono MP3 (~4.5MB each) are the reliability sweet spot. 20-min chunks (~9MB) sometimes RST under flaky proxies. Concurrency max_workers=2 is more reliable than 4.
Retry: every API call wrapped in 5× exponential backoff (time.sleep(min(2**n, 30))) — RemoteProtocolError: Server disconnected is common and transient.
Offset stitching: each chunk's words come back with timestamps relative to that chunk. When merging, add the chunk's absolute start offset to every word's start/end before assembling cues.
Loop guard (belt + suspenders): even with temperature=0.2, occasionally a sub-chunk still loops. After assembly, run a loop-detector on each cue's text — if any phrase of length 8–40 chars repeats 3+ times consecutively, drop the cue.

Anti-patterns (do not do)

❌ Do not request response_format=srt for content longer than ~2 minutes.
❌ Do not "fix" bad cues with a second API call. If you got blob cues or loop hallucinations from your first call, redo with word-level granularity once — don't re-transcribe just the broken sub-range.
❌ Do not use temperature=0 on potentially-quiet audio (yoga, spiritual content, podcast outros). Greedy decoding loops. 0.2 enables the fallback chain.
❌ Do not skip language=.... Aut

wjs-transcribing-audio

How to add

Drop this on your repo README

Related skills

pdf

pptx

canvas-design

theme-factory

Get new Documentos skills every Monday

wjs-transcribing-audio

When to use

When NOT to use

Routing: which engine

OpenAI Whisper API path (non-Chinese, and Chinese fallback)

Why not response_format=srt

Calling the API

Surprise: words[] has no punctuation, segments[] is inconsistent

Cue assembly recipe

Operational details

Anti-patterns (do not do)

Comments · No comments