wjs-dubbing-video
Video + target-language SRT → *_<lang>_dub.mp4 with a time-aligned TTS voice. This skill stops at the dub track. Burn-in + audio bed mixing is the next skill (/wjs-burning-subtitles/render.py composites everything in one final encode).
When to use
- User has a target-language SRT (e.g.,
entrevista.zh-CN.srt) and wants the video to speak that language. - User says "中文配音 / 配音 / 帮我做配音 / dub it / voice over".
- User has multiple speakers on camera and wants different voices per speaker.
When NOT to use
- No SRT yet → run
/wjs-transcribing-audiothen/wjs-translating-subtitlesfirst. - Source-language only TTS (rare; usually you translate first) → still use this skill, but pass the source SRT.
- Burn-in only, no audio change → skip to
/wjs-burning-subtitles.
Number of speakers — default to one
Default: assume one speaker. Use a single voice for the entire dub. This is the right answer for monologues, vlogs, recorded talks, narrator-only clips, and the overwhelming majority of videos people ask about. Don't run diarization, don't tag the SRT with [A]/[B], don't bring up multi-speaker complexity.
Switch to multi-speaker only when the user explicitly says so — phrasings like "two people", "interview", "dialogue", "conversation between", "separate the speakers", "different voice for each", or a direct request to do diarization. When triggered, follow the "Multi-speaker dubbing" section below.
If you're unsure whether a video is one speaker or many, ship the single-voice version first. Adding speaker separation later is cheap (just regenerate the dub); shipping confused multi-speaker output by default wastes the user's time.
Engine routing — by voice ID
scripts/dub.py auto-routes by voice-ID prefix:
| Voice ID pattern | Engine | Auth |
|---|---|---|
zh_..._bigtts | Volcano (字节跳动豆包) TTS | VOLC_TTS_APPID + VOLC_TTS_ACCESS_TOKEN |
zh-CN-...Neural / en-US-...Neural / etc. | edge-tts (Microsoft Edge neural) | none (free) |
For Mandarin, Volcano is markedly more natural than edge-tts, especially for emotional/contemplative content. Use edge-tts when Volcano credentials aren't available or as a debugging fallback.
Volcano TTS (Chinese only)
Endpoint: https://openspeech.bytedance.com/api/v3/tts/unidirectional (used for both TTS 1.0 and 2.0; the Resource-Id header picks the backend).
Headers:
X-Api-App-Id: (env: VOLC_TTS_APPID) # 10-digit speech App ID
X-Api-Access-Key: (env: VOLC_TTS_ACCESS_TOKEN) # 32-char token from speech console
X-Api-Resource-Id: volc.service_type.10029 # see resource ID note below
Content-Type: application/json
Loading credentials: most users keep them in ~/code/.env. Read them at the top of any session via:
set -a; source ~/code/.env; set +a
Resource ID — important quirk
The doc lists seed-tts-2.0 as the "TTS 2.0 (recommended)" resource, but a typical TTS-SeedTTS2.0 console instance does not include the popular *_bigtts speaker catalog (爽快斯斯, 高冷御姐, 开朗姐姐, etc.). Trying those speakers against seed-tts-2.0 returns 200 code=55000000 "resource ID is mismatched with speaker related resource". The fix is to use volc.service_type.10029 (the TTS 1.0 V3 endpoint) — the audio quality of the bigtts speakers is identical, and they all work against this resource. The bundled dub.py defaults to volc.service_type.10029; override with VOLC_TTS_RESOURCE env if you have a different instance.
Other 401/403 errors:
401 code=45000010 "load grant: requested grant not found in SaaS storage"— the App ID + key combo is valid against the gateway, but the user has not activated this resource. They must go to 火山引擎 → 语音技术 → 语音合成大模型 → 实例管理 and 开通 the service. No workaround.403 code=45000030— the speaker isn't included in the user's instance bundle.
Response format
Despite the doc's casual language, the response is streaming NDJSON, not a single JSON object and not raw audio bytes. Each line is a separate JSON event with a base64-encoded MP3 chunk in data. The terminal event has code: 20000000 (which means OK in this API's success codes — different from code: 0). Concatenate the decoded chunks for the full MP3.
import base64, json, requests
audio = b""
r = requests.post(url, headers=h, json=payload, timeout=60, stream=True)
for line in r.iter_lines():
if not line: continue
evt = json.loads(line)
if evt.get("code") not in (0, None, 20000000):
raise RuntimeError(f"code={evt.get('code')} {evt.get('message')}")
if evt.get("data"):
audio += base64.b64decode(evt["data"])
Speaker catalog (verified working under volc.service_type.10029)
Full list at volcengine.com/docs/6561/1257544 — but availability depends on your instance bundle. Confirmed-working female voices for the typical SeedTTS-2.0 starter instance:
| Speaker ID | 中文名 | Feel |
|---|---|---|
zh_female_gaolengyujie_moon_bigtts | 高冷御姐 | Best for contemplative/spiritual content. Mature, restrained, calm. |
zh_female_kailangjiejie_moon_bigtts | 开朗姐姐 | Warm older-sister storytelling. |
zh_female_shuangkuaisisi_moon_bigtts | 爽快斯斯 | Versatile, conversational baseline. |
zh_female_linjianvhai_moon_bigtts | 邻家女孩 | Casual, lifestyle-vlog. |
zh_female_yuanqinvyou_moon_bigtts | 元气女友 | Lively, upbeat. |
zh_female_meilinvyou_moon_bigtts | 美丽女友 | Soft, intimate. |
zh_female_shuangkuaisisi_emo_v2_mars_bigtts | 斯斯情感版 | Full emotional range — pair with explicit emotion + scale. |
These voices return 55000000 against the typical instance even though the doc lists them: vv_uranus_bigtts, wenroushunv_moon_bigtts, qingxin_moon_bigtts, yingmaoxiaoyuan_moon_bigtts, tianxinxiaoling_moon_bigtts, shaoergushi_moon_bigtts. Don't promise them without testing.
Audio params
speech_rate is Volcano's native scale [-50, +100] where the value is a percentage delta (so -8 means 8% slower). The script passes --rate -8% through as -8.
Useful emotion presets:
emotion="calm",emotion_scale=4— contemplative, default for this skill's spiritual-content niche.emotion="gentle"— softer / more intimate.emotion="neutral"— flat / informational.emotion="sad"— melancholic. Use sparingly.
Override dub.py defaults with VOLC_TTS_EMOTION and VOLC_TTS_EMOTION_SCALE env vars without editing code.
No English Volcano voices are wired up in this skill — for English use edge-tts (next section). Volcano does have English speakers (en_male_*_bigtts, en_female_*_bigtts) but they aren't typically included in TTS-SeedTTS-2.0 starter instances. Add them by extending the voice routing in dub.py once verified.
edge-tts (Microsoft Edge neural TTS)
Free, no API key, high-quality but less expressive than Volcano. Install into a project venv — do not call it via uvx once per segment. Each uvx invocation spawns a fresh Python process and the bing endpoint will rate-limit or RST the connection after a handful of rapid hits, breaking mid-render.
uv venv .venv
uv pip install --python .venv/bin/python edge-tts
Then drive it from a single long-lived Python process using edge_tts.Communicate(...) directly, with retry-on-failure logic. The bundled scripts/dub.py does this.
Voice selection — match the original speaker
There is no perfect cross-language match — choose gender, age feel, and tone deliberately, then bend with rate/pitch.
Chinese voices (Volcano preferred, edge-tts fallback)
Volcano's zh_female_gaolengyujie_moon_bigtts (高冷御姐, calm, speech_rate=-8) is the validated baseline for mature contemplative female speakers — equivalent to or better than any edge-tts option for that profi