wjs-dubbing-video

Video + target-language SRT → *_<lang>_dub.mp4 with a time-aligned TTS voice. This skill stops at the dub track. Burn-in + audio bed mixing is the next skill (/wjs-burning-subtitles/render.py composites everything in one final encode).

When to use

User has a target-language SRT (e.g., entrevista.zh-CN.srt) and wants the video to speak that language.
User says "中文配音 / 配音 / 帮我做配音 / dub it / voice over".
User has multiple speakers on camera and wants different voices per speaker.

When NOT to use

No SRT yet → run /wjs-transcribing-audio then /wjs-translating-subtitles first.
Source-language only TTS (rare; usually you translate first) → still use this skill, but pass the source SRT.
Burn-in only, no audio change → skip to /wjs-burning-subtitles.

Number of speakers — default to one

Default: assume one speaker. Use a single voice for the entire dub. This is the right answer for monologues, vlogs, recorded talks, narrator-only clips, and the overwhelming majority of videos people ask about. Don't run diarization, don't tag the SRT with [A]/[B], don't bring up multi-speaker complexity.

Switch to multi-speaker only when the user explicitly says so — phrasings like "two people", "interview", "dialogue", "conversation between", "separate the speakers", "different voice for each", or a direct request to do diarization. When triggered, follow the "Multi-speaker dubbing" section below.

If you're unsure whether a video is one speaker or many, ship the single-voice version first. Adding speaker separation later is cheap (just regenerate the dub); shipping confused multi-speaker output by default wastes the user's time.

Engine routing — by voice ID

scripts/dub.py auto-routes by voice-ID prefix:

Voice ID pattern	Engine	Auth
`zh_..._bigtts`	Volcano (字节跳动豆包) TTS	`VOLC_TTS_APPID` + `VOLC_TTS_ACCESS_TOKEN`
`zh-CN-...Neural` / `en-US-...Neural` / etc.	edge-tts (Microsoft Edge neural)	none (free)

For Mandarin, Volcano is markedly more natural than edge-tts, especially for emotional/contemplative content. Use edge-tts when Volcano credentials aren't available or as a debugging fallback.

Volcano TTS (Chinese only)

Endpoint: https://openspeech.bytedance.com/api/v3/tts/unidirectional (used for both TTS 1.0 and 2.0; the Resource-Id header picks the backend).

Headers:

X-Api-App-Id:       (env: VOLC_TTS_APPID)         # 10-digit speech App ID
X-Api-Access-Key:   (env: VOLC_TTS_ACCESS_TOKEN)  # 32-char token from speech console
X-Api-Resource-Id:  volc.service_type.10029       # see resource ID note below
Content-Type:       application/json

Loading credentials: most users keep them in ~/code/.env. Read them at the top of any session via:

set -a; source ~/code/.env; set +a

Resource ID — important quirk

The doc lists seed-tts-2.0 as the "TTS 2.0 (recommended)" resource, but a typical TTS-SeedTTS2.0 console instance does not include the popular *_bigtts speaker catalog (爽快斯斯, 高冷御姐, 开朗姐姐, etc.). Trying those speakers against seed-tts-2.0 returns 200 code=55000000 "resource ID is mismatched with speaker related resource". The fix is to use volc.service_type.10029 (the TTS 1.0 V3 endpoint) — the audio quality of the bigtts speakers is identical, and they all work against this resource. The bundled dub.py defaults to volc.service_type.10029; override with VOLC_TTS_RESOURCE env if you have a different instance.

Other 401/403 errors:

401 code=45000010 "load grant: requested grant not found in SaaS storage" — the App ID + key combo is valid against the gateway, but the user has not activated this resource. They must go to 火山引擎 → 语音技术 → 语音合成大模型 → 实例管理 and 开通 the service. No workaround.
403 code=45000030 — the speaker isn't included in the user's instance bundle.

Response format

Despite the doc's casual language, the response is streaming NDJSON, not a single JSON object and not raw audio bytes. Each line is a separate JSON event with a base64-encoded MP3 chunk in data. The terminal event has code: 20000000 (which means OK in this API's success codes — different from code: 0). Concatenate the decoded chunks for the full MP3.

import base64, json, requests
audio = b""
r = requests.post(url, headers=h, json=payload, timeout=60, stream=True)
for line in r.iter_lines():
    if not line: continue
    evt = json.loads(line)
    if evt.get("code") not in (0, None, 20000000):
        raise RuntimeError(f"code={evt.get('code')} {evt.get('message')}")
    if evt.get("data"):
        audio += base64.b64decode(evt["data"])

Speaker catalog (verified working under `volc.service_type.10029`)

Full list at volcengine.com/docs/6561/1257544 — but availability depends on your instance bundle. Confirmed-working female voices for the typical SeedTTS-2.0 starter instance:

Speaker ID	中文名	Feel
`zh_female_gaolengyujie_moon_bigtts`	高冷御姐	Best for contemplative/spiritual content. Mature, restrained, calm.
`zh_female_kailangjiejie_moon_bigtts`	开朗姐姐	Warm older-sister storytelling.
`zh_female_shuangkuaisisi_moon_bigtts`	爽快斯斯	Versatile, conversational baseline.
`zh_female_linjianvhai_moon_bigtts`	邻家女孩	Casual, lifestyle-vlog.
`zh_female_yuanqinvyou_moon_bigtts`	元气女友	Lively, upbeat.
`zh_female_meilinvyou_moon_bigtts`	美丽女友	Soft, intimate.
`zh_female_shuangkuaisisi_emo_v2_mars_bigtts`	斯斯情感版	Full emotional range — pair with explicit emotion + scale.

These voices return 55000000 against the typical instance even though the doc lists them: vv_uranus_bigtts, wenroushunv_moon_bigtts, qingxin_moon_bigtts, yingmaoxiaoyuan_moon_bigtts, tianxinxiaoling_moon_bigtts, shaoergushi_moon_bigtts. Don't promise them without testing.

Audio params

speech_rate is Volcano's native scale [-50, +100] where the value is a percentage delta (so -8 means 8% slower). The script passes --rate -8% through as -8.

Useful emotion presets:

emotion="calm", emotion_scale=4 — contemplative, default for this skill's spiritual-content niche.
emotion="gentle" — softer / more intimate.
emotion="neutral" — flat / informational.
emotion="sad" — melancholic. Use sparingly.

Override dub.py defaults with VOLC_TTS_EMOTION and VOLC_TTS_EMOTION_SCALE env vars without editing code.

No English Volcano voices are wired up in this skill — for English use edge-tts (next section). Volcano does have English speakers (en_male_*_bigtts, en_female_*_bigtts) but they aren't typically included in TTS-SeedTTS-2.0 starter instances. Add them by extending the voice routing in dub.py once verified.

edge-tts (Microsoft Edge neural TTS)

Free, no API key, high-quality but less expressive than Volcano. Install into a project venv — do not call it via uvx once per segment. Each uvx invocation spawns a fresh Python process and the bing endpoint will rate-limit or RST the connection after a handful of rapid hits, breaking mid-render.

uv venv .venv
uv pip install --python .venv/bin/python edge-tts

Then drive it from a single long-lived Python process using edge_tts.Communicate(...) directly, with retry-on-failure logic. The bundled scripts/dub.py does this.

Voice selection — match the original speaker

There is no perfect cross-language match — choose gender, age feel, and tone deliberately, then bend with rate/pitch.

Chinese voices (Volcano preferred, edge-tts fallback)

Volcano's zh_female_gaolengyujie_moon_bigtts (高冷御姐, calm, speech_rate=-8) is the validated baseline for mature contemplative female speakers — equivalent to or better than any edge-tts option for that profi

wjs-dubbing-video

How to add

Drop this on your repo README

Related skills

algorithmic-art

doc-coauthoring

blog-writing-guide

agents-md

Get new Escrita e Conteúdo skills every Monday

wjs-dubbing-video

When to use

When NOT to use

Number of speakers — default to one

Engine routing — by voice ID

Volcano TTS (Chinese only)

Resource ID — important quirk

Response format

Speaker catalog (verified working under `volc.service_type.10029`)

Audio params

edge-tts (Microsoft Edge neural TTS)

Voice selection — match the original speaker

Chinese voices (Volcano preferred, edge-tts fallback)

Comments · No comments

How to add

Drop this on your repo README

Related skills

algorithmic-art

doc-coauthoring

blog-writing-guide

agents-md

Get new Escrita e Conteúdo skills every Monday

wjs-dubbing-video

When to use

When NOT to use

Number of speakers — default to one

Engine routing — by voice ID

Volcano TTS (Chinese only)

Resource ID — important quirk

Response format

Speaker catalog (verified working under volc.service_type.10029)

Audio params

edge-tts (Microsoft Edge neural TTS)

Voice selection — match the original speaker

Chinese voices (Volcano preferred, edge-tts fallback)

Comments · No comments

Speaker catalog (verified working under `volc.service_type.10029`)