AI video — generation, any kind
An end-to-end AI video skill: prompt craft → first-frame stills → generation across six models → voiceover → edit/stitch → upscale, with a quality-control loop that makes the skill improve over time. The ByteDance Seedance 2.0 schema is the canonical interchange format — jobs are authored against it and the runner translates to whichever model fits. Treat every clip as a miniature production brief, not an image caption.
Full pipeline
Not every job needs every stage — a single clip is just stages 2–4 — but this is the shape of a complete video:
- Still (
scripts/imagegen.py) — generate a first-frame image or character reference plate (Flux / Imagen). Skip for pure text-to-video. - Prompt — build the request with the 5-part structure (Workflow below).
- Validate + generate (
validate.py→generate.py) — run on the auto-routed model; local refs auto-upload. - QC + learn (
review.py→LESSONS.md) — watch the clip, score it, record a lesson. Mandatory. - Voiceover (
scripts/tts.py) — synthesize narration (e.g. the user's cloned voice) when a clip or cut needs spoken audio. - Assemble (
scripts/assemble.py) — stitch multiple clips into a finished video with cuts/crossfades and an audio bed. - Finish (
scripts/upscale.py) — upscale resolution + interpolate fps for delivery.
Workflow (a single generation)
-
Read
LESSONS.mdfirst. It is the skill's growing memory of what has and hasn't worked. Apply its lessons when building the prompt. -
Pick a production mode (the choice drives everything else):
Mode When Reference inputs Text-to-video Tone piece, action, abstract — no source media none Native dialogue A character speaks; voice is generated none ( generate_audio=true)Native SFX Sound-led showcase (ASMR, ambience) none Image-to-video Animate a still / first frame → last frame image(+last_frame_image)Character consistency Same character/object across shots reference_images≤9Motion transfer Keep a video's motion, swap the subject reference_videos+reference_imagesLip-sync Match a face to a real voice clip reference_audios+reference_imagesReal-person episode Founder/figure parody, stitched 30–80s reference_images(6–9 face set) -
Build the prompt. Use the 5-part canonical structure, in this order — earlier tokens carry more weight:
Subject → Action → Camera → Style → Constraints
- Subject: the visible thing + 2–3 concrete traits.
- Action: one visible verb, not plot.
- Camera: shot size, angle, lens, movement (name them — "35mm handheld push-in", not "cinematic").
- Style: lighting, palette, film stock, director, medium.
- Constraints: production rules that pre-empt failures ("no subtitles", "single continuous take", "hands resting naturally").
For anything longer than one simple shot, use time-coded blocks
[00:00-00:05]— Seedance reads them as hard editorial cuts and they re-anchor identity/wardrobe/palette every few seconds. A 15s clip is usually 3 shots, not 7. For the category-specific theory and paste-ready examples, readreference/prompt-logic.md(10 categories). -
Assemble the request body per
reference/schema.md. Respect the exclusivity rules in §"Hard rules" below. -
Validate before spending a generation:
python scripts/validate.py request.jsonFix every ERROR; consider every WARNING.
-
Generate:
python scripts/generate.py job.json --out ./seedance_out [--fallback]job.jsonis{"id": "...", "model": "auto", "input": { ...schema... }}or an array of such jobs. Local file paths in any reference field are auto-uploaded. The model is auto-routed by intent (see §"Models" below) — set"model"or--modelto override.--fallbackretries classifier-flagged jobs on Kling. -
Quality control + learn (mandatory — this is what makes the skill improve). See §"Quality control & self-improvement" below.
-
Iterate only if asked. Record the lesson regardless; do not regenerate unless the user asks for it. If they do, remove one demand before adding three. Most fixes are in
reference/failure-modes.md.
Models & auto-routing
Six runnable models, one schema. generate.py translates the canonical
Seedance body to whichever model is chosen. With "model": "auto" (default)
it routes by intent:
| Job signal | Routes to | Why |
|---|---|---|
reference_audios set (lip-sync, in a scene) | seedance-2.0 | audio-ref lip-sync inside a full generated scene |
reference_videos set (motion transfer) | seedance-2.0 | quad-modal references |
≥2 reference_images (character bible) | seedance-2.0 | 9-image consistency |
| native dialogue / SFX, text-only | seedance-2.0 | native synchronized audio |
| single-image image-to-video, no other refs | wan-2.7-i2v | open-weights, permissive, strong i2v |
classifier-flagged on Seedance (--fallback) | kling-3.0-omni | different moderation gateway |
The full model set, and when to --model-override to one:
seedance-2.0— the generalist; quad-modal, native audio, scene-aware.kling-3.0-omni— t2v + i2v with a different moderation gateway; pick it directly for i2v when Seedance won't pass moderation.wan-2.7-i2v— open-weights, permissive image-to-video specialist.veo-3.1— highest-realism cinematic text-to-video, native audio; the free Gemini key is rate-limited (~5/day).veed-fabric-1.0— dedicated talking-video model (face image + speech audio → lip-synced clip). Best for a clean talking head when you do not need Seedance to generate a whole scene around it.omnihuman-1.5— realistic audio-driven avatar (face image + speech audio, optional prompt → full-body-aware talking human). The strongest pure lip-sync/gesture realism; use for founder/presenter clips.
Rule of thumb for lip-sync: if the shot is a scene with a speaking
character, keep seedance-2.0; if it is just a person talking from a still
- a voice clip, override to
omnihuman-1.5(orveed-fabric-1.0). Both need areference_images(orimage) face and areference_audiosclip. When you override, tell the user which model and why.
Personal profile — "a video of me"
When the request means the user themselves ("a video of me", "my avatar",
"in my voice", "me talking"), load profile/profile.json and use it. That
file is personal and git-ignored — if it is absent, the profile is not set
up: tell the user to copy profile/profile.example.json to profile.json
and fill in their avatar image, ElevenLabs voice id, and preferences.
- Avatar → use
avatar.imageas thereference_images/imageface. Honour every rule inavatar.rules(identity 100% faithful; mic to the side, mouth unobstructed; stylized studio look). - Voice → generate narration with
scripts/tts.py(defaults to the profile's ElevenLabs voice id), then feed the MP3 asreference_audios. - Model → default talking-avatar clips to
model_preferences.talking_avatar_default(wan-2.7-i2v); use the alternate for gesture-rich wider frames. Never useveed-fabric-1.0for the user's avatar — seemodel_preferences.avoidandLESSONS.md.
Quality control & self-improvement
Every generation ends with a QC pass — this is not optional, it is how the skill gets better:
- Extract frames:
python scripts/review.py CLIP.mp4 --job job.json— writes a contact sheet + sampled frames + a probe toseedance_out/review/<id>/. - Watch the clip.
Readthecontact_sheet.jpgand the individualframe_*.jpgfiles. Check the probe (duration,has_audio,