SSkilltecabyclaudinhocode
Enviar skill
← Voltar para o catálogo

ai-video

DevOps e Infra

Generate AI video of any kind, end to end — write the prompt, create first-frame stills, run the generation across six video models (Seedance 2.0, Kling 3.0, Wan 2.7, Veo 3.1, OmniHuman 1.5, VEED Fabric), synthesize voiceover, stitch clips into a finished cut, and upscale for delivery. Use whenever the user wants to make or improve a video: cinematic shots, character dialogue, action, dance, nativ

3estrelas
Ver no GitHub ↗Autor: 0xadvaitLicença: MIT

AI video — generation, any kind

An end-to-end AI video skill: prompt craft → first-frame stills → generation across six models → voiceover → edit/stitch → upscale, with a quality-control loop that makes the skill improve over time. The ByteDance Seedance 2.0 schema is the canonical interchange format — jobs are authored against it and the runner translates to whichever model fits. Treat every clip as a miniature production brief, not an image caption.

Full pipeline

Not every job needs every stage — a single clip is just stages 2–4 — but this is the shape of a complete video:

  1. Still (scripts/imagegen.py) — generate a first-frame image or character reference plate (Flux / Imagen). Skip for pure text-to-video.
  2. Prompt — build the request with the 5-part structure (Workflow below).
  3. Validate + generate (validate.pygenerate.py) — run on the auto-routed model; local refs auto-upload.
  4. QC + learn (review.pyLESSONS.md) — watch the clip, score it, record a lesson. Mandatory.
  5. Voiceover (scripts/tts.py) — synthesize narration (e.g. the user's cloned voice) when a clip or cut needs spoken audio.
  6. Assemble (scripts/assemble.py) — stitch multiple clips into a finished video with cuts/crossfades and an audio bed.
  7. Finish (scripts/upscale.py) — upscale resolution + interpolate fps for delivery.

Workflow (a single generation)

  1. Read LESSONS.md first. It is the skill's growing memory of what has and hasn't worked. Apply its lessons when building the prompt.

  2. Pick a production mode (the choice drives everything else):

    ModeWhenReference inputs
    Text-to-videoTone piece, action, abstract — no source medianone
    Native dialogueA character speaks; voice is generatednone (generate_audio=true)
    Native SFXSound-led showcase (ASMR, ambience)none
    Image-to-videoAnimate a still / first frame → last frameimage (+ last_frame_image)
    Character consistencySame character/object across shotsreference_images ≤9
    Motion transferKeep a video's motion, swap the subjectreference_videos + reference_images
    Lip-syncMatch a face to a real voice clipreference_audios + reference_images
    Real-person episodeFounder/figure parody, stitched 30–80sreference_images (6–9 face set)
  3. Build the prompt. Use the 5-part canonical structure, in this order — earlier tokens carry more weight:

    Subject → Action → Camera → Style → Constraints

    • Subject: the visible thing + 2–3 concrete traits.
    • Action: one visible verb, not plot.
    • Camera: shot size, angle, lens, movement (name them — "35mm handheld push-in", not "cinematic").
    • Style: lighting, palette, film stock, director, medium.
    • Constraints: production rules that pre-empt failures ("no subtitles", "single continuous take", "hands resting naturally").

    For anything longer than one simple shot, use time-coded blocks [00:00-00:05] — Seedance reads them as hard editorial cuts and they re-anchor identity/wardrobe/palette every few seconds. A 15s clip is usually 3 shots, not 7. For the category-specific theory and paste-ready examples, read reference/prompt-logic.md (10 categories).

  4. Assemble the request body per reference/schema.md. Respect the exclusivity rules in §"Hard rules" below.

  5. Validate before spending a generation:

    python scripts/validate.py request.json
    

    Fix every ERROR; consider every WARNING.

  6. Generate:

    python scripts/generate.py job.json --out ./seedance_out [--fallback]
    

    job.json is {"id": "...", "model": "auto", "input": { ...schema... }} or an array of such jobs. Local file paths in any reference field are auto-uploaded. The model is auto-routed by intent (see §"Models" below) — set "model" or --model to override. --fallback retries classifier-flagged jobs on Kling.

  7. Quality control + learn (mandatory — this is what makes the skill improve). See §"Quality control & self-improvement" below.

  8. Iterate only if asked. Record the lesson regardless; do not regenerate unless the user asks for it. If they do, remove one demand before adding three. Most fixes are in reference/failure-modes.md.

Models & auto-routing

Six runnable models, one schema. generate.py translates the canonical Seedance body to whichever model is chosen. With "model": "auto" (default) it routes by intent:

Job signalRoutes toWhy
reference_audios set (lip-sync, in a scene)seedance-2.0audio-ref lip-sync inside a full generated scene
reference_videos set (motion transfer)seedance-2.0quad-modal references
≥2 reference_images (character bible)seedance-2.09-image consistency
native dialogue / SFX, text-onlyseedance-2.0native synchronized audio
single-image image-to-video, no other refswan-2.7-i2vopen-weights, permissive, strong i2v
classifier-flagged on Seedance (--fallback)kling-3.0-omnidifferent moderation gateway

The full model set, and when to --model-override to one:

  • seedance-2.0 — the generalist; quad-modal, native audio, scene-aware.
  • kling-3.0-omni — t2v + i2v with a different moderation gateway; pick it directly for i2v when Seedance won't pass moderation.
  • wan-2.7-i2v — open-weights, permissive image-to-video specialist.
  • veo-3.1 — highest-realism cinematic text-to-video, native audio; the free Gemini key is rate-limited (~5/day).
  • veed-fabric-1.0 — dedicated talking-video model (face image + speech audio → lip-synced clip). Best for a clean talking head when you do not need Seedance to generate a whole scene around it.
  • omnihuman-1.5 — realistic audio-driven avatar (face image + speech audio, optional prompt → full-body-aware talking human). The strongest pure lip-sync/gesture realism; use for founder/presenter clips.

Rule of thumb for lip-sync: if the shot is a scene with a speaking character, keep seedance-2.0; if it is just a person talking from a still

  • a voice clip, override to omnihuman-1.5 (or veed-fabric-1.0). Both need a reference_images (or image) face and a reference_audios clip. When you override, tell the user which model and why.

Personal profile — "a video of me"

When the request means the user themselves ("a video of me", "my avatar", "in my voice", "me talking"), load profile/profile.json and use it. That file is personal and git-ignored — if it is absent, the profile is not set up: tell the user to copy profile/profile.example.json to profile.json and fill in their avatar image, ElevenLabs voice id, and preferences.

  • Avatar → use avatar.image as the reference_images/image face. Honour every rule in avatar.rules (identity 100% faithful; mic to the side, mouth unobstructed; stylized studio look).
  • Voice → generate narration with scripts/tts.py (defaults to the profile's ElevenLabs voice id), then feed the MP3 as reference_audios.
  • Model → default talking-avatar clips to model_preferences.talking_avatar_default (wan-2.7-i2v); use the alternate for gesture-rich wider frames. Never use veed-fabric-1.0 for the user's avatar — see model_preferences.avoid and LESSONS.md.

Quality control & self-improvement

Every generation ends with a QC pass — this is not optional, it is how the skill gets better:

  1. Extract frames: python scripts/review.py CLIP.mp4 --job job.json — writes a contact sheet + sampled frames + a probe to seedance_out/review/<id>/.
  2. Watch the clip. Read the contact_sheet.jpg and the individual frame_*.jpg files. Check the probe (duration, has_audio,

Como adicionar

/plugin marketplace add 0xadvait/ai-video-skill

O comando exato pode variar conforme o repositório. Confira o README no GitHub.

Comentários · Nenhum comentário

Entre para comentar. Entrar

  • Ainda não há comentários. Seja o primeiro.