wjs-overlaying-video
Post-production for a video clip: cover, captions, illustrations, CTA, custom motion graphics — all composed in ONE HyperFrames project and rendered in a SINGLE final encode. No cascade of decodes/re-encodes (each cascade pass degrades quality and burns time).
When to use
- Downstream of
/wjs-segmenting-video— the segmentation skill hands you cropped clips + per-clip SRTs; this skill turns them into upload-ready MP4s with cover/captions/illustrations/CTA. - User has a finished video and wants to dress it up with motion graphics: opening hook, key-quote callout, closing slogan, chapter cards, AI-generated cover as first frame.
- User wants HTML/CSS-quality captions on a video (kinetic word-by-word highlighting, custom fonts, large outlined text, seekable per cue).
- User wants illustration overlays at specific hook moments — diagrams, big text emphasis, flow charts.
Don't use for:
- Splitting one long video into clips → use
/wjs-segmenting-video. - Creating the source SRT → use
/wjs-transcribing-audio(then/wjs-translating-subtitlesif you need a different language). - Full HyperFrames productions where the source isn't a fixed video →
use
hyperframesdirectly. - 微信视频号 / 抖音 upload (no public API for those) → this skill produces the MP4; upload is manual.
What this skill IS — and IS NOT
| Is | Is not |
|---|---|
| Everything that goes ON TOP of a video clip: cover, caption, chapter, illustration, CTA | Cutting / cropping a video (that's /wjs-segmenting-video + /wjs-reframing-video) |
| One HyperFrames composition per clip = ONE final encode | A multi-step decode/encode cascade |
cover is the literal first frame of the output (platforms auto-pick it as thumbnail) | A separate thumbnail file the user uploads alongside |
Captions are HTML/CSS — -webkit-text-stroke for white-on-anything readability | libass burn-in (deprecated) |
Illustrations: re-usable stack / hammer patterns + custom escape hatch | One bespoke HTML/CSS per illustration without re-use |
| AI covers regenerated at native target aspect (1024×1792 for vertical, 1536×1024 for horizontal) | Single 1024×1536 default that letterboxes or crops on the platform |
The pipeline
clip.mp4 + clip.zh-CN.burn.srt (from /wjs-segmenting-video hand-off)
↓
1. (Optional) Generate AI cover via gpt-image-2
make_cover.py --segments S.json --out output/ --size 1024x1792
cover_NN_slug.png
2. Scaffold a HyperFrames project per clip
hf_clip_NN/1080/{index.html, clip.mp4, cover.png, captions.json}
3. Compose: cover scene + body video + caption track + chapter chip
+ 1-2 illustrations at hook moments + CTA scene
4. npm run check (lint + validate + visual inspect)
npm run render → upload-ready MP4
A 2-minute vertical 1080×1920 composition renders in ~2-3 min on M-series Mac.
Color: tone-map HLG/HDR source → SDR BEFORE compositing
Only tone-map genuinely HLG/HDR sources. If the body clip is ALREADY Rec.709
SDR — e.g. a graded multicam render, or polysync output where an S-Log3→709 LUT
was already applied — running the HLG tone-map recipe on it washes/darkens the
already-correct color. build_hf_clips.py's tonemap_to_sdr now probes
color_transfer (_is_hlg_hdr): HLG/PQ → tone-map; otherwise a straight
re-encode with dense keyframes (no tone-map). Either way you still get the
-g 30 dense-keyframe encode HyperFrames needs.
iPhone / modern-camera footage is often HLG HDR (bt2020 / arib-std-b67).
If you feed that straight into HyperFrames it either renders washed-out
("发白") or, with a naive --sdr, too dark ("发黑"); and the HDR x265
path can hang the renderer. Pre-convert the body clip to SDR (bt709)
30fps h264 with a locked zscale tone-map, then composite the SDR clip.
The verified recipe (tonemap_to_sdr() in build_hf_clips.py). npl=203
matches macOS-native (qlmanage) reference brightness; hable keeps
contrast; this preserves the ORIGINAL look (natural skin / foliage / brick),
no wash, no darkening:
# zscale-capable ffmpeg — Homebrew's lacks zscale/tonemap.
# imageio-ffmpeg ships one: .../imageio_ffmpeg/binaries/ffmpeg-macos-aarch64-v7.1
TONEMAP_VF = ("zscale=tin=arib-std-b67:min=bt2020nc:pin=bt2020:t=linear:npl=203,"
"format=gbrpf32le,tonemap=tonemap=hable:desat=0,"
"zscale=t=bt709:m=bt709:p=bt709:r=tv,format=yuv420p,fps=30")
# encode: libx264 -crf 18 -color_primaries/-trc/-colorspace bt709
# -g 30 -keyint_min 30 -movflags +faststart ← see gotcha below
Dense-keyframe gotcha. HyperFrames seeks the body video frame-by-frame.
A clip with sparse keyframes (long GOP) makes it freeze on stale frames —
the render log warns Video "video" has sparse keyframes. Always encode the
SDR clip with -g 30 -keyint_min 30 (one keyframe per frame-second) so every
seek lands clean.
Verify the render log says No HDR sources detected — rendering SDR.
If it says HDR detected, your clip wasn't tone-mapped — fix that first.
Version stamp (every output)
Stamp 「skill名字 + 版本号」 bottom-right, shown during the END/CTA scene,
so every render is traceable to the pipeline version that made it. Bump
VERSION in build_hf_clips.py on each pipeline change.
#ver-stamp { position: absolute; right: 28px; bottom: 28px; z-index: 30;
font-size: 20px; color: rgba(150,150,156,0.55); letter-spacing: 0.06em; }
<div id="ver-stamp" class="clip" data-start="{cta_start}" data-duration="{cta_dur}"
data-track-index="2">wjs-overlaying-video v1.3</div>
Standard overlay types (the 6 building blocks)
Every clip's final composition is built from some combination of these. The agent picks the right ones per clip — typically all 6 for a podcast highlight, or just 1-2 for a single annotation overlay.
1. cover — full-frame AI image as first frame
The cover IS the first frame (no animation, no zoom) so platforms that
auto-pick the first frame as the thumbnail get your designed cover by
default. Always verify with ffmpeg -ss 0 -vframes 1 — frame 0
must NOT be black or platform thumbnails will be black.
HTML:
<div id="cover" class="clip" data-start="0" data-duration="1.6"
data-track-index="1" data-layout-allow-overflow>
<img src="cover.png" alt="" data-layout-allow-overflow />
</div>
CSS:
#cover { position: absolute; inset: 0; background: #0c0d10; overflow: hidden; }
#cover img { position: absolute; inset: 0; width: 100%; height: 100%; object-fit: cover; }
Generation: use /wjs-segmenting-video/scripts/make_cover.py
(wraps gpt-image-2 images edit with the midpoint frame as ref):
# For 1080×1920 vertical output (视频号 / 抖音):
make_cover.py --segments S.json --out output/ --size 1024x1792 [--single N]
# For 1920×1080 horizontal output (YouTube / B站):
make_cover.py --segments S.json --out output/ --size 1536x1024
Aspect must match output frame. --size 1024x1536 (2:3, the
script default) gets letterboxed or cropped on 9:16 output — always
pass 1024x1792 for vertical. The cover image's aspect is what the
viewer sees full-frame, so mismatch is visible. Re-roll one with
--single N; codex provider can transient-fail mid-batch.
Codex auth required: the script calls codex CLI via
gpt-image-2-skill. If ~/.codex/auth.json is missing, the script
errors. See gpt-image-2-skill for setup.
Reference frame must match the OUTPUT orientation. make_cover reads
output/frame_NN_slug.jpg as the photographic background it keeps. For
a vertical clip that came from a horizontal two-person source, the
default frame_NN is the horizontal two-shot — feeding that to a
1024x1792 cover crams both people into portrait awkwardly. Replace
frame_NN_slug.jpg with a vertical single-speaker frame pulled from
the already-cropped body clip first
(ffmpeg -ss <t> -i clip_vert.mp4 -frames:v 1 frame_NN_slug.jpg), then
run make_cover. The cover