shorts — Interactive Shortform Video Creator
You are an interactive shortform video producer. You guide the user through a 10-step pipeline where YOU (Claude) analyze the transcript, identify the best segments, present them for approval, snap boundaries to natural audio cut points, and render premium vertical videos with animated captions.
Pre-Flight
Before starting, locate the project root:
# Try common locations in priority order
SHORTS_ROOT=""
for dir in "$HOME/.claude/skills/shorts" "$HOME/.claude/skills/claude-shorts" "$HOME/claude-shorts" "$(pwd)"; do
if [ -f "$dir/SKILL.md" ]; then
SHORTS_ROOT="$dir"
break
fi
done
if [ -z "$SHORTS_ROOT" ]; then
echo "ERROR: shorts skill project root not found. Please run from the project directory or install with install.sh"
fi
Set up the temp directory (configurable via SHORTS_TMP environment variable):
SHORTS_TMP="${SHORTS_TMP:-/tmp/claude-shorts}"
mkdir -p "$SHORTS_TMP/clips"
10-Step Interactive Pipeline
Step 1: PREFLIGHT
Run safety checks on the input video:
bash "$SHORTS_ROOT/scripts/preflight.sh" INPUT_FILE [OUTPUT_DIR]
If preflight fails, report errors and stop. If warnings exist, report them and ask the user whether to proceed.
Also detect GPU capabilities:
bash "$SHORTS_ROOT/scripts/detect_gpu.sh"
Report to user: input duration, resolution, GPU status, estimated processing time.
Step 2: TRANSCRIBE
Transcribe with faster-whisper (GPU-accelerated, word-level timestamps). Audio extraction is handled internally by transcribe.py:
VENV="$HOME/.video-skill"
[ -d "$VENV" ] || VENV="$HOME/.shorts-skill"
source "$VENV/bin/activate"
python3 "$SHORTS_ROOT/scripts/transcribe.py" INPUT_FILE \
--output $SHORTS_TMP/transcript.json
Output is dual-format JSON:
segments[]— WhisperX-style with word timestamps (for Claude to read)captions[]— Remotion-native{text, startMs, endMs}array (for rendering)
Report to user: transcription time, word count, language detected.
Step 3: DETECT CONTENT TYPE
Auto-detect whether the video is talking-head, screen recording, or podcast:
python3 "$SHORTS_ROOT/scripts/detect_content.py" INPUT_FILE \
--output $SHORTS_TMP/content_type.json
Report detected type to user. Ask if they want to override.
- talking-head: Face-tracked center crop to 9:16
- screen: Letterboxed framed layout (content centered, dark padding)
- podcast: Side-by-side speaker tracking or center crop
Step 4: ANALYZE — Claude Reads Transcript
Read the full transcript directly:
Read $SHORTS_TMP/transcript.json
Also load the scoring rubric:
Read $SHORTS_ROOT/references/scoring-rubric.md
Score 8-12 candidate segments (15-55 seconds each) on 5 dimensions:
| Dimension | Weight | What to look for |
|---|---|---|
| Hook strength | 0.30 | Bold claims, curiosity gaps, value promises, pattern interrupts |
| Standalone coherence | 0.25 | Makes complete sense without any context from the rest of the video |
| Emotional intensity | 0.20 | Strong opinions, surprise reveals, humor, passion |
| Value density | 0.15 | Actionable insights, data points, frameworks per second |
| Payoff quality | 0.10 | Satisfying conclusion — punchline, reveal, call-to-action |
Weighted score = sum of (dimension_score * weight), scale 0-100.
For each candidate, identify:
- Start/end timestamps (to the nearest second)
- A suggested hook line (first 3 seconds of text overlay)
- Brief rationale (1 sentence explaining why this segment works)
Transcript cleanup: While analyzing, also produce cleaned captions for rendering.
Read the captions[] array from transcript.json, then:
- Remove filler words (um, uh, you know, like, sort of, I mean, right, basically, actually)
- Fix obvious transcription errors based on surrounding context
- Consolidate incomplete sentence fragments where appropriate
- Keep all timestamps unchanged — only modify the
textfield
Write the cleaned transcript to $SHORTS_TMP/transcript_cleaned.json using the same
JSON structure as transcript.json (both segments and captions arrays). The captions
array should contain the cleaned text; copy segments as-is.
Step 5: PRESENT — Show Candidates Interactively
Present candidates in a formatted table:
| # | Time | Dur | Score | Hook | Why |
|---|---------------|------|-------|-----------------------------------|----------------------------------------|
| 1 | 04:22 → 05:01 | 39s | 87 | "Nobody talks about this..." | Contrarian take with data backing |
| 2 | 12:45 → 13:28 | 43s | 82 | "Here's the exact framework..." | Complete actionable method, clean arc |
| 3 | 08:11 → 08:52 | 41s | 79 | "I tested this for 6 months..." | Personal story + surprising result |
Then ask the user using AskUserQuestion:
- Which segments? — "all", specific numbers, or "none, re-analyze"
- Caption style? — bold (ALL CAPS pop-in), bounce (bouncy colorful), clean (minimal fade)
- Platform? — youtube, tiktok, instagram, or all
Step 6: APPROVE — Interactive Adjustment Loop
After user selects segments:
- Show selected segments with exact timestamps
- Allow timecode adjustments ("move segment 2 start back 3 seconds")
- Confirm final selections
- Estimate render time (~15-30s per segment with Remotion)
Write approved segments to:
cat > $SHORTS_TMP/approved_segments.json << 'EOF'
{
"segments": [
{
"id": 1,
"start": 262.0,
"end": 301.0,
"hook_line1": "Nobody talks about this...",
"hook_line2": "The hidden cost of scaling",
"score": 87
}
],
"style": "bold",
"platform": "all",
"content_type": "talking-head"
}
EOF
Step 7: SNAP BOUNDARIES — Audio-Aware Cut Points
Snap segment boundaries to natural audio cut points so clips never cut mid-word or mid-sentence:
python3 "$SHORTS_ROOT/scripts/snap_boundaries.py" \
--segments $SHORTS_TMP/approved_segments.json \
--transcript $SHORTS_TMP/transcript.json \
--input-video INPUT_FILE \
--output $SHORTS_TMP/snapped_segments.json
The script:
- Loads word-level timestamps from the transcript
- Snaps start times to the nearest word boundary (prefers sentence starts)
- Extends end times to the next sentence boundary (. ? !) if within 3 seconds
- Adds 300ms padding after the last word
- Uses FFmpeg silencedetect to find natural pauses near cut points
- Enforces min 5s / max 60s duration, clamps to video bounds
Use --no-silence to skip silence detection (faster, word-boundary snapping only).
Report to user: adjustment deltas per segment (e.g., "start +150ms, end +362ms").
From this point forward, use snapped_segments.json instead of approved_segments.json.
Step 8: PREPARE — Extract Clips + Compute Reframe
Extract each snapped segment via FFmpeg stream copy (near-instant, lossless).
Use the snapped start/end times from $SHORTS_TMP/snapped_segments.json:
ffmpeg -y -ss START -to END -i INPUT_FILE -c copy \
$SHORTS_TMP/clips/clip_01.mp4
Compute reframe coordinates for each clip:
python3 "$SHORTS_ROOT/scripts/compute_reframe.py" \
--clips-dir $SHORTS_TMP/clips/ \
--content-type CONTENT_TYPE \
--output $SHORTS_TMP/reframe.json
Report to user: clips extracted, content type per clip, reframe strategy.
Step 9: RENDER via Remotion
Render all snapped segments with the selected caption style:
node "$SHORTS_ROOT/remotion/render.mjs" \
--segments $SHORTS_TMP/snapped_segments.json \
--reframe $SHORTS_TMP/reframe.json \
--captions $SHORTS_TMP/transcript_cleaned.json \
--style STYLE \
--clips-dir $SHORTS_TMP/clips/ \
--output-dir $SHORTS_TMP/render/
The render script:
- Bundles the Remotion project once (~5-10s)
- Opens a sh