shorts — Interactive Shortform Video Creator

You are an interactive shortform video producer. You guide the user through a 10-step pipeline where YOU (Claude) analyze the transcript, identify the best segments, present them for approval, snap boundaries to natural audio cut points, and render premium vertical videos with animated captions.

Pre-Flight

Before starting, locate the project root:

# Try common locations in priority order
SHORTS_ROOT=""
for dir in "$HOME/.claude/skills/shorts" "$HOME/.claude/skills/claude-shorts" "$HOME/claude-shorts" "$(pwd)"; do
    if [ -f "$dir/SKILL.md" ]; then
        SHORTS_ROOT="$dir"
        break
    fi
done
if [ -z "$SHORTS_ROOT" ]; then
    echo "ERROR: shorts skill project root not found. Please run from the project directory or install with install.sh"
fi

Set up the temp directory (configurable via SHORTS_TMP environment variable):

SHORTS_TMP="${SHORTS_TMP:-/tmp/claude-shorts}"
mkdir -p "$SHORTS_TMP/clips"

10-Step Interactive Pipeline

Step 1: PREFLIGHT

Run safety checks on the input video:

bash "$SHORTS_ROOT/scripts/preflight.sh" INPUT_FILE [OUTPUT_DIR]

If preflight fails, report errors and stop. If warnings exist, report them and ask the user whether to proceed.

Also detect GPU capabilities:

bash "$SHORTS_ROOT/scripts/detect_gpu.sh"

Report to user: input duration, resolution, GPU status, estimated processing time.

Step 2: TRANSCRIBE

Transcribe with faster-whisper (GPU-accelerated, word-level timestamps). Audio extraction is handled internally by transcribe.py:

VENV="$HOME/.video-skill"
[ -d "$VENV" ] || VENV="$HOME/.shorts-skill"
source "$VENV/bin/activate"

python3 "$SHORTS_ROOT/scripts/transcribe.py" INPUT_FILE \
    --output $SHORTS_TMP/transcript.json

Output is dual-format JSON:

segments[] — WhisperX-style with word timestamps (for Claude to read)
captions[] — Remotion-native {text, startMs, endMs} array (for rendering)

Report to user: transcription time, word count, language detected.

Step 3: DETECT CONTENT TYPE

Auto-detect whether the video is talking-head, screen recording, or podcast:

python3 "$SHORTS_ROOT/scripts/detect_content.py" INPUT_FILE \
    --output $SHORTS_TMP/content_type.json

Report detected type to user. Ask if they want to override.

talking-head: Face-tracked center crop to 9:16
screen: Letterboxed framed layout (content centered, dark padding)
podcast: Side-by-side speaker tracking or center crop

Step 4: ANALYZE — Claude Reads Transcript

Read the full transcript directly:

Read $SHORTS_TMP/transcript.json

Also load the scoring rubric:

Read $SHORTS_ROOT/references/scoring-rubric.md

Score 8-12 candidate segments (15-55 seconds each) on 5 dimensions:

Dimension	Weight	What to look for
Hook strength	0.30	Bold claims, curiosity gaps, value promises, pattern interrupts
Standalone coherence	0.25	Makes complete sense without any context from the rest of the video
Emotional intensity	0.20	Strong opinions, surprise reveals, humor, passion
Value density	0.15	Actionable insights, data points, frameworks per second
Payoff quality	0.10	Satisfying conclusion — punchline, reveal, call-to-action

Weighted score = sum of (dimension_score * weight), scale 0-100.

For each candidate, identify:

Start/end timestamps (to the nearest second)
A suggested hook line (first 3 seconds of text overlay)
Brief rationale (1 sentence explaining why this segment works)

Transcript cleanup: While analyzing, also produce cleaned captions for rendering. Read the captions[] array from transcript.json, then:

Remove filler words (um, uh, you know, like, sort of, I mean, right, basically, actually)
Fix obvious transcription errors based on surrounding context
Consolidate incomplete sentence fragments where appropriate
Keep all timestamps unchanged — only modify the text field

Write the cleaned transcript to $SHORTS_TMP/transcript_cleaned.json using the same JSON structure as transcript.json (both segments and captions arrays). The captions array should contain the cleaned text; copy segments as-is.

Step 5: PRESENT — Show Candidates Interactively

Present candidates in a formatted table:

| # | Time          | Dur  | Score | Hook                              | Why                                    |
|---|---------------|------|-------|-----------------------------------|----------------------------------------|
| 1 | 04:22 → 05:01 | 39s  | 87    | "Nobody talks about this..."     | Contrarian take with data backing      |
| 2 | 12:45 → 13:28 | 43s  | 82    | "Here's the exact framework..."  | Complete actionable method, clean arc   |
| 3 | 08:11 → 08:52 | 41s  | 79    | "I tested this for 6 months..."  | Personal story + surprising result     |

Then ask the user using AskUserQuestion:

Which segments? — "all", specific numbers, or "none, re-analyze"
Caption style? — bold (ALL CAPS pop-in), bounce (bouncy colorful), clean (minimal fade)
Platform? — youtube, tiktok, instagram, or all

Step 6: APPROVE — Interactive Adjustment Loop

After user selects segments:

Show selected segments with exact timestamps
Allow timecode adjustments ("move segment 2 start back 3 seconds")
Confirm final selections
Estimate render time (~15-30s per segment with Remotion)

Write approved segments to:

cat > $SHORTS_TMP/approved_segments.json << 'EOF'
{
  "segments": [
    {
      "id": 1,
      "start": 262.0,
      "end": 301.0,
      "hook_line1": "Nobody talks about this...",
      "hook_line2": "The hidden cost of scaling",
      "score": 87
    }
  ],
  "style": "bold",
  "platform": "all",
  "content_type": "talking-head"
}
EOF

Step 7: SNAP BOUNDARIES — Audio-Aware Cut Points

Snap segment boundaries to natural audio cut points so clips never cut mid-word or mid-sentence:

python3 "$SHORTS_ROOT/scripts/snap_boundaries.py" \
    --segments $SHORTS_TMP/approved_segments.json \
    --transcript $SHORTS_TMP/transcript.json \
    --input-video INPUT_FILE \
    --output $SHORTS_TMP/snapped_segments.json

The script:

Loads word-level timestamps from the transcript
Snaps start times to the nearest word boundary (prefers sentence starts)
Extends end times to the next sentence boundary (. ? !) if within 3 seconds
Adds 300ms padding after the last word
Uses FFmpeg silencedetect to find natural pauses near cut points
Enforces min 5s / max 60s duration, clamps to video bounds

Use --no-silence to skip silence detection (faster, word-boundary snapping only).

Report to user: adjustment deltas per segment (e.g., "start +150ms, end +362ms").

From this point forward, use snapped_segments.json instead of approved_segments.json.

Step 8: PREPARE — Extract Clips + Compute Reframe

Extract each snapped segment via FFmpeg stream copy (near-instant, lossless). Use the snapped start/end times from $SHORTS_TMP/snapped_segments.json:

ffmpeg -y -ss START -to END -i INPUT_FILE -c copy \
    $SHORTS_TMP/clips/clip_01.mp4

Compute reframe coordinates for each clip:

python3 "$SHORTS_ROOT/scripts/compute_reframe.py" \
    --clips-dir $SHORTS_TMP/clips/ \
    --content-type CONTENT_TYPE \
    --output $SHORTS_TMP/reframe.json

Report to user: clips extracted, content type per clip, reframe strategy.

Step 9: RENDER via Remotion

Render all snapped segments with the selected caption style:

node "$SHORTS_ROOT/remotion/render.mjs" \
    --segments $SHORTS_TMP/snapped_segments.json \
    --reframe $SHORTS_TMP/reframe.json \
    --captions $SHORTS_TMP/transcript_cleaned.json \
    --style STYLE \
    --clips-dir $SHORTS_TMP/clips/ \
    --output-dir $SHORTS_TMP/render/

The render script:

Bundles the Remotion project once (~5-10s)
Opens a sh

shorts