Beat-Sync Reel Generator

Takes product images and a trending audio track, detects beats, and produces an Instagram Reel where every image cut lands exactly on a beat. Fast, free (no API credits), and scalable.

Requirements

Python 3 with librosa and Pillow packages
FFmpeg installed
yt-dlp installed (for URL/search audio input)

Input

The user provides:

Audio (required) — one of three formats:
- Local file path — e.g. /path/to/trending-audio.mp3
- URL — Instagram Reel, TikTok, or YouTube link. Download with: yt-dlp -x --audio-format mp3 -o "audio.%(ext)s" "<URL>"
- Audio name — e.g. "Nashe Si Chadh Gayi". Web search for it, find a YouTube/SoundCloud source, download with yt-dlp.
Product images (required) — one of:
- List of image file paths — local JPG/PNG files
- Product page URL — scrape images using these methods in order until one works:
  1. Shopify JSON — append .json to the product URL and extract image URLs from the response
  2. HTML scraping with referrer — curl with -H "Referer: <site-domain>" and a browser user-agent, then parse <img> tags
  3. Chrome DevTools — navigate to the page, extract image URLs via JavaScript, download each
Audio segment (optional) — start and end timestamps in seconds to use a specific portion of the audio. Defaults to 0-15s.
Beat frequency (optional) — cut on every Nth beat. Defaults to 2 (every 2nd beat, ~1.3s per image at typical tempos). Use 1 for fast cuts, 4 for slower.
Product info (optional) — brand name, product name, price, CTA URL. Used for end card. If not provided, skip end card.
Style preset (optional) — for end card text. One of: minimal, luxury, bold, editorial, clean. Defaults to clean. See Style Presets table below for font details.

Pipeline

Step 1: Resolve Audio

Based on input type:

Local file:

# Just verify it exists and get duration
ffprobe -v quiet -print_format json -show_format "audio.mp3"

URL (Instagram/TikTok/YouTube):

yt-dlp -x --audio-format mp3 -o "<workdir>/audio.%(ext)s" "<URL>"

Audio name (search):

Web search for "<audio name>" site:youtube.com or "<audio name>" instagram audio
Take the first YouTube/SoundCloud result
Download: yt-dlp -x --audio-format mp3 -o "<workdir>/audio.%(ext)s" "<URL>"

Step 2: Detect Beats

import librosa
import numpy as np

y, sr = librosa.load("audio.mp3", sr=None)
tempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr)
beat_times = librosa.frames_to_time(beat_frames, sr=sr)
beat_times = [float(t) for t in beat_times]

Select cut points based on beat frequency:

# beat_freq = 2 means every 2nd beat
cut_times = [0.0] + [beat_times[i] for i in range(beat_freq - 1, len(beat_times), beat_freq)]

Trim to audio segment:

start, end = 0.0, 15.0  # or user-provided
cut_times = [t - start for t in cut_times if start <= t < end]
if cut_times[0] != 0.0:
    cut_times.insert(0, 0.0)

Typical results by tempo:

Tempo (BPM)	Beat interval	Every 2nd beat	Cuts in 15s
80	0.75s	1.5s	~10
100	0.60s	1.2s	~12
120	0.50s	1.0s	~15
140	0.43s	0.86s	~17

If cuts > available images, cycle through images with different Ken Burns effects.

Step 3: Classify & Filter Images

If images were scraped from a product URL, filter out infographics and size charts:

Skip images with text overlays, size charts, comparison graphics (typically wider aspect ratios, or contain large text blocks)
Keep model photos, product-only photos, detail shots

Classification heuristic (by position on product page):

Position	Likely Type
Image 1 (first on page)	Hero / front-facing model
Image 2	Alternate angle (side/back)
Image 3-4	Close-up or detail
Last image	Size guide or back view

Model vs product-only detection: If image height > 1.5× width AND file size > 100KB → likely a model photo. Otherwise → product-only photo.

Order images for visual variety: hero → detail → alternate angle → repeat.

Step 4: Create Ken Burns Scenes

For each cut interval, create a Ken Burns clip from the assigned image. Alternate through these effects:

# Zoom in center
ffmpeg -y -loop 1 -i "image.jpg" \
  -vf "scale=2160:3840,zoompan=z='1+0.08*in/{frames}':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d={frames}:s=1080x1920:fps=25" \
  -t {duration} -c:v libx264 -pix_fmt yuv420p -r 25 scene.mp4

# Zoom out center
zoompan=z='1.15-0.08*in/{frames}':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d={frames}:s=1080x1920:fps=25

# Pan left to right
zoompan=z='1.08':x='(iw-iw/zoom)*in/{frames}':y='ih/2-(ih/zoom/2)':d={frames}:s=1080x1920:fps=25

# Pan right to left
zoompan=z='1.08':x='(iw-iw/zoom)*(1-in/{frames})':y='ih/2-(ih/zoom/2)':d={frames}:s=1080x1920:fps=25

# Zoom in top-center (for torso/face crops)
zoompan=z='1+0.08*in/{frames}':x='iw/2-(iw/zoom/2)':y='ih/4-(ih/zoom/4)':d={frames}:s=1080x1920:fps=25

# Pan up
zoompan=z='1.06':x='iw/2-(iw/zoom/2)':y='(ih-ih/zoom)*(1-in/{frames})':d={frames}:s=1080x1920:fps=25

Where {frames} = int(duration * 25) (25 fps).

Important: Always scale source image to at least 2160x3840 before zoompan so there's enough resolution for the zoom.

Step 5: Create End Card (Optional)

If product info is provided, create a 2-second end card using Pillow:

from PIL import Image, ImageDraw, ImageFont

card = Image.new("RGBA", (1080, 1920), (20, 20, 20, 255))
draw = ImageDraw.Draw(card)
# Brand name (centered, y=750)
# Product name (centered, y=830)
# Price (centered, y=920, accent color)
# CTA (centered, y=1020, muted)
card.save("endcard.png")

Convert to video:

ffmpeg -y -loop 1 -i endcard.png -vf "scale=1080:1920" \
  -t 2 -c:v libx264 -pix_fmt yuv420p -r 25 endcard.mp4

Style Presets

Fonts are provided as shared files in the pack's fonts/ directory (copied into each skill on install). Fall back to system fonts if custom fonts are not found.

Preset	Title Font	Body Font	Text Color	Treatment
minimal	Montserrat-Light.ttf	Montserrat-Light.ttf	White (255,255,255)	No background, subtle shadow
luxury	System Didot (/System/Library/Fonts/Supplemental/Didot.ttc)	Cormorant-Regular.ttf	Cream (245,235,210)	Thin gold stroke
bold	System Futura (/System/Library/Fonts/Supplemental/Futura.ttc)	Montserrat-Bold.ttf	White	Dark backdrop bar, uppercase
editorial	Cormorant-Italic.ttf	Cormorant-Regular.ttf	White	Minimal, italic titles
clean	System Helvetica (/System/Library/Fonts/Helvetica.ttc)	System Helvetica	White	Simple shadow, professional

Step 6: Concatenate Scenes

cat > concat.txt << EOF
file 'scene-00.mp4'
file 'scene-01.mp4'
...
file 'endcard.mp4'
EOF

ffmpeg -y -f concat -safe 0 -i concat.txt \
  -c:v libx264 -pix_fmt yuv420p -r 25 reel-silent.mp4

Step 7: Add Audio

ffmpeg -y -i reel-silent.mp4 -i audio.mp3 \
  -filter_complex "[1:a]atrim={start}:{end},asetpts=PTS-STARTPTS,afade=t=in:st=0:d=0.5,afade=t=out:st={fade_start}:d=2,volume=0.8[aud]" \
  -map 0:v -map "[aud]" \
  -c:v copy -c:a aac -shortest output.mp4

Where {start} and {end} are the audio segment timestamps, and {fade_start} = total_duration - 2.0.

Output

Save the final reel to a user-specified directory (or the current working directory).

Output specs:

Format: MP4 (H.264)
Resolution: 1080x1920 (9:16 portrait)
Frame rate: 25fps
Duration: typically 10-20 seconds (depends on audio segment)
Audio: AAC

Known Limitations

No AI video generation — this skill only uses Ken Burns (zoom/pan on stil

beat-sync-reel

How to add

Drop this on your repo README

Related skills

learn-codebase

remove-deadcode

sendgrid-automation

seo

Get new Marketing skills every Monday