Beat-Sync Reel Generator
Takes product images and a trending audio track, detects beats, and produces an Instagram Reel where every image cut lands exactly on a beat. Fast, free (no API credits), and scalable.
Requirements
- Python 3 with
librosaandPillowpackages - FFmpeg installed
- yt-dlp installed (for URL/search audio input)
Input
The user provides:
-
Audio (required) — one of three formats:
- Local file path — e.g.
/path/to/trending-audio.mp3 - URL — Instagram Reel, TikTok, or YouTube link. Download with:
yt-dlp -x --audio-format mp3 -o "audio.%(ext)s" "<URL>" - Audio name — e.g. "Nashe Si Chadh Gayi". Web search for it, find a YouTube/SoundCloud source, download with yt-dlp.
- Local file path — e.g.
-
Product images (required) — one of:
- List of image file paths — local JPG/PNG files
- Product page URL — scrape images using these methods in order until one works:
- Shopify JSON — append
.jsonto the product URL and extract image URLs from the response - HTML scraping with referrer —
curlwith-H "Referer: <site-domain>"and a browser user-agent, then parse<img>tags - Chrome DevTools — navigate to the page, extract image URLs via JavaScript, download each
- Shopify JSON — append
-
Audio segment (optional) —
startandendtimestamps in seconds to use a specific portion of the audio. Defaults to 0-15s. -
Beat frequency (optional) — cut on every Nth beat. Defaults to
2(every 2nd beat, ~1.3s per image at typical tempos). Use1for fast cuts,4for slower. -
Product info (optional) — brand name, product name, price, CTA URL. Used for end card. If not provided, skip end card.
-
Style preset (optional) — for end card text. One of:
minimal,luxury,bold,editorial,clean. Defaults toclean. See Style Presets table below for font details.
Pipeline
Step 1: Resolve Audio
Based on input type:
Local file:
# Just verify it exists and get duration
ffprobe -v quiet -print_format json -show_format "audio.mp3"
URL (Instagram/TikTok/YouTube):
yt-dlp -x --audio-format mp3 -o "<workdir>/audio.%(ext)s" "<URL>"
Audio name (search):
- Web search for
"<audio name>" site:youtube.comor"<audio name>" instagram audio - Take the first YouTube/SoundCloud result
- Download:
yt-dlp -x --audio-format mp3 -o "<workdir>/audio.%(ext)s" "<URL>"
Step 2: Detect Beats
import librosa
import numpy as np
y, sr = librosa.load("audio.mp3", sr=None)
tempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr)
beat_times = librosa.frames_to_time(beat_frames, sr=sr)
beat_times = [float(t) for t in beat_times]
Select cut points based on beat frequency:
# beat_freq = 2 means every 2nd beat
cut_times = [0.0] + [beat_times[i] for i in range(beat_freq - 1, len(beat_times), beat_freq)]
Trim to audio segment:
start, end = 0.0, 15.0 # or user-provided
cut_times = [t - start for t in cut_times if start <= t < end]
if cut_times[0] != 0.0:
cut_times.insert(0, 0.0)
Typical results by tempo:
| Tempo (BPM) | Beat interval | Every 2nd beat | Cuts in 15s |
|---|---|---|---|
| 80 | 0.75s | 1.5s | ~10 |
| 100 | 0.60s | 1.2s | ~12 |
| 120 | 0.50s | 1.0s | ~15 |
| 140 | 0.43s | 0.86s | ~17 |
If cuts > available images, cycle through images with different Ken Burns effects.
Step 3: Classify & Filter Images
If images were scraped from a product URL, filter out infographics and size charts:
- Skip images with text overlays, size charts, comparison graphics (typically wider aspect ratios, or contain large text blocks)
- Keep model photos, product-only photos, detail shots
Classification heuristic (by position on product page):
| Position | Likely Type |
|---|---|
| Image 1 (first on page) | Hero / front-facing model |
| Image 2 | Alternate angle (side/back) |
| Image 3-4 | Close-up or detail |
| Last image | Size guide or back view |
Model vs product-only detection: If image height > 1.5× width AND file size > 100KB → likely a model photo. Otherwise → product-only photo.
Order images for visual variety: hero → detail → alternate angle → repeat.
Step 4: Create Ken Burns Scenes
For each cut interval, create a Ken Burns clip from the assigned image. Alternate through these effects:
# Zoom in center
ffmpeg -y -loop 1 -i "image.jpg" \
-vf "scale=2160:3840,zoompan=z='1+0.08*in/{frames}':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d={frames}:s=1080x1920:fps=25" \
-t {duration} -c:v libx264 -pix_fmt yuv420p -r 25 scene.mp4
# Zoom out center
zoompan=z='1.15-0.08*in/{frames}':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':d={frames}:s=1080x1920:fps=25
# Pan left to right
zoompan=z='1.08':x='(iw-iw/zoom)*in/{frames}':y='ih/2-(ih/zoom/2)':d={frames}:s=1080x1920:fps=25
# Pan right to left
zoompan=z='1.08':x='(iw-iw/zoom)*(1-in/{frames})':y='ih/2-(ih/zoom/2)':d={frames}:s=1080x1920:fps=25
# Zoom in top-center (for torso/face crops)
zoompan=z='1+0.08*in/{frames}':x='iw/2-(iw/zoom/2)':y='ih/4-(ih/zoom/4)':d={frames}:s=1080x1920:fps=25
# Pan up
zoompan=z='1.06':x='iw/2-(iw/zoom/2)':y='(ih-ih/zoom)*(1-in/{frames})':d={frames}:s=1080x1920:fps=25
Where {frames} = int(duration * 25) (25 fps).
Important: Always scale source image to at least 2160x3840 before zoompan so there's enough resolution for the zoom.
Step 5: Create End Card (Optional)
If product info is provided, create a 2-second end card using Pillow:
from PIL import Image, ImageDraw, ImageFont
card = Image.new("RGBA", (1080, 1920), (20, 20, 20, 255))
draw = ImageDraw.Draw(card)
# Brand name (centered, y=750)
# Product name (centered, y=830)
# Price (centered, y=920, accent color)
# CTA (centered, y=1020, muted)
card.save("endcard.png")
Convert to video:
ffmpeg -y -loop 1 -i endcard.png -vf "scale=1080:1920" \
-t 2 -c:v libx264 -pix_fmt yuv420p -r 25 endcard.mp4
Style Presets
Fonts are provided as shared files in the pack's fonts/ directory (copied into each skill on install). Fall back to system fonts if custom fonts are not found.
| Preset | Title Font | Body Font | Text Color | Treatment |
|---|---|---|---|---|
| minimal | Montserrat-Light.ttf | Montserrat-Light.ttf | White (255,255,255) | No background, subtle shadow |
| luxury | System Didot (/System/Library/Fonts/Supplemental/Didot.ttc) | Cormorant-Regular.ttf | Cream (245,235,210) | Thin gold stroke |
| bold | System Futura (/System/Library/Fonts/Supplemental/Futura.ttc) | Montserrat-Bold.ttf | White | Dark backdrop bar, uppercase |
| editorial | Cormorant-Italic.ttf | Cormorant-Regular.ttf | White | Minimal, italic titles |
| clean | System Helvetica (/System/Library/Fonts/Helvetica.ttc) | System Helvetica | White | Simple shadow, professional |
Step 6: Concatenate Scenes
cat > concat.txt << EOF
file 'scene-00.mp4'
file 'scene-01.mp4'
...
file 'endcard.mp4'
EOF
ffmpeg -y -f concat -safe 0 -i concat.txt \
-c:v libx264 -pix_fmt yuv420p -r 25 reel-silent.mp4
Step 7: Add Audio
ffmpeg -y -i reel-silent.mp4 -i audio.mp3 \
-filter_complex "[1:a]atrim={start}:{end},asetpts=PTS-STARTPTS,afade=t=in:st=0:d=0.5,afade=t=out:st={fade_start}:d=2,volume=0.8[aud]" \
-map 0:v -map "[aud]" \
-c:v copy -c:a aac -shortest output.mp4
Where {start} and {end} are the audio segment timestamps, and {fade_start} = total_duration - 2.0.
Output
Save the final reel to a user-specified directory (or the current working directory).
Output specs:
- Format: MP4 (H.264)
- Resolution: 1080x1920 (9:16 portrait)
- Frame rate: 25fps
- Duration: typically 10-20 seconds (depends on audio segment)
- Audio: AAC
Known Limitations
- No AI video generation — this skill only uses Ken Burns (zoom/pan on stil