Music Producer — ACE Step 1.5 XL, maximum fidelity
Generate standalone music tracks at the highest audio quality the system is capable of. For radio-drama music beds (fast, "good enough"), use the radio-drama-production skill instead; this one exists for tracks where the audio is the deliverable.
0. Target host + tool
- Host:
${SSH_USER}@127.0.0.1(Workstation — RTX 5090, 64 GB RAM, Win 11 + OpenSSH) - Tool:
${COMFYUI_ROOT}\music_tool\music_maker.py - Templates:
music_tool\templates\ace_step_music_apg_api.json(APG chain) +ace_step_music_simple_api.json(simple KSampler) - ComfyUI endpoint on Workstation:
http://127.0.0.1:8188
1. Why a dedicated tool
scene_production_tool/radio_drama.py uses a simple KSampler template tuned for turbo variants — fast, clean enough to sit under dialogue, but the ceiling is the xl_base_sft merged model at CFG 3. The APG-requiring base models (xl_base fp32, xl_sft bf16) distort audibly under that template because ACE Step's full base models need SamplerCustomAdvanced + APG + CFGGuider to avoid artifacts (per NerdyRodent's v35 reference workflow and Stability's training notes).
music_maker.py here uses the proper APG chain for xl_base / xl_sft, producing clean output at true base-model quality. It also defaults to lossless FLAC output (48 kHz stereo), unlike the radio-drama pipeline which writes MP3 V0.
2. Variants — pick by quality/speed tradeoff
| Variant | UNet | Chain | Steps | CFG | Time (per 90 s) | Best for |
|---|---|---|---|---|---|---|
xl_base (default) | acestep_v1.5_xl_base.safetensors (19.95 GB fp32) | APG | 50 | 7.0 | ~21 s | Album masters, standalone songs, hero cues |
xl_sft | acestep_v1.5_xl_sft_bf16.safetensors | APG | 45 | 6.0 | ~18 s | Near-base quality, faster, bf16 |
xl_base_sft | acestep_v1.5_xl_merge_base_sft_ta_0.5.safetensors | simple KSampler | 35 | 3.0 | ~21 s | Balance (shared default with radio-drama) |
xl_turbo | acestep_v1.5_xl_turbo_bf16.safetensors | simple KSampler | 10 | 1.0 | ~12 s | Preview iterations, fast A/B |
base_turbo | acestep_v1.5_turbo.safetensors (4.8 GB) | simple KSampler | 8 | 1.0 | ~8 s | Smallest/fastest, lowest quality |
APG variants use SamplerCustomAdvanced with:
APG(eta=0.7, norm_threshold=2.5, momentum=-0.75)(v35 params)CFGGuider(cfg=per-variant)KSamplerSelect("gradient_estimation")BasicScheduler("simple", steps, denoise=1.0)ModelSamplingAuraFlow(shift=3)RandomNoise(seed)
Simple variants use a straight KSampler with euler / simple — works because those models are distilled (turbo) or merged (base+SFT).
3. Quick-start
Three ways to invoke from anywhere:
Direct SSH one-liner
ssh ${SSH_USER}@127.0.0.1 'cd ${COMFYUI_ROOT} && python music_tool\music_maker.py --prompt "lofi jazz, warm Rhodes, soft saxophone, brushed drums, vinyl crackle" --duration 180 --bpm 78 --key "A minor"'
From a sidecar script (recommended for longer tracks)
ssh ${SSH_USER}@127.0.0.1 'start /B python ${COMFYUI_ROOT}\music_tool\music_maker.py --prompt "..." --duration 240 --variant xl_base > ${USER_HOME}\music_maker_run.log 2>&1'
ssh ${SSH_USER}@127.0.0.1 'powershell -Command "Get-Content ${USER_HOME}\music_maker_run.log -Wait -Tail 10"'
Pull the result
scp ${SSH_USER}@127.0.0.1:${COMFYUI_ROOT}/output/music/lofi_jazz_*.flac .
4. Argument reference
python music_maker.py [options]
--prompt STR (required) comma-separated music descriptors
--duration FLOAT track length in seconds (default 120, max ~240)
--bpm INT tempo (default 75)
--key STR key/scale, e.g. "A minor", "C# major" (default "A minor")
--lyrics STR_OR_PATH literal lyrics OR path to .txt file (default empty = instrumental)
--variant {xl_base|xl_sft|xl_base_sft|xl_turbo|base_turbo} (default xl_base)
--steps INT override the variant's preset step count
--cfg FLOAT override the variant's preset CFG
--seed INT fixed seed for reproducibility
--output / -o PATH output file (.flac / .wav / .mp3) — default is
output/music/<slug>_<seed>.flac
5. Writing good prompts
ACE Step understands music the way image models understand art — the prompt is a cloud of descriptors, not a sentence. Pile on comma-separated tags across four categories:
Genre + subgenre
lofi jazz / jazz fusion / bossa nova / swing / cool jazz / bebop
ambient drone / cinematic ambient / dark ambient / space music
lofi hiphop / boom bap / trip hop / chillhop / study beats
neo-soul / R&B / funk / gospel
classical / chamber / string quartet / solo piano / minimalist / romantic
cinematic orchestral / film score / epic trailer / horror score / ghibli-style
indie rock / shoegaze / post-rock / dream pop / synthwave / vaporwave
electronic / IDM / techno / house / drum and bass / ambient techno
world / flamenco / tango / celtic / middle eastern / afrobeat / reggae
Instruments (more specific = better)
warm Rhodes piano, muted saxophone, brushed jazz drums,
upright bass walking line, vibraphone, muted trumpet,
Fender Rhodes, clean Stratocaster, nylon-string guitar,
Moog bass, analog synth pad, mellotron strings,
violin section, cello, timpani, woodwinds,
hand drums, sitar, oud, didgeridoo, koto
Production / mix character
vinyl crackle, tape hiss, analog warmth, lo-fi compression,
big reverb, long delay, spring reverb, plate reverb,
close-mic'd, room ambience, field recording,
dry and intimate, lush and wide, spectral shimmer,
sidechained pump, pumping kick, saturated bass
Mood / setting
nocturnal, rainy window, coffee shop, late-night drive,
contemplative, melancholic, uplifting, triumphant, dark foreboding,
urgent, tense, calm and measured, reverent, sacred,
morning coffee, sunrise, sunset, winter, summer, desert, forest
Rhythm / groove cues (reinforces BPM)
relaxed 4/4 swing, boom-bap groove, head-nod groove,
samba syncopation, waltz 3/4, odd meter 7/8,
driving straight 8ths, laid back behind the beat
Full example prompt
lofi jazz, mellow hip hop beat, warm Rhodes piano, soft muted saxophone,
brushed jazz drums, upright bass walking line, vinyl crackle,
rainy window atmosphere, nocturnal, study beats, relaxed 4/4 swing
Anti-patterns
- ❌ Full sentences ("A beautiful jazz song with piano") — ACE expects tags, not prose
- ❌ Requesting specific artists ("in the style of Miles Davis") — might hint but not reliable
- ❌ Contradictory tags ("aggressive peaceful / loud quiet") — model averages to mush
- ❌ Song-structure prose ("verse 1 goes like...") — use the
--lyricsarg for vocals
Writing for dynamics, feel, and punch
If your tracks sound flat / same-level / lifeless, the prompt is usually why. ACE Step mirrors the energy envelope of its tags. A "wall of sound" prompt produces a wall-of-sound track — no peaks, no valleys, no feel.
Words that CREATE dynamics (use these):
punchy, snappy, transient-rich, kick-forward, staccato, percussive,
breathy, restrained, sparse, minimal, space between notes,
quiet intro, slow build, drops to silence, sudden hit,
accent on the one, ghost note, syncopated, rhythmic tension,
call and response, rest, pause, breathing room,
rises and falls, crescendo, decrescendo, swell, taper,
loud-quiet-loud dynamics, cinematic dynamics,
sidechain pump, ducking, gated, stabbed, plucked, stabs,
muted, then big, whispered then roared
Words that KILL dynamics (avoid or use sparingly):
wall of sound, dense mix, thick, maximal, lush full arrangement,
constant energy, always moving, never stops, saturated everything,
massive, huge, overwhelming, pounding nonstop,
layered and layered, everything at once,
compressed to the max, radio-ready loud ← asks the model to pre-compress
**Structural cu