Stable Diffusion Image Generation

Comprehensive guide to generating images with Stable Diffusion using the HuggingFace Diffusers library.

When to use Stable Diffusion

Use Stable Diffusion when:

Generating images from text descriptions
Performing image-to-image translation (style transfer, enhancement)
Inpainting (filling in masked regions)
Outpainting (extending images beyond boundaries)
Creating variations of existing images
Building custom image generation workflows

Key features:

Text-to-Image: Generate images from natural language prompts
Image-to-Image: Transform existing images with text guidance
Inpainting: Fill masked regions with context-aware content
ControlNet: Add spatial conditioning (edges, poses, depth)
LoRA Support: Efficient fine-tuning and style adaptation
Multiple Models: SD 1.5, SDXL, SD 3.0, Flux support

Use alternatives instead:

DALL-E 3: For API-based generation without GPU
Midjourney: For artistic, stylized outputs
Imagen: For Google Cloud integration
Leonardo.ai: For web-based creative workflows

Quick start

Installation

pip install diffusers transformers accelerate torch
pip install xformers  # Optional: memory-efficient attention

Basic text-to-image

from diffusers import DiffusionPipeline
import torch

# Load pipeline (auto-detects model type)
pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
)
pipe.to("cuda")

# Generate image
image = pipe(
    "A serene mountain landscape at sunset, highly detailed",
    num_inference_steps=50,
    guidance_scale=7.5
).images[0]

image.save("output.png")

Using SDXL (higher quality)

from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")

# Enable memory optimization
pipe.enable_model_cpu_offload()

image = pipe(
    prompt="A futuristic city with flying cars, cinematic lighting",
    height=1024,
    width=1024,
    num_inference_steps=30
).images[0]

Architecture overview

Three-pillar design

Diffusers is built around three core components:

Pipeline (orchestration)
├── Model (neural networks)
│   ├── UNet / Transformer (noise prediction)
│   ├── VAE (latent encoding/decoding)
│   └── Text Encoder (CLIP/T5)
└── Scheduler (denoising algorithm)

Pipeline inference flow

Text Prompt → Text Encoder → Text Embeddings
                                    ↓
Random Noise → [Denoising Loop] ← Scheduler
                      ↓
               Predicted Noise
                      ↓
              VAE Decoder → Final Image

Core concepts

Pipelines

Pipelines orchestrate complete workflows:

Pipeline	Purpose
`StableDiffusionPipeline`	Text-to-image (SD 1.x/2.x)
`StableDiffusionXLPipeline`	Text-to-image (SDXL)
`StableDiffusion3Pipeline`	Text-to-image (SD 3.0)
`FluxPipeline`	Text-to-image (Flux models)
`StableDiffusionImg2ImgPipeline`	Image-to-image
`StableDiffusionInpaintPipeline`	Inpainting

Schedulers

Schedulers control the denoising process:

Scheduler	Steps	Quality	Use Case
`EulerDiscreteScheduler`	20-50	Good	Default choice
`EulerAncestralDiscreteScheduler`	20-50	Good	More variation
`DPMSolverMultistepScheduler`	15-25	Excellent	Fast, high quality
`DDIMScheduler`	50-100	Good	Deterministic
`LCMScheduler`	4-8	Good	Very fast
`UniPCMultistepScheduler`	15-25	Excellent	Fast convergence

Swapping schedulers

from diffusers import DPMSolverMultistepScheduler

# Swap for faster generation
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    pipe.scheduler.config
)

# Now generate with fewer steps
image = pipe(prompt, num_inference_steps=20).images[0]

Generation parameters

Key parameters

Parameter	Default	Description
`prompt`	Required	Text description of desired image
`negative_prompt`	None	What to avoid in the image
`num_inference_steps`	50	Denoising steps (more = better quality)
`guidance_scale`	7.5	Prompt adherence (7-12 typical)
`height`, `width`	512/1024	Output dimensions (multiples of 8)
`generator`	None	Torch generator for reproducibility
`num_images_per_prompt`	1	Batch size

Reproducible generation

import torch

generator = torch.Generator(device="cuda").manual_seed(42)

image = pipe(
    prompt="A cat wearing a top hat",
    generator=generator,
    num_inference_steps=50
).images[0]

Negative prompts

image = pipe(
    prompt="Professional photo of a dog in a garden",
    negative_prompt="blurry, low quality, distorted, ugly, bad anatomy",
    guidance_scale=7.5
).images[0]

Image-to-image

Transform existing images with text guidance:

from diffusers import AutoPipelineForImage2Image
from PIL import Image

pipe = AutoPipelineForImage2Image.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

init_image = Image.open("input.jpg").resize((512, 512))

image = pipe(
    prompt="A watercolor painting of the scene",
    image=init_image,
    strength=0.75,  # How much to transform (0-1)
    num_inference_steps=50
).images[0]

Inpainting

Fill masked regions:

from diffusers import AutoPipelineForInpainting
from PIL import Image

pipe = AutoPipelineForInpainting.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    torch_dtype=torch.float16
).to("cuda")

image = Image.open("photo.jpg")
mask = Image.open("mask.png")  # White = inpaint region

result = pipe(
    prompt="A red car parked on the street",
    image=image,
    mask_image=mask,
    num_inference_steps=50
).images[0]

ControlNet

Add spatial conditioning for precise control:

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch

# Load ControlNet for edge conditioning
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/control_v11p_sd15_canny",
    torch_dtype=torch.float16
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Use Canny edge image as control
control_image = get_canny_image(input_image)

image = pipe(
    prompt="A beautiful house in the style of Van Gogh",
    image=control_image,
    num_inference_steps=30
).images[0]

Available ControlNets

ControlNet	Input Type	Use Case
`canny`	Edge maps	Preserve structure
`openpose`	Pose skeletons	Human poses
`depth`	Depth maps	3D-aware generation
`normal`	Normal maps	Surface details
`mlsd`	Line segments	Architectural lines
`scribble`	Rough sketches	Sketch-to-image

LoRA adapters

Load fine-tuned style adapters:

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

# Load LoRA weights
pipe.load_lora_weights("path/to/lora", weight_name="style.safetensors")

# Generate with LoRA style
image = pipe("A portrait in the trained style").images[0]

# Adjust LoRA strength
pipe.fuse_lora(lora_scale=0.8)

# Unload LoRA
pipe.unload_lora_weights()

Multiple LoRAs

# Load multiple LoRAs
pipe.load_lora_weights("lora1", adapter_name="style")
pipe.load_lora_weights("lora2", adapter_name="character")

# Set weights for each
pipe.set_adapters(["style", "character"], adapter_weights=[0.7, 0.5])

stable-diffusion-image-generation

Como adicionar

Cole no README do seu repo

Skills relacionadas

dev-browser

agent-browser

understand-chat

understand-dashboard

Receba novas skills de Pesquisa e Web toda segunda