Diffusion Models
How does a computer turn words into a picture? It starts with pure random noise and gradually removes it, guided by your prompt.
The core idea: reverse noise
During training, a diffusion model learns a peculiar skill: how to take a noisy image and make it slightly less noisy. To train this, researchers took millions of real images and progressively added random noise until each image became pure static. The model learned to reverse each step.
At generation time, the process is run in reverse: start with pure noise, apply the model's "denoise by one step" skill repeatedly, and a coherent image emerges. Your text prompt guides every step.
What is the "latent space"?
Modern diffusion models don't work in full pixel space — that would be computationally prohibitive. Instead they operate in a compressed representation called the latent space. An encoder squishes a 512×512 image into a small 64×64 grid of abstract numbers. The diffusion happens there. At the end, a decoder expands it back to full resolution. This is the "latent" in Latent Diffusion Models — which is what Stable Diffusion and Flux are.
How your text prompt guides the image
The prompt is processed by a text encoder (often a model like CLIP or T5) into a sequence of vectors. These vectors are injected into the denoising network at every step via a mechanism called cross-attention. The model learns to move the noisy image toward representations that match the text.
Popular diffusion models you can run locally
| Model | Params | Style / strength | Min VRAM |
|---|---|---|---|
| Stable Diffusion XL | 3.5B | Versatile, huge ecosystem of LoRAs | 6 GB |
| SD 3.5 Large | 8B | Excellent text rendering, photorealism | 10 GB |
| Flux.1 [dev] | 12B | Top-tier quality, prompt adherence | 12 GB |
| Flux.1 [schnell] | 12B | Same quality, 4× faster (distilled) | 12 GB |
| HiDream | 17B | Highly detailed, cinematic | 16 GB |
NVIDIA Tensor Cores are purpose-built for the matrix math that every denoising step requires. An RTX 4060 (8 GB) runs SDXL comfortably. An RTX 4070 Super or 4060 Ti 16GB handles Flux.1. An RTX 4090 or 5090 generates Flux images in seconds instead of tens of seconds.
TensorRT-LLM and TensorRT can compile diffusion models for your specific GPU, often yielding a 2–4× throughput boost over naive inference. The DGX Spark's 128 GB unified memory makes it practical to run multiple large diffusion models side by side.