The core idea: reverse noise

During training, a diffusion model learns a peculiar skill: how to take a noisy image and make it slightly less noisy. To train this, researchers took millions of real images and progressively added random noise until each image became pure static. The model learned to reverse each step.

At generation time, the process is run in reverse: start with pure noise, apply the model's "denoise by one step" skill repeatedly, and a coherent image emerges. Your text prompt guides every step.

Start Step 1 Step 2 Step 3 Final ← Guided by your text prompt at every step → Output image

What is the "latent space"?

Modern diffusion models don't work in full pixel space — that would be computationally prohibitive. Instead they operate in a compressed representation called the latent space. An encoder squishes a 512×512 image into a small 64×64 grid of abstract numbers. The diffusion happens there. At the end, a decoder expands it back to full resolution. This is the "latent" in Latent Diffusion Models — which is what Stable Diffusion and Flux are.

How your text prompt guides the image

The prompt is processed by a text encoder (often a model like CLIP or T5) into a sequence of vectors. These vectors are injected into the denoising network at every step via a mechanism called cross-attention. The model learns to move the noisy image toward representations that match the text.

Popular diffusion models you can run locally

ModelParamsStyle / strengthMin VRAM
Stable Diffusion XL3.5BVersatile, huge ecosystem of LoRAs6 GB
SD 3.5 Large8BExcellent text rendering, photorealism10 GB
Flux.1 [dev]12BTop-tier quality, prompt adherence12 GB
Flux.1 [schnell]12BSame quality, 4× faster (distilled)12 GB
HiDream17BHighly detailed, cinematic16 GB
How this maps to your RTX / Spark

NVIDIA Tensor Cores are purpose-built for the matrix math that every denoising step requires. An RTX 4060 (8 GB) runs SDXL comfortably. An RTX 4070 Super or 4060 Ti 16GB handles Flux.1. An RTX 4090 or 5090 generates Flux images in seconds instead of tens of seconds.

TensorRT-LLM and TensorRT can compile diffusion models for your specific GPU, often yielding a 2–4× throughput boost over naive inference. The DGX Spark's 128 GB unified memory makes it practical to run multiple large diffusion models side by side.