Diffusion Models — Local AI Guide

The core idea: reverse noise

During training, a diffusion model learns a peculiar skill: how to take a noisy image and make it slightly less noisy. To train this, researchers took millions of real images and progressively added random noise until each image became pure static. The model learned to reverse each step.

At generation time, the process is run in reverse: start with pure noise, apply the model's "denoise by one step" skill repeatedly, and a coherent image emerges. Your text prompt guides every step.

What is the "latent space"?

Modern diffusion models don't work in full pixel space — that would be computationally prohibitive. Instead they operate in a compressed representation called the latent space. An encoder squishes a 512×512 image into a small 64×64 grid of abstract numbers. The diffusion happens there. At the end, a decoder expands it back to full resolution. This is the "latent" in Latent Diffusion Models — which is what Stable Diffusion and Flux are.

How your text prompt guides the image

The prompt is processed by a text encoder (often a model like CLIP or T5) into a sequence of vectors. These vectors are injected into the denoising network at every step via a mechanism called cross-attention. The model learns to move the noisy image toward representations that match the text.

Popular diffusion models you can run locally

Model	Params	Style / strength	Min VRAM
Stable Diffusion XL	3.5B	Versatile, huge ecosystem of LoRAs	6 GB
SD 3.5 Large	8B	Excellent text rendering, photorealism	10 GB
Flux.1 [dev]	12B	Top-tier quality, prompt adherence	12 GB
Flux.1 [schnell]	12B	Same quality, 4× faster (distilled)	12 GB
HiDream	17B	Highly detailed, cinematic	16 GB

How this maps to your RTX / Spark

NVIDIA Tensor Cores are purpose-built for the matrix math that every denoising step requires. An RTX 4060 (8 GB) runs SDXL comfortably. An RTX 4070 Super or 4060 Ti 16GB handles Flux.1. An RTX 4090 or 5090 generates Flux images in seconds instead of tens of seconds.

TensorRT-LLM and TensorRT can compile diffusion models for your specific GPU, often yielding a 2–4× throughput boost over naive inference. The DGX Spark's 128 GB unified memory makes it practical to run multiple large diffusion models side by side.