Local Image Generation — Local AI Guide

Why run image generation locally?

Cloud image generators charge per image, log your prompts, and apply content policies you can't change. Running locally means unlimited generation at effectively zero marginal cost once you have the hardware, with complete privacy and customisability.

Local image generation has also reached parity with — and in some cases surpassed — cloud quality, thanks to models like Flux.1 and SD 3.5 Large.

The tools

ComfyUI

A node-based workflow editor — extremely flexible and powerful. The professional's choice. Supports every model and extension. Steeper learning curve.

WindowsMacLinux

Automatic1111

The original web UI for Stable Diffusion. Huge extension ecosystem, mature, well-documented. Slower to adopt new architectures.

WindowsLinux

Forge

A fork of Automatic1111 optimised for VRAM efficiency. Runs larger models on less memory than the original. Good starting point for 8 GB cards.

WindowsLinux

InvokeAI

A polished, modern UI focused on creative professionals. Clean interface, good canvas tools, active development.

WindowsMacLinux

The models

Model	Params	Min VRAM	Strengths
Stable Diffusion XL	3.5B	6 GB	Fastest, largest LoRA library, great for artistic styles
SD 3.5 Large	8B	10 GB	Excellent text rendering, photorealism, prompt adherence
Flux.1 [schnell]	12B	12 GB	High quality in 4 steps — very fast, permissive license
Flux.1 [dev]	12B	12 GB	Best quality at 20–30 steps, non-commercial license
HiDream	17B	16 GB	Cinematic detail, strong photorealism

What about add-ons?

ControlNet models let you control image composition using a reference image, pose skeleton, or edge map. LoRAs are small (50–300 MB) fine-tune files that add a specific style or subject to a base model — you can mix multiple LoRAs with different weights. These run on top of your base model and consume additional VRAM.

Speed expectations

GPU	Flux.1 schnell (4 steps, 1024px)
RTX 4060 (8 GB) — with SDXL instead	~8 seconds (SDXL only)
RTX 4070 Super (12 GB)	~18–25 seconds
RTX 4090 (24 GB)	~6–8 seconds
RTX 5090 (32 GB)	~3–5 seconds

How this maps to your RTX / Spark

NVIDIA Tensor Cores are the engine behind every denoising step. An RTX 4060 (8 GB) runs SDXL comfortably and SD 3.5 Medium with Forge's memory optimisations. An RTX 4070 Super or 4060 Ti 16GB opens up Flux.1 at full quality. An RTX 4090 or 5090 generates Flux images in seconds and supports batch generation for serious workflows.

The DGX Spark's 128 GB lets you load multiple base models, ControlNets, and large LoRA stacks simultaneously — eliminating the model-swapping delays that slow down creative workflows on consumer GPUs. TensorRT-optimised pipelines on NVIDIA hardware can deliver an additional 2–4× speed boost over standard PyTorch inference.