VRAM Calculator
Estimate how much GPU memory you need to run AI locally. Two calculators β one for language models (text), one for diffusion models (images & video) β because they use memory in very different ways.
How the estimates work
Both calculators start from the same idea β every parameter takes up memory, and the quantization level decides how many bytes each one needs (16-bit = 2 bytes, 8-bit = 1 byte, 4-bit = 0.5 bytes). Where they differ is the second term.
Language models: weights + KV cache
An LLM also has to store a KV cache β the attention state for every token currently in the context window. This is what makes long context expensive, and it scales linearly with context length:
// 2 = Key + Value Β· KV stored in FP16 (2 bytes)
Llama 3.1 8B (32 layers, 8 KV heads via GQA, head dim 128):
at 4K context β β 0.5 GB
at 32K context β β 4.0 GB (8Γ the context = 8Γ the cache)
at 128K context β β 16 GB
The calculator assumes modern grouped-query attention (8 KV heads), which is why the cache is far smaller than older full multi-head-attention estimates. Double the context, double the cache β verify it yourself by dragging the slider.
Diffusion models: weights + activations
A diffusion model has no context window and no KV cache. Instead, its extra memory comes from activations β the intermediate image data held in memory while denoising. That scales with the number of pixels, so it grows with output resolution, not context:
// ~1.8 GB at 1024Γ1024, ~4 GB at 1536Β², ~7 GB at 2048Β²
Flux.1 [dev] (12B) at GGUF Q4, 1024Γ1024:
weights β 6.5 GB + activations β 1.8 GB β β 8.5 GB
Same model at FP16, 1024Γ1024 β β 26 GB
Video (HunyuanVideo, Wan) multiplies activation memory by the number of frames, which is why even short clips need 24 GB+. Treat the video preset as a rough single-frame floor β real clips demand considerably more.
Why real numbers vary
- Architecture: GQA, MoE sparsity, and extra text encoders (Flux ships a large T5 encoder) all shift the totals.
- Framework: llama.cpp, vLLM, ComfyUI and diffusers each allocate memory differently, and offer offloading that trades speed for lower VRAM.
- Quantization scheme: Q4_K_M is mixed-precision β it averages ~4.5 bits/weight, not exactly 4.
Use these for planning, then confirm with your tool's live VRAM display.