VRAM Calculator — Local AI Guide

Language Model LLM

Weights + KV cache. Memory grows with context length.

Popular model

Parameters (B)

Quantization

Context length: 4K

—

Total VRAM estimate

Diffusion Model Image / Video

Weights + activations. Memory grows with output resolution.

Popular model

Parameters (B)

Precision

Output resolution: 1024 × 1024

—

Total VRAM estimate

How the estimates work

Both calculators start from the same idea — every parameter takes up memory, and the quantization level decides how many bytes each one needs (16-bit = 2 bytes, 8-bit = 1 byte, 4-bit = 0.5 bytes). Where they differ is the second term.

Language models: weights + KV cache

An LLM also has to store a KV cache — the attention state for every token currently in the context window. This is what makes long context expensive, and it scales linearly with context length:

KV cache (GB) = 2 × layers × kv_heads × head_dim × context_tokens × 2 bytes ÷ 1024³
// 2 = Key + Value · KV stored in FP16 (2 bytes)

Llama 3.1 8B (32 layers, 8 KV heads via GQA, head dim 128):
  at 4K context → ≈ 0.5 GB
  at 32K context → ≈ 4.0 GB (8× the context = 8× the cache)
  at 128K context → ≈ 16 GB

The calculator assumes modern grouped-query attention (8 KV heads), which is why the cache is far smaller than older full multi-head-attention estimates. Double the context, double the cache — verify it yourself by dragging the slider.

Diffusion models: weights + activations

A diffusion model has no context window and no KV cache. Instead, its extra memory comes from activations — the intermediate image data held in memory while denoising. That scales with the number of pixels, so it grows with output resolution, not context:

Activations (GB) ≈ base × (width × height) ÷ (1024 × 1024)
// ~1.8 GB at 1024×1024, ~4 GB at 1536², ~7 GB at 2048²

Flux.1 [dev] (12B) at GGUF Q4, 1024×1024:
weights ≈ 6.5 GB + activations ≈ 1.8 GB → ≈ 8.5 GB
Same model at FP16, 1024×1024 → ≈ 26 GB

Video (HunyuanVideo, Wan) multiplies activation memory by the number of frames, which is why even short clips need 24 GB+. Treat the video preset as a rough single-frame floor — real clips demand considerably more.

Why real numbers vary

Architecture: GQA, MoE sparsity, and extra text encoders (Flux ships a large T5 encoder) all shift the totals.
Framework: llama.cpp, vLLM, ComfyUI and diffusers each allocate memory differently, and offer offloading that trades speed for lower VRAM.
Quantization scheme: Q4_K_M is mixed-precision — it averages ~4.5 bits/weight, not exactly 4.

Use these for planning, then confirm with your tool's live VRAM display.