Language Model LLM

Weights + KV cache. Memory grows with context length.

β€”
Total VRAM estimate
    Diffusion Model Image / Video

    Weights + activations. Memory grows with output resolution.

    β€”
    Total VRAM estimate

      How the estimates work

      Both calculators start from the same idea β€” every parameter takes up memory, and the quantization level decides how many bytes each one needs (16-bit = 2 bytes, 8-bit = 1 byte, 4-bit = 0.5 bytes). Where they differ is the second term.

      Language models: weights + KV cache

      An LLM also has to store a KV cache β€” the attention state for every token currently in the context window. This is what makes long context expensive, and it scales linearly with context length:

      KV cache (GB) = 2 Γ— layers Γ— kv_heads Γ— head_dim Γ— context_tokens Γ— 2 bytes Γ· 1024Β³
      // 2 = Key + Value  Β·  KV stored in FP16 (2 bytes)

      Llama 3.1 8B (32 layers, 8 KV heads via GQA, head dim 128):
        at 4K context β†’ β‰ˆ 0.5 GB
        at 32K context β†’ β‰ˆ 4.0 GB  (8Γ— the context = 8Γ— the cache)
        at 128K context β†’ β‰ˆ 16 GB

      The calculator assumes modern grouped-query attention (8 KV heads), which is why the cache is far smaller than older full multi-head-attention estimates. Double the context, double the cache β€” verify it yourself by dragging the slider.

      Diffusion models: weights + activations

      A diffusion model has no context window and no KV cache. Instead, its extra memory comes from activations β€” the intermediate image data held in memory while denoising. That scales with the number of pixels, so it grows with output resolution, not context:

      Activations (GB) β‰ˆ base Γ— (width Γ— height) Γ· (1024 Γ— 1024)
      // ~1.8 GB at 1024Γ—1024, ~4 GB at 1536Β², ~7 GB at 2048Β²

      Flux.1 [dev] (12B) at GGUF Q4, 1024Γ—1024:
        weights β‰ˆ 6.5 GB + activations β‰ˆ 1.8 GB β†’ β‰ˆ 8.5 GB
      Same model at FP16, 1024Γ—1024 β†’ β‰ˆ 26 GB

      Video (HunyuanVideo, Wan) multiplies activation memory by the number of frames, which is why even short clips need 24 GB+. Treat the video preset as a rough single-frame floor β€” real clips demand considerably more.

      Why real numbers vary

      • Architecture: GQA, MoE sparsity, and extra text encoders (Flux ships a large T5 encoder) all shift the totals.
      • Framework: llama.cpp, vLLM, ComfyUI and diffusers each allocate memory differently, and offer offloading that trades speed for lower VRAM.
      • Quantization scheme: Q4_K_M is mixed-precision β€” it averages ~4.5 bits/weight, not exactly 4.

      Use these for planning, then confirm with your tool's live VRAM display.