VRAM & Model Size — Local AI Guide

What needs to fit in VRAM?

VRAM (video RAM) is the dedicated memory on your GPU. During inference, three things compete for that space:

Model weights — the bulk: all those billions of numbers that define the model.
Activations — intermediate values computed as tokens pass through each layer.
KV cache — grows with your context window length; stores attention keys and values for all past tokens.

All three must be resident in VRAM simultaneously. If you run out, the model either won't load, or forces your GPU to swap to system RAM — which typically drops throughput by 10–100×.

The formula

        VRAM ≈ (params × bytes_per_param) × 1.2
        
        Example — Llama 3.1 8B at FP16:

        8,000,000,000 × 2 bytes × 1.2 = 19.2 GB
        
        Same model at Q4 (0.5 bytes/param):

        8,000,000,000 × 0.5 bytes × 1.2 = 4.8 GB

The 1.2× multiplier accounts for activations and base KV cache overhead. Longer context windows add more on top.

Interactive VRAM Calculator

Use the calculator below to estimate how much VRAM your model needs, and see which GPUs can run it.

Quick-select a model

Parameters (billions)

Quantization

Context length overhead: 0K tokens

Loading GPU data…

Open full-page calculator →

How this maps to your RTX / Spark

Every GB of VRAM you add unlocks a bigger model or a longer context window. The RTX 5090's 32 GB is currently the consumer sweet spot — it comfortably runs a 70B model at Q4 (≈35 GB is tight; Q5 or Q6 fits cleanly at 32 GB for a 30B model). The RTX 4090's 24 GB handles 13B at FP16 or 70B at Q4 with careful headroom management.

DGX Spark's 128 GB changes the problem entirely — you stop thinking about quantization and just load what you need. It's the first desktop system that makes 70B+ unquantized inference a routine operation.