VRAM & Model Size
The single biggest factor in "can I run this?" β all model weights, activations, and the KV cache must fit in your GPU's memory simultaneously.
What needs to fit in VRAM?
VRAM (video RAM) is the dedicated memory on your GPU. During inference, three things compete for that space:
- Model weights β the bulk: all those billions of numbers that define the model.
- Activations β intermediate values computed as tokens pass through each layer.
- KV cache β grows with your context window length; stores attention keys and values for all past tokens.
All three must be resident in VRAM simultaneously. If you run out, the model either won't load, or forces your GPU to swap to system RAM β which typically drops throughput by 10β100Γ.
The formula
Example β Llama 3.1 8B at FP16:
8,000,000,000 Γ 2 bytes Γ 1.2 = 19.2 GB
Same model at Q4 (0.5 bytes/param):
8,000,000,000 Γ 0.5 bytes Γ 1.2 = 4.8 GB
The 1.2Γ multiplier accounts for activations and base KV cache overhead. Longer context windows add more on top.
Interactive VRAM Calculator
Use the calculator below to estimate how much VRAM your model needs, and see which GPUs can run it.
Every GB of VRAM you add unlocks a bigger model or a longer context window. The RTX 5090's 32 GB is currently the consumer sweet spot β it comfortably runs a 70B model at Q4 (β35 GB is tight; Q5 or Q6 fits cleanly at 32 GB for a 30B model). The RTX 4090's 24 GB handles 13B at FP16 or 70B at Q4 with careful headroom management.
DGX Spark's 128 GB changes the problem entirely β you stop thinking about quantization and just load what you need. It's the first desktop system that makes 70B+ unquantized inference a routine operation.