Quantization
Same model, smaller file, almost as smart β and it fits on your GPU. Quantization trades a tiny bit of accuracy for a massive reduction in memory.
The problem: model weights are big
Each parameter in a model is stored as a floating-point number. In the default format β FP16 (16-bit float) β each parameter takes 2 bytes. A 7B model is therefore 14 GB. A 70B model is 140 GB. Most people don't have that much VRAM.
Quantization compresses the weights by storing them with less precision β fewer bits per number. The model gets smaller, but the core information survives, because most of the precision in a trained model is redundant anyway.
Quantization formats explained
| Format | Bits / param | Bytes / param | VRAM vs FP16 | Quality impact |
|---|---|---|---|---|
| FP16 | 16 | 2.0 | 1Γ (baseline) | Reference β full quality |
| FP8 | 8 | 1.0 | Β½Γ | Negligible β native on RTX 50 |
| INT8 / Q8 | 8 | 1.0 | Β½Γ | Minimal β barely noticeable |
| Q6 | 6 | 0.75 | β Γ | Very small |
| Q5 | 5 | 0.625 | ~β Γ | Small |
| Q4 | 4 | 0.5 | ΒΌΓ | Moderate β still very usable |
Quantization file formats
GGUF is the most common format for local use β it's a single file containing the model and its metadata, and is supported by llama.cpp, Ollama, LM Studio, and most local AI tools. Filenames typically include the quant level, e.g. Llama-3.1-8B-Q4_K_M.gguf.
Other formats: AWQ and GPTQ are GPU-optimised quantization formats used by frameworks like vLLM and HuggingFace Transformers. Safetensors is the preferred unquantized format.
How much quality do you lose?
Q4 on a 70B model typically outperforms FP16 on a 7B model β bigger architecture wins even at lower precision. This is the key insight: don't be afraid of quantization. A 70B Q4 is often smarter than a 13B FP16 for reasoning tasks, and it might fit in the same VRAM budget.
RTX 50-series GPUs (5090, 5080, 5070 Ti, 5070) add native FP8 compute via NVIDIA's Blackwell architecture. This delivers throughput equivalent to INT4 while maintaining FP16 accuracy β the best of both worlds. On RTX 40 series, FP8 is emulated in software with smaller gains.
The DGX Spark's 128 GB of unified memory means you rarely need aggressive quantization. Running Llama 3.3 70B at Q8 (70 GB) leaves ample headroom. You can even experiment with multiple concurrent models that would be impossible on any single consumer GPU.