Quantization — Local AI Guide

The problem: model weights are big

Each parameter in a model is stored as a floating-point number. In the default format — FP16 (16-bit float) — each parameter takes 2 bytes. A 7B model is therefore 14 GB. A 70B model is 140 GB. Most people don't have that much VRAM.

Quantization compresses the weights by storing them with less precision — fewer bits per number. The model gets smaller, but the core information survives, because most of the precision in a trained model is redundant anyway.

Quantization formats explained

Format	Bits / param	Bytes / param	VRAM vs FP16	Quality impact
FP16	16	2.0	1× (baseline)	Reference — full quality
FP8	8	1.0	½×	Negligible — native on RTX 50
INT8 / Q8	8	1.0	½×	Minimal — barely noticeable
Q6	6	0.75	⅜×	Very small
Q5	5	0.625	~⅓×	Small
Q4	4	0.5	¼×	Moderate — still very usable

Quantization file formats

GGUF is the most common format for local use — it's a single file containing the model and its metadata, and is supported by llama.cpp, Ollama, LM Studio, and most local AI tools. Filenames typically include the quant level, e.g. Llama-3.1-8B-Q4_K_M.gguf.

Other formats: AWQ and GPTQ are GPU-optimised quantization formats used by frameworks like vLLM and HuggingFace Transformers. Safetensors is the preferred unquantized format.

How much quality do you lose?

Q4 on a 70B model typically outperforms FP16 on a 7B model — bigger architecture wins even at lower precision. This is the key insight: don't be afraid of quantization. A 70B Q4 is often smarter than a 13B FP16 for reasoning tasks, and it might fit in the same VRAM budget.

How this maps to your RTX / Spark

RTX 50-series GPUs (5090, 5080, 5070 Ti, 5070) add native FP8 compute via NVIDIA's Blackwell architecture. This delivers throughput equivalent to INT4 while maintaining FP16 accuracy — the best of both worlds. On RTX 40 series, FP8 is emulated in software with smaller gains.

The DGX Spark's 128 GB of unified memory means you rarely need aggressive quantization. Running Llama 3.3 70B at Q8 (70 GB) leaves ample headroom. You can even experiment with multiple concurrent models that would be impossible on any single consumer GPU.