The problem: model weights are big

Each parameter in a model is stored as a floating-point number. In the default format β€” FP16 (16-bit float) β€” each parameter takes 2 bytes. A 7B model is therefore 14 GB. A 70B model is 140 GB. Most people don't have that much VRAM.

Quantization compresses the weights by storing them with less precision β€” fewer bits per number. The model gets smaller, but the core information survives, because most of the precision in a trained model is redundant anyway.

FP16 (16-bit) β€” millimeter precision 65,536 distinct values β€” high precision, 2 bytes/weight INT8 (8-bit) β€” centimeter precision 256 distinct values β€” good enough, 1 byte/weight INT4 / Q4 (4-bit) β€” rough ruler

Quantization formats explained

FormatBits / paramBytes / paramVRAM vs FP16Quality impact
FP16162.01Γ— (baseline)Reference β€” full quality
FP881.0Β½Γ—Negligible β€” native on RTX 50
INT8 / Q881.0Β½Γ—Minimal β€” barely noticeable
Q660.75β…œΓ—Very small
Q550.625~β…“Γ—Small
Q440.5ΒΌΓ—Moderate β€” still very usable

Quantization file formats

GGUF is the most common format for local use β€” it's a single file containing the model and its metadata, and is supported by llama.cpp, Ollama, LM Studio, and most local AI tools. Filenames typically include the quant level, e.g. Llama-3.1-8B-Q4_K_M.gguf.

Other formats: AWQ and GPTQ are GPU-optimised quantization formats used by frameworks like vLLM and HuggingFace Transformers. Safetensors is the preferred unquantized format.

How much quality do you lose?

Q4 on a 70B model typically outperforms FP16 on a 7B model β€” bigger architecture wins even at lower precision. This is the key insight: don't be afraid of quantization. A 70B Q4 is often smarter than a 13B FP16 for reasoning tasks, and it might fit in the same VRAM budget.

How this maps to your RTX / Spark

RTX 50-series GPUs (5090, 5080, 5070 Ti, 5070) add native FP8 compute via NVIDIA's Blackwell architecture. This delivers throughput equivalent to INT4 while maintaining FP16 accuracy β€” the best of both worlds. On RTX 40 series, FP8 is emulated in software with smaller gains.

The DGX Spark's 128 GB of unified memory means you rarely need aggressive quantization. Running Llama 3.3 70B at Q8 (70 GB) leaves ample headroom. You can even experiment with multiple concurrent models that would be impossible on any single consumer GPU.