Parameters
Why does "70B" matter? Parameters are the knobs the model tuned during training — and their count determines both how smart the model is and how much memory it needs.
What is a parameter?
A neural network is a large graph of simple math operations. At each connection in that graph sits a number — a weight — that controls how strongly one node influences another. These weights are the parameters. During training, billions of examples are fed through the network, and the weights are adjusted slightly each time until the network's predictions match the desired outputs.
After training is done, the weights are frozen. The model file you download is essentially just a very large list of these frozen numbers.
Common model sizes and what they mean
| Size | Example models | FP16 VRAM needed | Use case |
|---|---|---|---|
| 1–4B | Phi-3 Mini 3.8B, Gemma 2 2B | ~3–8 GB | On-device, fast, light tasks |
| 7–8B | Llama 3.1 8B, Mistral 7B | ~14–16 GB | Great all-rounder for everyday use |
| 13B | Llama 2 13B, Vicuna 13B | ~26 GB | Better reasoning, still efficient |
| 27–34B | Gemma 2 27B, Qwen 2.5 32B | ~54–68 GB | Coding, long documents, nuanced tasks |
| 70B | Llama 3.3 70B | ~140 GB | Near-frontier quality locally |
| 405B | Llama 3.1 405B | ~810 GB | Multi-node or cloud territory |
The good news: quantization (the next topic) can cut these VRAM numbers by 4–8×, making even a 70B model runnable on a high-end consumer GPU.
More parameters ≠ always better
Architecture improvements matter enormously. A well-designed 8B model trained on high-quality data can outperform a poorly-trained 70B model on specific tasks. The parameter count is the starting point, not the whole story. Pay attention to benchmarks and community reports, not just the "B" number.
Parameter count maps almost directly to VRAM: each parameter stored at FP16 takes 2 bytes, so an 8B model needs roughly 16 GB at full precision. An RTX 4060 Ti 16GB can hold it exactly; quantized to 4-bit it drops to ~4 GB, freeing room for context and activations.
The DGX Spark's 128 GB unified memory holds a 70B model at full FP16 precision (140 GB is tight — quantized Q8 at ~70 GB fits easily) and makes 30B–70B models feel like running an 8B on a desktop GPU.