What is an LLM?
ChatGPT is one. So is Llama. So is Claude. Here's what they actually are and how they produce text.
The one-sentence version
A Large Language Model (LLM) is a neural network trained to predict the next most likely word (technically, token) given everything that came before it. That's it. Do that billions of times on massive amounts of text, and something that looks like reasoning emerges.
How text generation works
When you type a prompt, the model doesn't look up an answer in a database. It generates a response one token at a time, each token chosen based on all the previous ones.
Each token generates quickly — a capable GPU can produce 50–100 tokens per second on a 7B model. A typical response of 200 words takes 2–4 seconds.
What makes them "large"?
The "L" in LLM refers to the number of parameters — the adjustable values the model learned during training. Modern models range from 3 billion to over 400 billion parameters. More parameters generally mean more knowledge and better reasoning, but also more memory and compute to run.
Popular open-weight LLMs you can run locally
| Model | Size | Created by | Strengths |
|---|---|---|---|
| Llama 3.1 8B | 8B params | Meta | Fast, efficient, great all-round |
| Llama 3.3 70B | 70B params | Meta | Near frontier quality, needs high VRAM |
| Mistral 7B | 7B params | Mistral AI | Punches above its weight in speed |
| Qwen 2.5 32B | 32B params | Alibaba | Strong coding, long context |
| Phi-3 Mini | 3.8B params | Microsoft | Tiny but surprisingly capable |
| Gemma 2 27B | 27B params | Efficient, open license |
Open-weight vs. open-source
Models like Llama and Mistral are open-weight — the trained model file is freely downloadable. You can run, modify, and build on them. This is different from closed models like GPT-4 or Claude, which only offer API access — you can't download or run them locally.
A 7–8B model fits comfortably in 8 GB VRAM (RTX 4060 and up) when quantized to 4-bit, delivering 40–80 tokens per second. At FP16 precision you need ~16 GB — an RTX 4060 Ti 16GB or similar.
A 70B model needs ~40 GB at FP16 — beyond any single consumer GPU. Quantized to 4-bit it drops to ~40 GB still, or ~35 GB at Q4 — just within the RTX 5090's 32 GB (tight). The DGX Spark runs a 70B model at full FP16 with room to spare, and can even run 200B+ models.