The one-sentence version

A Large Language Model (LLM) is a neural network trained to predict the next most likely word (technically, token) given everything that came before it. That's it. Do that billions of times on massive amounts of text, and something that looks like reasoning emerges.

Analogy: Autocomplete on your phone has learned to predict your next word from your messages. An LLM is autocomplete trained on most of the internet — vastly larger, vastly more capable, but the same core idea.

How text generation works

When you type a prompt, the model doesn't look up an answer in a database. It generates a response one token at a time, each token chosen based on all the previous ones.

Prompt "The sky is" 3 tokens LLM billions of parameters on your GPU "blue" feeds back in "and" "clear" One token generated per pass — repeated until done

Each token generates quickly — a capable GPU can produce 50–100 tokens per second on a 7B model. A typical response of 200 words takes 2–4 seconds.

What makes them "large"?

The "L" in LLM refers to the number of parameters — the adjustable values the model learned during training. Modern models range from 3 billion to over 400 billion parameters. More parameters generally mean more knowledge and better reasoning, but also more memory and compute to run.

Popular open-weight LLMs you can run locally

ModelSizeCreated byStrengths
Llama 3.1 8B8B paramsMetaFast, efficient, great all-round
Llama 3.3 70B70B paramsMetaNear frontier quality, needs high VRAM
Mistral 7B7B paramsMistral AIPunches above its weight in speed
Qwen 2.5 32B32B paramsAlibabaStrong coding, long context
Phi-3 Mini3.8B paramsMicrosoftTiny but surprisingly capable
Gemma 2 27B27B paramsGoogleEfficient, open license

Open-weight vs. open-source

Models like Llama and Mistral are open-weight — the trained model file is freely downloadable. You can run, modify, and build on them. This is different from closed models like GPT-4 or Claude, which only offer API access — you can't download or run them locally.

How this maps to your RTX / Spark

A 7–8B model fits comfortably in 8 GB VRAM (RTX 4060 and up) when quantized to 4-bit, delivering 40–80 tokens per second. At FP16 precision you need ~16 GB — an RTX 4060 Ti 16GB or similar.

A 70B model needs ~40 GB at FP16 — beyond any single consumer GPU. Quantized to 4-bit it drops to ~40 GB still, or ~35 GB at Q4 — just within the RTX 5090's 32 GB (tight). The DGX Spark runs a 70B model at full FP16 with room to spare, and can even run 200B+ models.