What is an LLM? — Local AI Guide

The one-sentence version

A Large Language Model (LLM) is a neural network trained to predict the next most likely word (technically, token) given everything that came before it. That's it. Do that billions of times on massive amounts of text, and something that looks like reasoning emerges.

Analogy: Autocomplete on your phone has learned to predict your next word from your messages. An LLM is autocomplete trained on most of the internet — vastly larger, vastly more capable, but the same core idea.

How text generation works

When you type a prompt, the model doesn't look up an answer in a database. It generates a response one token at a time, each token chosen based on all the previous ones.

Each token generates quickly — a capable GPU can produce 50–100 tokens per second on a 7B model. A typical response of 200 words takes 2–4 seconds.

What makes them "large"?

The "L" in LLM refers to the number of parameters — the adjustable values the model learned during training. Modern models range from 3 billion to over 400 billion parameters. More parameters generally mean more knowledge and better reasoning, but also more memory and compute to run.

Popular open-weight LLMs you can run locally

Model	Size	Created by	Strengths
Llama 3.1 8B	8B params	Meta	Fast, efficient, great all-round
Llama 3.3 70B	70B params	Meta	Near frontier quality, needs high VRAM
Mistral 7B	7B params	Mistral AI	Punches above its weight in speed
Qwen 2.5 32B	32B params	Alibaba	Strong coding, long context
Phi-3 Mini	3.8B params	Microsoft	Tiny but surprisingly capable
Gemma 2 27B	27B params	Google	Efficient, open license

Open-weight vs. open-source

Models like Llama and Mistral are open-weight — the trained model file is freely downloadable. You can run, modify, and build on them. This is different from closed models like GPT-4 or Claude, which only offer API access — you can't download or run them locally.

How this maps to your RTX / Spark

A 7–8B model fits comfortably in 8 GB VRAM (RTX 4060 and up) when quantized to 4-bit, delivering 40–80 tokens per second. At FP16 precision you need ~16 GB — an RTX 4060 Ti 16GB or similar.

A 70B model needs ~40 GB at FP16 — beyond any single consumer GPU. Quantized to 4-bit it drops to ~40 GB still, or ~35 GB at Q4 — just within the RTX 5090's 32 GB (tight). The DGX Spark runs a 70B model at full FP16 with room to spare, and can even run 200B+ models.