Tokens & Context Windows
AI doesn't read letter by letter or word by word — it reads in chunks called tokens. The context window is how many tokens it can hold in mind at once.
What is a token?
A token is the basic unit of text an LLM processes. It's roughly a word piece — common words are a single token, long or rare words may be split across two or three. Numbers and punctuation are usually one token each.
What is a context window?
The context window is the maximum number of tokens the model can process in a single pass — both the input prompt and the generated output combined. Think of it as the model's short-term working memory.
Everything outside the window is invisible to the model. If you paste a 200-page document into a model with a 4K context window, it will only see the last few pages.
Context window sizes in practice
| Context size | Approx words | What fits |
|---|---|---|
| 4K tokens | ~3,000 words | A few-page document, short conversation |
| 8K tokens | ~6,000 words | A short story, multi-turn conversation |
| 32K tokens | ~24,000 words | A small book, long codebase section |
| 128K tokens | ~96,000 words | A full novel, large codebase |
| 1M tokens | ~750,000 words | Multiple books, entire repos (frontier models) |
The hidden cost: KV cache
Every token in the context generates a set of key and value vectors (the KV cache) used by the attention mechanism. These are stored in VRAM. The more tokens in your context window, the bigger the KV cache.
This is why a 128K context window costs significantly more VRAM than an 8K one, even with the same model. On a 7B model at FP16, a 128K context can add 10–15 GB of VRAM overhead on top of the model weights.
On an RTX 4060 (8 GB), you're typically running a quantized 7B model with 4–8K context. That's plenty for chat and document Q&A. Pushing to 32K requires more VRAM — an RTX 4090 or 5090 at 24–32 GB handles it comfortably.
The DGX Spark's 128 GB unified memory is where long-context truly shines. Running Llama 3.3 70B at Q8 (~70 GB) leaves ~58 GB for the KV cache — enough for very long reasoning chains, large codebases, or multi-document analysis without truncation.