Tokens & Context Windows — Local AI Guide

What is a token?

A token is the basic unit of text an LLM processes. It's roughly a word piece — common words are a single token, long or rare words may be split across two or three. Numbers and punctuation are usually one token each.

Rough rule of thumb: 1 token ≈ ¾ of an English word. So 1,000 words ≈ 1,333 tokens. A full novel (80,000 words) ≈ 107,000 tokens.

What is a context window?

The context window is the maximum number of tokens the model can process in a single pass — both the input prompt and the generated output combined. Think of it as the model's short-term working memory.

Everything outside the window is invisible to the model. If you paste a 200-page document into a model with a 4K context window, it will only see the last few pages.

Context window sizes in practice

Context size	Approx words	What fits
4K tokens	~3,000 words	A few-page document, short conversation
8K tokens	~6,000 words	A short story, multi-turn conversation
32K tokens	~24,000 words	A small book, long codebase section
128K tokens	~96,000 words	A full novel, large codebase
1M tokens	~750,000 words	Multiple books, entire repos (frontier models)

The hidden cost: KV cache

Every token in the context generates a set of key and value vectors (the KV cache) used by the attention mechanism. These are stored in VRAM. The more tokens in your context window, the bigger the KV cache.

This is why a 128K context window costs significantly more VRAM than an 8K one, even with the same model. On a 7B model at FP16, a 128K context can add 10–15 GB of VRAM overhead on top of the model weights.

How this maps to your RTX / Spark

On an RTX 4060 (8 GB), you're typically running a quantized 7B model with 4–8K context. That's plenty for chat and document Q&A. Pushing to 32K requires more VRAM — an RTX 4090 or 5090 at 24–32 GB handles it comfortably.

The DGX Spark's 128 GB unified memory is where long-context truly shines. Running Llama 3.3 70B at Q8 (~70 GB) leaves ~58 GB for the KV cache — enough for very long reasoning chains, large codebases, or multi-document analysis without truncation.