Context Window
How much text the AI can 'remember' and process at once — directly tied to VRAM through the KV cache.
Definition
The context window (or context length) is the maximum number of tokens a language model can process in a single prompt + response interaction. Modern frontier models support 128K-1M tokens. For local models, the practical context window is limited by VRAM available for the KV cache.
Why It Matters
High. Longer context enables document analysis, long-form coding, and multi-turn conversations without information loss. A 32K context window is sufficient for most tasks; 128K enables full-book analysis.
Real-World Example
At 128K context with Llama 3 8B, you could paste an entire 350-page technical manual into the prompt and ask questions about it. Without enough VRAM for the KV cache, the runtime will truncate or crash.
History of Context Window
Early GPT-2 (2019) had a context window of 1,024 tokens. GPT-3 expanded to 4,096. The race to longer contexts accelerated in 2023 with Anthropic's 100K Claude and subsequently Meta's Llama 2 long-context variants. Today (2026), models like Llama 3.3 and Gemma 3 support 128K tokens natively.