LOCAL AI // GLOSSARY

Context Window

How much text the AI can 'remember' and process at once — directly tied to VRAM through the KV cache.

Definition

The context window (or context length) is the maximum number of tokens a language model can process in a single prompt + response interaction. Modern frontier models support 128K-1M tokens. For local models, the practical context window is limited by VRAM available for the KV cache.

Why It Matters

High. Longer context enables document analysis, long-form coding, and multi-turn conversations without information loss. A 32K context window is sufficient for most tasks; 128K enables full-book analysis.

Real-World Example

At 128K context with Llama 3 8B, you could paste an entire 350-page technical manual into the prompt and ask questions about it. Without enough VRAM for the KV cache, the runtime will truncate or crash.

History of Context Window

Early GPT-2 (2019) had a context window of 1,024 tokens. GPT-3 expanded to 4,096. The race to longer contexts accelerated in 2023 with Anthropic's 100K Claude and subsequently Meta's Llama 2 long-context variants. Today (2026), models like Llama 3.3 and Gemma 3 support 128K tokens natively.

Frequently Asked Questions

Can I increase the context window of a local model?▼

No, you cannot increase the theoretical context limit beyond what the model was originally trained on (e.g., Llama 3 is hard-capped at 128K natively, unless fine-tuned with RoPE scaling). However, you can manually lower your software's maximum context allowed to prevent VRAM crashes.

Why is long context so slow locally?▼

Processing 100,000 tokens of text requires computing attention interactions for every token against every other token—a quadratic mathematical workload. This prefill phase causes massive initial latency spikes and places intense strain on power delivery.

Is pasting a PDF into my chat counted in the context?▼

Yes, entirely. If you paste a 50-page PDF string, or use automated document reading tools in Open WebUI, every character is converted into tokens and loaded directly into your context window, radically expanding your KV cache size.

Related Concepts

VRAM

The on-GPU memory that stores model weights. Determines which AI models you can run.

Quantization

Compressing model weights from 16-bit to 4-bit precision to massively reduce VRAM usage.

KV Cache

The memory that stores the AI's 'conversation history' during generation — it lives in VRAM.

Browse More Terms

All Terms

VRAM

Quantization

FP8 / FP4

Tokens per Second (TPS)

Memory Bandwidth

KV Cache