KV Cache
The memory that stores the AI's 'conversation history' during generation โ it lives in VRAM.
Definition
The Key-Value (KV) Cache is a mechanism used by transformer-based language models to avoid recomputing the attention scores for all previous tokens at every generation step. Instead, the keys and values computed for each token are stored in VRAM and reused. The KV cache size scales linearly with context length and model size.
Why It Matters
High. A 128K context window with a large model can consume more VRAM for the KV cache than the model weights themselves. This is why a 24GB GPU that runs Llama 3 8B easily may struggle at 128K context length.
Real-World Example
Llama 3.1 8B with a 32K context window uses approximately 2-4GB of VRAM for the KV cache alone. At 128K context, this can exceed 16GB โ meaning VRAM capacity for the KV cache can exceed the model weights.
History of KV Cache
The modern transformer KV cache was introduced in the original 'Attention Is All You Need' paper (Vaswani et al., 2017). PagedAttention, introduced by the vLLM team at UC Berkeley (2023), revolutionized KV cache management by treating it like virtual memory pages โ allowing much more efficient VRAM utilization in production serving.
Frequently Asked Questions
How do I calculate KV Cache size?โผ
Why does my local LLM crash halfway through a summary?โผ
Can I compress the KV cache?โผ
Related Concepts
VRAM
The on-GPU memory that stores model weights. Determines which AI models you can run.
Memory Bandwidth
How fast your GPU can read model weights from VRAM โ the real determinant of inference speed.
Context Window
How much text the AI can 'remember' and process at once โ directly tied to VRAM through the KV cache.