LOCAL AI // GLOSSARY

KV Cache

The memory that stores the AI's 'conversation history' during generation — it lives in VRAM.

Definition

The Key-Value (KV) Cache is a mechanism used by transformer-based language models to avoid recomputing the attention scores for all previous tokens at every generation step. Instead, the keys and values computed for each token are stored in VRAM and reused. The KV cache size scales linearly with context length and model size.

Why It Matters

High. A 128K context window with a large model can consume more VRAM for the KV cache than the model weights themselves. This is why a 24GB GPU that runs Llama 3 8B easily may struggle at 128K context length.

Real-World Example

Llama 3.1 8B with a 32K context window uses approximately 2-4GB of VRAM for the KV cache alone. At 128K context, this can exceed 16GB — meaning VRAM capacity for the KV cache can exceed the model weights.

History of KV Cache

The modern transformer KV cache was introduced in the original 'Attention Is All You Need' paper (Vaswani et al., 2017). PagedAttention, introduced by the vLLM team at UC Berkeley (2023), revolutionized KV cache management by treating it like virtual memory pages — allowing much more efficient VRAM utilization in production serving.

Frequently Asked Questions

How do I calculate KV Cache size?▼

The KV Cache scales sequentially based on context dimension. As a rough rule of thumb, an 8B model uses ~1.5GB of VRAM per 10,000 context tokens, while a 70B model requires far more. Using the 'Will It Run?' calculator is the most accurate way to project your constraints.

Why does my local LLM crash halfway through a summary?▼

This almost always indicates an Out of Memory (OOM) error caused by the KV Cache expanding. As the conversation gets longer, the cache demands more VRAM, eventually pushing the total memory required past your GPU's physical limit.

Can I compress the KV cache?▼

Yes. Modern inference servers like vLLM and llama.cpp support KV cache quantization (such as FlashAttention or 8-bit KV caching), which cuts the memory requirement in half, allowing you to double your effective context window on limited VRAM.

Related Concepts

VRAM

The on-GPU memory that stores model weights. Determines which AI models you can run.

Memory Bandwidth

How fast your GPU can read model weights from VRAM — the real determinant of inference speed.

Context Window

How much text the AI can 'remember' and process at once — directly tied to VRAM through the KV cache.

Browse More Terms

All Terms

VRAM

Quantization

FP8 / FP4

Tokens per Second (TPS)

Memory Bandwidth

Context Window