VRAM
The on-GPU memory that stores model weights. Determines which AI models you can run.
Definition
Video Random Access Memory (VRAM) is the high-speed dedicated memory built directly onto a GPU. In local AI inference, VRAM stores the model's weights, activations, and the KV-cache generated during generation. Unlike system RAM, VRAM communicates directly with CUDA/ROCm cores at bandwidths exceeding 1,000 GB/s.
Why It Matters
Critical. If a model's weights exceed VRAM capacity, the runtime spills to system RAM โ slashing speed by 80-95%. A Llama 3.1 70B model at full FP16 precision requires ~140GB VRAM; Q4 quantization brings this to ~40GB.
Real-World Example
An NVIDIA RTX 5090 has 32GB of GDDR7 VRAM at 1,792 GB/s bandwidth. This allows running Llama 3.1 70B at Q4 quantization comfortably, or Llama 3.1 8B at full FP16 with headroom for large context windows.
History of VRAM
VRAM as a concept dates to the earliest consumer 3D graphics cards in the 1990s (as SGRAM and GDDR1). The AI inflection point arrived with NVIDIA's Volta architecture (2017) and the V100, which introduced HBM2 memory and the Tensor Core โ the first GPU truly designed around matrix math for deep learning. The GDDR6X standard (Ampere, 2020) became the consumer standard, with HBM3 appearing on the H100 (2022) and eventually influencing the RTX 50-series.
Frequently Asked Questions
Why is VRAM the most important spec for local AI?โผ
Does system RAM work as a substitute for VRAM?โผ
How much VRAM do I need for a 70B model?โผ
Related Concepts
Quantization
Compressing model weights from 16-bit to 4-bit precision to massively reduce VRAM usage.
Tokens per Second (TPS)
The universal speed metric for LLMs โ how many words (tokens) your GPU generates per second.
Memory Bandwidth
How fast your GPU can read model weights from VRAM โ the real determinant of inference speed.