LOCAL AI // GLOSSARY

VRAM

The on-GPU memory that stores model weights. Determines which AI models you can run.

Definition

Video Random Access Memory (VRAM) is the high-speed dedicated memory built directly onto a GPU. In local AI inference, VRAM stores the model's weights, activations, and the KV-cache generated during generation. Unlike system RAM, VRAM communicates directly with CUDA/ROCm cores at bandwidths exceeding 1,000 GB/s.

Why It Matters

Critical. If a model's weights exceed VRAM capacity, the runtime spills to system RAM โ€” slashing speed by 80-95%. A Llama 3.1 70B model at full FP16 precision requires ~140GB VRAM; Q4 quantization brings this to ~40GB.

Real-World Example

An NVIDIA RTX 5090 has 32GB of GDDR7 VRAM at 1,792 GB/s bandwidth. This allows running Llama 3.1 70B at Q4 quantization comfortably, or Llama 3.1 8B at full FP16 with headroom for large context windows.

History of VRAM

VRAM as a concept dates to the earliest consumer 3D graphics cards in the 1990s (as SGRAM and GDDR1). The AI inflection point arrived with NVIDIA's Volta architecture (2017) and the V100, which introduced HBM2 memory and the Tensor Core โ€” the first GPU truly designed around matrix math for deep learning. The GDDR6X standard (Ampere, 2020) became the consumer standard, with HBM3 appearing on the H100 (2022) and eventually influencing the RTX 50-series.

Frequently Asked Questions

Why is VRAM the most important spec for local AI?โ–ผ
VRAM dictates the size of the neural network you can physically load onto your GPU. If a model's weights exceed your VRAM capacity, your system will either crash or spill the workload to extremely slow system RAM, making generation speeds unusably slow.
Does system RAM work as a substitute for VRAM?โ–ผ
Not effectively. While tools like Ollama will gracefully spill to system RAM (CPU) if you exceed VRAM limits, system RAM is drastically slower. Typical DDR5 provides ~70 GB/s bandwidth, whereas modern GDDR6X/GDDR7 VRAM offers 1,000+ GB/s, making VRAM necessary for fast token generation.
How much VRAM do I need for a 70B model?โ–ผ
To run a massive 70 billion parameter model (like Llama 3 70B) at a standard Q4 quantization, you need approximately 40 to 44GB of VRAM. This typically requires purchasing a dual-GPU setup (like two RTX 4090s) or a high-end Mac Studio.

Related Concepts

Browse More Terms

All Terms
As an Amazon Associate, I earn from qualifying purchases.