LOCAL AI // GLOSSARY

VRAM

The on-GPU memory that stores model weights. Determines which AI models you can run.

Definition

Video Random Access Memory (VRAM) is the high-speed dedicated memory built directly onto a GPU. In local AI inference, VRAM stores the model's weights, activations, and the KV-cache generated during generation. Unlike system RAM, VRAM communicates directly with CUDA/ROCm cores at bandwidths exceeding 1,000 GB/s.

Why It Matters

Critical. If a model's weights exceed VRAM capacity, the runtime spills to system RAM — slashing speed by 80-95%. A Llama 3.1 70B model at full FP16 precision requires ~140GB VRAM; Q4 quantization brings this to ~40GB.

Real-World Example

An NVIDIA RTX 5090 has 32GB of GDDR7 VRAM at 1,792 GB/s bandwidth. This allows running Llama 3.1 70B at Q4 quantization comfortably, or Llama 3.1 8B at full FP16 with headroom for large context windows.

History of VRAM

VRAM as a concept dates to the earliest consumer 3D graphics cards in the 1990s (as SGRAM and GDDR1). The AI inflection point arrived with NVIDIA's Volta architecture (2017) and the V100, which introduced HBM2 memory and the Tensor Core — the first GPU truly designed around matrix math for deep learning. The GDDR6X standard (Ampere, 2020) became the consumer standard, with HBM3 appearing on the H100 (2022) and eventually influencing the RTX 50-series.

Frequently Asked Questions

Why is VRAM the most important spec for local AI?▼

VRAM dictates the size of the neural network you can physically load onto your GPU. If a model's weights exceed your VRAM capacity, your system will either crash or spill the workload to extremely slow system RAM, making generation speeds unusably slow.

Does system RAM work as a substitute for VRAM?▼

Not effectively. While tools like Ollama will gracefully spill to system RAM (CPU) if you exceed VRAM limits, system RAM is drastically slower. Typical DDR5 provides ~70 GB/s bandwidth, whereas modern GDDR6X/GDDR7 VRAM offers 1,000+ GB/s, making VRAM necessary for fast token generation.

How much VRAM do I need for a 70B model?▼

To run a massive 70 billion parameter model (like Llama 3 70B) at a standard Q4 quantization, you need approximately 40 to 44GB of VRAM. This typically requires purchasing a dual-GPU setup (like two RTX 4090s) or a high-end Mac Studio.

Related Concepts

Quantization

Compressing model weights from 16-bit to 4-bit precision to massively reduce VRAM usage.

Tokens per Second (TPS)

The universal speed metric for LLMs — how many words (tokens) your GPU generates per second.

Memory Bandwidth

How fast your GPU can read model weights from VRAM — the real determinant of inference speed.

Browse More Terms

All Terms

Quantization

FP8 / FP4

Tokens per Second (TPS)

Memory Bandwidth

KV Cache

Context Window