LOCAL AI // GLOSSARY

Memory Bandwidth

How fast your GPU can read model weights from VRAM — the real determinant of inference speed.

Definition

Memory bandwidth is the rate at which the GPU's compute cores can read and write data to and from VRAM, measured in gigabytes per second (GB/s). During LLM inference (the generation phase), the GPU must load model weights from VRAM on every token generated. This makes bandwidth — not raw CUDA core count — the primary bottleneck for local LLM throughput.

Why It Matters

Critical for inference speed. The RTX 4090's 1,008 GB/s bandwidth generates tokens roughly 2x faster than a card with 504 GB/s, holding model size constant. The RTX 5090 at 1,792 GB/s pushes this further. This is why Apple Silicon (with unified HBM-class memory) often outperforms gaming GPUs on TPS for certain models.

Real-World Example

The NVIDIA RTX 5080 has 896 GB/s of bandwidth. Running Llama 3.1 8B Q4 (~4.6GB of weights loaded per token step), the theoretical ceiling is approximately 896 / 4.6 ≈ 194 tokens/second — real-world results of 120-150 TPS are typical.

History of Memory Bandwidth

The relationship between memory bandwidth and neural network performance was formally documented by NVIDIA's 'Roofline Model' analysis in 2008. The memory wall problem — where bandwidth cannot keep up with compute — has driven the design of modern AI accelerators including the H100's 3.35 TB/s HBM3 and the B200's 8 TB/s HBM3e.

Frequently Asked Questions

Why is memory bandwidth more important than CUDA cores?▼

LLM generation is a 'memory-bound' operation. To predict a single word, the GPU must move the entire multi-gigabyte model from VRAM into the compute cores. Even if you have infinite computational cores, they will sit idle waiting for the VRAM memory pipe to deliver the data.

Can I overclock my GPU memory bandwidth?▼

Yes, many AI enthusiasts successfully overclock their GPU VRAM (for example, applying a +1000MHz memory offset in MSI Afterburner) to squeeze out an extra 5-10% token generation speed. However, this dramatically increases heat and potential system instability.

Why do Macs perform differently with bandwidth?▼

Apple Silicon (M-series Max and Ultra chips) uses a 'unified memory architecture'. This means the CPU and GPU share the same ultra-wide memory bus, granting them up to 800 GB/s of bandwidth. While slower than a flagship discrete PC GPU, it allows Macs to load massive 100GB+ models that a single consumer NVIDIA card physically cannot hold.

Related Concepts

VRAM

The on-GPU memory that stores model weights. Determines which AI models you can run.

FP8 / FP4

Next-generation precision formats that dramatically accelerate inference on NVIDIA Blackwell and Ada GPUs.

Tokens per Second (TPS)

The universal speed metric for LLMs — how many words (tokens) your GPU generates per second.

Browse More Terms

All Terms

VRAM

Quantization

FP8 / FP4

Tokens per Second (TPS)

KV Cache

Context Window