Memory Bandwidth
How fast your GPU can read model weights from VRAM โ the real determinant of inference speed.
Definition
Memory bandwidth is the rate at which the GPU's compute cores can read and write data to and from VRAM, measured in gigabytes per second (GB/s). During LLM inference (the generation phase), the GPU must load model weights from VRAM on every token generated. This makes bandwidth โ not raw CUDA core count โ the primary bottleneck for local LLM throughput.
Why It Matters
Critical for inference speed. The RTX 4090's 1,008 GB/s bandwidth generates tokens roughly 2x faster than a card with 504 GB/s, holding model size constant. The RTX 5090 at 1,792 GB/s pushes this further. This is why Apple Silicon (with unified HBM-class memory) often outperforms gaming GPUs on TPS for certain models.
Real-World Example
The NVIDIA RTX 5080 has 896 GB/s of bandwidth. Running Llama 3.1 8B Q4 (~4.6GB of weights loaded per token step), the theoretical ceiling is approximately 896 / 4.6 โ 194 tokens/second โ real-world results of 120-150 TPS are typical.
History of Memory Bandwidth
The relationship between memory bandwidth and neural network performance was formally documented by NVIDIA's 'Roofline Model' analysis in 2008. The memory wall problem โ where bandwidth cannot keep up with compute โ has driven the design of modern AI accelerators including the H100's 3.35 TB/s HBM3 and the B200's 8 TB/s HBM3e.
Frequently Asked Questions
Why is memory bandwidth more important than CUDA cores?โผ
Can I overclock my GPU memory bandwidth?โผ
Why do Macs perform differently with bandwidth?โผ
Related Concepts
VRAM
The on-GPU memory that stores model weights. Determines which AI models you can run.
FP8 / FP4
Next-generation precision formats that dramatically accelerate inference on NVIDIA Blackwell and Ada GPUs.
Tokens per Second (TPS)
The universal speed metric for LLMs โ how many words (tokens) your GPU generates per second.