How Much VRAM Do You Need to Run LLMs in 2026? The Complete Guide
VRAM is the single most important spec for running large language models locally. Get this wrong, and your model either refuses to load, crawls at 1-2 tokens per second while grinding system RAM, or crashes outright.
This guide gives you the exact VRAM numbers for every major model and quantization level in 2026. By the end, you'll know exactly what GPU you need for your specific models.
Quick start: Take the GPU Quiz | GPU Deals by VRAM Tier | DeepSeek R1 VRAM Requirements
At a Glance: VRAM Requirements by Use Case (2026)
- Casual use (7B/8B models, Q4): 6GB minimum, 8GB recommended
- Power user (13B/14B models, good quality): 12GB minimum, 16GB recommended
- Serious local AI (32B models, Q4): 24GB minimum
- Frontier local AI (70B models, Q4): 48GB or dual 24GB
- Image generation (Stable Diffusion XL): 8GB minimum, 12GB comfortable
- Multimodal (vision + text): Add 2-4GB on top of text model requirements
Why VRAM Is the Bottleneck for LLMs
Unlike traditional GPU workloads (gaming, rendering), LLMs don't need massive compute throughput; they need fast, large memory. The entire model must be loaded into VRAM before it can generate a single token. A 7B parameter model at 4-bit quantization occupies ~4.5GB of VRAM. A 70B model at the same quantization needs ~40GB.
If your model doesn't fit in VRAM:
- The excess spills to system RAM (called CPU offloading)
- CPU RAM is 10-50x slower than GPU VRAM
- Generation drops from 40-100+ tok/s to 2-8 tok/s
The rule: Every layer of the model that fits in VRAM is fast. Every layer that spills to RAM is dramatically slower. Full VRAM fit = fast. Partial fit = slow. No fit = unacceptably slow.
For a deeper dive on multi-GPU setups that eliminate this problem entirely, see our Llama 3.3 Hardware Requirements guide.
Quantization Explained: Q4, Q5, Q8, F16
Quantization is how you trade model quality for VRAM. Here's what the numbers mean:
| Format | Bits per weight | VRAM multiplier | Quality |
|---|---|---|---|
| F32 | 32-bit float | 4x | Reference quality (impractical) |
| F16 / BF16 | 16-bit float | 2x | Full quality, high VRAM |
| Q8_0 | 8-bit integer | 1x | Near-lossless, ~1% degradation |
| Q6_K | 6-bit | 0.75x | Excellent quality, ~1-2% loss |
| Q5_K_M | 5-bit | 0.625x | Good quality, ~2-3% loss |
| Q4_K_M | 4-bit | 0.5x | Good quality, ~3-5% loss |
| Q3_K_M | 3-bit | 0.375x | Acceptable, ~5-8% loss |
| Q2_K | 2-bit | 0.25x | Noticeable degradation |
Which quantization should you use?
- Q4_K_M is the standard recommendation for most users โ half the VRAM of F16 with ~3-5% quality loss
- Q5_K_M is a good middle ground if you have the VRAM โ noticeably better than Q4 with modest cost
- Q8_0 is essentially lossless โ use it when VRAM allows, especially for coding tasks
- Q2 should only be used to fit a model that otherwise won't fit at all โ quality degrades significantly
VRAM formula
VRAM needed = (Parameters in billions) x (bytes per weight) + ~10% overhead
- F16: parameters x 2GB
- Q8: parameters x 1GB
- Q4: parameters x 0.5GB
- Q2: parameters x 0.25GB
Example: Llama 3.1 70B at Q4 = 70 x 0.5 = 35GB + 10% overhead = ~38-40GB
Master VRAM Requirements Table
Llama 3.1 / 3.2 / 3.3 Models
| Model | Q2 | Q4 | Q5 | Q6 | Q8 | F16 |
|---|---|---|---|---|---|---|
| Llama 3.2 3B | ~1GB | ~2GB | ~2.4GB | ~2.8GB | ~3.5GB | ~6GB |
| Llama 3.1/3.2 8B | ~2.5GB | ~4.5GB | ~5.5GB | ~6.5GB | ~8.5GB | ~16GB |
| Llama 3.1 13B | ~4GB | ~7.5GB | ~9GB | ~11GB | ~14GB | ~26GB |
| Llama 3.1 70B | ~22GB | ~40GB | ~48GB | ~55GB | ~70GB | 140GB |
| Llama 3.1 405B | ~125GB | ~200GB+ | - | - | - | - |
Mistral Models
| Model | Q4 | Q5 | Q8 | F16 |
|---|---|---|---|---|
| Mistral 7B | ~4.5GB | ~5.5GB | ~8GB | ~14GB |
| Mistral 12B / Nemo | ~7GB | ~9GB | ~13GB | ~24GB |
| Mixtral 8x7B (MoE) | ~26GB | ~32GB | ~47GB | - |
| Mistral Large 2 (123B) | ~62GB | - | - | - |
DeepSeek R1 Models
| Model | Q2 | Q4 | Q5 | Q8 |
|---|---|---|---|---|
| DeepSeek R1 7B | ~2GB | ~4.5GB | ~5.5GB | ~8GB |
| DeepSeek R1 14B | ~4.5GB | ~9GB | ~11GB | ~15GB |
| DeepSeek R1 32B | ~10GB | ~20GB | ~24GB | ~35GB |
| DeepSeek R1 70B | ~22GB | ~40GB | ~48GB | ~70GB |
| DeepSeek R1 671B | - | ~420GB | - | - |
DeepSeek R1 671B is the full MoE model. Cloud-only for all practical purposes. See the Full DeepSeek R1 GPU Guide for hardware recommendations.
Qwen 2.5 Models
| Model | Q4 | Q5 | Q8 | F16 |
|---|---|---|---|---|
| Qwen 2.5 7B | ~4.5GB | ~5.5GB | ~8GB | ~14GB |
| Qwen 2.5 14B | ~9GB | ~11GB | ~15GB | ~28GB |
| Qwen 2.5 32B | ~20GB | ~24GB | ~35GB | - |
| Qwen 2.5 72B | ~40GB | ~48GB | ~72GB | - |
Google Gemma 3 Models
| Model | Q4 | Q5 | Q8 | F16 |
|---|---|---|---|---|
| Gemma 3 4B | ~2.5GB | ~3GB | ~4.5GB | ~8GB |
| Gemma 3 12B | ~7GB | ~9GB | ~13GB | ~24GB |
| Gemma 3 27B | ~17GB | ~21GB | ~29GB | ~54GB |
VRAM Tiers: What You Can Run at Each Level
8GB VRAM
The minimum viable tier for running local LLMs in 2026. You can run 7B/8B models comfortably at Q4-Q6, but nothing larger without significant quality compromise.
Can run:
- Llama 3.2 8B at Q4/Q5/Q6
- Mistral 7B at Q4/Q5/Q6
- DeepSeek R1 7B at Q4/Q8
- Gemma 3 4B at full Q8
- Qwen 2.5 7B at Q4/Q5
- Stable Diffusion 1.5
Cannot run:
- 13B/14B models at usable quality
- Any 32B+ model
- Stable Diffusion XL reliably
Best 8GB GPU: RTX 4060 8GB on Amazon (~$329)
12GB VRAM
The sweet spot for hobbyists. 12GB lets you run 7B models at near-lossless Q8 quality and opens 13B/14B models at Q4.
Can run:
- Llama 3.2 8B at Q8 (~8.5GB) โ near-lossless quality
- Mistral 7B at full Q8
- DeepSeek R1 14B at Q4 (~9GB) โ fits with headroom
- Llama 3.1 13B at Q4 (~7.5GB) โ comfortable
- Qwen 2.5 14B at Q4 (~9GB)
- Gemma 3 12B at Q5
- Stable Diffusion XL
Cannot run:
- 32B+ models at usable quality
- 14B models at Q8 (needs 15GB)
Best 12GB GPUs:
- Intel Arc B580 12GB on Amazon (~$249) โ best value
- RTX 3060 12GB on Amazon (~$299) โ CUDA option
16GB VRAM
The recommended minimum for serious local AI work in 2026. 16GB runs 14B models at full Q8 quality and opens early access to 32B at heavy quantization.
Can run:
- Llama 3.1 8B at full F16 (~16GB) โ maximum quality
- DeepSeek R1 14B at Q8 (~15GB) โ excellent quality
- Mistral 12B at Q8 (~13GB) โ comfortable
- Qwen 2.5 14B at Q6/Q8
- Gemma 3 12B at Q8
- Stable Diffusion XL + ControlNet โ no constraints
Cannot run:
- 32B models at Q4 (needs ~20GB)
- 70B models at any reasonable quality
Best 16GB GPUs:
- Intel Arc A770 16GB on Amazon (~$280) โ best value
- Intel Arc B770 16GB on Amazon (~$349) โ faster Battlemage
- RTX 4060 Ti 16GB on Amazon (~$479) โ best CUDA option
24GB VRAM
The frontier for consumer AI hardware in 2026. 24GB unlocks 32B models at full Q4 quality and 70B models at Q2.
Can run:
- DeepSeek R1 32B at Q4 (~20GB) โ flagship reasoning model
- Llama 3.1 70B at Q2 (~22GB) โ impressive despite compression
- DeepSeek R1 70B at Q2 (~22GB)
- Qwen 2.5 72B at Q2 (~22GB)
- Gemma 3 27B at Q5 (~21GB)
- Everything at 16GB and below
Cannot run:
- 70B models at Q4 (needs ~40GB)
- 405B models
Best 24GB GPUs:
- Used RTX 3090 24GB on Amazon (~$475 used) โ massive VRAM for budget
- RTX 4090 24GB on Amazon (~$2,755) โ fastest single consumer GPU
See the full comparison: RTX 4090 vs RTX 3090 for Local LLMs
32GB+ VRAM
Above 32GB, you're in workstation or multi-GPU territory. The RTX 5090 (32GB) sits at the edge of this tier.
32GB (RTX 5090):
- Llama 3.1 70B at Q4 (~38-40GB) โ tight, may need 1-2 layers offloaded
- 70B at Q2/Q3 โ comfortable
- Everything at 24GB tier
48GB (dual RTX 3090 or RTX 6000 Ada):
- Llama 3.1 70B at Q8 (~70GB) โ requires dual-GPU
- DeepSeek R1 70B at Q6/Q8 โ dual-GPU only
- 70B at Q4 โ single 48GB card fits comfortably
Best 32GB GPU: RTX 5090 32GB on Amazon (~$2,900-$3,600)
What Happens When You Run Out of VRAM?
When your model exceeds available VRAM, the inference runtime has three options:
1. Refuse to load
Many tools fail to start if the model doesn't fit. You'll see errors like CUDA out of memory. This is the safest outcome.
2. CPU offloading (partial fit)
The excess layers run on CPU. This is the most common scenario with llama.cpp and Ollama.
Speed impact of CPU offloading:
- Full GPU fit: 40-120+ tok/s
- 50% offloaded: 8-15 tok/s
- 90% offloaded: 1-3 tok/s
At 1-3 tokens per second, a 200-token response takes over a minute. You'd be better off using a cloud API.
3. Reduce context length
A large portion of VRAM usage during inference is the KV cache โ memory used to store previous tokens. Reducing context from 8K to 2K tokens can save 1-3GB.
In Ollama: set OLLAMA_NUM_CTX=2048. In LM Studio: reduce "Context Length" in model settings.
VRAM by Use Case
Coding assistant
Recommended: 12-16GB | Ideal models: DeepSeek R1 14B Q4, Qwen 2.5 14B Q4 14B code models are the sweet spot โ smart enough for complex functions, fast enough for real-time completion.
Chat / general Q&A
Recommended: 8-12GB | Ideal models: Llama 3.2 8B Q6, Mistral 7B Q8 For casual conversation, 8B models at Q6 are excellent.
Reasoning / chain-of-thought
Recommended: 24GB | Ideal models: DeepSeek R1 32B Q4, Llama 3.1 70B Q2 Reasoning models like DeepSeek R1 shine at 32B. See the Best GPU for DeepSeek R1 guide.
Image generation
Recommended: 8GB minimum, 12-16GB ideal SD 1.5 runs on 6GB. SDXL needs 8GB minimum, 12GB for comfortable batch generation.
Fine-tuning / LoRA training
Recommended: 24GB minimum LoRA fine-tuning of a 7B model needs ~16-20GB. See Full Fine-Tuning vs PEFT VRAM for the full breakdown.
Quick Reference: VRAM vs. GPU Recommendations
| VRAM | Best Value GPU | Best CUDA GPU | Price Range |
|---|---|---|---|
| 8GB | RX 7600 8GB | RTX 4060 8GB | $200-$329 |
| 12GB | Arc B580 12GB | RTX 3060 12GB | $249-$299 |
| 16GB | Arc A770 16GB | RTX 4060 Ti 16GB | $280-$479 |
| 24GB | Used RTX 3090 | RTX 4090 | $475-$2,755 |
| 32GB | - | RTX 5090 | $2,900-$3,600 |
| 48GB | Dual RTX 3090 | RTX 6000 Ada | $950-$4,000+ |
Not sure which fits your use case? Take the 2-minute GPU Quiz Browse current GPU deals filtered by VRAM tier
FAQ
How much VRAM do I need to run Llama 3? For Llama 3.2 8B, you need at minimum 6GB VRAM for Q4 quantization, but 8GB is recommended. Llama 3.1 13B needs 8GB at Q4, 12GB for Q8. Llama 3.1 70B needs 24GB at Q2 quantization or 40GB+ at Q4.
Can I run a 70B model on a consumer GPU? Yes, at Q2 quantization. The RTX 3090 (24GB) and RTX 4090 (24GB) can run Llama 3.1 70B and DeepSeek R1 70B at Q2 (~22GB). For Q4 quality on 70B, you need 40GB+ โ either a single 48GB workstation card or dual 24GB cards.
Does system RAM help when VRAM runs out? Partially. When VRAM overflows, layers spill to system RAM โ supported by llama.cpp, Ollama, and LM Studio. But system RAM bandwidth is 10-50x lower than GPU VRAM. Expect 90%+ speed drops for heavily offloaded models.
How much VRAM do I need for DeepSeek R1?
- DeepSeek R1 7B: 6GB (Q4), 8GB (Q8)
- DeepSeek R1 14B: 10GB (Q4), 16GB (Q8)
- DeepSeek R1 32B: 21GB (Q4), 36GB (Q8) โ requires 24GB card
- DeepSeek R1 70B: 22GB (Q2), 40GB (Q4)
- DeepSeek R1 671B: 400GB+ โ cloud only
Full breakdown: Can You Run DeepSeek R1 on Your GPU?
Is 8GB VRAM enough for AI in 2026? For 7B/8B models, yes. But 8GB is increasingly the floor, not the ceiling. As 14B models become the mainstream recommendation, 8GB becomes a real limitation. Stretching to 12GB (like the Arc B580 at $249) gives much more headroom.
Does the GPU brand matter for LLM inference? Less than it used to. In 2026, Nvidia (CUDA), Intel Arc (SYCL/OpenCL), and AMD (ROCm) all work with Ollama, LM Studio, and llama.cpp. Nvidia still has the most consistent ecosystem, but Intel Arc offers compelling VRAM-per-dollar value.
What's the difference between model VRAM and total VRAM usage? Your GPU also uses VRAM for the OS display stack (~500MB-1GB), other applications, and the KV cache. Rule of thumb: subtract 1-2GB from your GPU's listed VRAM to get the safe working capacity. A 12GB card reliably has ~10-11GB available for model weights.
Summary: The VRAM Decision Tree
- Running 7B/8B models only? โ 8GB is fine. 12GB is better.
- Want to run 13B/14B models at good quality? โ 12GB minimum, 16GB recommended.
- Want to run 32B models (DeepSeek R1 32B, Qwen 32B)? โ 24GB required.
- Want to run 70B models? โ 24GB for Q2, 40GB+ for Q4.
- Running image generation alongside LLMs? โ Add 2-4GB to your LLM requirement.
- CUDA required for your workflow? โ Nvidia only. Otherwise, Intel Arc offers better VRAM-per-dollar.
Browse budget GPU options by VRAM tier Take the GPU Quiz for a personalized recommendation
Prices accurate as of April 2026. VRAM requirements are approximate and include ~10% overhead for KV cache and runtime.
About the Author: Justin Murray
AI Computer Guide Founder, has over a decade of AI and computer hardware experience. From leading the cryptocurrency mining hardware rush to repairing personal and commercial computer hardware, Justin has always had a passion for sharing knowledge and the cutting edge.
