How Much VRAM Do You Need to Run LLMs in 2026? The Complete Guide

VRAM is the single most important spec for running large language models locally. Get this wrong, and your model either refuses to load, crawls at 1-2 tokens per second while grinding system RAM, or crashes outright.

This guide gives you the exact VRAM numbers for every major model and quantization level in 2026. By the end, you'll know exactly what GPU you need for your specific models.

Quick start: Take the GPU Quiz | GPU Deals by VRAM Tier | DeepSeek R1 VRAM Requirements

At a Glance: VRAM Requirements by Use Case (2026)

Casual use (7B/8B models, Q4): 6GB minimum, 8GB recommended
Power user (13B/14B models, good quality): 12GB minimum, 16GB recommended
Serious local AI (32B models, Q4): 24GB minimum
Frontier local AI (70B models, Q4): 48GB or dual 24GB
Image generation (Stable Diffusion XL): 8GB minimum, 12GB comfortable
Multimodal (vision + text): Add 2-4GB on top of text model requirements

Why VRAM Is the Bottleneck for LLMs

Unlike traditional GPU workloads (gaming, rendering), LLMs don't need massive compute throughput; they need fast, large memory. The entire model must be loaded into VRAM before it can generate a single token. A 7B parameter model at 4-bit quantization occupies ~4.5GB of VRAM. A 70B model at the same quantization needs ~40GB.

If your model doesn't fit in VRAM:

The excess spills to system RAM (called CPU offloading)
CPU RAM is 10-50x slower than GPU VRAM
Generation drops from 40-100+ tok/s to 2-8 tok/s

The rule: Every layer of the model that fits in VRAM is fast. Every layer that spills to RAM is dramatically slower. Full VRAM fit = fast. Partial fit = slow. No fit = unacceptably slow.

For a deeper dive on multi-GPU setups that eliminate this problem entirely, see our Llama 3.3 Hardware Requirements guide.

Quantization Explained: Q4, Q5, Q8, F16

Quantization is how you trade model quality for VRAM. Here's what the numbers mean:

Format	Bits per weight	VRAM multiplier	Quality
F32	32-bit float	4x	Reference quality (impractical)
F16 / BF16	16-bit float	2x	Full quality, high VRAM
Q8_0	8-bit integer	1x	Near-lossless, ~1% degradation
Q6_K	6-bit	0.75x	Excellent quality, ~1-2% loss
Q5_K_M	5-bit	0.625x	Good quality, ~2-3% loss
Q4_K_M	4-bit	0.5x	Good quality, ~3-5% loss
Q3_K_M	3-bit	0.375x	Acceptable, ~5-8% loss
Q2_K	2-bit	0.25x	Noticeable degradation

Which quantization should you use?

Q4_K_M is the standard recommendation for most users — half the VRAM of F16 with ~3-5% quality loss
Q5_K_M is a good middle ground if you have the VRAM — noticeably better than Q4 with modest cost
Q8_0 is essentially lossless — use it when VRAM allows, especially for coding tasks
Q2 should only be used to fit a model that otherwise won't fit at all — quality degrades significantly

VRAM formula

VRAM needed = (Parameters in billions) x (bytes per weight) + ~10% overhead

F16: parameters x 2GB
Q8: parameters x 1GB
Q4: parameters x 0.5GB
Q2: parameters x 0.25GB

Example: Llama 3.1 70B at Q4 = 70 x 0.5 = 35GB + 10% overhead = ~38-40GB

Master VRAM Requirements Table

Llama 3.1 / 3.2 / 3.3 Models

Model	Q2	Q4	Q5	Q6	Q8	F16
Llama 3.2 3B	~1GB	~2GB	~2.4GB	~2.8GB	~3.5GB	~6GB
Llama 3.1/3.2 8B	~2.5GB	~4.5GB	~5.5GB	~6.5GB	~8.5GB	~16GB
Llama 3.1 13B	~4GB	~7.5GB	~9GB	~11GB	~14GB	~26GB
Llama 3.1 70B	~22GB	~40GB	~48GB	~55GB	~70GB	140GB
Llama 3.1 405B	~125GB	~200GB+	-	-	-	-

Mistral Models

Model	Q4	Q5	Q8	F16
Mistral 7B	~4.5GB	~5.5GB	~8GB	~14GB
Mistral 12B / Nemo	~7GB	~9GB	~13GB	~24GB
Mixtral 8x7B (MoE)	~26GB	~32GB	~47GB	-
Mistral Large 2 (123B)	~62GB	-	-	-

DeepSeek R1 Models

Model	Q2	Q4	Q5	Q8
DeepSeek R1 7B	~2GB	~4.5GB	~5.5GB	~8GB
DeepSeek R1 14B	~4.5GB	~9GB	~11GB	~15GB
DeepSeek R1 32B	~10GB	~20GB	~24GB	~35GB
DeepSeek R1 70B	~22GB	~40GB	~48GB	~70GB
DeepSeek R1 671B	-	~420GB	-	-

DeepSeek R1 671B is the full MoE model. Cloud-only for all practical purposes. See the Full DeepSeek R1 GPU Guide for hardware recommendations.

Qwen 2.5 Models

Model	Q4	Q5	Q8	F16
Qwen 2.5 7B	~4.5GB	~5.5GB	~8GB	~14GB
Qwen 2.5 14B	~9GB	~11GB	~15GB	~28GB
Qwen 2.5 32B	~20GB	~24GB	~35GB	-
Qwen 2.5 72B	~40GB	~48GB	~72GB	-

Google Gemma 3 Models

Model	Q4	Q5	Q8	F16
Gemma 3 4B	~2.5GB	~3GB	~4.5GB	~8GB
Gemma 3 12B	~7GB	~9GB	~13GB	~24GB
Gemma 3 27B	~17GB	~21GB	~29GB	~54GB

VRAM Tiers: What You Can Run at Each Level

8GB VRAM

The minimum viable tier for running local LLMs in 2026. You can run 7B/8B models comfortably at Q4-Q6, but nothing larger without significant quality compromise.

Can run:

Llama 3.2 8B at Q4/Q5/Q6
Mistral 7B at Q4/Q5/Q6
DeepSeek R1 7B at Q4/Q8
Gemma 3 4B at full Q8
Qwen 2.5 7B at Q4/Q5
Stable Diffusion 1.5

Cannot run:

13B/14B models at usable quality
Any 32B+ model
Stable Diffusion XL reliably

Best 8GB GPU: RTX 4060 8GB on Amazon (~$329)

12GB VRAM

The sweet spot for hobbyists. 12GB lets you run 7B models at near-lossless Q8 quality and opens 13B/14B models at Q4.

Can run:

Llama 3.2 8B at Q8 (~8.5GB) — near-lossless quality
Mistral 7B at full Q8
DeepSeek R1 14B at Q4 (~9GB) — fits with headroom
Llama 3.1 13B at Q4 (~7.5GB) — comfortable
Qwen 2.5 14B at Q4 (~9GB)
Gemma 3 12B at Q5
Stable Diffusion XL

Cannot run:

32B+ models at usable quality
14B models at Q8 (needs 15GB)

Best 12GB GPUs:

Intel Arc B580 12GB on Amazon (~$249) — best value
RTX 3060 12GB on Amazon (~$299) — CUDA option

16GB VRAM

The recommended minimum for serious local AI work in 2026. 16GB runs 14B models at full Q8 quality and opens early access to 32B at heavy quantization.

Can run:

Llama 3.1 8B at full F16 (~16GB) — maximum quality
DeepSeek R1 14B at Q8 (~15GB) — excellent quality
Mistral 12B at Q8 (~13GB) — comfortable
Qwen 2.5 14B at Q6/Q8
Gemma 3 12B at Q8
Stable Diffusion XL + ControlNet — no constraints

Cannot run:

32B models at Q4 (needs ~20GB)
70B models at any reasonable quality

Best 16GB GPUs:

Intel Arc A770 16GB on Amazon (~$280) — best value
Intel Arc B770 16GB on Amazon (~$349) — faster Battlemage
RTX 4060 Ti 16GB on Amazon (~$479) — best CUDA option

24GB VRAM

The frontier for consumer AI hardware in 2026. 24GB unlocks 32B models at full Q4 quality and 70B models at Q2.

Can run:

DeepSeek R1 32B at Q4 (~20GB) — flagship reasoning model
Llama 3.1 70B at Q2 (~22GB) — impressive despite compression
DeepSeek R1 70B at Q2 (~22GB)
Qwen 2.5 72B at Q2 (~22GB)
Gemma 3 27B at Q5 (~21GB)
Everything at 16GB and below

Cannot run:

70B models at Q4 (needs ~40GB)
405B models

Best 24GB GPUs:

Used RTX 3090 24GB on Amazon (~$475 used) — massive VRAM for budget
RTX 4090 24GB on Amazon (~$2,755) — fastest single consumer GPU

See the full comparison: RTX 4090 vs RTX 3090 for Local LLMs

32GB+ VRAM

Above 32GB, you're in workstation or multi-GPU territory. The RTX 5090 (32GB) sits at the edge of this tier.

32GB (RTX 5090):

Llama 3.1 70B at Q4 (~38-40GB) — tight, may need 1-2 layers offloaded
70B at Q2/Q3 — comfortable
Everything at 24GB tier

48GB (dual RTX 3090 or RTX 6000 Ada):

Llama 3.1 70B at Q8 (~70GB) — requires dual-GPU
DeepSeek R1 70B at Q6/Q8 — dual-GPU only
70B at Q4 — single 48GB card fits comfortably

Best 32GB GPU: RTX 5090 32GB on Amazon (~$2,900-$3,600)

What Happens When You Run Out of VRAM?

When your model exceeds available VRAM, the inference runtime has three options:

1. Refuse to load

Many tools fail to start if the model doesn't fit. You'll see errors like CUDA out of memory. This is the safest outcome.

2. CPU offloading (partial fit)

The excess layers run on CPU. This is the most common scenario with llama.cpp and Ollama.

Speed impact of CPU offloading:

Full GPU fit: 40-120+ tok/s
50% offloaded: 8-15 tok/s
90% offloaded: 1-3 tok/s

At 1-3 tokens per second, a 200-token response takes over a minute. You'd be better off using a cloud API.

3. Reduce context length

A large portion of VRAM usage during inference is the KV cache — memory used to store previous tokens. Reducing context from 8K to 2K tokens can save 1-3GB.

In Ollama: set OLLAMA_NUM_CTX=2048. In LM Studio: reduce "Context Length" in model settings.

VRAM by Use Case

Coding assistant

Recommended: 12-16GB | Ideal models: DeepSeek R1 14B Q4, Qwen 2.5 14B Q4 14B code models are the sweet spot — smart enough for complex functions, fast enough for real-time completion.

Chat / general Q&A

Recommended: 8-12GB | Ideal models: Llama 3.2 8B Q6, Mistral 7B Q8 For casual conversation, 8B models at Q6 are excellent.

Reasoning / chain-of-thought

Recommended: 24GB | Ideal models: DeepSeek R1 32B Q4, Llama 3.1 70B Q2 Reasoning models like DeepSeek R1 shine at 32B. See the Best GPU for DeepSeek R1 guide.

Image generation

Recommended: 8GB minimum, 12-16GB ideal SD 1.5 runs on 6GB. SDXL needs 8GB minimum, 12GB for comfortable batch generation.

Fine-tuning / LoRA training

Recommended: 24GB minimum LoRA fine-tuning of a 7B model needs ~16-20GB. See Full Fine-Tuning vs PEFT VRAM for the full breakdown.

Quick Reference: VRAM vs. GPU Recommendations

VRAM	Best Value GPU	Best CUDA GPU	Price Range
8GB	RX 7600 8GB	RTX 4060 8GB	$200-$329
12GB	Arc B580 12GB	RTX 3060 12GB	$249-$299
16GB	Arc A770 16GB	RTX 4060 Ti 16GB	$280-$479
24GB	Used RTX 3090	RTX 4090	$475-$2,755
32GB	-	RTX 5090	$2,900-$3,600
48GB	Dual RTX 3090	RTX 6000 Ada	$950-$4,000+

Not sure which fits your use case? Take the 2-minute GPU Quiz Browse current GPU deals filtered by VRAM tier

FAQ

How much VRAM do I need to run Llama 3? For Llama 3.2 8B, you need at minimum 6GB VRAM for Q4 quantization, but 8GB is recommended. Llama 3.1 13B needs 8GB at Q4, 12GB for Q8. Llama 3.1 70B needs 24GB at Q2 quantization or 40GB+ at Q4.

Can I run a 70B model on a consumer GPU? Yes, at Q2 quantization. The RTX 3090 (24GB) and RTX 4090 (24GB) can run Llama 3.1 70B and DeepSeek R1 70B at Q2 (~22GB). For Q4 quality on 70B, you need 40GB+ — either a single 48GB workstation card or dual 24GB cards.

Does system RAM help when VRAM runs out? Partially. When VRAM overflows, layers spill to system RAM — supported by llama.cpp, Ollama, and LM Studio. But system RAM bandwidth is 10-50x lower than GPU VRAM. Expect 90%+ speed drops for heavily offloaded models.

How much VRAM do I need for DeepSeek R1?

DeepSeek R1 7B: 6GB (Q4), 8GB (Q8)
DeepSeek R1 14B: 10GB (Q4), 16GB (Q8)
DeepSeek R1 32B: 21GB (Q4), 36GB (Q8) — requires 24GB card
DeepSeek R1 70B: 22GB (Q2), 40GB (Q4)
DeepSeek R1 671B: 400GB+ — cloud only

Full breakdown: Can You Run DeepSeek R1 on Your GPU?

Is 8GB VRAM enough for AI in 2026? For 7B/8B models, yes. But 8GB is increasingly the floor, not the ceiling. As 14B models become the mainstream recommendation, 8GB becomes a real limitation. Stretching to 12GB (like the Arc B580 at $249) gives much more headroom.

Does the GPU brand matter for LLM inference? Less than it used to. In 2026, Nvidia (CUDA), Intel Arc (SYCL/OpenCL), and AMD (ROCm) all work with Ollama, LM Studio, and llama.cpp. Nvidia still has the most consistent ecosystem, but Intel Arc offers compelling VRAM-per-dollar value.

What's the difference between model VRAM and total VRAM usage? Your GPU also uses VRAM for the OS display stack (~500MB-1GB), other applications, and the KV cache. Rule of thumb: subtract 1-2GB from your GPU's listed VRAM to get the safe working capacity. A 12GB card reliably has ~10-11GB available for model weights.

Summary: The VRAM Decision Tree

Running 7B/8B models only? → 8GB is fine. 12GB is better.
Want to run 13B/14B models at good quality? → 12GB minimum, 16GB recommended.
Want to run 32B models (DeepSeek R1 32B, Qwen 32B)? → 24GB required.
Want to run 70B models? → 24GB for Q2, 40GB+ for Q4.
Running image generation alongside LLMs? → Add 2-4GB to your LLM requirement.
CUDA required for your workflow? → Nvidia only. Otherwise, Intel Arc offers better VRAM-per-dollar.

Browse budget GPU options by VRAM tier Take the GPU Quiz for a personalized recommendation

Prices accurate as of April 2026. VRAM requirements are approximate and include ~10% overhead for KV cache and runtime.

About the Author: Justin Murray

AI Computer Guide Founder, has over a decade of AI and computer hardware experience. From leading the cryptocurrency mining hardware rush to repairing personal and commercial computer hardware, Justin has always had a passion for sharing knowledge and the cutting edge.

How Much VRAM Do You Need to Run LLMs in 2026? The Complete Guide

How Much VRAM Do You Need to Run LLMs in 2026? The Complete Guide

At a Glance: VRAM Requirements by Use Case (2026)

Why VRAM Is the Bottleneck for LLMs

Quantization Explained: Q4, Q5, Q8, F16

Which quantization should you use?

VRAM formula

Master VRAM Requirements Table

Llama 3.1 / 3.2 / 3.3 Models

Mistral Models

DeepSeek R1 Models

Qwen 2.5 Models

Google Gemma 3 Models

VRAM Tiers: What You Can Run at Each Level

8GB VRAM

12GB VRAM

16GB VRAM

24GB VRAM

32GB+ VRAM

What Happens When You Run Out of VRAM?

1. Refuse to load

2. CPU offloading (partial fit)

3. Reduce context length

VRAM by Use Case

Coding assistant

Chat / general Q&A

Reasoning / chain-of-thought

Image generation

Fine-tuning / LoRA training

Quick Reference: VRAM vs. GPU Recommendations

FAQ

Summary: The VRAM Decision Tree

About the Author: Justin Murray

Ready to Build? Use the AI Computer Builder

Related Guides

Host Small Business AI Locally: Replace Monthly Cloud Subscriptions

Best Local AI Coding Models of 2026: VRAM Tiers and Benchmarks

Best Budget GPU for AI in 2026: Under $300, $400, and $500 Picks