How Much VRAM Do You Need to Run LLMs in 2026? The Complete Guide

By Justin Murrayโ€ขHardware Guideโ€ข
Cinematic visualization of GPU VRAM memory chips glowing with data processing for local LLM inference

How Much VRAM Do You Need to Run LLMs in 2026? The Complete Guide

VRAM is the single most important spec for running large language models locally. Get this wrong, and your model either refuses to load, crawls at 1-2 tokens per second while grinding system RAM, or crashes outright.

This guide gives you the exact VRAM numbers for every major model and quantization level in 2026. By the end, you'll know exactly what GPU you need for your specific models.

Quick start: Take the GPU Quiz | GPU Deals by VRAM Tier | DeepSeek R1 VRAM Requirements


At a Glance: VRAM Requirements by Use Case (2026)

  • Casual use (7B/8B models, Q4): 6GB minimum, 8GB recommended
  • Power user (13B/14B models, good quality): 12GB minimum, 16GB recommended
  • Serious local AI (32B models, Q4): 24GB minimum
  • Frontier local AI (70B models, Q4): 48GB or dual 24GB
  • Image generation (Stable Diffusion XL): 8GB minimum, 12GB comfortable
  • Multimodal (vision + text): Add 2-4GB on top of text model requirements

Why VRAM Is the Bottleneck for LLMs

Unlike traditional GPU workloads (gaming, rendering), LLMs don't need massive compute throughput; they need fast, large memory. The entire model must be loaded into VRAM before it can generate a single token. A 7B parameter model at 4-bit quantization occupies ~4.5GB of VRAM. A 70B model at the same quantization needs ~40GB.

If your model doesn't fit in VRAM:

  1. The excess spills to system RAM (called CPU offloading)
  2. CPU RAM is 10-50x slower than GPU VRAM
  3. Generation drops from 40-100+ tok/s to 2-8 tok/s

The rule: Every layer of the model that fits in VRAM is fast. Every layer that spills to RAM is dramatically slower. Full VRAM fit = fast. Partial fit = slow. No fit = unacceptably slow.

For a deeper dive on multi-GPU setups that eliminate this problem entirely, see our Llama 3.3 Hardware Requirements guide.


Quantization Explained: Q4, Q5, Q8, F16

Quantization is how you trade model quality for VRAM. Here's what the numbers mean:

FormatBits per weightVRAM multiplierQuality
F3232-bit float4xReference quality (impractical)
F16 / BF1616-bit float2xFull quality, high VRAM
Q8_08-bit integer1xNear-lossless, ~1% degradation
Q6_K6-bit0.75xExcellent quality, ~1-2% loss
Q5_K_M5-bit0.625xGood quality, ~2-3% loss
Q4_K_M4-bit0.5xGood quality, ~3-5% loss
Q3_K_M3-bit0.375xAcceptable, ~5-8% loss
Q2_K2-bit0.25xNoticeable degradation

Which quantization should you use?

  • Q4_K_M is the standard recommendation for most users โ€” half the VRAM of F16 with ~3-5% quality loss
  • Q5_K_M is a good middle ground if you have the VRAM โ€” noticeably better than Q4 with modest cost
  • Q8_0 is essentially lossless โ€” use it when VRAM allows, especially for coding tasks
  • Q2 should only be used to fit a model that otherwise won't fit at all โ€” quality degrades significantly

VRAM formula

VRAM needed = (Parameters in billions) x (bytes per weight) + ~10% overhead

  • F16: parameters x 2GB
  • Q8: parameters x 1GB
  • Q4: parameters x 0.5GB
  • Q2: parameters x 0.25GB

Example: Llama 3.1 70B at Q4 = 70 x 0.5 = 35GB + 10% overhead = ~38-40GB


Master VRAM Requirements Table

Llama 3.1 / 3.2 / 3.3 Models

ModelQ2Q4Q5Q6Q8F16
Llama 3.2 3B~1GB~2GB~2.4GB~2.8GB~3.5GB~6GB
Llama 3.1/3.2 8B~2.5GB~4.5GB~5.5GB~6.5GB~8.5GB~16GB
Llama 3.1 13B~4GB~7.5GB~9GB~11GB~14GB~26GB
Llama 3.1 70B~22GB~40GB~48GB~55GB~70GB140GB
Llama 3.1 405B~125GB~200GB+----

Mistral Models

ModelQ4Q5Q8F16
Mistral 7B~4.5GB~5.5GB~8GB~14GB
Mistral 12B / Nemo~7GB~9GB~13GB~24GB
Mixtral 8x7B (MoE)~26GB~32GB~47GB-
Mistral Large 2 (123B)~62GB---

DeepSeek R1 Models

ModelQ2Q4Q5Q8
DeepSeek R1 7B~2GB~4.5GB~5.5GB~8GB
DeepSeek R1 14B~4.5GB~9GB~11GB~15GB
DeepSeek R1 32B~10GB~20GB~24GB~35GB
DeepSeek R1 70B~22GB~40GB~48GB~70GB
DeepSeek R1 671B-~420GB--

DeepSeek R1 671B is the full MoE model. Cloud-only for all practical purposes. See the Full DeepSeek R1 GPU Guide for hardware recommendations.

Qwen 2.5 Models

ModelQ4Q5Q8F16
Qwen 2.5 7B~4.5GB~5.5GB~8GB~14GB
Qwen 2.5 14B~9GB~11GB~15GB~28GB
Qwen 2.5 32B~20GB~24GB~35GB-
Qwen 2.5 72B~40GB~48GB~72GB-

Google Gemma 3 Models

ModelQ4Q5Q8F16
Gemma 3 4B~2.5GB~3GB~4.5GB~8GB
Gemma 3 12B~7GB~9GB~13GB~24GB
Gemma 3 27B~17GB~21GB~29GB~54GB

VRAM Tiers: What You Can Run at Each Level

8GB VRAM

The minimum viable tier for running local LLMs in 2026. You can run 7B/8B models comfortably at Q4-Q6, but nothing larger without significant quality compromise.

Can run:

  • Llama 3.2 8B at Q4/Q5/Q6
  • Mistral 7B at Q4/Q5/Q6
  • DeepSeek R1 7B at Q4/Q8
  • Gemma 3 4B at full Q8
  • Qwen 2.5 7B at Q4/Q5
  • Stable Diffusion 1.5

Cannot run:

  • 13B/14B models at usable quality
  • Any 32B+ model
  • Stable Diffusion XL reliably

Best 8GB GPU: RTX 4060 8GB on Amazon (~$329)


12GB VRAM

The sweet spot for hobbyists. 12GB lets you run 7B models at near-lossless Q8 quality and opens 13B/14B models at Q4.

Can run:

  • Llama 3.2 8B at Q8 (~8.5GB) โ€” near-lossless quality
  • Mistral 7B at full Q8
  • DeepSeek R1 14B at Q4 (~9GB) โ€” fits with headroom
  • Llama 3.1 13B at Q4 (~7.5GB) โ€” comfortable
  • Qwen 2.5 14B at Q4 (~9GB)
  • Gemma 3 12B at Q5
  • Stable Diffusion XL

Cannot run:

  • 32B+ models at usable quality
  • 14B models at Q8 (needs 15GB)

Best 12GB GPUs:


16GB VRAM

The recommended minimum for serious local AI work in 2026. 16GB runs 14B models at full Q8 quality and opens early access to 32B at heavy quantization.

Can run:

  • Llama 3.1 8B at full F16 (~16GB) โ€” maximum quality
  • DeepSeek R1 14B at Q8 (~15GB) โ€” excellent quality
  • Mistral 12B at Q8 (~13GB) โ€” comfortable
  • Qwen 2.5 14B at Q6/Q8
  • Gemma 3 12B at Q8
  • Stable Diffusion XL + ControlNet โ€” no constraints

Cannot run:

  • 32B models at Q4 (needs ~20GB)
  • 70B models at any reasonable quality

Best 16GB GPUs:


24GB VRAM

The frontier for consumer AI hardware in 2026. 24GB unlocks 32B models at full Q4 quality and 70B models at Q2.

Can run:

  • DeepSeek R1 32B at Q4 (~20GB) โ€” flagship reasoning model
  • Llama 3.1 70B at Q2 (~22GB) โ€” impressive despite compression
  • DeepSeek R1 70B at Q2 (~22GB)
  • Qwen 2.5 72B at Q2 (~22GB)
  • Gemma 3 27B at Q5 (~21GB)
  • Everything at 16GB and below

Cannot run:

  • 70B models at Q4 (needs ~40GB)
  • 405B models

Best 24GB GPUs:

See the full comparison: RTX 4090 vs RTX 3090 for Local LLMs


32GB+ VRAM

Above 32GB, you're in workstation or multi-GPU territory. The RTX 5090 (32GB) sits at the edge of this tier.

32GB (RTX 5090):

  • Llama 3.1 70B at Q4 (~38-40GB) โ€” tight, may need 1-2 layers offloaded
  • 70B at Q2/Q3 โ€” comfortable
  • Everything at 24GB tier

48GB (dual RTX 3090 or RTX 6000 Ada):

  • Llama 3.1 70B at Q8 (~70GB) โ€” requires dual-GPU
  • DeepSeek R1 70B at Q6/Q8 โ€” dual-GPU only
  • 70B at Q4 โ€” single 48GB card fits comfortably

Best 32GB GPU: RTX 5090 32GB on Amazon (~$2,900-$3,600)


What Happens When You Run Out of VRAM?

When your model exceeds available VRAM, the inference runtime has three options:

1. Refuse to load

Many tools fail to start if the model doesn't fit. You'll see errors like CUDA out of memory. This is the safest outcome.

2. CPU offloading (partial fit)

The excess layers run on CPU. This is the most common scenario with llama.cpp and Ollama.

Speed impact of CPU offloading:

  • Full GPU fit: 40-120+ tok/s
  • 50% offloaded: 8-15 tok/s
  • 90% offloaded: 1-3 tok/s

At 1-3 tokens per second, a 200-token response takes over a minute. You'd be better off using a cloud API.

3. Reduce context length

A large portion of VRAM usage during inference is the KV cache โ€” memory used to store previous tokens. Reducing context from 8K to 2K tokens can save 1-3GB.

In Ollama: set OLLAMA_NUM_CTX=2048. In LM Studio: reduce "Context Length" in model settings.


VRAM by Use Case

Coding assistant

Recommended: 12-16GB | Ideal models: DeepSeek R1 14B Q4, Qwen 2.5 14B Q4 14B code models are the sweet spot โ€” smart enough for complex functions, fast enough for real-time completion.

Chat / general Q&A

Recommended: 8-12GB | Ideal models: Llama 3.2 8B Q6, Mistral 7B Q8 For casual conversation, 8B models at Q6 are excellent.

Reasoning / chain-of-thought

Recommended: 24GB | Ideal models: DeepSeek R1 32B Q4, Llama 3.1 70B Q2 Reasoning models like DeepSeek R1 shine at 32B. See the Best GPU for DeepSeek R1 guide.

Image generation

Recommended: 8GB minimum, 12-16GB ideal SD 1.5 runs on 6GB. SDXL needs 8GB minimum, 12GB for comfortable batch generation.

Fine-tuning / LoRA training

Recommended: 24GB minimum LoRA fine-tuning of a 7B model needs ~16-20GB. See Full Fine-Tuning vs PEFT VRAM for the full breakdown.


Quick Reference: VRAM vs. GPU Recommendations

VRAMBest Value GPUBest CUDA GPUPrice Range
8GBRX 7600 8GBRTX 4060 8GB$200-$329
12GBArc B580 12GBRTX 3060 12GB$249-$299
16GBArc A770 16GBRTX 4060 Ti 16GB$280-$479
24GBUsed RTX 3090RTX 4090$475-$2,755
32GB-RTX 5090$2,900-$3,600
48GBDual RTX 3090RTX 6000 Ada$950-$4,000+

Not sure which fits your use case? Take the 2-minute GPU Quiz Browse current GPU deals filtered by VRAM tier


FAQ

How much VRAM do I need to run Llama 3? For Llama 3.2 8B, you need at minimum 6GB VRAM for Q4 quantization, but 8GB is recommended. Llama 3.1 13B needs 8GB at Q4, 12GB for Q8. Llama 3.1 70B needs 24GB at Q2 quantization or 40GB+ at Q4.

Can I run a 70B model on a consumer GPU? Yes, at Q2 quantization. The RTX 3090 (24GB) and RTX 4090 (24GB) can run Llama 3.1 70B and DeepSeek R1 70B at Q2 (~22GB). For Q4 quality on 70B, you need 40GB+ โ€” either a single 48GB workstation card or dual 24GB cards.

Does system RAM help when VRAM runs out? Partially. When VRAM overflows, layers spill to system RAM โ€” supported by llama.cpp, Ollama, and LM Studio. But system RAM bandwidth is 10-50x lower than GPU VRAM. Expect 90%+ speed drops for heavily offloaded models.

How much VRAM do I need for DeepSeek R1?

  • DeepSeek R1 7B: 6GB (Q4), 8GB (Q8)
  • DeepSeek R1 14B: 10GB (Q4), 16GB (Q8)
  • DeepSeek R1 32B: 21GB (Q4), 36GB (Q8) โ€” requires 24GB card
  • DeepSeek R1 70B: 22GB (Q2), 40GB (Q4)
  • DeepSeek R1 671B: 400GB+ โ€” cloud only

Full breakdown: Can You Run DeepSeek R1 on Your GPU?

Is 8GB VRAM enough for AI in 2026? For 7B/8B models, yes. But 8GB is increasingly the floor, not the ceiling. As 14B models become the mainstream recommendation, 8GB becomes a real limitation. Stretching to 12GB (like the Arc B580 at $249) gives much more headroom.

Does the GPU brand matter for LLM inference? Less than it used to. In 2026, Nvidia (CUDA), Intel Arc (SYCL/OpenCL), and AMD (ROCm) all work with Ollama, LM Studio, and llama.cpp. Nvidia still has the most consistent ecosystem, but Intel Arc offers compelling VRAM-per-dollar value.

What's the difference between model VRAM and total VRAM usage? Your GPU also uses VRAM for the OS display stack (~500MB-1GB), other applications, and the KV cache. Rule of thumb: subtract 1-2GB from your GPU's listed VRAM to get the safe working capacity. A 12GB card reliably has ~10-11GB available for model weights.


Summary: The VRAM Decision Tree

  1. Running 7B/8B models only? โ†’ 8GB is fine. 12GB is better.
  2. Want to run 13B/14B models at good quality? โ†’ 12GB minimum, 16GB recommended.
  3. Want to run 32B models (DeepSeek R1 32B, Qwen 32B)? โ†’ 24GB required.
  4. Want to run 70B models? โ†’ 24GB for Q2, 40GB+ for Q4.
  5. Running image generation alongside LLMs? โ†’ Add 2-4GB to your LLM requirement.
  6. CUDA required for your workflow? โ†’ Nvidia only. Otherwise, Intel Arc offers better VRAM-per-dollar.

Browse budget GPU options by VRAM tier Take the GPU Quiz for a personalized recommendation

Prices accurate as of April 2026. VRAM requirements are approximate and include ~10% overhead for KV cache and runtime.

About the Author: Justin Murray

AI Computer Guide Founder, has over a decade of AI and computer hardware experience. From leading the cryptocurrency mining hardware rush to repairing personal and commercial computer hardware, Justin has always had a passion for sharing knowledge and the cutting edge.

Ready to Build? Use the AI Computer Builder

Configure a VRAM-optimised rig using the hardware mentioned in this guide.

Launch AI Computer Builder

Related Guides

As an Amazon Associate, I earn from qualifying purchases.