NVIDIAElite/Model Training Ready

What Can the NVIDIA GeForce RTX 5090 Run?

99

Compatible Models

32GB

VRAM (GDDR7)

1792

GB/s Bandwidth

575W

TDP

🏆 Top Pick

Mixtral 8x7B

Mistral AI · 46.7B (MoE) · Q4_K_M

💬 Chat

Mixture-of-Experts: 46.7B total params but only 12.9B active per token. Near-70B quality at lower inference cost.

24GB / 32GB VRAM

32,768 ctxApache 2.0
💻 Code

Qwen3-Coder-Next

Alibaba · 80B (MoE) · Q4_K_M

💻 Code

An 80B MoE model with 3B active parameters, designed for coding agents and complex tool usage.

24GB / 32GB VRAM

262,144 ctxApache 2.0
🏆 Top Pick

Qwen3 Next

Alibaba · 80B (MoE) · Q4_K_M

💬 Chat

A high-sparsity Mixture-of-Experts 80B model utilizing a hybrid attention architecture.

24GB / 32GB VRAM

131,072 ctxApache 2.0
💻 Code

GLM-4.7-Flash

Zhipu AI · 355B (MoE) · Q4_K_M

💻 Code

Winner of the 24GB VRAM agentic coding challenge. A 355B MoE model that fits a single consumer GPU by activating only 30B parameters. LiveCodeBench 89% — the most capable coding model per dollar of VRAM.

24GB / 32GB VRAM

200,000 ctxMIT
🏆 Top Pick

FLUX.2 Dev

Black Forest Labs · 32B diffusion · BF16

🎨 Image Gen

The best photorealism and text-in-image accuracy of any local model in 2026. Multi-reference image conditioning. Handles 1000-character prompts with full semantic fidelity — the definitive standard for professional AI photography.

24GB / 32GB VRAM

0 ctxFLUX non-commercial
🏆 Top Pick

Wan 2.2

Alibaba · 14B video · BF16

🎬 Video Gen

Leading open video model for 2026. 720P with full camera motion controls and best-in-class semantic consistency across frames. The top choice for filmmakers and content creators running a single 24GB GPU workstation.

24GB / 32GB VRAM

0 ctxApache 2.0
👁️ Vision

LLaVA 34B

Haotian Liu et al. · 34B · Q4_K_M

👁️ Vision

High-quality vision model at 34B scale. Significantly better image analysis than the 7B version.

22GB / 32GB VRAM

4,096 ctxApache 2.0
🔬 Reasoning

DeepSeek R1 32B

DeepSeek · 32B · Q4_K_M

🔬 Reasoning

The sweet spot for local reasoning. Competitive with o1-mini on math and coding tasks at 32B scale.

20GB / 32GB VRAM

131,072 ctxMIT
🔬 Reasoning

QwQ 32B

Alibaba · 32B · Q4_K_M

🔬 Reasoning

Alibaba's advanced reasoning model. Extended thinking and reflection enables GPT-o1-level problem solving.

20GB / 32GB VRAM

131,072 ctxApache 2.0
🏆 Top Pick

Qwen 2.5 32B

Alibaba · 32B · Q4_K_M

💬 Chat

Near-frontier performance from a local 32B model. Exceptional multilingual reasoning and instruction following.

20GB / 32GB VRAM

128,000 ctxApache 2.0
💻 Code

Qwen 2.5 Coder 32B

Alibaba · 32B · Q4_K_M

💻 Code

The best open-weight code model available. Matches GPT-4o on coding benchmarks, runs locally at 20GB VRAM.

20GB / 32GB VRAM

128,000 ctxApache 2.0
🔬 Reasoning

Magistral

Mistral AI · 32B · Q4_K_M

🔬 Reasoning

An open-weight reasoning model capable of long chains of thought before answering.

20GB / 32GB VRAM

131,072 ctxMistral Open

OLMo 2 32B

Allen AI · 32B dense · Q4_K_M

💬 Chat

The most transparent large language model ever released — weights, training data, code, evaluation harnesses, and training logs all fully public. Essential for research reproducibility and academic citation.

20GB / 32GB VRAM

4,096 ctxApache 2.0

Aya Expanse 32B

CohereForAI · 32B dense · Q4_K_M

💬 Chat

Native support for 23 languages. The single best open multilingual foundation model available for researchers building localized or translation-heavy pipelines.

20GB / 32GB VRAM

131,072 ctxCC-BY-NC-4.0
💻 Code

Qwen 2.5 Coder 32B

Alibaba · 32B dense · Q4_K_M

💻 Code

HumanEval 92% — the best single-GPU coding model for RTX 5090 owners. Excels at multi-file edits, refactoring, and agentic tool use pipelines. The definitive open coding model at the 20GB VRAM tier.

20GB / 32GB VRAM

131,072 ctxApache 2.0
🏆 Top Pick

Gemma 3 27B

Google · 27B · Q4_K_M

💬 Chat

Google's largest open Gemma model. Competes with 70B-class models from previous generations.

17GB / 32GB VRAM

128,000 ctxGemma ToS
🏆 Top Pick

Qwen3.5-27B

Alibaba · 27B dense · Q4_K_M

💬 Chat

Community top pick for a single 24GB GPU. Supports 201 languages natively with vision-language capability built in. IFBench 76.5 and AIME 91.3 — outperforms GPT-5.2 on instruction following.

17GB / 32GB VRAM

262,144 ctxApache 2.0

Gemma 3 27B

Google · 27B dense · Q4_K_M

💬 Chat

Outstanding performance at the 24GB VRAM tier with a native Google vision-language encoder integrated directly into the weights. MMLU-Pro score of 67.5.

17GB / 32GB VRAM

131,072 ctxGemma Terms
⚡ Fast

CogVideoX-5B

THUDM · 5B video · BF16

🎬 Video Gen

6-second 720P clips on a single 16GB GPU — the easiest entry point into local video generation. Apache 2.0 commercial license with active ComfyUI integration. Best for social media automation and rapid prototyping.

16GB / 32GB VRAM

0 ctxApache 2.0

Open-Sora

HPC-AI Tech · 4B video · BF16

🎬 Video Gen

Full open-source Sora replication — weights, training code, and full data pipeline are all public. The most transparent video generation model. Essential for researchers studying text-to-video architectures and training dynamics.

16GB / 32GB VRAM

0 ctxApache 2.0
⚡ Fast

LFM2-24B-A2B

Liquid AI · 24B · Q4_K_M

🔬 Reasoning

A large hybrid model family designed specifically for efficient on-device deployment.

15GB / 32GB VRAM

131,072 ctxLiquid Non-Commercial
🏆 Top Pick

Nemotron 3 22B

NVIDIA · 22B · Q4_K_M

🔬 Reasoning

A general-purpose reasoning and chat model trained from scratch by NVIDIA, featuring a low-latency MoE architecture.

14.5GB / 32GB VRAM

32,768 ctxNVIDIA Open License
💻 Code

Codestral 22B

Mistral AI · 22B · Q4_K_M

💻 Code

Mistral's dedicated code model. Industry-leading performance on FIM (fill-in-middle) and complex code generation.

14GB / 32GB VRAM

32,768 ctxCodestral License
🏆 Top Pick

Mistral Small 22B

Mistral AI · 22B · Q4_K_M

💬 Chat

Highly capable 22B model from Mistral AI. Excellent instruction following for enterprise chat applications.

14GB / 32GB VRAM

32,768 ctxApache 2.0

Ernie-4.5

Baidu · 32B (MoE) · Q4_K_M

💬 Chat

A medium-sized Mixture-of-Experts foundation model from Baidu.

14GB / 32GB VRAM

32,768 ctxApache 2.0
🏆 Top Pick

HunyuanVideo 1.5

Tencent · 13B video · BF16

🎬 Video Gen

Cinematic-quality video generation accessible at just 14GB VRAM with model offloading. The most democratized cinematic video tool available. Produces Hollywood-grade motion blur, depth of field, and lighting consistency.

14GB / 32GB VRAM

0 ctxTencent
⚡ Fast

FLUX.1 Schnell

Black Forest Labs · 12B diffusion · BF16

🎨 Image Gen

4-step generation with Apache 2.0 commercial license. The fastest high-quality local image model — produces studio-grade output in under 3 seconds on a 24GB GPU. The go-to for commercial product photography pipelines.

12GB / 32GB VRAM

0 ctxApache 2.0
🏆 Top Pick

Stable Diffusion 3.5 Large

Stability AI · 8B diffusion · BF16

🎨 Image Gen

Best text-in-image rendering and the largest open LoRA fine-tune ecosystem with 50,000+ community models. The creative industry's preferred foundation for style transfer, brand consistency, and character sheet generation.

12GB / 32GB VRAM

0 ctxStability AI

MusicGen Large

Meta AI · 4B · BF16

🎙️ Audio / TTS

Meta's flagship 4B music model with melody conditioning from reference audio clips. Best for cinematic scoring and mood-driven generation. CC-BY-NC-4.0. Industry standard for AI-assisted film and game soundtrack production.

12GB / 32GB VRAM

0 ctxCC-BY-NC-4.0
🔬 Reasoning

gpt-oss

OpenAI · 16B · Q4_K_M

🔬 Reasoning

OpenAI's open-weight LLM, supporting configurable reasoning efforts (low, medium, high).

11GB / 32GB VRAM

131,072 ctxMIT
🔬 Reasoning

seed-oss

ByteDance · 16B · Q4_K_M

🔬 Reasoning

An advanced reasoning model with flexible 'thinking budget' control and self-reflection capabilities.

10.5GB / 32GB VRAM

131,072 ctxOpen Use
💻 Code

StarCoder2 15B

BigCode · 15B · Q4_K_M

💻 Code

Specialized code completion model trained on 600+ programming languages. Top-tier for in-IDE completions.

10GB / 32GB VRAM

16,384 ctxBigCode OpenRAIL-M
💻 Code

Devstral 2

Mistral AI · 14B · Q4_K_M

💻 Code

Second-generation Devstral for agentic coding, built for tool use, multi-file editing, and software engineering agents with vision support.

9.5GB / 32GB VRAM

131,072 ctxMistral Open
👁️ Vision

Qwen3-VL

Alibaba · 14B · Q4_K_M

👁️ Vision

A vision-language model featuring upgrades to visual perception, spatial reasoning, and image understanding.

9.5GB / 32GB VRAM

131,072 ctxApache 2.0
🔬 Reasoning

DeepSeek R1 14B

DeepSeek · 14B · Q4_K_M

🔬 Reasoning

Qwen-2.5 distilled reasoning model. Strong chain-of-thought and math at an accessible VRAM cost.

9GB / 32GB VRAM

131,072 ctxMIT
🏆 Top Pick

Qwen 2.5 14B

Alibaba · 14B · Q4_K_M

💬 Chat

Exceptional 14B all-rounder. Competitive with many 30B+ models on reasoning and coding benchmarks.

9GB / 32GB VRAM

128,000 ctxApache 2.0
💻 Code

Qwen 2.5 Coder 14B

Alibaba · 14B · Q4_K_M

💻 Code

Top-tier code model for 12GB+ GPUs. Strong at agentic coding, multi-file edits, and complex refactors.

9GB / 32GB VRAM

128,000 ctxApache 2.0
🔬 Reasoning

Phi 4 14B

Microsoft · 14B · Q4_K_M

🔬 Reasoning

Microsoft's flagship small model. Trained on synthetic data with exceptional reasoning and STEM performance.

9GB / 32GB VRAM

16,384 ctxMIT
🏆 Top Pick

Qwen 3.5 14B

Alibaba · 14B · Q4_K_M

💬 Chat

Integrates breakthroughs in multimodal learning, architectural efficiency, and reinforcement learning scale.

9GB / 32GB VRAM

131,072 ctxApache 2.0
🔬 Reasoning

phi-4-reasoning

Microsoft · 14B · Q4_K_M

🔬 Reasoning

A lightweight open model focused on high-quality, reasoning-dense synthetic data.

9GB / 32GB VRAM

131,072 ctxMIT
👁️ Vision

Llama 3.2 Vision 11B

Meta AI · 11B · Q4_K_M

👁️ Vision

State-of-the-art open-weight vision model. Analyze charts, read documents, describe complex scenes.

8GB / 32GB VRAM

131,072 ctxLlama 3 Community
🏆 Top Pick

Mistral NeMo 12B

Mistral AI / NVIDIA · 12B · Q4_K_M

💬 Chat

128K context window in a 12B model. Joint Mistral AI & NVIDIA collaboration — excellent for long-document tasks.

8GB / 32GB VRAM

128,000 ctxApache 2.0
🏆 Top Pick

Gemma 3 12B

Google · 12B · Q4_K_M

💬 Chat

Google's mid-tier Gemma 3. Multimodal capable, 128K context, strong multilingual reasoning.

8GB / 32GB VRAM

128,000 ctxGemma ToS
⚡ Fast

SDXL-Lightning

ByteDance · 3.5B diffusion · FP16

🎨 Image Gen

50-step SDXL quality in just 4 steps via adversarial diffusion distillation. Fully Apache 2.0 commercial. ComfyUI-native workflow. The fastest path from creative brief to production asset on 8GB VRAM hardware.

8GB / 32GB VRAM

0 ctxApache 2.0
🏆 Top Pick

Voxtral TTS

Mistral AI · 7B · Q4_K_M

🎙️ Audio / TTS

Matches or beats ElevenLabs Flash on prosody naturalness. 3-second voice cloning from a reference clip with no fine-tuning required. The top locally-run TTS for podcast production, audiobooks, and voice-over automation.

8GB / 32GB VRAM

0 ctxMistral
🏆 Top Pick

ACE-Step 1.5

ACE-Step · 2B · BF16

🎙️ Audio / TTS

Best local music model for 2026. Generates up to 10 minutes of audio with precise genre, instrument, tempo, and lyrics control. Apache 2.0 commercial license. The definitive tool for indie game composers and content creators.

8GB / 32GB VRAM

0 ctxApache 2.0
👁️ Vision

GLM-4.6V-Flash

Zhipu AI · 9B · Q4_K_M

👁️ Vision

A 9B vision-language model optimized for local deployment and low-latency applications.

7GB / 32GB VRAM

32,768 ctxApache 2.0
💻 Code

GLM-4.7

Zhipu AI · 9B · Q4_K_M

💬 Chat

Open source coding models specializing in coding and tool calling, based on a new base model.

6.5GB / 32GB VRAM

128,000 ctxApache 2.0
💻 Code

FunctionGemma 9B

Google · 9B · Q4_K_M

💻 Code

A lightweight, open model built as a foundation for creating specialized function calling models.

6.5GB / 32GB VRAM

128,000 ctxGemma ToS
🔬 Reasoning

Olmo 3

Allen AI · 10B · Q4_K_M

🔬 Reasoning

A family of open language models designed to enable scientific research into language modeling.

6.5GB / 32GB VRAM

32,768 ctxApache 2.0
💻 Code

Devstral

Mistral AI · 8B · Q4_K_M

💻 Code

A coding model from Mistral AI designed for codebase exploration and engineering agents.

6GB / 32GB VRAM

32,768 ctxMistral Open

PixArt-Σ

PixArt-alpha · 600M diffusion · FP16

🎨 Image Gen

Tiny model with 4K native output capability — best image-per-VRAM-dollar ratio available. Apache 2.0. Runs on GTX 1070 8GB. The only model sub-1GB that produces print-resolution imagery with coherent composition.

6GB / 32GB VRAM

0 ctxApache 2.0

MuseTalk

Tencent · 500M · FP16

🎙️ Audio / TTS

Photorealistic lip sync at 30+ FPS in real time. The best model for live avatar streaming and talking-head video creation. MIT license with active Discord community. Integrates natively with OBS and streaming tools.

6GB / 32GB VRAM

0 ctxMIT

SadTalker

OpenTalker · 300M · FP16

🎙️ Audio / TTS

One photo → talking head video with natural head movement, blinks, and mouth articulation. MIT license. No video sample required — just a single still image and an audio clip. The most accessible local avatar creation tool.

6GB / 32GB VRAM

0 ctxMIT
👁️ Vision

LLaVA 7B

Haotian Liu et al. · 7B · Q4_K_M

👁️ Vision

The classic vision-language model. Describe images, answer visual questions locally. Proven and reliable.

5.5GB / 32GB VRAM

4,096 ctxApache 2.0
⚡ Fast

Rnj-1

Essential AI · 8B · Q4_K_M

💬 Chat

A family of open-weight, dense models trained from scratch by Essential AI.

5.5GB / 32GB VRAM

32,768 ctxApache 2.0
🏆 Top Pick

Ministral 3 8B

Mistral AI · 8B · Q4_K_M

💬 Chat

A highly cost-effective, high-performing 8B instruction tuned model.

5.5GB / 32GB VRAM

131,072 ctxMistral Open
👁️ Vision

olmOCR 2

Allen AI · 7B · Q4_K_M

👁️ Vision

A specialized Vision Language Model (VLM) for optical character recognition tasks.

5.5GB / 32GB VRAM

32,768 ctxApache 2.0

gpt-oss-safeguard

OpenAI · 7B · Q4_K_M

💬 Chat

Open safety models built on the gpt-oss foundation to help classify and filter text content.

5.5GB / 32GB VRAM

32,768 ctxMIT
💻 Code

Granite 4.0

IBM · 8B · Q4_K_M

💻 Code

Lightweight open models supporting multilingual tasks, RAG, coding, and tool use.

5.5GB / 32GB VRAM

131,072 ctxApache 2.0
👁️ Vision

MiniCPM-V 2.6

OpenBMB · 8B · Q4_K_M

👁️ Vision

Video + multi-image + text understanding at 8B parameters. The best vision model for 8GB VRAM setups — handles 40-frame video clips, multi-image comparison, and document understanding in a single context window.

5.5GB / 32GB VRAM

131,072 ctxApache 2.0
💻 Code

CodeGemma 7B

Google · 7B · Q4_K_M

💻 Code

Google's code-tuned Gemma variant. Excellent at code completion tasks inside IDEs.

5GB / 32GB VRAM

8,192 ctxGemma ToS
🏆 Top Pick

Llama 3.1 8B

Meta AI · 8B · Q4_K_M

💬 Chat

Meta's powerhouse 8B model with 128K context. Excellent all-rounder for chat, code, and reasoning.

5GB / 32GB VRAM

128,000 ctxLlama 3 Community
🔬 Reasoning

DeepSeek R1 8B

DeepSeek · 8B · Q4_K_M

🔬 Reasoning

Llama-3 distilled reasoning model. Outperforms GPT-4o on several math benchmarks at 8B scale.

5GB / 32GB VRAM

131,072 ctxMIT
💻 Code

OpenCoder-8B

Infly · 8B dense · Q4_K_M

💻 Code

The 'OLMo of coding' — fully transparent training pipeline with HumanEval 83.5%. Every component is open: weights, data, and methodology. The most trustworthy small coding model for compliance-sensitive teams.

5GB / 32GB VRAM

8,192 ctxMIT
🔬 Reasoning

NVIDIA Nemotron Nano 8B

NVIDIA · 8B dense · Q4_K_M

🔬 Reasoning

Math Index 91.0 — the highest math score at the 8GB VRAM tier. NVIDIA's distilled Llama-3.1 with proprietary reward model training. Ideal for STEM tutoring and quantitative analysis on a single mid-range GPU.

5GB / 32GB VRAM

131,072 ctxNVIDIA
🔗 Embed

Qwen3-Embedding-8B

Alibaba · 8B · Q4_K_M

🔗 Embedding

Top-ranked self-hosted embedding on MTEB English — outperforms all sub-72B models. 32K context window for ultra-long document encoding. The upgrade path from BGE-M3 for teams needing maximum retrieval precision.

5GB / 32GB VRAM

32,768 ctxApache 2.0
🏆 Top Pick

Mistral 7B

Mistral AI · 7B · Q4_K_M

💬 Chat

The model that proved smaller can beat bigger. Mistral 7B outperforms many 13B models with blazing fast speed.

4.5GB / 32GB VRAM

32,768 ctxApache 2.0
🔬 Reasoning

DeepSeek R1 7B

DeepSeek · 7B · Q4_K_M

🔬 Reasoning

Distilled reasoning power in a 7B package. Excels at math, logic, and step-by-step problem solving.

4.5GB / 32GB VRAM

131,072 ctxMIT
🏆 Top Pick

Qwen 2.5 7B

Alibaba · 7B · Q4_K_M

💬 Chat

Highly competitive 7B model with long context and strong multilingual support. A top value pick.

4.5GB / 32GB VRAM

128,000 ctxApache 2.0
💻 Code

Qwen 2.5 Coder 7B

Alibaba · 7B · Q4_K_M

💻 Code

Best-in-class 7B code model. Excellent at multi-language completion, bug fixing, and code explanation.

4.5GB / 32GB VRAM

128,000 ctxApache 2.0
💻 Code

MagicCoder-S-DS-6.7B

UIUC · 6.7B dense · Q4_K_M

💻 Code

HumanEval 76.8% at just 6.7B — beats models 10× its size through OSS-Instruct training on real open-source code. The best option for code completion on 6GB VRAM with quality that defies the parameter count.

4GB / 32GB VRAM

16,384 ctxMIT
🏆 Top Pick

WhisperX

OpenAI · 1.5B · FP16

🎙️ Audio / TTS

The de facto standard for local speech-to-text. Word-level timestamps, speaker diarization, and 99 language support. Essential for transcription pipelines, meeting summarization, and building voice-first AI interfaces.

4GB / 32GB VRAM

0 ctxMIT
👁️ Vision

SAM 2

Meta AI · 300M · FP16

👁️ Vision

Click anywhere on an image or video → instant object segmentation. Apache 2.0. Universal segmentation model used in medical imaging, autonomous driving datasets, and content creation. Zero training required for any object class.

4GB / 32GB VRAM

0 ctxApache 2.0

Depth Pro

Apple · 300M · FP16

👁️ Vision

Single image → metrically accurate 3D depth map in under 0.3 seconds. Free for research use. Powers 3D scene reconstruction, bokeh simulation, and AR/VR depth estimation pipelines without any calibration data.

4GB / 32GB VRAM

0 ctxApple Research
🏆 Top Pick

Gemma 3 4B

Google · 4B · Q4_K_M

💬 Chat

Google's strong 4B model with multimodal capability and 128K context. One of the best small models.

3GB / 32GB VRAM

128,000 ctxGemma ToS
⚡ Fast

Phi 3.5 Mini

Microsoft · 3.8B · Q4_K_M

💬 Chat

Microsoft's multilingual tiny model with enormous 128K context window. Exceptional for document tasks.

2.5GB / 32GB VRAM

128,000 ctxMIT
⚡ Fast

Phi-4-mini

Microsoft · 3.8B dense · Q4_K_M

💬 Chat

Microsoft's best edge release. Fits 8GB RAM and runs fast on M1 MacBook Air in airplane mode. Exceptional at structured reasoning for its size — the top choice for on-device personal assistants and document Q&A.

2.3GB / 32GB VRAM

131,072 ctxMIT
💻 Code

Qwen 2.5 Coder 3B

Alibaba · 3B · Q4_K_M

💻 Code

Compact code-specialized model. Strong at code completion and debugging on very limited hardware.

2.2GB / 32GB VRAM

32,768 ctxApache 2.0
⚡ Fast

gemma-3n

Google · 3B · Q4_K_M

💬 Chat

A generative AI model optimized for use in everyday devices like laptops and phones.

2.2GB / 32GB VRAM

32,768 ctxGemma ToS
🏆 Top Pick

Llama 3.2 3B

Meta AI · 3B · Q4_K_M

💬 Chat

Best-in-class 3B model with 128K context. Outperforms many 7B models on common benchmarks.

2GB / 32GB VRAM

131,072 ctxLlama 3 Community
⚡ Fast

Qwen 2.5 3B

Alibaba · 3B · Q4_K_M

💬 Chat

Compact and fast. Excellent multilingual and instruction-following performance at tiny VRAM cost.

2GB / 32GB VRAM

32,768 ctxApache 2.0
🏆 Top Pick

F5-TTS

SWivid · 300M · FP16

🎙️ Audio / TTS

Flow-matching TTS with no duration modeling — produces the most natural prosody and sentence rhythm of any local voice model. MIT license. The preferred choice for giving local AI agents a human-sounding voice interface.

2GB / 32GB VRAM

0 ctxMIT

ColPali

Vidore · 3B · Q4_K_M

👁️ Vision

Encodes PDF pages as images — bypasses all broken PDF parsers for perfect scanned document retrieval. Apache 2.0. The breakthrough for RAG on government forms, research papers, and historical archives with graphical content.

2GB / 32GB VRAM

0 ctxApache 2.0
👁️ Vision

Moondream 2

vikhyatk · 1.8B · Q4_K_M

👁️ Vision

Tiny but capable vision language model. Describe images, read text, answer visual questions locally.

1.5GB / 32GB VRAM

2,048 ctxApache 2.0
⚡ Fast

Qwen3.5-2B

Alibaba · 2B dense · Q4_K_M

💬 Chat

Runs on iPhone in airplane mode. First sub-3B model with native multimodal support — a landmark for on-device AI. Perfect for privacy-preserving mobile apps that need real conversational capability without a server.

1.3GB / 32GB VRAM

32,768 ctxApache 2.0
🔬 Reasoning

DeepSeek R1 1.5B

DeepSeek · 1.5B · Q4_K_M

🔬 Reasoning

Smallest reasoning model you can run locally. Surprising chain-of-thought performance for its size.

1.1GB / 32GB VRAM

131,072 ctxMIT
⚡ Fast

Qwen 2.5 1.5B

Alibaba · 1.5B · Q4_K_M

💬 Chat

Excellent multilingual capabilities for its size. Particularly strong in Chinese and coding tasks.

1.1GB / 32GB VRAM

32,768 ctxApache 2.0
⚡ Fast

SmolLM2-1.7B

HuggingFace · 1.7B dense · Q4_K_M

💬 Chat

Runs in-browser via WebGPU — no installation required. Best for Electron apps and Raspberry Pi deployments. HuggingFace's most downloaded edge model with an Apache 2.0 license and full community model ecosystem.

1.1GB / 32GB VRAM

8,192 ctxApache 2.0
⚡ Fast

Llama 3.2 1B

Meta AI · 1B · Q4_K_M

💬 Chat

Compact Llama 3.2 with impressively long 128K context window. Perfect for edge deployment.

1GB / 32GB VRAM

131,072 ctxLlama 3 Community
⚡ Fast

Gemma 3 1B

Google · 1B · Q4_K_M

💬 Chat

Google's smallest Gemma 3. Runs on virtually any GPU or even CPU — great for on-device applications.

0.9GB / 32GB VRAM

32,768 ctxGemma ToS
🔗 Embed

mxbai-embed-large

MixedBread AI · 335M · FP32

🔗 Embedding

State-of-the-art embedding model for retrieval tasks. Ranks #1 on multiple MTEB categories.

0.7GB / 32GB VRAM

512 ctxApache 2.0
👁️ Vision

Florence-2

Microsoft · 770M · FP16

👁️ Vision

Captioning, object detection, grounding, OCR, and segmentation in one 770M model — MIT license. The Swiss Army knife of computer vision. Runs on almost any GPU and powers automated image tagging pipelines at scale.

0.5GB / 32GB VRAM

0 ctxMIT
🔗 Embed

BGE-M3

BAAI · 568M · FP16

🔗 Embedding

The default local RAG embedding model for 2026. 100 languages, 8K context, and three retrieval modes (dense, sparse, multi-vector) in one model. MIT license. Used in production by thousands of enterprise RAG pipelines.

0.5GB / 32GB VRAM

8,192 ctxMIT
⚡ Fast

Qwen 2.5 0.5B

Alibaba · 0.5B · Q4_K_M

💬 Chat

Smallest Qwen 2.5 — blazing fast on any hardware. Surprisingly capable for its size on simple tasks.

0.4GB / 32GB VRAM

32,768 ctxApache 2.0
🔗 Embed

Nomic Embed Text

Nomic AI · 137M · FP32

🔗 Embedding

High-quality embedding model with 8K context. Outperforms OpenAI text-embedding-ada-002 on MTEB benchmark.

0.3GB / 32GB VRAM

8,192 ctxApache 2.0
⚡ Fast

LFM2-350M

Liquid AI · 350M · FP16

💬 Chat

Non-Transformer architecture with linear context scaling — never degrades on long sequences. Achieves 40,400 tokens/sec on Apple Silicon. The fastest local model for structured extraction pipelines and IoT edge nodes.

0.3GB / 32GB VRAM

131,072 ctxLiquid AI
🔗 Embed

all-MiniLM-L6

Sentence Transformers · 22M · FP32

🔗 Embedding

Ultra-compact sentence embedding model. Perfect for semantic search and RAG pipelines on any hardware.

0.1GB / 32GB VRAM

512 ctxApache 2.0
⚡ Fast

Kokoro-82M

hexgrad · 82M · CPU

🎙️ Audio / TTS

CPU-only TTS that runs on Raspberry Pi. The best quality-per-watt ratio of any voice model — 82M parameters producing studio-quality speech synthesis. Apache 2.0 commercial license with no GPU requirement whatsoever.

0GB / 32GB VRAM

0 ctxApache 2.0

How to Run These Models

  1. 1
    Install OllamaDownload from ollama.com for Windows, macOS, or Linux. No CUDA setup required.
  2. 2
    Pick a modelClick any copy button above to get the run command.
  3. 3
    Run itPaste the command into your terminal. Ollama downloads and launches the model automatically on your NVIDIA GeForce RTX 5090.
As an Amazon Associate, I earn from qualifying purchases.