Ollama
The easiest way to run open-source LLMs locally with one command โ like Docker for AI models.
Definition
Ollama is an open-source tool that simplifies running local LLMs on consumer hardware. It wraps llama.cpp as a backend, provides a REST API compatible with OpenAI's spec, and handles model downloading, quantization selection, and GPU memory management automatically. Single command: 'ollama run llama3.3'.
Why It Matters
High for getting started. It abstracts all complexity. Power users move to llama.cpp directly for maximum control. Ollama's OpenAI-compatible API also allows you to point existing AI apps (like Continue.dev or Open WebUI) at your local GPU.
Real-World Example
To run DeepSeek R1 8B locally: install Ollama, run 'ollama run deepseek-r1:8b'. Ollama downloads the GGUF file, loads it into VRAM, and starts a local API server at http://localhost:11434.
History of Ollama
Ollama was founded in 2023 and rapidly became the most popular local LLM inference tool. After llama.cpp made local inference possible in early 2023, Ollama made it accessible. It crossed 1M downloads in its first year and now supports NVIDIA, AMD, and Apple Silicon platforms natively.
Frequently Asked Questions
Does Ollama have a Web UI?โผ
Can Ollama use multiple GPUs?โผ
Is Ollama secure for company data?โผ
Related Concepts
VRAM
The on-GPU memory that stores model weights. Determines which AI models you can run.
GGUF
The universal file format for running quantized LLMs locally via llama.cpp and Ollama.
LM Studio
A polished desktop GUI for discovering, downloading, and chatting with local AI models.
Llama.cpp
The open-source engine that made running 70B models on consumer hardware possible in 2023.