LOCAL AI // GLOSSARY

Ollama

The easiest way to run open-source LLMs locally with one command — like Docker for AI models.

Definition

Ollama is an open-source tool that simplifies running local LLMs on consumer hardware. It wraps llama.cpp as a backend, provides a REST API compatible with OpenAI's spec, and handles model downloading, quantization selection, and GPU memory management automatically. Single command: 'ollama run llama3.3'.

Why It Matters

High for getting started. It abstracts all complexity. Power users move to llama.cpp directly for maximum control. Ollama's OpenAI-compatible API also allows you to point existing AI apps (like Continue.dev or Open WebUI) at your local GPU.

Real-World Example

To run DeepSeek R1 8B locally: install Ollama, run 'ollama run deepseek-r1:8b'. Ollama downloads the GGUF file, loads it into VRAM, and starts a local API server at http://localhost:11434.

History of Ollama

Ollama was founded in 2023 and rapidly became the most popular local LLM inference tool. After llama.cpp made local inference possible in early 2023, Ollama made it accessible. It crossed 1M downloads in its first year and now supports NVIDIA, AMD, and Apple Silicon platforms natively.

Frequently Asked Questions

Does Ollama have a Web UI?▼

Ollama itself is entirely a command-line tool and invisible background service. However, it integrates instantly with dozens of third-party interfaces, the most popular being 'Open WebUI', offering an experience nearly identical to ChatGPT.

Can Ollama use multiple GPUs?▼

Yes, natively. If you have two RTX 4090s installed, Ollama will automatically detect both and seamlessly distribute large models across their combined 48GB of VRAM without requiring manual tensor-splitting configuration.

Is Ollama secure for company data?▼

Yes, Ollama runs entirely local and air-gapped. None of your prompts or generated data ever leave your machine, making it a highly secure solution for analyzing sensitive medical, financial, or proprietary corporate data.

Related Concepts

VRAM

The on-GPU memory that stores model weights. Determines which AI models you can run.

GGUF

The universal file format for running quantized LLMs locally via llama.cpp and Ollama.

LM Studio

A polished desktop GUI for discovering, downloading, and chatting with local AI models.

Llama.cpp

The open-source engine that made running 70B models on consumer hardware possible in 2023.

Browse More Terms

All Terms

VRAM

Quantization

FP8 / FP4

Tokens per Second (TPS)

Memory Bandwidth

KV Cache