Llama.cpp
The open-source engine that made running 70B models on consumer hardware possible in 2023.
Definition
llama.cpp is an open-source C++ inference engine for transformer-based LLMs, written by Georgi Gerganov. It pioneered consumer GGUF quantization, CPU inference, and multi-GPU support for local LLMs. It powers Ollama, LM Studio, and Jan.ai as a backend.
Why It Matters
Foundational. Without llama.cpp, the local AI movement would not exist in its current form. It serves as the reference implementation for GGUF inference and continues to receive daily performance improvements from a large open-source community.
Real-World Example
Clone llama.cpp, download a GGUF model, run: './llama-cli -m model.gguf -n 512 --n-gpu-layers 35 -p "Explain VRAM to me"'. The --n-gpu-layers flag controls how many layers offload to the GPU vs. CPU.
History of Llama.cpp
Released on March 11, 2023 โ just one week after Meta leaked LLaMA-1 weights โ llama.cpp became an overnight sensation on GitHub. Gerganov's breakthrough was adapting the model to run in 4-bit integer arithmetic, enabling 13B models to run on a 2019 MacBook Pro at a viable 6-8 tokens/second. It spawned an entire ecosystem.
Frequently Asked Questions
Does llama.cpp only run Llama models?โผ
Why does my console print layer offloading numbers?โผ
Do I need to compile llama.cpp from source?โผ
Related Concepts
VRAM
The on-GPU memory that stores model weights. Determines which AI models you can run.
Quantization
Compressing model weights from 16-bit to 4-bit precision to massively reduce VRAM usage.
GGUF
The universal file format for running quantized LLMs locally via llama.cpp and Ollama.
Ollama
The easiest way to run open-source LLMs locally with one command โ like Docker for AI models.