LOCAL AI // GLOSSARY

Llama.cpp

The open-source engine that made running 70B models on consumer hardware possible in 2023.

Definition

llama.cpp is an open-source C++ inference engine for transformer-based LLMs, written by Georgi Gerganov. It pioneered consumer GGUF quantization, CPU inference, and multi-GPU support for local LLMs. It powers Ollama, LM Studio, and Jan.ai as a backend.

Why It Matters

Foundational. Without llama.cpp, the local AI movement would not exist in its current form. It serves as the reference implementation for GGUF inference and continues to receive daily performance improvements from a large open-source community.

Real-World Example

Clone llama.cpp, download a GGUF model, run: './llama-cli -m model.gguf -n 512 --n-gpu-layers 35 -p "Explain VRAM to me"'. The --n-gpu-layers flag controls how many layers offload to the GPU vs. CPU.

History of Llama.cpp

Released on March 11, 2023 — just one week after Meta leaked LLaMA-1 weights — llama.cpp became an overnight sensation on GitHub. Gerganov's breakthrough was adapting the model to run in 4-bit integer arithmetic, enabling 13B models to run on a 2019 MacBook Pro at a viable 6-8 tokens/second. It spawned an entire ecosystem.

Frequently Asked Questions

Does llama.cpp only run Llama models?▼

No. Despite the name, llama.cpp has grown to support nearly every major neural network architecture in existence, including Mistral, Gemma, Phi, Qwen, and DeepSeek variants, acting as universal adapter for the entire Hugging Face community.

Why does my console print layer offloading numbers?▼

When you execute a model, llama.cpp prints your '--n-gpu-layers' offload profile. If a model has 32 conceptual layers but your GPU VRAM can only hold 20, llama.cpp handles putting 20 on the super-fast GPU and 12 on the slower CPU, maximizing the blend of speed and capacity.

Do I need to compile llama.cpp from source?▼

You can compile it directly via Makefile or CMake to ensure maximum optimization for your exact CPU instruction sets (AVX2/AVX512). However, most users simply rely on pre-compiled versions automatically bundled inside tools like Ollama or text-generation-webui for absolute simplicity.

Related Concepts

VRAM

The on-GPU memory that stores model weights. Determines which AI models you can run.

Quantization

Compressing model weights from 16-bit to 4-bit precision to massively reduce VRAM usage.

GGUF

The universal file format for running quantized LLMs locally via llama.cpp and Ollama.

Ollama

The easiest way to run open-source LLMs locally with one command — like Docker for AI models.

Browse More Terms

All Terms

VRAM

Quantization

FP8 / FP4

Tokens per Second (TPS)

Memory Bandwidth

KV Cache