GGUF
The universal file format for running quantized LLMs locally via llama.cpp and Ollama.
Definition
GGUF (GPT-Generated Unified Format) is a binary file format for storing quantized LLM weights and model metadata. It replaced the older GGML format and is the standard for llama.cpp-based inference. GGUF files encode the quantization scheme, tokenizer, and architecture parameters in a single portable file.
Why It Matters
High. GGUF is the format powering Ollama, LM Studio, and most consumer-grade local AI apps. Understanding GGUF quantization suffixes (Q4_K_M, Q5_K_S, Q8_0) helps you choose the right speed-quality tradeoff.
Real-World Example
A 'Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf' file is approximately 4.9GB. Download it, point llama.cpp or Ollama at it, and you have a private, offline AI assistant running at 80-120 tokens/second on a single RTX 4090.
History of GGUF
GGUF was introduced by Georgi Gerganov in August 2023 as a replacement for GGML, motivated by the need for a more extensible, forward-compatible format. Within weeks of release, every major model author on Hugging Face began distributing GGUF versions. It became the de-facto local inference standard by Q4 2023.