LOCAL AI // GLOSSARY

GGUF

The universal file format for running quantized LLMs locally via llama.cpp and Ollama.

Definition

GGUF (GPT-Generated Unified Format) is a binary file format for storing quantized LLM weights and model metadata. It replaced the older GGML format and is the standard for llama.cpp-based inference. GGUF files encode the quantization scheme, tokenizer, and architecture parameters in a single portable file.

Why It Matters

High. GGUF is the format powering Ollama, LM Studio, and most consumer-grade local AI apps. Understanding GGUF quantization suffixes (Q4_K_M, Q5_K_S, Q8_0) helps you choose the right speed-quality tradeoff.

Real-World Example

A 'Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf' file is approximately 4.9GB. Download it, point llama.cpp or Ollama at it, and you have a private, offline AI assistant running at 80-120 tokens/second on a single RTX 4090.

History of GGUF

GGUF was introduced by Georgi Gerganov in August 2023 as a replacement for GGML, motivated by the need for a more extensible, forward-compatible format. Within weeks of release, every major model author on Hugging Face began distributing GGUF versions. It became the de-facto local inference standard by Q4 2023.

Frequently Asked Questions

How do I know which GGUF to download?▼

Look at the model's model card on Hugging Face and choose the Q4_K_M version for the best standard quality. Ensure the file size is comfortably 2-3GB smaller than your physical VRAM to leave room for your operating system and the KV cache.

Can an NVIDIA GPU run GGUF?▼

Yes. While GGUF models were originally heavily optimized for CPU execution, platforms like llama.cpp have seamlessly integrated CUDA acceleration. They will prioritize loading the GGUF weights directly to GPU VRAM for massive speed improvements.

How is GGUF different than SafeTensors?▼

SafeTensors is traditionally used for raw, unquantized huggingface model weights across multiple files, targeted heavily at Python/PyTorch users. GGUF single-file packages compress these weights explicitly for C++-based, high-performance consumer inference engines.

Related Concepts

Quantization

Compressing model weights from 16-bit to 4-bit precision to massively reduce VRAM usage.

Ollama

The easiest way to run open-source LLMs locally with one command — like Docker for AI models.

LM Studio

A polished desktop GUI for discovering, downloading, and chatting with local AI models.

Browse More Terms

All Terms

VRAM

Quantization

FP8 / FP4

Tokens per Second (TPS)

Memory Bandwidth

KV Cache