LOCAL AI // GLOSSARY

Quantization

Compressing model weights from 16-bit to 4-bit precision to massively reduce VRAM usage.

Definition

Quantization is the process of converting neural network weights from high-precision floating-point formats (FP32 or FP16) to lower-bit representations (INT8, INT4, or even INT2). Tools like llama.cpp's GGUF format, GPTQ, and AWQ implement this. A Q4_K_M quantization, for example, stores most weights at 4-bit precision while preserving critical layers at higher precision.

Why It Matters

High. A Llama 3 70B model at FP16 requires 140GB of VRAM — unusable on consumer hardware. Q4 quantization reduces this to approximately 38-42GB, making it runnable on a single RTX 5090 or a dual RTX 4090 setup.

Real-World Example

Instead of storing a weight as 0.00234567890 (FP32, 32 bits), Q4 quantization might store it as 3 (INT4, 4 bits) — an 8x compression. The total model accuracy drop is often less than 2% on standard benchmarks.

History of Quantization

Quantization has existed in signal processing since the 1940s. Its application to neural networks began seriously around 2017-2018 with Google's integer-quantization research for TensorFlow Lite. The LLM community popularized consumer-grade quantization when Georgi Gerganov released llama.cpp in 2023, introducing the GGUF file format and making 4-bit 65B models runnable on Mac M1 laptops.

Frequently Asked Questions

Will quantization ruin my model's intelligence?▼

No. Extensive testing shows that reducing a model from 16-bit to 4-bit (Q4) quantization only slightly degrades accuracy on benchmark tests, often by less than 2-3%. The massive speed improvements and reduced VRAM requirements far outweigh the imperceptible loss of nuance for everyday tasks.

What is the best quantization level to use?▼

The community standard for local LLM inference is Q4_K_M (4-bit quantization). It offers the best 'sweet spot' between drastically reducing VRAM consumption and maintaining the model's original intelligence.

Can I quantize a model myself?▼

Yes, but you usually don't need to. The open-source community around platforms like Hugging Face frequently quantize new models into standard formats like GGUF within hours of their release, making them immediately available for download.

Related Concepts

VRAM

The on-GPU memory that stores model weights. Determines which AI models you can run.

FP8 / FP4

Next-generation precision formats that dramatically accelerate inference on NVIDIA Blackwell and Ada GPUs.

GGUF

The universal file format for running quantized LLMs locally via llama.cpp and Ollama.

Browse More Terms

All Terms

VRAM

FP8 / FP4

Tokens per Second (TPS)

Memory Bandwidth

KV Cache

Context Window