Quantization
Compressing model weights from 16-bit to 4-bit precision to massively reduce VRAM usage.
Definition
Quantization is the process of converting neural network weights from high-precision floating-point formats (FP32 or FP16) to lower-bit representations (INT8, INT4, or even INT2). Tools like llama.cpp's GGUF format, GPTQ, and AWQ implement this. A Q4_K_M quantization, for example, stores most weights at 4-bit precision while preserving critical layers at higher precision.
Why It Matters
High. A Llama 3 70B model at FP16 requires 140GB of VRAM — unusable on consumer hardware. Q4 quantization reduces this to approximately 38-42GB, making it runnable on a single RTX 5090 or a dual RTX 4090 setup.
Real-World Example
Instead of storing a weight as 0.00234567890 (FP32, 32 bits), Q4 quantization might store it as 3 (INT4, 4 bits) — an 8x compression. The total model accuracy drop is often less than 2% on standard benchmarks.
History of Quantization
Quantization has existed in signal processing since the 1940s. Its application to neural networks began seriously around 2017-2018 with Google's integer-quantization research for TensorFlow Lite. The LLM community popularized consumer-grade quantization when Georgi Gerganov released llama.cpp in 2023, introducing the GGUF file format and making 4-bit 65B models runnable on Mac M1 laptops.