FP8 / FP4
Next-generation precision formats that dramatically accelerate inference on NVIDIA Blackwell and Ada GPUs.
Definition
FP8 (8-bit floating point) and FP4 (4-bit floating point) are reduced-precision numeric formats that maintain a floating-point exponent โ unlike INT8/INT4 which lose dynamic range. NVIDIA's Blackwell architecture (RTX 50-series) natively supports FP4 Tensor Core operations, achieving up to 4x the throughput of FP16 at the hardware level.
Why It Matters
Medium-High. FP8 support in Blackwell GPUs means local inference of frontier models at higher precision is possible without sacrificing speed. Frameworks like TensorRT-LLM and vLLM have added FP8 support, and it is becoming the default inference format for production deployments.
Real-World Example
Running DeepSeek R1 Distill 70B at FP8 on an RTX 5080 can achieve approximately 15-20 tokens/second โ compared to 8-12 tokens/second at Q4 GGUF via llama.cpp, with meaningfully better output quality.
History of FP8 / FP4
FP8 was first introduced conceptually by NVIDIA and Microsoft researchers in a 2022 paper ('FP8 Formats for Deep Learning'). It became commercially available with the NVIDIA H100 (Hopper architecture, 2022). FP4 was introduced with the Blackwell B100/B200 datacenter cards in 2024 and trickled to consumer hardware with the RTX 50-series in 2025.
Frequently Asked Questions
What makes FP4 better than typical INT4 quantization?โผ
Do I need a new GPU to use FP8 or FP4?โผ
Will FP8 replace GGUF?โผ
Related Concepts
VRAM
The on-GPU memory that stores model weights. Determines which AI models you can run.
Quantization
Compressing model weights from 16-bit to 4-bit precision to massively reduce VRAM usage.
Tokens per Second (TPS)
The universal speed metric for LLMs โ how many words (tokens) your GPU generates per second.