LOCAL AI // GLOSSARY

FP8 / FP4

Next-generation precision formats that dramatically accelerate inference on NVIDIA Blackwell and Ada GPUs.

Definition

FP8 (8-bit floating point) and FP4 (4-bit floating point) are reduced-precision numeric formats that maintain a floating-point exponent โ€” unlike INT8/INT4 which lose dynamic range. NVIDIA's Blackwell architecture (RTX 50-series) natively supports FP4 Tensor Core operations, achieving up to 4x the throughput of FP16 at the hardware level.

Why It Matters

Medium-High. FP8 support in Blackwell GPUs means local inference of frontier models at higher precision is possible without sacrificing speed. Frameworks like TensorRT-LLM and vLLM have added FP8 support, and it is becoming the default inference format for production deployments.

Real-World Example

Running DeepSeek R1 Distill 70B at FP8 on an RTX 5080 can achieve approximately 15-20 tokens/second โ€” compared to 8-12 tokens/second at Q4 GGUF via llama.cpp, with meaningfully better output quality.

History of FP8 / FP4

FP8 was first introduced conceptually by NVIDIA and Microsoft researchers in a 2022 paper ('FP8 Formats for Deep Learning'). It became commercially available with the NVIDIA H100 (Hopper architecture, 2022). FP4 was introduced with the Blackwell B100/B200 datacenter cards in 2024 and trickled to consumer hardware with the RTX 50-series in 2025.

Frequently Asked Questions

What makes FP4 better than typical INT4 quantization?โ–ผ
While both use 4 bits of memory, FP4 (Floating Point 4) maintains a scaling exponent in its data structure. This provides superior dynamic range over simple integer (INT4) scaling, resulting in more accurate mathematical calculations during complex AI reasoning.
Do I need a new GPU to use FP8 or FP4?โ–ผ
FP8 native hardware acceleration requires Ada Lovelace (RTX 40-series) or Blackwell (RTX 50-series) GPUs. True native FP4 acceleration is exclusive to the latest architectures like NVIDIA Blackwell. Older GPUs will emulate the math much slower, negating the hardware speed benefits.
Will FP8 replace GGUF?โ–ผ
FP8 is rapidly becoming the standard for raw model distribution and server-side production serving frameworks like vLLM. However, heavily quantized GGUF formats (like Q4) will remain popular for extreme compression on limited consumer VRAM.

Related Concepts

Browse More Terms

All Terms
As an Amazon Associate, I earn from qualifying purchases.