LOCAL AI // GLOSSARY

FP8 / FP4

Next-generation precision formats that dramatically accelerate inference on NVIDIA Blackwell and Ada GPUs.

Definition

FP8 (8-bit floating point) and FP4 (4-bit floating point) are reduced-precision numeric formats that maintain a floating-point exponent — unlike INT8/INT4 which lose dynamic range. NVIDIA's Blackwell architecture (RTX 50-series) natively supports FP4 Tensor Core operations, achieving up to 4x the throughput of FP16 at the hardware level.

Why It Matters

Medium-High. FP8 support in Blackwell GPUs means local inference of frontier models at higher precision is possible without sacrificing speed. Frameworks like TensorRT-LLM and vLLM have added FP8 support, and it is becoming the default inference format for production deployments.

Real-World Example

Running DeepSeek R1 Distill 70B at FP8 on an RTX 5080 can achieve approximately 15-20 tokens/second — compared to 8-12 tokens/second at Q4 GGUF via llama.cpp, with meaningfully better output quality.

History of FP8 / FP4

FP8 was first introduced conceptually by NVIDIA and Microsoft researchers in a 2022 paper ('FP8 Formats for Deep Learning'). It became commercially available with the NVIDIA H100 (Hopper architecture, 2022). FP4 was introduced with the Blackwell B100/B200 datacenter cards in 2024 and trickled to consumer hardware with the RTX 50-series in 2025.

Frequently Asked Questions

What makes FP4 better than typical INT4 quantization?▼

While both use 4 bits of memory, FP4 (Floating Point 4) maintains a scaling exponent in its data structure. This provides superior dynamic range over simple integer (INT4) scaling, resulting in more accurate mathematical calculations during complex AI reasoning.

Do I need a new GPU to use FP8 or FP4?▼

FP8 native hardware acceleration requires Ada Lovelace (RTX 40-series) or Blackwell (RTX 50-series) GPUs. True native FP4 acceleration is exclusive to the latest architectures like NVIDIA Blackwell. Older GPUs will emulate the math much slower, negating the hardware speed benefits.

Will FP8 replace GGUF?▼

FP8 is rapidly becoming the standard for raw model distribution and server-side production serving frameworks like vLLM. However, heavily quantized GGUF formats (like Q4) will remain popular for extreme compression on limited consumer VRAM.

Related Concepts

VRAM

The on-GPU memory that stores model weights. Determines which AI models you can run.

Quantization

Compressing model weights from 16-bit to 4-bit precision to massively reduce VRAM usage.

Tokens per Second (TPS)

The universal speed metric for LLMs — how many words (tokens) your GPU generates per second.

Browse More Terms

All Terms

VRAM

Quantization

Tokens per Second (TPS)

Memory Bandwidth

KV Cache

Context Window