Tokens per Second (TPS)
The universal speed metric for LLMs โ how many words (tokens) your GPU generates per second.
Definition
Tokens per Second (TPS) is the primary benchmark for local LLM performance. A 'token' is roughly equivalent to 0.75 English words. TPS measures the throughput of the inference engine during the generation phase (not the prefill phase). It depends on the model size, quantization level, GPU memory bandwidth, and batch size.
Why It Matters
High. For single-user chat, 10-20 TPS feels like real-time conversation. Below 5 TPS feels noticeably slow. For agentic pipelines that process thousands of tokens per task, TPS directly impacts workflow duration and cost.
Real-World Example
An RTX 4090 running Llama 3.1 8B at Q4_K_M typically achieves 80-120 TPS. The same card running Llama 3.1 70B at Q4_K_M achieves approximately 15-25 TPS โ both perfectly usable for chat.
History of Tokens per Second (TPS)
TPS as a standardized metric emerged with llama.cpp's benchmarking tools in 2023. Before that, AI researchers typically measured model performance using perplexity and FLOPS. Ggerganov's llama.cpp introduced --bench-model commands that reported TPS, which quickly became the community standard on forums like LocalLLaMA (Reddit).