LOCAL AI // GLOSSARY

Tokens per Second (TPS)

The universal speed metric for LLMs — how many words (tokens) your GPU generates per second.

Definition

Tokens per Second (TPS) is the primary benchmark for local LLM performance. A 'token' is roughly equivalent to 0.75 English words. TPS measures the throughput of the inference engine during the generation phase (not the prefill phase). It depends on the model size, quantization level, GPU memory bandwidth, and batch size.

Why It Matters

High. For single-user chat, 10-20 TPS feels like real-time conversation. Below 5 TPS feels noticeably slow. For agentic pipelines that process thousands of tokens per task, TPS directly impacts workflow duration and cost.

Real-World Example

An RTX 4090 running Llama 3.1 8B at Q4_K_M typically achieves 80-120 TPS. The same card running Llama 3.1 70B at Q4_K_M achieves approximately 15-25 TPS — both perfectly usable for chat.

History of Tokens per Second (TPS)

TPS as a standardized metric emerged with llama.cpp's benchmarking tools in 2023. Before that, AI researchers typically measured model performance using perplexity and FLOPS. Ggerganov's llama.cpp introduced --bench-model commands that reported TPS, which quickly became the community standard on forums like LocalLLaMA (Reddit).

Frequently Asked Questions

What is a good tokens-per-second (TPS) target?▼

For interactive chat experiences, you should target at least 15 to 20 tokens per second, which mimics natural reading speed. Anything below 7 TPS feels sluggish, while agentic workflows often demand 50+ TPS to churn through large logic chains rapidly.

Why is my TPS slower during the first few seconds of generation?▼

The initial delay is called 'Time to First Token' (TTFT). During this phase, the GPU is executing the 'prefill' step, where it processes your entire prompt into the KV cache before it can begin predicting the first new character. The longer your prompt, the longer the initial delay.

Does adding a better CPU improve my TPS?▼

If you are running the model entirely on your GPU, the CPU has virtually zero impact on your generation TPS. However, a better CPU heavily influences the prefill speed at the start of a prompt and is critical if you are offloading layers to system RAM.

Related Concepts

VRAM

The on-GPU memory that stores model weights. Determines which AI models you can run.

Quantization

Compressing model weights from 16-bit to 4-bit precision to massively reduce VRAM usage.

Memory Bandwidth

How fast your GPU can read model weights from VRAM — the real determinant of inference speed.

Browse More Terms

All Terms

VRAM

Quantization

FP8 / FP4

Memory Bandwidth

KV Cache

Context Window