Inference
Running a trained AI model to generate outputs — what your local GPU does when you chat with an LLM.
Definition
Inference is the process of using a trained neural network to generate predictions or outputs from new input data. Contrast this with training (updating model weights). Local AI enthusiasts run inference — the models are downloaded pre-trained. Inference is computationally cheaper than training but is still memory-bandwidth-bound on consumer GPUs.
Why It Matters
High. All local AI use cases are inference workloads. Understanding inference vs. training clarifies why a consumer GPU (high VRAM, high bandwidth) is the right tool vs. a training cluster (thousands of GPUs, optimized for gradient computation).
Real-World Example
Typing a message into Open WebUI connected to an Ollama server triggers an inference pass: your input tokens are fed through the model's 32 transformer layers, each performing matrix multiplications against the stored weights, ultimately sampling a new token — repeated until the response is complete.
History of Inference
Inference has been performed since the first neural networks in the 1950s. The focus on optimizing consumer inference (rather than just training) exploded in 2023 when llama.cpp made it possible to run GPT-3-class models on a laptop. The field of 'inference optimization' now includes techniques like speculative decoding, continuous batching, and flash attention.
Frequently Asked Questions
What is the difference between training and inference?â–¼
Why does my GPU fan get so loud during inference?â–¼
Is batched inference faster?â–¼
Related Concepts
VRAM
The on-GPU memory that stores model weights. Determines which AI models you can run.
Tokens per Second (TPS)
The universal speed metric for LLMs — how many words (tokens) your GPU generates per second.
Memory Bandwidth
How fast your GPU can read model weights from VRAM — the real determinant of inference speed.
KV Cache
The memory that stores the AI's 'conversation history' during generation — it lives in VRAM.