LOCAL AI // GLOSSARY

Inference

Running a trained AI model to generate outputs — what your local GPU does when you chat with an LLM.

Definition

Inference is the process of using a trained neural network to generate predictions or outputs from new input data. Contrast this with training (updating model weights). Local AI enthusiasts run inference — the models are downloaded pre-trained. Inference is computationally cheaper than training but is still memory-bandwidth-bound on consumer GPUs.

Why It Matters

High. All local AI use cases are inference workloads. Understanding inference vs. training clarifies why a consumer GPU (high VRAM, high bandwidth) is the right tool vs. a training cluster (thousands of GPUs, optimized for gradient computation).

Real-World Example

Typing a message into Open WebUI connected to an Ollama server triggers an inference pass: your input tokens are fed through the model's 32 transformer layers, each performing matrix multiplications against the stored weights, ultimately sampling a new token — repeated until the response is complete.

History of Inference

Inference has been performed since the first neural networks in the 1950s. The focus on optimizing consumer inference (rather than just training) exploded in 2023 when llama.cpp made it possible to run GPT-3-class models on a laptop. The field of 'inference optimization' now includes techniques like speculative decoding, continuous batching, and flash attention.

Frequently Asked Questions

What is the difference between training and inference?▼

Training requires feeding terabytes of data over weeks to establish the network's understanding of language, which requires massive data centers. Inference is simply running that completed, static model to predict the next word—a relatively light workload achievable on a standard consumer desktop.

Why does my GPU fan get so loud during inference?▼

During active generation, the memory controllers and compute clusters are working at 100% maximum capacity, converting hundreds of watts of electricity directly into heat. The GPU fans automatically ramp to maximum velocity to prevent thermal throttling.

Is batched inference faster?▼

If multiple users are chatting with your server simultaneously, engines like vLLM use 'continuous batching' to process all their prompts simultaneously during a single VRAM weight sweep. This dramatically increases total server throughput, though it doesn't vastly speed up a single isolated user's chat speed.

Related Concepts

VRAM

The on-GPU memory that stores model weights. Determines which AI models you can run.

Tokens per Second (TPS)

The universal speed metric for LLMs — how many words (tokens) your GPU generates per second.

Memory Bandwidth

How fast your GPU can read model weights from VRAM — the real determinant of inference speed.

KV Cache

The memory that stores the AI's 'conversation history' during generation — it lives in VRAM.

Browse More Terms

All Terms

VRAM

Quantization

FP8 / FP4

Tokens per Second (TPS)

Memory Bandwidth

KV Cache