LOCAL AI // GLOSSARY

LLM (Large Language Model)

The AI models like Llama, Mistral, and DeepSeek that generate human-like text — the software your GPU runs.

Definition

A Large Language Model (LLM) is a transformer-based neural network trained on massive text datasets to predict the next token in a sequence. Modern LLMs (GPT-4, Llama 3, DeepSeek R1) have billions of parameters and can perform reasoning, coding, translation, creative writing, and instruction-following.

Why It Matters

Foundational. Every other concept on this glossary serves the purpose of running LLMs more effectively on local hardware. Model size (measured in billions of parameters) directly correlates with VRAM requirements.

Real-World Example

Llama 3.3 70B is an LLM with 70 billion parameters. At Q4 quantization, each parameter occupies ~0.5 bytes on average, requiring approximately 40GB of VRAM — achievable on a dual RTX 4090 setup or a single RTX 5090.

History of LLM (Large Language Model)

The transformer architecture was introduced in 'Attention Is All You Need' (Google, 2017). GPT-1 (OpenAI, 2018) demonstrated unsupervised text generation. GPT-3 (2020) shocked the world with its capabilities at 175B parameters. Meta's LLaMA (2023) democratized the field by releasing competitive open weights, spawning the entire local AI ecosystem.

Frequently Asked Questions

Are local LLMs as smart as GPT-4?▼

Large open weights models, like Llama 3.3 70B and DeepSeek R1, are highly competitive with major commercial models like GPT-4o for coding, reasoning, and instruction, respectively—especially when granted high quality system prompts.

What happens if I turn off my wifi?▼

If you are using a local inference engine like Ollama or LM Studio, the LLM will continue functioning perfectly. They are entirely mathematical matrices operating strictly inside your GPU VRAM, requiring zero network connection to function.

Why do some LLMs refuse my prompts locally?▼

Most 'Instruct' models contain a safety alignment phase baked into the weights during training by their corporate authors (like Meta) to prevent malicious or harmful generations. To bypass this, enthusiasts seek 'uncensored' fine-tunes created by the open-source community.

Related Concepts

VRAM

The on-GPU memory that stores model weights. Determines which AI models you can run.

Quantization

Compressing model weights from 16-bit to 4-bit precision to massively reduce VRAM usage.

Tokens per Second (TPS)

The universal speed metric for LLMs — how many words (tokens) your GPU generates per second.

Context Window

How much text the AI can 'remember' and process at once — directly tied to VRAM through the KV cache.

Browse More Terms

All Terms

VRAM

Quantization

FP8 / FP4

Tokens per Second (TPS)

Memory Bandwidth

KV Cache