The Foundations of AI :: Inside the Loop #01

An introductory vocabulary for generative AI and deep learning: the essential terms before the technical conversations begin

Apr 16, 2026

A pattern recurs in conversations with people who work in and around this field (customers, colleagues, practitioners with years of experience) that is easy to overlook until it becomes impossible to ignore. The assumed basics are often not as shared as everyone assumes. People who have shipped AI products describe models as “open-source” when they mean something considerably more restricted. People who have read dozens of papers use “parameter” and “weight” as synonyms, as though the distinction does not matter. People who have sat through alignment presentations still reach for “hallucination” as though it were a bug that could be fixed rather than a structural property of how these systems work. Nobody admits the gap. The conversation proceeds as though it does not exist. And then something important gets lost in the imprecision.

This piece is the foundation for everything that follows in Inside the loop series. Inside the Loop is the technical deep-dive track of Mind in the Loop: each issue walks through a specific AI topic with enough mathematical and technical grounding to make the subject genuinely legible, not just familiar. The goal is not to train engineers. It is to make sure that anyone thinking seriously about this field understands what is actually happening inside the systems they are discussing, building or deploying. It does not assume a background in machine learning. It does assume an interest in understanding the actual mechanics of the systems reshaping how knowledge is produced, stored and retrieved, as well as a tolerance for precision over comfort.

Each term is defined once, as tightly as the concept permits. Where a common misconception exists, it is noted. Where a term carries different meanings in different contexts, the relevant distinction is drawn. The vocabulary is organized by conceptual layer, beginning with the mathematical building blocks and moving outward toward the systems, deployment practices and alignment techniques that determine how these models behave in the world.

A reader who works through this glossary will not become a machine learning practitioner. They will, however, be equipped to follow a technical argument without losing the thread when an author drops a term without definition (which is most of the time).

A note on this document: this glossary is a living reference. As the Inside in the Loop series covers new architectures, techniques and concepts, the relevant terms will be added here and linked back from the articles where they appear. The definitions below provide a working foundation and each reader is encouraged to treat them as a starting point for their own study, not a substitute for it. The field rewards the curious who go further.

The building blocks

Every neural network, regardless of its size or purpose, reduces to a small set of mathematical primitives. These are the terms that appear in every paper, every benchmark comparison and every discussion of what makes one model different from another.

Neural network

A computational system composed of layers of mathematical operations, loosely modeled on the structure of biological neurons. Data flows forward through the layers, gets transformed at each step and produces an output. The network learns by adjusting its internal numbers (its weights) until its outputs match a desired target. Every system discussed in this series, from the earliest image classifiers to the latest frontier language models, is a neural network at its core.

Parameter

Any learnable number inside a model. Parameters is the superset: it includes weights, biases, the scale and shift values inside normalization layers and the vectors inside embedding tables. When a model is described as having 70 billion parameters, that is the total count of all such numbers combined. More parameters mean more capacity to store learned patterns; they also mean more memory and more compute required at every step. Parameter count is the most commonly cited measure of model size and the least informative measure of what a model actually costs to run.

Weight

The specific type of parameter that controls how strongly one node in a network influences the next. A weight near zero means the connection carries almost no signal; a large positive or negative value means it is highly influential. Weights are what most people mean when they say “the knowledge in a model lives in its parameters”. Strictly speaking, it lives predominantly in its weights. The two terms are used interchangeably throughout the field, including in most research papers, which is why the conflation is so persistent. The distinction that matters in practice: weight decay (a regularization technique) specifically targets weights, not all parameters equally.

Bias

A second type of parameter, distinct from weights. While a weight scales an input, a bias shifts the output by a fixed amount, allowing a neuron to fire even when its inputs are zero. Every neuron in a standard network has one bias value. Biases give layers the flexibility to represent patterns that weights alone cannot capture.

Layer

A discrete transformation step in a neural network. Each layer takes a set of numbers, applies an operation and passes the result to the next layer. Early layers tend to capture low-level patterns; later layers capture higher-level abstractions. The number of stacked layers is what researchers mean when they describe a model as “deep”.

Tensor

The fundamental data structure of deep learning. A tensor is a multi-dimensional array of numbers. A single number is a zero-dimensional tensor; a list of numbers is a one-dimensional tensor (a vector); a matrix is two-dimensional. Most data flowing through a neural network (token embeddings, attention matrices and batch inputs) is a three- or four-dimensional tensor.

Logits

The raw, not normalized scores produced by the final layer of a model before any probability conversion. For a language model predicting the next token, logits are a vector with one number per item in the vocabulary. A high logit signals that the model considers that token likely. Logits are passed through the softmax function to produce probabilities during generation.

Dense architecture

A neural network in which every parameter activates for every input, every time. The full computational weight of the system engages regardless of whether the task requires it. Dense models are the historical baseline against which sparse architectures are measured; their inference cost scales linearly with parameter count.

The AI stack

The vocabulary of systems, deployment practices and interaction paradigms that govern how language models are used in production represents a second layer of terminology, distinct from model architecture but equally important for following contemporary AI coverage.

Large Language Model (LLM)

A model trained on a large text corpus to predict the next token. Models above roughly 7 to 10 billion parameters begin exhibiting qualitative capability jumps, including instruction following, multi-step reasoning and in-context learning, absent in smaller models. GPT-4, Claude, Gemini, Llama, DeepSeek and Mistral are all large language models. The term is increasingly imprecise as models become multimodal.

Prompt / prompt engineering

A prompt is the text input given to a language model. Prompt engineering is the practice of designing inputs that reliably produce desired outputs, specifying format, persona, examples or reasoning instructions. The discipline emerged because model behavior is highly sensitive to phrasing in ways that are not always predictable. At frontier model capabilities, its importance relative to model robustness is declining.

System prompt

A privileged prompt, typically invisible to the end user, that defines a model’s behavior, persona or constraints for a given deployment. Processed before the user’s message. Operators use it to configure assistant behavior, restricting topics, setting tone and providing product context. The system prompt cannot be fully protected from extraction by a sufficiently motivated user in most current implementations.

In-context learning (ICL)

The ability of a language model to learn a new task from examples provided in the prompt, without any weight updates. Show the model several input-output pairs and it generalizes to new cases without training. This property emerges at scale and is one of the more surprising capabilities of large Transformers. It is what makes few-shot prompting effective.

Few-shot / zero-shot

Zero-shot describes asking a model to perform a task with no examples, relying on pre-training knowledge alone. Few-shot provides a small number of examples (typically two to ten) before the actual query. Both are forms of in-context learning. Zero-shot performance on complex tasks is used as a benchmark of frontier model quality.

Chain-of-thought (CoT)

A prompting technique in which the model produces step-by-step reasoning before its final answer. Introduced by Wei et al. at Google in 2022. Chain-of-thought substantially improves performance on multi-step reasoning tasks by externalizing intermediate steps into the context window rather than compressing them into a single output.

Reasoning model

A class of large language model trained or prompted to spend additional compute on a problem before producing a final answer, typically through extended chain-of-thought, search over possible solution paths or process reward models. OpenAI o1 and o3, DeepSeek-R1 and Claude 3.7 Sonnet are examples. Reasoning models trade inference latency for accuracy on hard tasks. The distinction between reasoning and non-reasoning models is a 2024-2025 commercial framing rather than a sharp architectural boundary.

RAG (Retrieval-Augmented Generation)

An architecture that augments generation by first retrieving relevant documents from an external knowledge store and including them in the prompt as context. RAG addresses two limitations of pure language models: outdated knowledge and hallucination. Retrieval is typically done via embedding similarity search over a vector database.

Vector database

A database optimized for storing and querying high-dimensional embedding vectors via similarity search. Given a query embedding, a vector database returns the most semantically similar stored vectors. Used as the retrieval backend in RAG systems. Standard SQL databases are not efficient for high-dimensional vector similarity at scale.

Agent / agentic AI

A system in which a language model is given tools (search, code execution, file access) and operates in a loop: observe the environment, reason, act, observe the outcome and repeat, rather than producing a single response. Agents pursue multi-step goals with varying degrees of autonomy. The dominant challenge is reliability: errors compound over long action sequences. 2024 and 2025 marked the first period in which agentic systems were deployed at scale in production.

Tool use / function calling

The ability of a language model to invoke external functions or APIs during generation, including search, code execution and database queries. The model outputs a structured call specifying which tool to invoke and with what arguments; the result is returned as additional context. Tool use transforms a language model from a text predictor into a system capable of interacting with external services.

Hallucination

The phenomenon in which a model generates fluent, confident text that is factually wrong or entirely fabricated. Hallucination is structural: language models predict probable token sequences, not verified facts. High-confidence hallucinations (those where the model does not signal uncertainty) are the most consequential for production deployments. RAG, RLHF and fine-tuning reduce but cannot eliminate the problem.

Multimodal

A model that processes more than one type of data. Text plus image is the most common combination; text plus audio, image plus video and unified all-modality models are active research areas. The core engineering challenge is representing different data types in a shared token space. Multimodality is now a baseline expectation for frontier models.

Open-weight model

A model whose weights are publicly released, allowing download, local deployment and fine-tuning, but whose training code, data and full methodology may not be disclosed. This is the correct term for what media coverage routinely describes as “open-source.” Llama 3, Mixtral, DeepSeek-V3 and Kimi K2 are open-weight. The release of open-weight models compresses the commercial advantage of proprietary labs and accelerates independent research.

Benchmark

A standardized test used to measure and compare model capabilities. Common examples include MMLU (academic knowledge across subjects), HumanEval (code generation), MATH (mathematical reasoning), GPQA (graduate-level science questions) and the LMSYS Arena (human preference). Benchmarks are the common currency of capability claims in research papers and are regularly criticized as gameable, saturated or poorly representative of real-world usefulness. A model that tops published benchmarks is not necessarily the most useful in deployment.

How models learn

The mechanics of training (how a model’s parameters are adjusted from random initialization toward a useful configuration) are governed by a small set of algorithms and hyper-parameters. These terms appear in every discussion of model behavior, capability and failure mode.

Training

The process of adjusting a model’s parameters so that its outputs better match a target signal. The model processes a batch of data, produces predictions, measures its error via a loss function and uses back-propagation to adjust weights in the direction that reduces that error. This cycle repeats billions of times across a large dataset.

Inference

Using a trained model to produce outputs. No learning happens during inference because weights are frozen. Inference cost is what a deployment pays per user query; this is precisely why architectural choices that reduce per-token computation, like Mixture of Experts, carry such significant commercial stakes.

Forward pass

One complete run of an input through all model layers to produce an output. During inference, a forward pass produces a next-token prediction. During training, it produces a loss value used to run back-propagation. Per-token computational cost is determined by how many parameters activate during each forward pass.

Loss function

A measurement of how wrong the model’s current predictions are. The training process is an effort to minimize this number. For language models, the standard loss is cross-entropy over next-token predictions: the negative log-probability assigned to the correct next token. Lower loss means the model assigns higher probability to the right answer.

Back-propagation

The algorithm that computes how much each weight contributed to the current loss and in which direction it should be adjusted. It propagates the error signal backward through layers using the chain rule of calculus. Introduced to neural networks by Rumelhart, Hinton and Williams in 1986. Without it, training deep networks at scale would be computationally intractable.

Gradient descent

The optimization algorithm that uses back-propagation results to update weights. The gradient points in the direction of steepest increase in loss; gradient descent moves weights in the opposite direction by a step size controlled by the learning rate. Modern variants (Adam, AdamW) maintain per-parameter adaptive step sizes and are almost universally used in large language model training.

Learning rate

A scalar that controls how large each weight update step is. Too large: the model overshoots good solutions and training diverges. Too small: convergence is extremely slow. In practice, a learning rate schedule is used, warming up from a small value, peaking, then decaying. Choosing the learning rate correctly is one of the most consequential decisions in a training run.

Optimizer

The algorithm that translates gradients into weight updates. SGD (stochastic gradient descent) is the simplest. Adam maintains running estimates of gradient mean and variance per parameter, producing adaptive per-weight step sizes. AdamW adds weight decay. Virtually all frontier language models are trained with Adam or AdamW variants.

Batch / mini-batch

The number of training examples processed together before a single weight update is applied. Larger batches produce more stable gradient estimates but require more memory and reduce the number of updates per epoch. Smaller batches introduce noise that can help escape local optima. Batch size has significant effects on training dynamics and final model quality.

Epoch

One complete pass through the entire training dataset. Modern language models trained on trillions of tokens are often trained for less than one epoch: the dataset is so large that a full pass is never completed. Fine-tuning runs typically use multiple epochs on smaller curated datasets.

Overfitting

A failure mode in which a model learns the training data too precisely, including its noise, and performs poorly on new data. The model has memorized rather than generalized. Very large models can overfit even on trillion-token datasets if trained too long. Mitigations include dropout, regularization and early stopping.

Underfitting

A failure mode in which a model is too simple or undertrained to capture the patterns in the data, performing poorly on both training and test sets. In the large language model era, underfitting is typically a capacity problem (too few parameters) or a training budget problem (too few steps), not an algorithmic one.

Vanishing gradient

A training failure in which gradients become exponentially smaller as they propagate backward through many layers, making weights in early layers update negligibly or not at all. The central obstacle to training deep networks before ReLU activations and residual connections. Understanding it explains why both innovations mattered so much when they arrived.

Regularization

Any technique that reduces overfitting by discouraging the model from fitting too tightly to training data. Includes dropout, weight decay (penalizing large weight values) and data augmentation. In large language model training, weight decay implemented via AdamW is the dominant regularization mechanism.

Pre-training

The large-scale initial training phase in which a model learns from an enormous corpus of raw text without task-specific supervision. Pre-training teaches general language understanding, world knowledge and reasoning patterns. It requires the most compute. The result is a base model: capable but not yet aligned to follow instructions or produce consistently useful outputs.

Fine-tuning

A subsequent, smaller-scale training phase in which a pre-trained model is trained further on curated task-specific data. Fine-tuning adjusts behavior while preserving most pre-training knowledge. Instruction fine-tuning teaches the model to follow directions; RLHF and DPO are fine-tuning stages that incorporate human preference signals; LoRA and other parameter-efficient methods allow fine-tuning with minimal additional compute.

Language and representation

Before a neural network can process text, that text has to become numbers. The way that conversion happens and the vocabulary that describes it are shared across virtually every language model in existence today. The Transformer architecture, introduced by Vaswani et al. in 2017, is the structural frame within which these concepts live for most modern systems. But the terms in this section such as tokens, embeddings, context windows, training objectives, belong to the broader problem of how language gets represented inside a machine, not to any single architecture.

Transformer

A neural network architecture that processes all tokens in a sequence simultaneously via attention, rather than sequentially as earlier recurrent networks did. A Transformer is a stack of identical blocks, each containing a self-attention mechanism and a feed-forward network. Its parallelism made it far more efficient to train on modern GPU hardware than its predecessors. Every model named in this series is a Transformer.

Self-attention

The mechanism that lets each token in a sequence attend to every other token and gather relevant information from it. For each token, the mechanism computes three vectors: a Query (what am I looking for?), a Key (what do I offer?) and a Value (what will I contribute?). Attention scores are computed as the dot product of queries and keys, normalized by softmax. Self-attention layers are dense: every token attends to every other. Mixture of Experts does not touch these layers.

Feed-forward network (FFN)

The second major component of each Transformer block, applied after attention. A two-layer neural network applied independently to each token: expand to a wider hidden dimension, apply a non-linear activation and project back down. FFN layers store factual knowledge acquired during pre-training. In large models, FFN parameters account for the majority of total parameter count.

Token

The atomic unit of text that a language model processes. Tokens are not words: they are subword fragments produced by a tokenizer. Common words are typically one token; rare or long words split into several; punctuation and whitespace add more. As a rough approximation, 750 words in English equal roughly 1,000 tokens. The model never sees raw text; it processes only sequences of integer token IDs, each mapped to an embedding vector.

Tokenization

The process of converting raw text into a sequence of integer token IDs. Modern tokenizers use subword segmentation (breaking text at statistically meaningful boundaries). The tokenizer is fixed after its own training and is not updated during model training. Tokenizers trained on English-heavy data are less efficient for other scripts, requiring more tokens to represent the same content.

Vocabulary

The complete set of tokens a model recognizes, fixed at tokenizer training time. GPT-4 uses a vocabulary of roughly 100,000 tokens; LLaMA-3 uses approximately 128,000. Each token maps to a unique integer ID and a learned embedding vector. Vocabulary size determines the dimension of the logit vector the model produces at each generation step.

Context window

The maximum number of tokens a model can process in a single forward pass, input and output combined. Tokens outside the window are invisible to the model. A 128,000-token context window holds roughly 300 pages of text. Extending context windows without proportionate growth in compute cost is an active area of engineering, addressed by techniques like RoPE and sliding window attention.

Embedding

A dense numerical vector representing a token in high-dimensional space. Before the Transformer processes tokens, each integer ID is converted to an embedding vector, typically several thousand numbers long. The geometry of the embedding space carries meaning: semantically similar tokens cluster nearby. Embeddings are learned during training. The term “embedding model” (a model trained specifically to produce useful embeddings) is a distinct usage.

Causal language modeling

The training objective used by GPT-style decoder-only models. The model is trained to predict the next token given all previous tokens, but cannot see future tokens. This “causal masking” ensures no cheating during training. At inference, the model generates autoregressively: each new token is appended to the context and the next prediction follows.

Masked language modeling (MLM)

The training objective used by BERT-style encoder-only models. A random subset of tokens is replaced with a mask token and the model predicts the original values. Unlike causal language modeling, MLM lets the model attend to tokens on both sides of the masked position, making it better suited to understanding and classification tasks. BERT models cannot generate text in the autoregressive sense; GPT models cannot use MLM.

Activation functions

Every feed-forward network inside a Transformer contains an activation function between its two linear layers. That placement is not incidental: without it, the entire FFN collapses to a single linear transformation regardless of how many layers it contains. Activation functions are what give that component its expressive power and the same principle applies to every other layer in any deep network.

Activation function

Any mathematical function applied after a linear transformation to introduce non-linearity. Without activation functions, stacking layers provides no additional expressive power. The choice of activation function has measurable effects on training speed and model performance and has shifted significantly across the history of the field.

ReLU (Rectified Linear Unit)

The most historically prevalent activation function: f(x) = max(0, x). ReLU sets any negative input to zero and passes positive values unchanged. Its computational simplicity made it dominant from roughly 2012 to 2020 and resolved the vanishing gradient problem that had plagued earlier activations. Most large language models have since moved to smoother variants.

GELU (Gaussian Error Linear Unit)

A smooth approximation of ReLU that weights inputs by the probability of a Gaussian distribution rather than hard-clipping at zero. GELU allows small negative values to pass through with diminished magnitude. Used in BERT, GPT-2 and GPT-3, the smooth gradient is generally understood to aid training stability in deep networks.

SwiGLU

An activation function introduced by Noam Shazeer in 2020 that combines the Swish activation with a learned gating mechanism. The input vector is split in two, one half is activated and the result is multiplied element-wise by the other half. SwiGLU has become the dominant choice in modern large language model feed-forward layers, appearing in LLaMA, PaLM, Gemini and DeepSeek.

Sigmoid

An activation function that maps any input to a value between 0 and 1, following an S-shaped curve: f(x) = 1/(1+e⁻ˣ). Historically used in binary classification and early recurrent networks, sigmoid was largely replaced in deep networks by ReLU and its variants because it saturates near its extremes, producing near-zero gradients that prevent effective learning in early layers.

Softmax

A function that converts a vector of arbitrary numbers into a probability distribution, with all values between 0 and 1 summing to 1. It amplifies differences between inputs: a score of 4.7 against a competing score of 0.3 becomes near-certainty against near-zero after softmax. Used in Transformer attention to normalize attention scores and in language model generation to convert logits into token probabilities.

Architecture patterns

The Transformer’s design draws on a set of structural solutions that predate it and extend well beyond it. Some like residual connections, normalization and dropout are general-purpose techniques applicable to any deep network. Others like multi-head attention, cross-attention, positional encoding and the KV cache are specific mechanisms that make the Transformer work at scale. Both sets appear constantly in architecture papers and are worth holding as distinct concepts.

Residual connection (skip connection)

A shortcut that adds a layer’s input directly to its output, bypassing the layer’s own transformation: output = layer(x) + x. Introduced in ResNet by He et al. in 2015, residual connections solved the vanishing gradient problem in very deep networks by providing gradients a direct path backward through the architecture. Every Transformer block uses residual connections around both the attention layer and the feed-forward layer.

Layer normalization (LayerNorm)

A normalization operation that rescales a layer’s activations to have zero mean and unit variance, applied independently to each input in a batch. It stabilizes training in deep networks by preventing activations from growing or shrinking uncontrollably as they pass through many successive layers. LayerNorm is not specific to Transformers; it appears in recurrent networks, graph neural networks and other deep architectures wherever training instability from activation scale is a concern. In Transformers specifically, modern variants apply LayerNorm before each sub-layer rather than after, a placement that improves stability at scale.

Batch normalization (BatchNorm)

A normalization technique that standardizes activations across the batch dimension rather than across the layer. The dominant normalization approach for convolutional networks and early deep architectures. Less suitable for language models with variable sequence lengths, which is why Transformers use layer normalization instead.

Dropout

A regularization technique in which, during training, each neuron’s output is randomly set to zero with a chosen probability (typically between 0.1 and 0.5). This forces the network to learn redundant representations, preventing any single neuron from becoming critical and reducing overfitting. Dropout is disabled at inference time; very large language models often use little or none.

Encoder

A network component that converts an input (text, image or audio) into a dense internal representation: a compressed, information-rich vector. The encoder does not produce a prediction; it produces a representation that other components can use. BERT is an encoder-only model, capable of representing text but not generating it.

Decoder

A network component that generates output one element at a time, conditioned on a representation. GPT-style models are decoder-only: the decoder reads all previous tokens and predicts the next one, autoregressively. In encoder-decoder architectures (the original Transformer for translation), the decoder also attends to the encoder’s output via cross-attention.

Multi-head attention (MHA)

An extension of self-attention that runs multiple attention computations in parallel, each with its own learned weight matrices. Each “head” attends to the input from a different learned perspective. Their outputs are concatenated and projected into a single representation. GPT-4 uses 96 attention heads; standard models use between 32 and 64.

Cross-attention

An attention mechanism in encoder-decoder models in which the decoder attends to the encoder’s representations rather than only to its own previous outputs. The query vectors come from the decoder; the key and value vectors come from the encoder. This is the mechanism through which translation models “read” the source sentence while generating the target language.

Positional encoding

A mechanism that injects sequence-order information into token embeddings before they enter the Transformer. The attention mechanism is position-agnostic by design, treating tokens as a set rather than a sequence; positional encodings restore the order. The original Transformer used fixed sinusoidal encodings. Modern models use learned relative positional encodings, particularly RoPE (Rotary Position Embedding), which generalize better to sequences longer than those seen during training.

KV cache (Key-Value cache)

An inference optimization that stores the computed key and value vectors for all previous tokens so they do not need to be recomputed when generating each new token. Without a KV cache, generating a 1,000-token response would require 1,000 full forward passes through the model. With it, each new token requires only one. KV cache size scales with context length and is a primary memory bottleneck at inference time.

Generative architectures

Large language models are not the only family of generative AI systems. Three other major architectures (GANs, VAEs and diffusion models) shaped the field’s development and remain central to image, video and audio generation. They share underlying concepts while diverging sharply in their mechanics.

Autoregressive model

A model that generates output one element at a time, each conditioned on all previously generated elements. GPT-style models are autoregressive over tokens. The mechanism is serial: predict token N+1 from tokens 1 through N, append it, then predict N+2. Autoregressive generation cannot be parallelized across the output, which is a fundamental inference speed constraint that techniques like speculative decoding address.

GAN (Generative Adversarial Network)

An architecture introduced by Goodfellow et al. in 2014 in which two networks train against each other. A generator learns to produce realistic outputs; a discriminator learns to distinguish real from generated. The generator improves by fooling the discriminator; the discriminator improves by catching the generator. GANs dominated image synthesis from 2014 to roughly 2021 before diffusion models surpassed them in quality and training stability. GAN training is notoriously prone to mode collapse, where the generator learns to produce only a narrow range of outputs.

VAE (Variational Autoencoder)

A generative model that learns a compressed latent representation of data by training an encoder and decoder simultaneously, with a regularization term constraining the latent space to follow a known probability distribution. This allows generating new samples by sampling from the distribution. VAEs underpin the latent diffusion approach used in Stable Diffusion.

Diffusion model

A generative architecture that learns to reverse a gradual noising process. Training: incrementally add Gaussian noise to data until it becomes pure noise, and train a neural network to predict and remove that noise. Inference: start from pure noise and iteratively denoise. Diffusion models achieved state-of-the-art image quality in DALL-E 3, Midjourney and Stable Diffusion and have since been applied to audio, video and molecular design.

Latent diffusion

A variant of diffusion models that operates in the compressed latent space of a VAE rather than in pixel space. The insight is that diffusing and denoising a low-dimensional latent vector is far cheaper than operating on a full-resolution image, with minimal quality loss. Stable Diffusion is a latent diffusion model. The technique has become standard in image and video generation systems.

Latent space

The internal numerical space in which a model encodes compressed representations of inputs. A point in latent space corresponds to a combination of learned features. The geometry carries meaning: nearby points decode to similar outputs; arithmetic in the space can have semantic significance, a property first systematically documented in Word2Vec representations. Controlling and navigating the latent space is a central design challenge in generative model development.

CLIP (Contrastive Language-Image Pretraining)

A model developed by OpenAI in 2021 that aligns visual and language representations in a shared embedding space. Trained on image-text pairs, CLIP pulls matching pairs together and pushes mismatched pairs apart in the shared space. It became a foundational component in text-to-image systems, used to guide diffusion models toward a target text prompt. CLIP-style contrastive training is now widespread in multimodal model design.

Language generation

Once a language model exists, a set of decisions governs how it produces text. These are not architectural choices; they are operational ones, made at inference time and having significant effects on output quality, diversity and reliability.

Temperature

A scalar applied to logits before softmax during generation that controls randomness. Temperature below 1 sharpens the distribution, making the most likely tokens even more probable and output more deterministic. Temperature above 1 flattens it, giving less probable tokens more chance and output more variety. Temperature of 0 is greedy decoding; temperature of 1 samples from the raw model distribution.

Top-p sampling (nucleus sampling)

A sampling strategy that, at each generation step, restricts choices to the smallest set of tokens whose cumulative probability exceeds a threshold p (typically 0.9 or 0.95) and samples from that set. Unlike top-k, the set size adapts to the distribution’s shape: a confident distribution produces a small nucleus; a flat distribution produces a large one. Top-p is the most widely used sampling method in deployed language models.

Top-k sampling

A sampling strategy that restricts generation to the k most probable tokens at each step, zeroing out all others before sampling. Simpler than top-p and widely used in combination with it. Setting k too low risks repetition; too high reintroduces improbable tokens. Most deployed systems use top-p and temperature as primary controls, with top-k as an optional additional constraint.

Beam search

A decoding strategy that maintains the top-k most probable partial sequences at each step and expands all of them, keeping only the best k. Deterministic and tending toward high-probability but generic output, beam search dominated before sampling strategies were preferred for open-ended tasks. It remains widely used in translation and summarization.

Greedy decoding

The simplest generation strategy: select the highest-probability token at each step. Fast and deterministic, but produces repetitive and often degenerate output for open-ended generation. Equivalent to temperature of 0. Used in constrained settings where determinism matters more than creativity.

Perplexity

A metric for evaluating language model quality that measures how surprised the model is by a held-out text. Formally, it is the exponentiated average negative log-probability per token. Lower is better. A perplexity of 10 means the model is, on average, as uncertain as if it had to choose uniformly among 10 equally probable tokens at each step. Perplexity is used to compare models trained on identical data distributions; cross-distribution comparisons are unreliable.

Repetition penalty

A generation parameter that reduces the probability of tokens that have already appeared in the output. Applied by dividing repeated tokens’ logits by a penalty factor before sampling. Without this mechanism, autoregressive models are prone to degenerate repetition loops, particularly at low temperature settings.

Efficiency and compression

The cost of training and running large models has driven a substantial engineering literature focused on doing more with less. These techniques are not theoretical; they are production-critical and appear in the technical reports of every major model release.

Quantization

The process of representing model weights at reduced numerical precision to shrink memory footprint and accelerate computation. Full precision uses 32-bit floats (FP32); common alternatives include FP16, BF16 (a 16-bit format that matches FP32 dynamic range), INT8 and INT4. Going from FP32 to INT8 roughly halves memory and increases inference throughput. BF16 has become the default training and inference precision for most frontier models.

Pruning

The process of zeroing out or removing weights that contribute minimally to model outputs. A pruned model has fewer effective parameters, reducing memory and compute. Structured pruning removes entire components (attention heads, layers); unstructured pruning zeros individual weights. Pruning is harder to apply to large language models without significant quality loss and is less widely deployed than quantization.

Knowledge distillation

A training technique in which a smaller student model learns to mimic the output distributions of a larger teacher model rather than learning directly from ground-truth labels. The teacher’s soft probability distributions carry richer information than hard labels. Distillation is how organizations deploy capable small models without replicating the full training cost of frontier systems.

LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning technique that freezes the original model weights and adds small, trainable low-rank matrices to each layer. Rather than updating all parameters, LoRA trains only a fraction of the total, often below 1%, reducing memory requirements for fine-tuning by a factor of three to ten. The LoRA matrices can be merged back into the original weights at inference time with no additional cost. Introduced by Hu et al. in 2021.

PEFT (Parameter-Efficient Fine-Tuning)

A category of fine-tuning techniques that modify only a small fraction of a model’s parameters. LoRA is the most widely used; others include prefix tuning, prompt tuning and adapter layers. PEFT makes fine-tuning tractable on consumer hardware by avoiding the memory requirements of full fine-tuning runs.

Mixed precision training

A training technique that uses lower-precision formats (FP16 or BF16) for most computations while maintaining a full-precision master copy of weights for the gradient accumulation step. Lower precision means faster matrix multiplications and less memory; the full-precision copy prevents numerical errors from compounding. Mixed precision is standard in all frontier model training.

Flash Attention

A hardware-aware exact attention algorithm developed by Dao et al. in 2022 that computes standard self-attention with significantly less memory and faster wall-clock time. It achieves this by tiling the computation to fit within GPU SRAM rather than reading repeatedly from slower high-bandwidth memory. Flash Attention produces identical outputs to standard attention: it is an implementation optimization, not an approximation. Flash Attention 2 and 3 are now universal in frontier model training and inference.

Speculative decoding

An inference technique that uses a small draft model to generate multiple candidate tokens in parallel, then verifies them in a single forward pass of the larger target model. Where the large model accepts the drafts, multiple tokens are produced for roughly the cost of one. Where it rejects them, the system falls back to standard generation. Speculative decoding achieves two- to three-fold inference speedups with no change to output quality.

Scale and compute

The hardware, measurement units and distribution strategies that govern how large models are trained and deployed carry their own vocabulary. These terms appear constantly in research papers and hardware announcements.

Scaling laws

Empirical relationships between model performance and three variables: parameters, training data and compute. Characterized rigorously by Kaplan et al. at OpenAI in 2020 and revised by Hoffmann et al. at DeepMind in 2022 (the Chinchilla paper). Scaling laws predict that loss decreases smoothly and predictably as any of the three variables increases. The Chinchilla result demonstrated that most pre-2022 models were undertrained relative to their size: compute was being spent on more parameters rather than more data.

GPU (Graphics Processing Unit)

The hardware that executes AI training and inference. GPUs were designed for the parallel floating-point operations of graphics rendering; that same parallelism makes them ideal for the matrix multiplications that dominate neural network computation. NVIDIA H100 and H800 dominated frontier model training through 2024. The GB200 NVL72 rack system, deployed in 2026, was architected specifically for Mixture of Experts workloads.

TPU (Tensor Processing Unit)

Google’s custom AI accelerator, designed specifically for the matrix multiplications in neural network training. TPUs are faster and more power-efficient than GPUs for the specific operations that dominate Transformer training. Google trains most of its frontier models on TPU pods. Unlike GPUs, TPUs are largely unavailable outside Google Cloud infrastructure.

CUDA

NVIDIA’s parallel computing platform and programming model, released in 2007, that made GPU hardware accessible for general-purpose scientific computation. Before CUDA, using GPU hardware required reformulating problems as graphics operations. CUDA unlocked GPUs for deep learning. Most major deep learning frameworks, including PyTorch and JAX, are built on top of it. NVIDIA’s sustained dominance in AI hardware is inseparable from CUDA’s ecosystem.

FLOP / FLOPs (Floating-Point Operation)

A single arithmetic computation on a floating-point number. FLOPs count the total operations in a forward or training pass, serving as the standard unit for comparing computational cost across models independent of hardware. Training GPT-3 required roughly 3.1×10²³ FLOPs. Frontier models in 2024 and 2025 require between 10²⁴ and 10²⁵. FLOPs measure computational demand; FLOP/s measures how fast a piece of hardware delivers it.

Model parallelism

A distributed training strategy that splits the model’s layers or components across multiple GPUs when the full model does not fit on a single device. Tensor parallelism splits individual weight matrices across devices; pipeline parallelism assigns groups of layers to different devices. Training frontier models requires combining multiple parallelism strategies simultaneously.

Data parallelism

A distributed training strategy that replicates the entire model on multiple GPUs, each processing a different subset of the training batch, with gradients averaged across devices before each weight update. The simplest form of distributed training. When the model does not fit on one device, data parallelism alone is insufficient.

Expert parallelism

A distributed training and inference strategy specific to Mixture of Experts models. Expert weights are distributed across GPUs rather than replicated: each device hosts a subset of experts. When a token is routed to an expert on a different device, it is transmitted there. NVIDIA’s Dynamo framework handles expert parallelism as a first-class workload in its current generation hardware.

Matrix multiplication

The fundamental mathematical operation of deep learning. Linear layers, attention mechanisms and feed-forward networks are, at their mathematical core, sequences of matrix multiplications. GPUs and TPUs are engineered specifically for this operation. The cost of a forward pass is largely determined by the size and number of matrix multiplications it requires.

Alignment and training methods

Turning a pre-trained base model into a system that reliably follows instructions, declines harmful requests and communicates honestly requires a distinct set of training techniques. These are not capability methods; they are behavioral ones.

RLHF (Reinforcement Learning from Human Feedback)

A fine-tuning technique that incorporates human preference signals into model training. Human raters compare pairs of model outputs and indicate which is better. A reward model learns to predict those preferences. The language model is then trained via reinforcement learning to maximize the reward. RLHF is responsible for the transformation from capable but raw base models to instruction-following assistants. It is used by OpenAI, Anthropic, Google and most major labs.

DPO (Direct Preference Optimization).

A fine-tuning algorithm introduced by Rafailov et al. in 2023 that achieves RLHF-equivalent results without a separate reward model or a reinforcement learning loop. DPO optimizes the language model directly on preference pairs (chosen versus rejected responses) using a classification-style objective. It is simpler, more stable and more memory-efficient than full RLHF. DPO and its variants have largely displaced full RLHF pipelines at many organizations.

Alignment

The problem of ensuring that AI systems behave in accordance with human values and intentions. In practice, alignment work covers instruction-following without deception, honesty about uncertainty, refusal of harmful requests and avoidance of unintended goal pursuit. RLHF, DPO and Constitutional AI are alignment techniques. The distinction between capability research and alignment research is one of the organizing tensions in the field.

Constitutional AI (CAI)

An alignment technique developed by Anthropic in which a model critiques and revises its own outputs according to a written set of principles (a “constitution”) rather than relying solely on human feedback labels. The model generates a response, identifies ways it violates the constitution and revises it. This reduces the need for human labeling on harmful outputs. Constitutional AI is the primary alignment method behind Claude.

Red-teaming

The practice of deliberately attempting to cause a model to produce harmful, dishonest or undesirable outputs, in order to identify failure modes before deployment. Red-teamers probe for jailbreaks, prompt injections, dangerous knowledge leakage and systematic bias. Internal red-teaming is standard in frontier model development; external programs invite the broader research community to participate.

Guardrails

Constraints applied to model inputs or outputs (either within training or as post-hoc filters) to prevent harmful output. Input guardrails block dangerous prompts before they reach the model; output guardrails filter or flag model responses. In deployed systems, guardrails are the operational layer of alignment: the gap between what a model is capable of producing and what it is permitted to produce in production.

Mind In The Loop

Discussion about this post

Ready for more?