AI
    June 1, 2026

    LLM Inference Explained

    How large language models generate responses — from tokenisation to transformer attention — and what this means for building production AI systems.

    Share

    LLM Inference Explained

    What you'll learn: By the end of this guide you will understand how tokens are generated, why the KV cache matters, how to tune temperature and sampling parameters for production, and how to reduce inference cost and latency without changing models.

    When you send a message to ChatGPT or Claude, what actually happens? Understanding the inference process — the computation that generates a response — makes you a better AI system designer. It explains why latency behaves the way it does, why longer prompts cost more, and why you can't just "make the model faster" without trade-offs.

    What Inference Is

    Training is when a model learns from data — a compute-intensive, one-time process that produces the model's weights. Inference is what happens at runtime: given an input, use those weights to produce an output. You use inference every time you call an LLM API.

    LLM inference is autoregressive: the model generates one token at a time, and each new token is conditioned on all previous tokens. This is fundamentally different from, say, running an image classifier that produces a single output in one forward pass.

    The Inference Pipeline

    flowchart LR IN([Input Text]) --> TOK[Tokeniser\nbpe encoding] TOK --> EMB[Embedding\nLayer] EMB --> ATT[Transformer\nAttention × N layers] ATT --> LOGIT[Logit Head\nvocab distribution] LOGIT --> SAMPLE[Sampling\ntemp · top-p · top-k] SAMPLE --> TOKEN([Next Token]) TOKEN -->|append & repeat| ATT style IN fill:#0d2d3a,stroke:#00e5ff,color:#fff style ATT fill:#1a1a3a,stroke:#9b59b6,color:#fff style SAMPLE fill:#0d2d3a,stroke:#00c4ff,color:#fff style TOKEN fill:#1a2d0d,stroke:#2ecc71,color:#fff

    Figure 1: Autoregressive token generation — each token is fed back in as input

    Step 1: Tokenisation

    Your text input is not fed to the model as characters or words — it's split into tokens, which are byte-pair encoded chunks of text. Common words are often single tokens; rare words may be split into multiple tokens.

    "Hello, world!" → ["Hello", ",", " world", "!"]  (4 tokens)
    "unbelievable" → ["un", "believe", "able"]        (3 tokens)
    "GPT-4" → ["G", "PT", "-", "4"]                  (4 tokens)

    Why this matters for engineers:

    • Cost is almost always measured in tokens, not characters or words
    • Context window limits are in tokens (e.g., 128K tokens for GPT-4o)
    • Non-English text often uses more tokens per word than English
    • Code is typically token-efficient; prose descriptions of code are not

    GPT-4o's tokeniser (cl100k_base) averages roughly 0.75 words per token, or 4 characters per token for English text.

    Step 2: Embedding Generation

    Each token is converted into a high-dimensional vector (an embedding) that represents the token's meaning in a continuous vector space. GPT-4's embeddings are 12,288-dimensional vectors.

    These embeddings are not static — they're the starting point. The transformer layers will refine them based on context.

    Step 3: Transformer Attention Layers

    This is the computational core. Each transformer layer has two sub-components:

    Multi-head self-attention allows every token to "look at" every other token in the sequence and decide how much to attend to each. This is how "it" in "The cat sat on the mat because it was tired" gets associated with "cat" rather than "mat."

    The attention mechanism computes:

    Attention(Q, K, V) = softmax(QK^T / √d_k) × V

    Where Q (queries), K (keys), and V (values) are linear projections of the token embeddings. This computation is O(n²) in sequence length — a major reason why inference gets expensive with long contexts.

    Feed-forward network: after attention, each token's representation passes through a position-wise feed-forward network that adds non-linearity and capacity.

    GPT-4 has 96 of these transformer layers stacked. Each layer refines the token representations, building increasingly abstract understanding of the input.

    Step 4: Token Prediction

    After the final transformer layer, the model produces a probability distribution over its entire vocabulary (~50,000–100,000 tokens) for the next token. The model has not decided what to say — it has computed "given everything so far, here's how likely each possible next token is."

    Sampling parameters control what happens next:

    • Temperature (0.0–2.0): scales the probability distribution. Temperature 0 = always pick the most likely token (deterministic, but repetitive). Temperature 1 = sample from the distribution as-is. Temperature > 1 = flatten the distribution (more creative, less coherent).

    • Top-P (nucleus sampling): sample only from the smallest set of tokens whose cumulative probability exceeds P. Top-P=0.9 means consider only the tokens that together account for 90% of probability mass.

    • Top-K: only consider the K most likely tokens. Top-K=50 is common for creative tasks.

    For production classification tasks, use temperature 0 for determinism. For creative generation, temperature 0.7–1.0.

    Step 5: Autoregressive Generation

    The selected token is appended to the sequence, and the entire forward pass repeats — now with one more token in the context. This continues until the model generates an end-of-sequence token or hits the max_tokens limit.

    This is why:

    • First token latency (time-to-first-token, TTFT) is usually higher — the model processes your entire prompt before generating anything
    • Tokens per second (TPS) is the steady-state generation rate
    • Total latency = TTFT + (output_tokens ÷ TPS)

    The KV Cache

    The Key and Value tensors computed during attention for the input prompt don't need to be recomputed for every new token generated. The KV cache stores these tensors in GPU memory and reuses them. Without the KV cache, generating a 1,000-token response would require 1,000 full forward passes over the entire context — completely impractical.

    The KV cache is why:

    • GPU memory is the primary bottleneck for running large models
    • Longer contexts require proportionally more GPU memory
    • Prompt caching (caching KV tensors across API calls) can dramatically reduce latency for repeated prompts

    Quantisation: Running Models on Less Hardware

    Full-precision LLMs store each weight as a 32-bit float. A 70B parameter model would require 280GB of GPU memory — more than most hardware setups can provide.

    Quantisation reduces precision to 8-bit (INT8), 4-bit (INT4), or even fewer bits per weight:

    Precision Memory (70B model) Quality loss
    FP32 ~280 GB None
    FP16 ~140 GB Negligible
    INT8 ~70 GB Very small
    INT4 ~35 GB Small for most tasks

    For enterprise deployment, 4-bit quantised models often offer 95%+ of full-precision quality at 25% of the memory cost. This makes running Llama 3 70B feasible on a single high-end GPU server.

    Prompt Caching

    Most enterprise AI use cases involve a large system prompt that doesn't change between requests (company context, instructions, knowledge base snippets). Re-processing this prompt for every request wastes compute.

    Prompt caching keeps the KV cache for your system prompt in memory, so only the user's message needs to be processed fresh. Anthropic's prompt caching can reduce latency by 80% and cost by up to 90% for cache hits on long system prompts.

    Design your prompts with the static content first (system context, instructions) and the dynamic content last (user query). This maximises cache hit rate.

    What This Means for Your System Design

    Budget your context window. Each token in your prompt is compute. A bloated 10,000-token system prompt when 2,000 tokens would do costs 5x more per request.

    Prefer smaller models when they're sufficient. GPT-4o-mini is 20–30x cheaper per token than GPT-4o and handles the majority of classification, extraction, and summarisation tasks. Use the big model for complex reasoning; use the small model everywhere else.

    Batch similar requests. Many inference endpoints support batch processing — sending multiple inputs in one API call. For offline tasks (document processing, bulk classification), batch processing dramatically reduces per-unit cost.

    Cache aggressively. LLM responses to identical inputs are deterministic at temperature 0. Cache them. For temperature > 0, cache at the semantic level using a vector similarity check before hitting the API.

    Stream responses. For user-facing AI, stream tokens as they're generated rather than waiting for the full response. This improves perceived responsiveness dramatically — the user starts reading while the model is still generating.

    Understanding inference mechanics transforms you from a prompt writer to an AI systems engineer. Every architectural decision — model choice, context length, caching strategy, batching — flows from understanding what's actually happening in those transformer layers.


    Key Takeaways

    • LLM inference is autoregressive — one token at a time. This fundamental property explains every latency and cost characteristic
    • Tokenisation is your cost unit. Non-English text and code have different token densities — always measure, never assume
    • Temperature 0 = deterministic, reproducible outputs. Use it for classification and extraction. Use 0.7–1.0 for creative generation
    • The KV cache is why inference is fast after the first token. Long contexts consume proportionally more GPU memory
    • Prompt caching (Anthropic/OpenAI) stores the KV cache of repeated prefixes — 70–90% cost reduction for high-volume endpoints with static system prompts
    • Model routing: route simple tasks to small models (GPT-4o-mini, Claude Haiku) and complex reasoning to large models — typical cost reduction 70%+
    • Always stream responses for user-facing AI. Perceived latency matters as much as actual latency

    Practice Exercises

    Exercise 1 — Starter (20 min): Use the OpenAI Tokeniser tool to count tokens in 5 different system prompts of varying complexity. Calculate the monthly cost at 10,000 requests/day at GPT-4o pricing. Then measure how much prompt caching would save.

    Exercise 2 — Intermediate (1 hour): Build a simple model router. Write a classifier (using GPT-4o-mini itself) that categorises incoming questions as "simple" or "complex". Route simple questions to GPT-4o-mini and complex ones to GPT-4o. Measure the cost reduction on 100 sample queries.

    Exercise 3 — Advanced (2 hours): Implement a semantic response cache. Before every LLM call, embed the user query and check a Redis cache using cosine similarity (threshold 0.95). On a cache miss, store the response. Run 500 queries through a typical support bot and measure cache hit rate and cost reduction.

    Ask about this article

    Get answers grounded in this post. AI-generated — based on this article, and may be imperfect.

    Scaled AI Weekly

    Enjoyed this? Get more like it every Monday.

    Real architecture decisions, LLMOps patterns that survive production, and engineering leadership advice — from 12+ years of building at enterprise scale. Free. No spam. Unsubscribe anytime.

    Join engineers building production AI systems