Teams obsess over prompt wording, then wonder why their LLM feature is flaky. The uncomfortable truth: for production systems, what you put in the context window matters far more than how you phrase the instruction. That discipline has a name now — context engineering — and it's the skill that separates demos from dependable products.
What context engineering actually is
Prompt engineering is about phrasing one instruction well. Context engineering is the broader job of deciding everything the model sees on each call — and assembling it reliably, every time, within a fixed budget.
A request's context is usually a mix of:
- System prompt — role, rules, output contract
- Few-shot examples — demonstrations of the desired behaviour
- Retrieved knowledge — RAG chunks, docs, records
- Tool definitions & results — what the model can call, and what came back
- Conversation history — prior turns
- Task state — scratchpad, plan, intermediate results
Your job is to get the right subset of that into the window, in the right order, on every call.
The context budget
The window is finite and every token competes. Two failure modes bracket the problem:
- Too little → the model lacks what it needs and hallucinates or guesses.
- Too much → "context rot": relevant signal gets buried, the model is distracted by noise, latency and cost climb, and accuracy drops even though you added more.
More context is not better. Relevant, well-ordered context is better. Treat the window like a performance budget, not a junk drawer.
Techniques that move the needle
1. Retrieve, don't dump. Don't paste a whole document — retrieve the few chunks that answer the query (RAG). Quality of retrieval caps quality of output.
2. Compress history. Long conversations blow the budget. Summarise older turns, keep recent ones verbatim, and externalise durable facts to a store you re-inject on demand.
3. Order for salience. Models weight the start and end of the context most. Put the stable, important material (system rules, key facts) up front; put the immediate task last.
4. Structure it. Delimit sections clearly (headers, tags, JSON). A model parses ### Retrieved context + ### Task far more reliably than a wall of text.
5. Make static content cacheable. Put large, unchanging context (system prompt, long instructions) first so it can be prompt-cached — cutting latency and cost on every request.
6. Isolate per agent. In multi-agent systems, give each agent only the context and tools it needs. Small, scoped context = better tool selection and fewer distractions.
Failure modes to watch for
- Context rot — accuracy degrades as irrelevant tokens accumulate. Trim aggressively.
- Lost in the middle — facts buried mid-context get ignored. Reposition the important ones.
- Conflicting sources — retrieved chunks disagree; surface the conflict and keep provenance instead of letting the model silently pick.
- Stale state — history that no longer reflects reality. Prune or refresh it.
A practical checklist
Before shipping an LLM feature, ask:
- Is every token in this context earning its place?
- Is retrieval returning the right chunks (measure recall)?
- Is the static prefix first, so it's cacheable?
- Are sections clearly delimited?
- What happens when history grows 10×? Do I summarise/prune?
- For agents: does each one see only what it needs?
Wrap-up
Prompt phrasing is the last 10%. The reliability of an LLM app is decided by context engineering — retrieval quality, budget discipline, ordering, structure, and state management. Get the context right and mediocre prompts work fine; get it wrong and no amount of clever wording saves you.
Related reading
- RAG Systems Explained — the retrieval layer that feeds good context.
- Prompt Caching: Cut Your LLM Costs by 80% — why a cacheable static prefix matters.
- Building Enterprise AI Agents — per-agent context isolation in practice.