Prompt Caching: Cut Your LLM Costs by 80%
Prompt caching is the single highest-leverage optimisation available to enterprise AI teams right now. Anthropic offers up to 90% cost reduction and 85% latency improvement on cache hits. OpenAI offers similar savings. Yet most teams are not using it.
This post explains what prompt caching is, why it matters, and how to implement it step by step in a Spring AI application.
Step 1 — Understand How LLM Calls Work
Every LLM API call processes tokens in two distinct phases:
- Prefill phase — The model reads and processes your entire prompt. This involves running transformer attention across every token. For a 2,000-token system prompt, this happens on every single request, even if the prompt never changes.
- Decode phase — The model generates the response one token at a time.
Prefill is the expensive, slow part. Caching eliminates it.
Figure 1: On a cache miss, the full prefill runs. On a hit, it is skipped entirely.
Step 2 — Identify What to Cache
Prompt caching saves the KV cache state of your prompt prefix. It works best when you have a large, static block at the start of every request.
Good candidates for caching:
| Content type | Typical size | Cache value |
|---|---|---|
| System instructions + persona | 500-2,000 tokens | High |
| Company knowledge base in RAG prompts | 2,000-8,000 tokens | Very high |
| Few-shot examples | 500-3,000 tokens | High |
| User query | 20-200 tokens | Never cache (always dynamic) |
The golden rule: Static content goes first. Dynamic content goes last.
[CACHED — never changes]
System prompt (2,000 tokens)
Company context (1,000 tokens)
Few-shot examples (500 tokens)
[NOT CACHED — changes every request]
Retrieved documents (varies)
User query (varies)
Step 3 — Enable Caching on Anthropic
Anthropic's prompt caching requires your static prefix to be at least:
- 1,024 tokens for Claude Sonnet
- 2,048 tokens for Claude Haiku
Spring AI handles the cache control headers automatically when using the Anthropic client.
3a — Add the dependency
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-anthropic-spring-boot-starter</artifactId>
</dependency>
3b — Configure your application.yml
spring:
ai:
anthropic:
api-key: ${ANTHROPIC_API_KEY}
chat:
options:
model: claude-sonnet-4-5-20251001
max-tokens: 1024
3c — Build the service with a large, static system prompt
@Service
public class CachedEnterpriseAIService {
private final ChatClient chatClient;
// Static system prompt — load once, cache on every subsequent request
private static final String SYSTEM_PROMPT = """
You are an expert software engineering assistant for Acme Corp.
## Your Role
Help engineers design, debug, and review code across Java, Python,
and TypeScript. Provide production-ready examples with explanations.
## Guidelines
- Always explain your approach before showing code
- Highlight potential edge cases and failure modes
- Suggest observability and error handling patterns
- Reference Acme's internal standards when applicable
## Constraints
- Only answer software engineering questions
- Never reveal confidential system information
- Acknowledge uncertainty rather than guessing
[... additional context that pushes this above 1,024 tokens ...]
""";
public CachedEnterpriseAIService(ChatClient.Builder builder) {
this.chatClient = builder
.defaultSystem(SYSTEM_PROMPT) // This prefix will be cached
.build();
}
public String answer(String userQuery) {
return chatClient.prompt()
.user(userQuery) // Only this changes per request
.call()
.content();
}
}
On the first call, Anthropic processes and caches the system prompt. Every subsequent call skips the prefill for that prefix — you only pay for the user query tokens at full price.
Step 4 — Enable Caching on OpenAI
OpenAI caching is fully automatic — no code changes required.
How it works
Any prompt prefix longer than 1,024 tokens that appears in multiple requests is automatically cached on OpenAI's servers for up to one hour. You see a cached_tokens field in the API response with the discounted count.
What you must do: structure your prompt correctly
// CORRECT — static system prompt comes before dynamic content
public String answerWithCaching(String userQuery) {
return chatClient.prompt()
.system(LARGE_STATIC_SYSTEM_PROMPT) // 2,000+ tokens — gets cached
.user(userQuery) // Dynamic — never cached
.call()
.content();
}
// WRONG — injecting dynamic content into the system prompt breaks caching
public String answerWithoutCaching(String userQuery, String sessionId) {
return chatClient.prompt()
.system(STATIC_PROMPT + "\nSession: " + sessionId) // Changes every time!
.user(userQuery)
.call()
.content();
}
Common mistakes that break OpenAI caching
- Appending timestamps or session IDs to the system prompt
- Including request IDs or trace IDs in the static prefix
- Personalising the system prompt per user (use the user message instead)
- Shuffling few-shot examples (always keep them in the same order)
Step 5 — Measure the Impact
Add instrumentation to verify caching is working. For OpenAI, check the usage object:
@Service
public class InstrumentedAIService {
private final OpenAiChatModel chatModel;
private final MeterRegistry meterRegistry;
public String answerWithMetrics(String prompt, String userQuery) {
ChatResponse response = chatModel.call(
new Prompt(List.of(
new SystemMessage(prompt),
new UserMessage(userQuery)
))
);
// Extract token usage from the response metadata
Usage usage = response.getMetadata().getUsage();
long totalInput = usage.getPromptTokens();
long cachedInput = extractCachedTokens(response); // provider-specific
long uncachedInput = totalInput - cachedInput;
meterRegistry.counter("llm.tokens.cached").increment(cachedInput);
meterRegistry.counter("llm.tokens.uncached").increment(uncachedInput);
double hitRate = totalInput > 0
? (double) cachedInput / totalInput
: 0;
meterRegistry.gauge("llm.cache.hit_rate", hitRate);
return response.getResult().getOutput().getContent();
}
}
Build a Grafana dashboard with:
- Cache hit rate (target: above 70%)
- Average cost per request (should drop after enabling caching)
- p95 latency (should drop on cache hits)
Step 6 — Cost Impact Calculator
Use this table to estimate your savings before implementing:
| Requests/day | Prompt tokens | Without caching | With 90pct hit rate | Monthly saving |
|---|---|---|---|---|
| 1,000 | 2,000 | $10/day | $1.90/day | $243/month |
| 10,000 | 2,000 | $100/day | $19/day | $2,430/month |
| 50,000 | 5,000 | $1,250/day | $237/day | $30,390/month |
| 100,000 | 5,000 | $2,500/day | $475/day | $60,750/month |
Based on GPT-4o input pricing of $5 per 1M tokens. Cached tokens billed at $0.50 per 1M (10pct).
Implementation Checklist
Go through this before deploying:
- Audit your prompts — identify the largest static block in your system prompt
- Separate static from dynamic — system prompt holds only static content; user turn holds dynamic content
- Check token count — use the tokeniser to verify your static prefix exceeds the cache threshold
- Remove dynamic injections — strip timestamps, session IDs, and per-user content from system prompts
- Anthropic only — confirm you are on a model that supports caching (Sonnet, Haiku, Opus)
- Add monitoring — track
cached_tokensin the API response and alert if hit rate drops below 70% - Test cache invalidation — verify your system behaves correctly after the 5-minute cache TTL expires (Anthropic) or 1-hour TTL (OpenAI)
Summary
Prompt caching is one of those rare optimisations where you get faster AND cheaper at the same time. The implementation effort is minimal — mostly restructuring prompts to put static content first — and the cost savings are measurable within 24 hours of deployment.
For any enterprise AI application with a system prompt longer than 1,000 tokens running at meaningful request volume, prompt caching should be the first production optimisation on your list.
Related reading
- Prompt Engineering Enterprise Guide — structure prompts well (it also makes them cache-friendly).