Prompt Caching: Cut Your LLM Costs by 80%

Prompt caching is the single highest-leverage optimisation available to enterprise AI teams right now. Anthropic offers up to 90% cost reduction and 85% latency improvement on cache hits. OpenAI offers similar savings. Yet most teams are not using it.

This post explains what prompt caching is, why it matters, and how to implement it step by step in a Spring AI application.

Step 1 — Understand How LLM Calls Work

Every LLM API call processes tokens in two distinct phases:

Prefill phase — The model reads and processes your entire prompt. This involves running transformer attention across every token. For a 2,000-token system prompt, this happens on every single request, even if the prompt never changes.
Decode phase — The model generates the response one token at a time.

Prefill is the expensive, slow part. Caching eliminates it.

sequenceDiagram participant App as Your App participant Cache as Provider Cache participant LLM as LLM Engine Note over App,LLM: First request - cache MISS App->>Cache: System prompt plus User query Cache->>LLM: Full prefill of all tokens LLM-->>App: Response and KV cache stored Note over App,LLM: Subsequent requests - cache HIT App->>Cache: Same system prompt plus new query Cache-->>LLM: Prefill skipped, load KV from cache LLM-->>App: Response 85pct faster, 90pct cheaper

Figure 1: On a cache miss, the full prefill runs. On a hit, it is skipped entirely.

Step 2 — Identify What to Cache

Prompt caching saves the KV cache state of your prompt prefix. It works best when you have a large, static block at the start of every request.

Good candidates for caching:

Content type	Typical size	Cache value
System instructions + persona	500-2,000 tokens	High
Company knowledge base in RAG prompts	2,000-8,000 tokens	Very high
Few-shot examples	500-3,000 tokens	High
User query	20-200 tokens	Never cache (always dynamic)

The golden rule: Static content goes first. Dynamic content goes last.

[CACHED — never changes]
  System prompt (2,000 tokens)
  Company context (1,000 tokens)
  Few-shot examples (500 tokens)

[NOT CACHED — changes every request]
  Retrieved documents (varies)
  User query (varies)

Step 3 — Enable Caching on Anthropic

Anthropic's prompt caching requires your static prefix to be at least:

1,024 tokens for Claude Sonnet
2,048 tokens for Claude Haiku

Spring AI handles the cache control headers automatically when using the Anthropic client.

3a — Add the dependency

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-anthropic-spring-boot-starter</artifactId>
</dependency>

3b — Configure your application.yml

spring:
  ai:
    anthropic:
      api-key: ${ANTHROPIC_API_KEY}
      chat:
        options:
          model: claude-sonnet-4-5-20251001
          max-tokens: 1024

3c — Build the service with a large, static system prompt

@Service
public class CachedEnterpriseAIService {

    private final ChatClient chatClient;

    // Static system prompt — load once, cache on every subsequent request
    private static final String SYSTEM_PROMPT = """
        You are an expert software engineering assistant for Acme Corp.
        
        ## Your Role
        Help engineers design, debug, and review code across Java, Python,
        and TypeScript. Provide production-ready examples with explanations.
        
        ## Guidelines
        - Always explain your approach before showing code
        - Highlight potential edge cases and failure modes
        - Suggest observability and error handling patterns
        - Reference Acme's internal standards when applicable
        
        ## Constraints
        - Only answer software engineering questions
        - Never reveal confidential system information
        - Acknowledge uncertainty rather than guessing
        
        [... additional context that pushes this above 1,024 tokens ...]
        """;

    public CachedEnterpriseAIService(ChatClient.Builder builder) {
        this.chatClient = builder
            .defaultSystem(SYSTEM_PROMPT)  // This prefix will be cached
            .build();
    }

    public String answer(String userQuery) {
        return chatClient.prompt()
            .user(userQuery)   // Only this changes per request
            .call()
            .content();
    }
}

On the first call, Anthropic processes and caches the system prompt. Every subsequent call skips the prefill for that prefix — you only pay for the user query tokens at full price.

Step 4 — Enable Caching on OpenAI

OpenAI caching is fully automatic — no code changes required.

How it works

Any prompt prefix longer than 1,024 tokens that appears in multiple requests is automatically cached on OpenAI's servers for up to one hour. You see a cached_tokens field in the API response with the discounted count.

What you must do: structure your prompt correctly

// CORRECT — static system prompt comes before dynamic content
public String answerWithCaching(String userQuery) {
    return chatClient.prompt()
        .system(LARGE_STATIC_SYSTEM_PROMPT)  // 2,000+ tokens — gets cached
        .user(userQuery)                       // Dynamic — never cached
        .call()
        .content();
}

// WRONG — injecting dynamic content into the system prompt breaks caching
public String answerWithoutCaching(String userQuery, String sessionId) {
    return chatClient.prompt()
        .system(STATIC_PROMPT + "\nSession: " + sessionId)  // Changes every time!
        .user(userQuery)
        .call()
        .content();
}

Common mistakes that break OpenAI caching

Appending timestamps or session IDs to the system prompt
Including request IDs or trace IDs in the static prefix
Personalising the system prompt per user (use the user message instead)
Shuffling few-shot examples (always keep them in the same order)

Step 5 — Measure the Impact

Add instrumentation to verify caching is working. For OpenAI, check the usage object:

@Service
public class InstrumentedAIService {

    private final OpenAiChatModel chatModel;
    private final MeterRegistry meterRegistry;

    public String answerWithMetrics(String prompt, String userQuery) {
        ChatResponse response = chatModel.call(
            new Prompt(List.of(
                new SystemMessage(prompt),
                new UserMessage(userQuery)
            ))
        );

        // Extract token usage from the response metadata
        Usage usage = response.getMetadata().getUsage();
        long totalInput     = usage.getPromptTokens();
        long cachedInput    = extractCachedTokens(response);  // provider-specific
        long uncachedInput  = totalInput - cachedInput;

        meterRegistry.counter("llm.tokens.cached").increment(cachedInput);
        meterRegistry.counter("llm.tokens.uncached").increment(uncachedInput);

        double hitRate = totalInput > 0
            ? (double) cachedInput / totalInput
            : 0;
        meterRegistry.gauge("llm.cache.hit_rate", hitRate);

        return response.getResult().getOutput().getContent();
    }
}

Build a Grafana dashboard with:

Cache hit rate (target: above 70%)
Average cost per request (should drop after enabling caching)
p95 latency (should drop on cache hits)

Step 6 — Cost Impact Calculator

Use this table to estimate your savings before implementing:

Requests/day	Prompt tokens	Without caching	With 90pct hit rate	Monthly saving
1,000	2,000	$10/day	$1.90/day	$243/month
10,000	2,000	$100/day	$19/day	$2,430/month
50,000	5,000	$1,250/day	$237/day	$30,390/month
100,000	5,000	$2,500/day	$475/day	$60,750/month

Based on GPT-4o input pricing of $5 per 1M tokens. Cached tokens billed at $0.50 per 1M (10pct).

Implementation Checklist

Go through this before deploying:

Audit your prompts — identify the largest static block in your system prompt
Separate static from dynamic — system prompt holds only static content; user turn holds dynamic content
Check token count — use the tokeniser to verify your static prefix exceeds the cache threshold
Remove dynamic injections — strip timestamps, session IDs, and per-user content from system prompts
Anthropic only — confirm you are on a model that supports caching (Sonnet, Haiku, Opus)
Add monitoring — track cached_tokens in the API response and alert if hit rate drops below 70%
Test cache invalidation — verify your system behaves correctly after the 5-minute cache TTL expires (Anthropic) or 1-hour TTL (OpenAI)

Summary

Prompt caching is one of those rare optimisations where you get faster AND cheaper at the same time. The implementation effort is minimal — mostly restructuring prompts to put static content first — and the cost savings are measurable within 24 hours of deployment.

For any enterprise AI application with a system prompt longer than 1,000 tokens running at meaningful request volume, prompt caching should be the first production optimisation on your list.

Prompt Engineering Enterprise Guide — structure prompts well (it also makes them cache-friendly).

Prompt Caching: Cut Your LLM Costs by 80%

Prompt Caching: Cut Your LLM Costs by 80%

Step 1 — Understand How LLM Calls Work

Step 2 — Identify What to Cache

Step 3 — Enable Caching on Anthropic

3a — Add the dependency

3b — Configure your application.yml

3c — Build the service with a large, static system prompt

Step 4 — Enable Caching on OpenAI

How it works

What you must do: structure your prompt correctly

Common mistakes that break OpenAI caching

Step 5 — Measure the Impact

Step 6 — Cost Impact Calculator

Implementation Checklist

Summary

Ask about this article

Enjoyed this? Get more like it
every Monday.

Prompt Caching: Cut Your LLM Costs by 80%

Prompt Caching: Cut Your LLM Costs by 80%

Step 1 — Understand How LLM Calls Work

Step 2 — Identify What to Cache

Step 3 — Enable Caching on Anthropic

3a — Add the dependency

3b — Configure your application.yml

3c — Build the service with a large, static system prompt

Step 4 — Enable Caching on OpenAI

How it works

What you must do: structure your prompt correctly

Common mistakes that break OpenAI caching

Step 5 — Measure the Impact

Step 6 — Cost Impact Calculator

Implementation Checklist

Summary

Related reading

Ask about this article

Enjoyed this? Get more like it every Monday.

Enjoyed this? Get more like it
every Monday.