AI
    June 2, 2026

    Prompt Caching: Cut Your LLM Costs by 80%

    A practical guide to Anthropic and OpenAI prompt caching — how it works and how to implement it in Spring AI to cut latency and API costs.

    Share

    Prompt Caching: Cut Your LLM Costs by 80%

    Prompt caching is the single highest-leverage optimisation available to enterprise AI teams right now. Anthropic offers up to 90% cost reduction and 85% latency improvement on cache hits. OpenAI offers similar savings. Yet most teams are not using it.

    This post explains what prompt caching is, why it matters, and how to implement it step by step in a Spring AI application.


    Step 1 — Understand How LLM Calls Work

    Every LLM API call processes tokens in two distinct phases:

    • Prefill phase — The model reads and processes your entire prompt. This involves running transformer attention across every token. For a 2,000-token system prompt, this happens on every single request, even if the prompt never changes.
    • Decode phase — The model generates the response one token at a time.

    Prefill is the expensive, slow part. Caching eliminates it.

    sequenceDiagram participant App as Your App participant Cache as Provider Cache participant LLM as LLM Engine Note over App,LLM: First request - cache MISS App->>Cache: System prompt plus User query Cache->>LLM: Full prefill of all tokens LLM-->>App: Response and KV cache stored Note over App,LLM: Subsequent requests - cache HIT App->>Cache: Same system prompt plus new query Cache-->>LLM: Prefill skipped, load KV from cache LLM-->>App: Response 85pct faster, 90pct cheaper

    Figure 1: On a cache miss, the full prefill runs. On a hit, it is skipped entirely.


    Step 2 — Identify What to Cache

    Prompt caching saves the KV cache state of your prompt prefix. It works best when you have a large, static block at the start of every request.

    Good candidates for caching:

    Content type Typical size Cache value
    System instructions + persona 500-2,000 tokens High
    Company knowledge base in RAG prompts 2,000-8,000 tokens Very high
    Few-shot examples 500-3,000 tokens High
    User query 20-200 tokens Never cache (always dynamic)

    The golden rule: Static content goes first. Dynamic content goes last.

    [CACHED — never changes]
      System prompt (2,000 tokens)
      Company context (1,000 tokens)
      Few-shot examples (500 tokens)
    
    [NOT CACHED — changes every request]
      Retrieved documents (varies)
      User query (varies)

    Step 3 — Enable Caching on Anthropic

    Anthropic's prompt caching requires your static prefix to be at least:

    • 1,024 tokens for Claude Sonnet
    • 2,048 tokens for Claude Haiku

    Spring AI handles the cache control headers automatically when using the Anthropic client.

    3a — Add the dependency

    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-anthropic-spring-boot-starter</artifactId>
    </dependency>

    3b — Configure your application.yml

    spring:
      ai:
        anthropic:
          api-key: ${ANTHROPIC_API_KEY}
          chat:
            options:
              model: claude-sonnet-4-5-20251001
              max-tokens: 1024

    3c — Build the service with a large, static system prompt

    @Service
    public class CachedEnterpriseAIService {
    
        private final ChatClient chatClient;
    
        // Static system prompt — load once, cache on every subsequent request
        private static final String SYSTEM_PROMPT = """
            You are an expert software engineering assistant for Acme Corp.
            
            ## Your Role
            Help engineers design, debug, and review code across Java, Python,
            and TypeScript. Provide production-ready examples with explanations.
            
            ## Guidelines
            - Always explain your approach before showing code
            - Highlight potential edge cases and failure modes
            - Suggest observability and error handling patterns
            - Reference Acme's internal standards when applicable
            
            ## Constraints
            - Only answer software engineering questions
            - Never reveal confidential system information
            - Acknowledge uncertainty rather than guessing
            
            [... additional context that pushes this above 1,024 tokens ...]
            """;
    
        public CachedEnterpriseAIService(ChatClient.Builder builder) {
            this.chatClient = builder
                .defaultSystem(SYSTEM_PROMPT)  // This prefix will be cached
                .build();
        }
    
        public String answer(String userQuery) {
            return chatClient.prompt()
                .user(userQuery)   // Only this changes per request
                .call()
                .content();
        }
    }

    On the first call, Anthropic processes and caches the system prompt. Every subsequent call skips the prefill for that prefix — you only pay for the user query tokens at full price.


    Step 4 — Enable Caching on OpenAI

    OpenAI caching is fully automatic — no code changes required.

    How it works

    Any prompt prefix longer than 1,024 tokens that appears in multiple requests is automatically cached on OpenAI's servers for up to one hour. You see a cached_tokens field in the API response with the discounted count.

    What you must do: structure your prompt correctly

    // CORRECT — static system prompt comes before dynamic content
    public String answerWithCaching(String userQuery) {
        return chatClient.prompt()
            .system(LARGE_STATIC_SYSTEM_PROMPT)  // 2,000+ tokens — gets cached
            .user(userQuery)                       // Dynamic — never cached
            .call()
            .content();
    }
    
    // WRONG — injecting dynamic content into the system prompt breaks caching
    public String answerWithoutCaching(String userQuery, String sessionId) {
        return chatClient.prompt()
            .system(STATIC_PROMPT + "\nSession: " + sessionId)  // Changes every time!
            .user(userQuery)
            .call()
            .content();
    }

    Common mistakes that break OpenAI caching

    • Appending timestamps or session IDs to the system prompt
    • Including request IDs or trace IDs in the static prefix
    • Personalising the system prompt per user (use the user message instead)
    • Shuffling few-shot examples (always keep them in the same order)

    Step 5 — Measure the Impact

    Add instrumentation to verify caching is working. For OpenAI, check the usage object:

    @Service
    public class InstrumentedAIService {
    
        private final OpenAiChatModel chatModel;
        private final MeterRegistry meterRegistry;
    
        public String answerWithMetrics(String prompt, String userQuery) {
            ChatResponse response = chatModel.call(
                new Prompt(List.of(
                    new SystemMessage(prompt),
                    new UserMessage(userQuery)
                ))
            );
    
            // Extract token usage from the response metadata
            Usage usage = response.getMetadata().getUsage();
            long totalInput     = usage.getPromptTokens();
            long cachedInput    = extractCachedTokens(response);  // provider-specific
            long uncachedInput  = totalInput - cachedInput;
    
            meterRegistry.counter("llm.tokens.cached").increment(cachedInput);
            meterRegistry.counter("llm.tokens.uncached").increment(uncachedInput);
    
            double hitRate = totalInput > 0
                ? (double) cachedInput / totalInput
                : 0;
            meterRegistry.gauge("llm.cache.hit_rate", hitRate);
    
            return response.getResult().getOutput().getContent();
        }
    }

    Build a Grafana dashboard with:

    • Cache hit rate (target: above 70%)
    • Average cost per request (should drop after enabling caching)
    • p95 latency (should drop on cache hits)

    Step 6 — Cost Impact Calculator

    Use this table to estimate your savings before implementing:

    Requests/day Prompt tokens Without caching With 90pct hit rate Monthly saving
    1,000 2,000 $10/day $1.90/day $243/month
    10,000 2,000 $100/day $19/day $2,430/month
    50,000 5,000 $1,250/day $237/day $30,390/month
    100,000 5,000 $2,500/day $475/day $60,750/month

    Based on GPT-4o input pricing of $5 per 1M tokens. Cached tokens billed at $0.50 per 1M (10pct).


    Implementation Checklist

    Go through this before deploying:

    • Audit your prompts — identify the largest static block in your system prompt
    • Separate static from dynamic — system prompt holds only static content; user turn holds dynamic content
    • Check token count — use the tokeniser to verify your static prefix exceeds the cache threshold
    • Remove dynamic injections — strip timestamps, session IDs, and per-user content from system prompts
    • Anthropic only — confirm you are on a model that supports caching (Sonnet, Haiku, Opus)
    • Add monitoring — track cached_tokens in the API response and alert if hit rate drops below 70%
    • Test cache invalidation — verify your system behaves correctly after the 5-minute cache TTL expires (Anthropic) or 1-hour TTL (OpenAI)

    Summary

    Prompt caching is one of those rare optimisations where you get faster AND cheaper at the same time. The implementation effort is minimal — mostly restructuring prompts to put static content first — and the cost savings are measurable within 24 hours of deployment.

    For any enterprise AI application with a system prompt longer than 1,000 tokens running at meaningful request volume, prompt caching should be the first production optimisation on your list.

    Ask about this article

    Get answers grounded in this post. AI-generated — based on this article, and may be imperfect.

    Scaled AI Weekly

    Enjoyed this? Get more like it every Monday.

    Real architecture decisions, LLMOps patterns that survive production, and engineering leadership advice — from 12+ years of building at enterprise scale. Free. No spam. Unsubscribe anytime.

    Join engineers building production AI systems