Spring AI
    June 10, 2026

    Spring AI — Production Architecture & Best Practices

    Take Spring AI to production: a reference architecture plus observability, caching, resilience, evals, security, and the anti-patterns to avoid.

    Share

    A working prototype and a production system are different things. This final page is the reference architecture and the cross-cutting concerns — observability, caching, resilience, evals, and security — that make a Spring AI app dependable.

    Reference architecture

    flowchart LR API@{ icon: "logos:spring", form: "square", label: "Spring Boot API" } VS@{ icon: "logos:postgresql", form: "square", label: "Vector store" } U[Clients]:::ext --> API API --> ADV[ChatClient + Advisors]:::svc ADV --> RAG[RAG retriever]:::svc --> VS ADV --> TOOLS["@Tool / MCP"]:::svc ADV --> GW[AI gateway / router]:::svc GW --> P[LLM providers]:::ai ADV -. traces/metrics .-> OBS[Observability]:::data GW -. cost/limits .-> OBS class API,VS logo classDef ai fill:#241844,stroke:#a855f7,color:#fff classDef svc fill:#06303a,stroke:#22d3ee,color:#fff classDef data fill:#08331f,stroke:#34d399,color:#fff classDef ext fill:#222838,stroke:#94a3b8,color:#fff classDef logo fill:#0b1220,stroke:#475569,color:#e2e8f0

    Keep the AI concerns in a dedicated layer (advisors, tools, retrieval) behind your normal service boundaries. Route model traffic through a gateway/router so cost, fallback, and limits live in one place.

    Observability

    Spring AI integrates with Micrometer — you get traces and metrics for model calls (latency, token usage, tool calls) out of the box. Ship them to your existing stack (Prometheus/Grafana, OTel). You can't run AI reliably without seeing tokens, latency, and failure rates per route.

    Caching

    • Prompt caching: keep large static prefixes (system prompt, instructions) first so providers can cache them — big latency/cost wins at volume.
    • Response/semantic caching: cache answers to identical/similar requests at a gateway to avoid repeat calls.

    Resilience

    LLM calls are network calls to a flaky dependency. Apply standard patterns (Resilience4j):

    • Timeouts on every call.
    • Retries with exponential backoff for transient/rate-limit errors.
    • Circuit breaker + fallback (cached answer, smaller model, or graceful "try later").
    • Bulkheads/budgets so one feature can't exhaust capacity or spend.

    Evaluation

    Treat quality as testable. Build a golden set and run evals (Spring AI has evaluator support; RAGAS-style metrics for RAG) in CI — fail the build if faithfulness/relevance/recall regress. Don't ship prompt or retrieval changes on vibes.

    Security

    • Treat user and retrieved text as untrusted (prompt-injection aware).
    • Enforce tool permissions and authorization in code, not the prompt.
    • Validate/scan structured output before use; scope context per user (no cross-tenant leakage).
    • Keep keys server-side; never expose provider keys to the browser.

    Anti-patterns to avoid

    • ❌ Calling providers directly from many services (no cost/limit/fallback control) — route through a gateway.
    • ❌ No evals — silent quality regressions.
    • ❌ Unbounded agent loops / no budgets — runaway cost.
    • ❌ Trusting model output as safe (HTML injection, unsafe tool args).
    • ❌ Hardcoding one provider — no routing or failover.

    You've completed the course 🎉

    You can now design and ship enterprise Spring AI: from ChatClient basics through structured output, tool calling, RAG, MCP/agents, model selection, and a production-grade architecture.

    Test your knowledge with the practice quiz below, and explore the related deep-dives.

    Ask about this article

    Get answers grounded in this post. AI-generated — based on this article, and may be imperfect.

    Scaled AI Weekly

    Enjoyed this? Get more like it every Monday.

    Real architecture decisions, LLMOps patterns that survive production, and engineering leadership advice — from 12+ years of building at enterprise scale. Free. No spam. Unsubscribe anytime.

    Join engineers building production AI systems