A working prototype and a production system are different things. This final page is the reference architecture and the cross-cutting concerns — observability, caching, resilience, evals, and security — that make a Spring AI app dependable.
Reference architecture
Keep the AI concerns in a dedicated layer (advisors, tools, retrieval) behind your normal service boundaries. Route model traffic through a gateway/router so cost, fallback, and limits live in one place.
Observability
Spring AI integrates with Micrometer — you get traces and metrics for model calls (latency, token usage, tool calls) out of the box. Ship them to your existing stack (Prometheus/Grafana, OTel). You can't run AI reliably without seeing tokens, latency, and failure rates per route.
Caching
- Prompt caching: keep large static prefixes (system prompt, instructions) first so providers can cache them — big latency/cost wins at volume.
- Response/semantic caching: cache answers to identical/similar requests at a gateway to avoid repeat calls.
Resilience
LLM calls are network calls to a flaky dependency. Apply standard patterns (Resilience4j):
- Timeouts on every call.
- Retries with exponential backoff for transient/rate-limit errors.
- Circuit breaker + fallback (cached answer, smaller model, or graceful "try later").
- Bulkheads/budgets so one feature can't exhaust capacity or spend.
Evaluation
Treat quality as testable. Build a golden set and run evals (Spring AI has evaluator support; RAGAS-style metrics for RAG) in CI — fail the build if faithfulness/relevance/recall regress. Don't ship prompt or retrieval changes on vibes.
Security
- Treat user and retrieved text as untrusted (prompt-injection aware).
- Enforce tool permissions and authorization in code, not the prompt.
- Validate/scan structured output before use; scope context per user (no cross-tenant leakage).
- Keep keys server-side; never expose provider keys to the browser.
Anti-patterns to avoid
- ❌ Calling providers directly from many services (no cost/limit/fallback control) — route through a gateway.
- ❌ No evals — silent quality regressions.
- ❌ Unbounded agent loops / no budgets — runaway cost.
- ❌ Trusting model output as safe (HTML injection, unsafe tool args).
- ❌ Hardcoding one provider — no routing or failover.
You've completed the course 🎉
You can now design and ship enterprise Spring AI: from ChatClient basics through structured output, tool calling, RAG, MCP/agents, model selection, and a production-grade architecture.
Test your knowledge with the practice quiz below, and explore the related deep-dives.