Spring AI
    June 10, 2026

    Spring AI — Choosing & Integrating LLMs

    Which LLM for which job, and how Spring AI integrates any provider behind one interface — Anthropic, OpenAI/Azure, Google, and local models.

    Share

    "Which LLM is best?" has no single answer — it depends on the task, cost, latency, and data-residency constraints. Spring AI's value is that you don't have to bet the codebase on one: you integrate behind a common interface and route per task.

    How Spring AI integrates "all of them"

    Every provider is a starter implementing the same ChatModel/EmbeddingModel interfaces:

    Provider Starter (model)
    Anthropic Claude spring-ai-starter-model-anthropic
    OpenAI spring-ai-starter-model-openai
    Azure OpenAI spring-ai-starter-model-azure-openai
    Google Vertex (Gemini) spring-ai-starter-model-vertex-ai-gemini
    Amazon Bedrock spring-ai-starter-model-bedrock-*
    Ollama (local) spring-ai-starter-model-ollama

    Your code calls ChatClient; the starter + config decide the concrete model. Swapping providers is a dependency/config change, not a rewrite.

    flowchart LR C@{ icon: "logos:anthropic", form: "square", label: "Claude" } O@{ icon: "logos:openai", form: "square", label: "OpenAI / Azure" } G@{ icon: "logos:google-cloud", form: "square", label: "Gemini / Vertex" } B@{ icon: "logos:aws", form: "square", label: "Bedrock" } APP[Your code: ChatClient]:::svc --> ABS[Spring AI abstraction]:::svc ABS --> C ABS --> O ABS --> G ABS --> B ABS --> L[Ollama local]:::ai class C,O,G,B logo classDef svc fill:#06303a,stroke:#22d3ee,color:#fff classDef ai fill:#241844,stroke:#a855f7,color:#fff classDef logo fill:#0b1220,stroke:#475569,color:#e2e8f0

    Which model for which job

    There's no universal "best", but sensible defaults:

    • Complex reasoning, agents, architecture, coding → a frontier model (e.g. Claude Sonnet/Opus class). Worth the cost where quality and tool-use reliability matter.
    • High-volume classification / extraction / routing → a small, fast model. Cheap and low-latency; pair with structured output.
    • Private / regulated data, offline → a local model via Ollama or a VPC-hosted provider for data residency.
    • Embeddings (RAG) → a dedicated embedding model; keep the same model for ingestion and query.

    Decision factors: capability (can it do the task reliably?), cost per token, latency, context window, data residency/compliance, and provider reliability.

    Routing: use more than one

    A mature system routes by task — a cheap model for the easy 80%, a frontier model for the hard 20%:

    ChatClient client = complexity.isHigh(task)
        ? frontierClient    // built from the Anthropic builder
        : fastClient;       // built from a small-model builder
    String out = client.prompt().user(task).call().content();

    Centralise this behind a service (or an AI gateway) so apps don't each reinvent routing, fallback, and budgets.

    Best practices & anti-patterns

    • ✅ Abstract on ChatClient; keep provider specifics in config so you can switch/route.
    • ✅ Benchmark candidate models on your eval set, not leaderboards.
    • ✅ Have a fallback provider for outages.
    • ❌ Hardcoding one provider's SDK throughout your app (lock-in, no routing).
    • ❌ Using a frontier model for trivial classification (cost) — or a tiny model for hard reasoning (quality).

    Next: make it production-grade → Production Architecture →

    Ask about this article

    Get answers grounded in this post. AI-generated — based on this article, and may be imperfect.

    Scaled AI Weekly

    Enjoyed this? Get more like it every Monday.

    Real architecture decisions, LLMOps patterns that survive production, and engineering leadership advice — from 12+ years of building at enterprise scale. Free. No spam. Unsubscribe anytime.

    Join engineers building production AI systems