Spring AI
    June 10, 2026

    Spring AI — RAG & Vector Stores

    Ground LLM answers in your own data with Spring AI: embeddings, vector stores, the ingestion and retrieval pipeline, and enterprise RAG patterns.

    Share

    LLMs don't know your internal docs, policies, or last week's data — and they hallucinate around the gaps. Retrieval-Augmented Generation (RAG) fixes this: fetch the relevant content at query time and put it in the context. Spring AI gives you the building blocks.

    The two phases

    flowchart LR subgraph Ingestion D[Documents] --> S[Split into chunks] S --> E1[EmbeddingModel] E1 --> VS[(Vector store)] end subgraph Query Q[User question] --> E2[EmbeddingModel] E2 --> VS VS --> C[Top-k chunks] C --> P[Prompt + context] P --> M[ChatClient] M --> A[Grounded answer] end

    Ingestion (offline): documents → chunks → embeddings → vector store. Query (online): question → embedding → similarity search → relevant chunks → prompt → answer.

    Embeddings & vector store

    Add an embedding model starter and a vector store. Spring AI supports pgvector, Redis, Qdrant, Pinecone, Chroma, and more behind one VectorStore interface:

    spring.ai.openai.embedding.options.model=text-embedding-3-small
    spring.ai.vectorstore.pgvector.dimensions=1536

    Ingestion

    var reader = new TikaDocumentReader(resource);     // PDF, docx, html…
    var splitter = new TokenTextSplitter();            // chunking
    List<Document> chunks = splitter.apply(reader.get());
    vectorStore.add(chunks);                           // embeds + stores

    Attach metadata (source, title, section) to each Document — you'll use it for filtering and provenance.

    Retrieval + generation

    The clean way is the RAG advisor, which retrieves and injects context automatically:

    String answer = chat.prompt()
        .advisors(new QuestionAnswerAdvisor(vectorStore))
        .user(question)
        .call()
        .content();

    Or do it manually for control: vectorStore.similaritySearch(...) → build the prompt with the chunks → call.

    What actually determines quality

    RAG quality is capped by retrieval, and retrieval is capped by chunking:

    • Chunk on structure (headings/sections) where possible; ~300–800 tokens with 10–20% overlap as a starting point.
    • Filter by metadata before vector search (e.g. by product, recency) to cut noise.
    • Tune top-k — too many chunks cause "context rot"; too few miss the answer.
    • Measure retrieval recall with a golden set before touching the prompt.

    Best practices & anti-patterns

    • ✅ Add metadata + provenance so answers are traceable and conflicts surface.
    • ✅ Keep static instructions first in the prompt (cache-friendly).
    • ❌ Don't dump whole documents into the prompt — retrieve.
    • ❌ Don't fine-tune facts in; they go stale. Retrieve fresh data instead.

    Next: standardise tool access and build agents → MCP & Agents →

    Ask about this article

    Get answers grounded in this post. AI-generated — based on this article, and may be imperfect.

    Scaled AI Weekly

    Enjoyed this? Get more like it every Monday.

    Real architecture decisions, LLMOps patterns that survive production, and engineering leadership advice — from 12+ years of building at enterprise scale. Free. No spam. Unsubscribe anytime.

    Join engineers building production AI systems