Spring AI — RAG & Vector Stores — Avaneesh Yadav

LLMs don't know your internal docs, policies, or last week's data — and they hallucinate around the gaps. Retrieval-Augmented Generation (RAG) fixes this: fetch the relevant content at query time and put it in the context. Spring AI gives you the building blocks.

The two phases

flowchart LR subgraph Ingestion D[Documents] --> S[Split into chunks] S --> E1[EmbeddingModel] E1 --> VS[(Vector store)] end subgraph Query Q[User question] --> E2[EmbeddingModel] E2 --> VS VS --> C[Top-k chunks] C --> P[Prompt + context] P --> M[ChatClient] M --> A[Grounded answer] end

Ingestion (offline): documents → chunks → embeddings → vector store. Query (online): question → embedding → similarity search → relevant chunks → prompt → answer.

Embeddings & vector store

Add an embedding model starter and a vector store. Spring AI supports pgvector, Redis, Qdrant, Pinecone, Chroma, and more behind one VectorStore interface:

spring.ai.openai.embedding.options.model=text-embedding-3-small
spring.ai.vectorstore.pgvector.dimensions=1536

Ingestion

var reader = new TikaDocumentReader(resource);     // PDF, docx, html…
var splitter = new TokenTextSplitter();            // chunking
List<Document> chunks = splitter.apply(reader.get());
vectorStore.add(chunks);                           // embeds + stores

Attach metadata (source, title, section) to each Document — you'll use it for filtering and provenance.

Retrieval + generation

The clean way is the RAG advisor, which retrieves and injects context automatically:

String answer = chat.prompt()
    .advisors(new QuestionAnswerAdvisor(vectorStore))
    .user(question)
    .call()
    .content();

Or do it manually for control: vectorStore.similaritySearch(...) → build the prompt with the chunks → call.

What actually determines quality

RAG quality is capped by retrieval, and retrieval is capped by chunking:

Chunk on structure (headings/sections) where possible; ~300–800 tokens with 10–20% overlap as a starting point.
Filter by metadata before vector search (e.g. by product, recency) to cut noise.
Tune top-k — too many chunks cause "context rot"; too few miss the answer.
Measure retrieval recall with a golden set before touching the prompt.

Best practices & anti-patterns

✅ Add metadata + provenance so answers are traceable and conflicts surface.
✅ Keep static instructions first in the prompt (cache-friendly).
❌ Don't dump whole documents into the prompt — retrieve.
❌ Don't fine-tune facts in; they go stale. Retrieve fresh data instead.

Next: standardise tool access and build agents → MCP & Agents →

Spring AI — RAG & Vector Stores

The two phases

Embeddings & vector store

Ingestion

Retrieval + generation

What actually determines quality

Best practices & anti-patterns

Ask about this article

Enjoyed this? Get more like it
every Monday.

Spring AI — RAG & Vector Stores

The two phases

Embeddings & vector store

Ingestion

Retrieval + generation

What actually determines quality

Best practices & anti-patterns

Ask about this article

Enjoyed this? Get more like it every Monday.

Enjoyed this? Get more like it
every Monday.