RAG Systems Explained
What you'll learn: By the end of this guide you will be able to design a production-grade RAG pipeline, choose the right chunking strategy, implement hybrid search, evaluate retrieval quality, and debug the most common RAG failure modes.
Large language models have a fixed knowledge cutoff and don't know anything about your company's internal data. Retrieval Augmented Generation (RAG) solves this by giving the LLM access to an external knowledge store at inference time — combining the reasoning ability of an LLM with the factual grounding of a search system.
The result: an AI that can answer questions about your specific domain without retraining, hallucination-resistant because it's grounded in real documents, and always up to date because the knowledge store updates independently of the model.
The Core Idea
Without RAG:
User: "What is our refund policy for Enterprise customers?"
LLM: [Guesses based on training data — likely wrong or hallucinated]
With RAG:
User: "What is our refund policy for Enterprise customers?"
Retriever: [Fetches relevant policy documents from your knowledge base]
LLM: [Reads the actual documents and answers accurately]
The LLM is no longer relying on memory — it's reading the right documents at the moment of the question.
RAG Pipeline Overview
Figure 1: Full RAG pipeline — ingestion (left) and query (right)
The Five Components of a RAG System
1. Document Ingestion Pipeline
Before you can retrieve documents, you need to process them into a format suitable for semantic search. The ingestion pipeline:
Load documents from your sources (PDFs, Confluence, SharePoint, databases, S3 objects).
Split documents into chunks. This is more nuanced than it sounds:
- Chunks too large → retrieval returns irrelevant sections alongside relevant ones, confusing the LLM
- Chunks too small → chunks lose context (a sentence without its surrounding paragraph)
- Optimal chunk size depends on your content type: 256–512 tokens for dense technical docs, 512–1024 tokens for narrative content
Use recursive character splitting with overlap:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64, # Overlap prevents losing context at boundaries
separators=["\n\n", "\n", ".", " "]
)
chunks = splitter.split_text(document_content)
Enrich each chunk with metadata: source document, section heading, creation date, author, document type. This metadata enables filtered retrieval ("only search policy documents, not engineering specs").
Embed each chunk using an embedding model. The embedding converts text into a high-dimensional vector that encodes semantic meaning.
2. Embedding Models
An embedding model converts text into a vector representation. Semantically similar texts produce vectors that are close together in vector space — this is what enables semantic search.
Common embedding models:
| Model | Dimensions | Strengths |
|---|---|---|
| OpenAI text-embedding-3-small | 1536 | Best quality/cost for English |
| OpenAI text-embedding-3-large | 3072 | Highest quality, 5x cost |
| Cohere embed-v3 | 1024 | Multilingual, good for mixed-language corpora |
| sentence-transformers/all-MiniLM | 384 | Free, fast, good for on-premise |
| nomic-embed-text | 768 | Open source, strong performance |
Critical rule: Use the same embedding model for ingestion and retrieval. If you embed documents with OpenAI and query with Cohere, the vector spaces don't align — retrieval will be meaningless.
3. Vector Database
The vector database stores your chunk embeddings and enables fast approximate nearest-neighbour (ANN) search — finding the chunks most semantically similar to a query.
Popular options and when to use them:
pgvector (PostgreSQL extension) — Use when you're already on PostgreSQL and don't want additional infrastructure. Excellent for < 10M vectors. Free, managed by your existing DBA team.
Pinecone — Fully managed, scales to billions of vectors, strong filtering support. Best for large-scale production deployments where operational simplicity is worth the cost.
Chroma — Open source, runs embedded in your application process. Perfect for development and small-scale deployments.
Weaviate — Open source with a managed cloud option. Strong multi-modal support (text + images).
Qdrant — Rust-based, very fast, supports rich payload filtering. Good for high-throughput use cases.
For most enterprise teams starting with RAG: use pgvector. It runs on infrastructure you already manage, you understand how to back it up, and it scales to millions of documents without issue.
4. Retrieval Layer
When a query arrives, the retriever finds the most relevant chunks. The naive approach — semantic similarity only — leaves performance on the table.
Hybrid search combines semantic search (vector similarity) with keyword search (BM25/full-text) and merges results using reciprocal rank fusion (RRF). This dramatically improves recall for queries with specific product names, IDs, or technical terms that semantic search handles poorly.
def hybrid_search(query: str, top_k: int = 10) -> list[Document]:
# Semantic search
query_embedding = embed(query)
semantic_results = vector_store.similarity_search(
query_embedding, top_k=top_k
)
# Keyword search
keyword_results = full_text_search(query, top_k=top_k)
# Reciprocal rank fusion
return reciprocal_rank_fusion(semantic_results, keyword_results)
Metadata filtering narrows search before semantic similarity is computed:
results = vector_store.similarity_search(
query_embedding,
filter={"document_type": "policy", "department": "enterprise-sales"},
top_k=5
)
Without filtering, a question about Enterprise refund policies might retrieve Consumer product documentation — technically relevant content that confuses the LLM.
Re-ranking is a post-retrieval step that uses a cross-encoder model to re-score the top-K retrieved chunks against the query. Cross-encoders are slower than bi-encoders but much more accurate because they consider query and document jointly. Use a re-ranker (Cohere Rerank, ColBERT) when precision matters.
5. Generation Layer
With retrieved chunks in hand, the LLM generates a grounded response. The prompt structure:
SYSTEM_PROMPT = """
You are a knowledge assistant for Acme Corp. Answer questions
ONLY based on the provided context documents.
If the answer is not clearly supported by the context, say:
"I don't have sufficient information to answer this accurately."
Never supplement context with your training data. Cite the source
document for each factual claim you make.
"""
def generate_answer(query: str, context_chunks: list[Document]) -> str:
context = "\n\n---\n\n".join([
f"Source: {chunk.metadata['source']}\n{chunk.content}"
for chunk in context_chunks
])
return llm.complete(
system=SYSTEM_PROMPT,
user=f"Context:\n{context}\n\nQuestion: {query}"
)
Key principles:
- Ground the model in context first. Put retrieved documents before the question.
- Explicit fallback instruction. Tell the model exactly what to say when it doesn't know.
- Request citations. Makes hallucinations visible and responses verifiable.
Advanced Patterns
Contextual Chunk Headers
Before embedding each chunk, prepend a generated summary of its position in the document:
Section: Enterprise Refund Policy > Eligibility > SaaS Subscriptions
Document: enterprise-policy-2024-q4.pdf
[original chunk content]
This improves retrieval accuracy for hierarchical documents (policy manuals, technical specifications) because the chunk carries context about where it sits in the document structure.
Query Rewriting
User queries are often underspecified. "What about refunds?" is worse for retrieval than "What is the enterprise refund policy for annual SaaS subscriptions?" Use the LLM to rewrite the query before retrieval:
def rewrite_query(user_query: str, conversation_history: list) -> str:
return llm.complete(
system="Rewrite the user's query to be self-contained and specific, "
"incorporating relevant context from the conversation history.",
user=f"History: {conversation_history}\nQuery: {user_query}"
)
HyDE (Hypothetical Document Embeddings)
Instead of embedding the user's query directly, generate a hypothetical document that would answer the query and embed that:
def hyde_retrieval(query: str) -> list[Document]:
# Generate a hypothetical answer document
hypothetical_doc = llm.complete(
system="Write a detailed, factual paragraph that would answer this question.",
user=query
)
# Embed the hypothetical document instead of the query
embedding = embed(hypothetical_doc)
return vector_store.similarity_search(embedding, top_k=5)
HyDE often improves recall because the hypothetical document is in the same distribution as your actual documents — closer in vector space to relevant chunks than the query itself.
Evaluating RAG Quality
RAG systems require specific evaluation metrics:
Retrieval metrics:
- Recall@K: what fraction of relevant documents appear in the top-K retrieved chunks?
- Mean Reciprocal Rank (MRR): are the most relevant chunks ranked first?
Generation metrics:
- Faithfulness: does the answer only contain claims supported by the retrieved context?
- Answer relevance: does the answer actually address the question?
- Context precision: are the retrieved chunks actually relevant to the question?
Use frameworks like RAGAS or TruLens to automate these evaluations. Run them on a held-out set of 100–200 question-answer pairs and treat quality regressions as failing tests.
Common Failure Modes
Retrieval failure. The right document exists but isn't retrieved. Causes: wrong chunk size, poor embedding model, missing metadata filters, insufficient top-K. Fix: audit low-quality responses to identify patterns; improve chunking or add hybrid search.
Context overflow. Too many retrieved chunks exceed the LLM's context window or dilute the signal. Fix: reduce top-K, use a re-ranker, or implement a summarisation step over retrieved chunks.
Semantic drift. Retrieved chunks are tangentially related but don't answer the question. Fix: add query rewriting, tighter metadata filtering, or a re-ranking step.
Answer fabrication. LLM supplements retrieved context with hallucinated facts. Fix: strengthen the system prompt's "answer only from context" instruction; add a post-generation verification step that checks each claim against the retrieved chunks.
A Production-Ready Stack
For most enterprise RAG deployments:
- Embedding: OpenAI text-embedding-3-small (quality/cost balance)
- Vector store: pgvector on PostgreSQL (operational simplicity)
- Search: Hybrid (pgvector ANN + PostgreSQL full-text search)
- Re-ranking: Cohere Rerank or a local cross-encoder
- Generation: GPT-4o or Claude Sonnet for complex queries; GPT-4o-mini for simple lookups
- Framework: Spring AI (Java) or LangChain/LlamaIndex (Python)
- Evaluation: RAGAS with a weekly eval run against a golden dataset
Start simple. Get basic RAG working with pgvector and evaluate it. Add hybrid search when keyword queries fail. Add re-ranking when precision suffers. Add query rewriting when conversation context is lost. Each step adds complexity — add it only when measurement shows it's needed.
Key Takeaways
- RAG = retrieval + generation. The LLM only sees what the retriever gives it — quality of retrieval determines quality of answer
- Chunking strategy matters more than model choice. 512-token semantic chunks with 64-token overlap outperform fixed-size splitting in almost every benchmark
- Hybrid search (vector + BM25 fusion) gives 15–25% better recall than pure semantic search for enterprise documents with specific terminology
- pgvector on PostgreSQL handles millions of vectors with zero additional infrastructure — don't reach for dedicated vector DBs prematurely
- Re-ranking is the highest-leverage improvement after basic RAG is working
- Evaluate with RAGAS: track Faithfulness, Answer Relevance, and Context Precision — if you can't measure it, you can't improve it
- The most common failure is answer fabrication — always ground the LLM in retrieved context and explicitly forbid supplementing from training data
Practice Exercises
Exercise 1 — Starter (30 min): Set up a basic RAG pipeline using Spring AI or LangChain. Ingest 10 PDF pages into pgvector. Ask 5 questions and measure retrieval precision manually (did the retrieved chunks contain the answer?). Target: >80% precision.
Exercise 2 — Intermediate (2 hours): Implement hybrid search by adding PostgreSQL full-text search alongside your vector search. Use Reciprocal Rank Fusion to merge results. Compare retrieval recall on 20 test questions with and without hybrid search. Document the improvement.
Exercise 3 — Advanced (half day): Build a RAGAS evaluation pipeline. Create a golden dataset of 50 question-answer pairs from your documents. Write an automated eval that scores Faithfulness and Answer Relevance. Set up a CI step that fails if either score drops below 0.8. Run it against a prompt change and observe the delta.
Related reading
- LLM Inference Explained — what happens under the hood when your RAG prompt hits the model.
- Building Enterprise AI Agents — combine retrieval with tool use for agentic workflows.