AI
    June 1, 2026

    RAG Systems Explained

    A complete guide to Retrieval Augmented Generation — how it works, why each component matters, and how to build production-grade RAG pipelines.

    Share

    RAG Systems Explained

    What you'll learn: By the end of this guide you will be able to design a production-grade RAG pipeline, choose the right chunking strategy, implement hybrid search, evaluate retrieval quality, and debug the most common RAG failure modes.

    Large language models have a fixed knowledge cutoff and don't know anything about your company's internal data. Retrieval Augmented Generation (RAG) solves this by giving the LLM access to an external knowledge store at inference time — combining the reasoning ability of an LLM with the factual grounding of a search system.

    The result: an AI that can answer questions about your specific domain without retraining, hallucination-resistant because it's grounded in real documents, and always up to date because the knowledge store updates independently of the model.

    The Core Idea

    Without RAG:

    User: "What is our refund policy for Enterprise customers?"
    LLM: [Guesses based on training data — likely wrong or hallucinated]

    With RAG:

    User: "What is our refund policy for Enterprise customers?"
    Retriever: [Fetches relevant policy documents from your knowledge base]
    LLM: [Reads the actual documents and answers accurately]

    The LLM is no longer relying on memory — it's reading the right documents at the moment of the question.

    RAG Pipeline Overview

    flowchart TD subgraph INGEST[Ingestion Pipeline] DOC[Source Documents\nPDF · Confluence · DB] --> SPLIT[Text Splitter\n512 token chunks] SPLIT --> EMBED[Embedding Model\ntext-embedding-3-small] EMBED --> VS[(Vector Store\npgvector / Pinecone)] end subgraph QUERY[Query Pipeline] Q([User Question]) --> QE[Query Embedding] QE --> SEARCH[Similarity Search\nTop-K chunks] VS --> SEARCH SEARCH --> RERANK[Re-Ranker\nOptional] RERANK --> LLM[LLM Generation\nGPT-4o / Claude] LLM --> ANS([Grounded Answer]) end style INGEST fill:#0d1a2d,stroke:#00e5ff,color:#fff style QUERY fill:#1a0d2d,stroke:#9b59b6,color:#fff style VS fill:#0d2d1a,stroke:#00e5ff,color:#fff style ANS fill:#1a2d0d,stroke:#2ecc71,color:#fff

    Figure 1: Full RAG pipeline — ingestion (left) and query (right)

    The Five Components of a RAG System

    1. Document Ingestion Pipeline

    Before you can retrieve documents, you need to process them into a format suitable for semantic search. The ingestion pipeline:

    Load documents from your sources (PDFs, Confluence, SharePoint, databases, S3 objects).

    Split documents into chunks. This is more nuanced than it sounds:

    • Chunks too large → retrieval returns irrelevant sections alongside relevant ones, confusing the LLM
    • Chunks too small → chunks lose context (a sentence without its surrounding paragraph)
    • Optimal chunk size depends on your content type: 256–512 tokens for dense technical docs, 512–1024 tokens for narrative content

    Use recursive character splitting with overlap:

    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=64,  # Overlap prevents losing context at boundaries
        separators=["\n\n", "\n", ".", " "]
    )
    
    chunks = splitter.split_text(document_content)

    Enrich each chunk with metadata: source document, section heading, creation date, author, document type. This metadata enables filtered retrieval ("only search policy documents, not engineering specs").

    Embed each chunk using an embedding model. The embedding converts text into a high-dimensional vector that encodes semantic meaning.

    2. Embedding Models

    An embedding model converts text into a vector representation. Semantically similar texts produce vectors that are close together in vector space — this is what enables semantic search.

    Common embedding models:

    Model Dimensions Strengths
    OpenAI text-embedding-3-small 1536 Best quality/cost for English
    OpenAI text-embedding-3-large 3072 Highest quality, 5x cost
    Cohere embed-v3 1024 Multilingual, good for mixed-language corpora
    sentence-transformers/all-MiniLM 384 Free, fast, good for on-premise
    nomic-embed-text 768 Open source, strong performance

    Critical rule: Use the same embedding model for ingestion and retrieval. If you embed documents with OpenAI and query with Cohere, the vector spaces don't align — retrieval will be meaningless.

    3. Vector Database

    The vector database stores your chunk embeddings and enables fast approximate nearest-neighbour (ANN) search — finding the chunks most semantically similar to a query.

    Popular options and when to use them:

    pgvector (PostgreSQL extension) — Use when you're already on PostgreSQL and don't want additional infrastructure. Excellent for < 10M vectors. Free, managed by your existing DBA team.

    Pinecone — Fully managed, scales to billions of vectors, strong filtering support. Best for large-scale production deployments where operational simplicity is worth the cost.

    Chroma — Open source, runs embedded in your application process. Perfect for development and small-scale deployments.

    Weaviate — Open source with a managed cloud option. Strong multi-modal support (text + images).

    Qdrant — Rust-based, very fast, supports rich payload filtering. Good for high-throughput use cases.

    For most enterprise teams starting with RAG: use pgvector. It runs on infrastructure you already manage, you understand how to back it up, and it scales to millions of documents without issue.

    4. Retrieval Layer

    When a query arrives, the retriever finds the most relevant chunks. The naive approach — semantic similarity only — leaves performance on the table.

    Hybrid search combines semantic search (vector similarity) with keyword search (BM25/full-text) and merges results using reciprocal rank fusion (RRF). This dramatically improves recall for queries with specific product names, IDs, or technical terms that semantic search handles poorly.

    def hybrid_search(query: str, top_k: int = 10) -> list[Document]:
        # Semantic search
        query_embedding = embed(query)
        semantic_results = vector_store.similarity_search(
            query_embedding, top_k=top_k
        )
        
        # Keyword search
        keyword_results = full_text_search(query, top_k=top_k)
        
        # Reciprocal rank fusion
        return reciprocal_rank_fusion(semantic_results, keyword_results)

    Metadata filtering narrows search before semantic similarity is computed:

    results = vector_store.similarity_search(
        query_embedding,
        filter={"document_type": "policy", "department": "enterprise-sales"},
        top_k=5
    )

    Without filtering, a question about Enterprise refund policies might retrieve Consumer product documentation — technically relevant content that confuses the LLM.

    Re-ranking is a post-retrieval step that uses a cross-encoder model to re-score the top-K retrieved chunks against the query. Cross-encoders are slower than bi-encoders but much more accurate because they consider query and document jointly. Use a re-ranker (Cohere Rerank, ColBERT) when precision matters.

    5. Generation Layer

    With retrieved chunks in hand, the LLM generates a grounded response. The prompt structure:

    SYSTEM_PROMPT = """
    You are a knowledge assistant for Acme Corp. Answer questions 
    ONLY based on the provided context documents. 
    
    If the answer is not clearly supported by the context, say:
    "I don't have sufficient information to answer this accurately."
    
    Never supplement context with your training data. Cite the source 
    document for each factual claim you make.
    """
    
    def generate_answer(query: str, context_chunks: list[Document]) -> str:
        context = "\n\n---\n\n".join([
            f"Source: {chunk.metadata['source']}\n{chunk.content}"
            for chunk in context_chunks
        ])
        
        return llm.complete(
            system=SYSTEM_PROMPT,
            user=f"Context:\n{context}\n\nQuestion: {query}"
        )

    Key principles:

    • Ground the model in context first. Put retrieved documents before the question.
    • Explicit fallback instruction. Tell the model exactly what to say when it doesn't know.
    • Request citations. Makes hallucinations visible and responses verifiable.

    Advanced Patterns

    Contextual Chunk Headers

    Before embedding each chunk, prepend a generated summary of its position in the document:

    Section: Enterprise Refund Policy > Eligibility > SaaS Subscriptions
    Document: enterprise-policy-2024-q4.pdf
    
    [original chunk content]

    This improves retrieval accuracy for hierarchical documents (policy manuals, technical specifications) because the chunk carries context about where it sits in the document structure.

    Query Rewriting

    User queries are often underspecified. "What about refunds?" is worse for retrieval than "What is the enterprise refund policy for annual SaaS subscriptions?" Use the LLM to rewrite the query before retrieval:

    def rewrite_query(user_query: str, conversation_history: list) -> str:
        return llm.complete(
            system="Rewrite the user's query to be self-contained and specific, "
                   "incorporating relevant context from the conversation history.",
            user=f"History: {conversation_history}\nQuery: {user_query}"
        )

    HyDE (Hypothetical Document Embeddings)

    Instead of embedding the user's query directly, generate a hypothetical document that would answer the query and embed that:

    def hyde_retrieval(query: str) -> list[Document]:
        # Generate a hypothetical answer document
        hypothetical_doc = llm.complete(
            system="Write a detailed, factual paragraph that would answer this question.",
            user=query
        )
        
        # Embed the hypothetical document instead of the query
        embedding = embed(hypothetical_doc)
        return vector_store.similarity_search(embedding, top_k=5)

    HyDE often improves recall because the hypothetical document is in the same distribution as your actual documents — closer in vector space to relevant chunks than the query itself.

    Evaluating RAG Quality

    RAG systems require specific evaluation metrics:

    Retrieval metrics:

    • Recall@K: what fraction of relevant documents appear in the top-K retrieved chunks?
    • Mean Reciprocal Rank (MRR): are the most relevant chunks ranked first?

    Generation metrics:

    • Faithfulness: does the answer only contain claims supported by the retrieved context?
    • Answer relevance: does the answer actually address the question?
    • Context precision: are the retrieved chunks actually relevant to the question?

    Use frameworks like RAGAS or TruLens to automate these evaluations. Run them on a held-out set of 100–200 question-answer pairs and treat quality regressions as failing tests.

    Common Failure Modes

    Retrieval failure. The right document exists but isn't retrieved. Causes: wrong chunk size, poor embedding model, missing metadata filters, insufficient top-K. Fix: audit low-quality responses to identify patterns; improve chunking or add hybrid search.

    Context overflow. Too many retrieved chunks exceed the LLM's context window or dilute the signal. Fix: reduce top-K, use a re-ranker, or implement a summarisation step over retrieved chunks.

    Semantic drift. Retrieved chunks are tangentially related but don't answer the question. Fix: add query rewriting, tighter metadata filtering, or a re-ranking step.

    Answer fabrication. LLM supplements retrieved context with hallucinated facts. Fix: strengthen the system prompt's "answer only from context" instruction; add a post-generation verification step that checks each claim against the retrieved chunks.

    A Production-Ready Stack

    For most enterprise RAG deployments:

    • Embedding: OpenAI text-embedding-3-small (quality/cost balance)
    • Vector store: pgvector on PostgreSQL (operational simplicity)
    • Search: Hybrid (pgvector ANN + PostgreSQL full-text search)
    • Re-ranking: Cohere Rerank or a local cross-encoder
    • Generation: GPT-4o or Claude Sonnet for complex queries; GPT-4o-mini for simple lookups
    • Framework: Spring AI (Java) or LangChain/LlamaIndex (Python)
    • Evaluation: RAGAS with a weekly eval run against a golden dataset

    Start simple. Get basic RAG working with pgvector and evaluate it. Add hybrid search when keyword queries fail. Add re-ranking when precision suffers. Add query rewriting when conversation context is lost. Each step adds complexity — add it only when measurement shows it's needed.


    Key Takeaways

    • RAG = retrieval + generation. The LLM only sees what the retriever gives it — quality of retrieval determines quality of answer
    • Chunking strategy matters more than model choice. 512-token semantic chunks with 64-token overlap outperform fixed-size splitting in almost every benchmark
    • Hybrid search (vector + BM25 fusion) gives 15–25% better recall than pure semantic search for enterprise documents with specific terminology
    • pgvector on PostgreSQL handles millions of vectors with zero additional infrastructure — don't reach for dedicated vector DBs prematurely
    • Re-ranking is the highest-leverage improvement after basic RAG is working
    • Evaluate with RAGAS: track Faithfulness, Answer Relevance, and Context Precision — if you can't measure it, you can't improve it
    • The most common failure is answer fabrication — always ground the LLM in retrieved context and explicitly forbid supplementing from training data

    Practice Exercises

    Exercise 1 — Starter (30 min): Set up a basic RAG pipeline using Spring AI or LangChain. Ingest 10 PDF pages into pgvector. Ask 5 questions and measure retrieval precision manually (did the retrieved chunks contain the answer?). Target: >80% precision.

    Exercise 2 — Intermediate (2 hours): Implement hybrid search by adding PostgreSQL full-text search alongside your vector search. Use Reciprocal Rank Fusion to merge results. Compare retrieval recall on 20 test questions with and without hybrid search. Document the improvement.

    Exercise 3 — Advanced (half day): Build a RAGAS evaluation pipeline. Create a golden dataset of 50 question-answer pairs from your documents. Write an automated eval that scores Faithfulness and Answer Relevance. Set up a CI step that fails if either score drops below 0.8. Run it against a prompt change and observe the delta.

    Ask about this article

    Get answers grounded in this post. AI-generated — based on this article, and may be imperfect.

    Scaled AI Weekly

    Enjoyed this? Get more like it every Monday.

    Real architecture decisions, LLMOps patterns that survive production, and engineering leadership advice — from 12+ years of building at enterprise scale. Free. No spam. Unsubscribe anytime.

    Join engineers building production AI systems