AI
    May 31, 2026

    RAG Chunking Strategies That Actually Improve Retrieval

    Your RAG quality is capped by how you chunk. A practical comparison of fixed, recursive, semantic, and structural chunking, with sizing and overlap tips.

    Share

    Most RAG systems that "don't work" don't have a model problem — they have a chunking problem. If retrieval returns the wrong passages, no prompt or model can save the answer. Chunking is the quiet decision that caps the quality of everything downstream, and it rarely gets the attention it deserves.

    Why chunking decides RAG quality

    Retrieval matches a query against chunks. If a chunk is too big, it dilutes the relevant sentence with noise and the embedding becomes "average." Too small, and it loses the context needed to be meaningful. Split mid-thought, and the answer is scattered across chunks that never co-retrieve. The chunk is the unit of retrieval — get it wrong and recall collapses.

    The strategies, worst to best (usually)

    1. Fixed-size (by characters/tokens). Split every N characters. Dead simple, but it slices through sentences, tables, and code. Fine as a baseline; rarely best.

    2. Recursive / separator-aware. Split on a hierarchy of separators (paragraphs → sentences → words) so chunks break at natural boundaries. A big step up from fixed-size and a sensible default for prose.

    3. Structural / document-aware. Respect the document's structure — Markdown headings, HTML sections, code blocks, table rows. Each chunk is a coherent unit (a section, a function, a row group). Best when your content has structure (docs, wikis, code).

    4. Semantic. Use embeddings to detect topic shifts and split where meaning changes. Most expensive to build, but produces the most coherent chunks for unstructured text.

    There's no universal winner — match the strategy to your content. Structured docs → structural; flowing prose → recursive or semantic.

    Sizing and overlap

    • Size: start around 300–800 tokens per chunk. Smaller = sharper retrieval but more fragments; larger = more context but noisier embeddings. Tune against your data.
    • Overlap: add 10–20% overlap so a sentence split across a boundary still appears whole in one chunk. Too much overlap wastes the index and double-counts content.

    Don't forget metadata

    Attach metadata to every chunk — source, title, section heading, date, URL. It pays off three ways:

    • Filtering: restrict retrieval by source/recency before the vector search.
    • Provenance: cite where an answer came from (and surface conflicts).
    • Context: prepend the section heading to the chunk so an isolated passage stays meaningful.

    Measure, don't guess

    Chunking is empirical. Build a small golden set of question → expected-source pairs and measure retrieval recall (did the right chunk make the top-k?) as you change strategy, size, and overlap. Optimize retrieval before you touch the prompt — it's where the leverage is.

    Wrap-up

    Pick a chunking strategy that matches your content's structure, size chunks to your data with modest overlap, enrich them with metadata, and validate with a recall metric. Do that and your RAG system improves more than any prompt tweak could deliver.

    Ask about this article

    Get answers grounded in this post. AI-generated — based on this article, and may be imperfect.

    Scaled AI Weekly

    Enjoyed this? Get more like it every Monday.

    Real architecture decisions, LLMOps patterns that survive production, and engineering leadership advice — from 12+ years of building at enterprise scale. Free. No spam. Unsubscribe anytime.

    Join engineers building production AI systems