Most RAG systems that "don't work" don't have a model problem — they have a chunking problem. If retrieval returns the wrong passages, no prompt or model can save the answer. Chunking is the quiet decision that caps the quality of everything downstream, and it rarely gets the attention it deserves.
Why chunking decides RAG quality
Retrieval matches a query against chunks. If a chunk is too big, it dilutes the relevant sentence with noise and the embedding becomes "average." Too small, and it loses the context needed to be meaningful. Split mid-thought, and the answer is scattered across chunks that never co-retrieve. The chunk is the unit of retrieval — get it wrong and recall collapses.
The strategies, worst to best (usually)
1. Fixed-size (by characters/tokens). Split every N characters. Dead simple, but it slices through sentences, tables, and code. Fine as a baseline; rarely best.
2. Recursive / separator-aware. Split on a hierarchy of separators (paragraphs → sentences → words) so chunks break at natural boundaries. A big step up from fixed-size and a sensible default for prose.
3. Structural / document-aware. Respect the document's structure — Markdown headings, HTML sections, code blocks, table rows. Each chunk is a coherent unit (a section, a function, a row group). Best when your content has structure (docs, wikis, code).
4. Semantic. Use embeddings to detect topic shifts and split where meaning changes. Most expensive to build, but produces the most coherent chunks for unstructured text.
There's no universal winner — match the strategy to your content. Structured docs → structural; flowing prose → recursive or semantic.
Sizing and overlap
- Size: start around 300–800 tokens per chunk. Smaller = sharper retrieval but more fragments; larger = more context but noisier embeddings. Tune against your data.
- Overlap: add 10–20% overlap so a sentence split across a boundary still appears whole in one chunk. Too much overlap wastes the index and double-counts content.
Don't forget metadata
Attach metadata to every chunk — source, title, section heading, date, URL. It pays off three ways:
- Filtering: restrict retrieval by source/recency before the vector search.
- Provenance: cite where an answer came from (and surface conflicts).
- Context: prepend the section heading to the chunk so an isolated passage stays meaningful.
Measure, don't guess
Chunking is empirical. Build a small golden set of question → expected-source pairs and measure retrieval recall (did the right chunk make the top-k?) as you change strategy, size, and overlap. Optimize retrieval before you touch the prompt — it's where the leverage is.
Wrap-up
Pick a chunking strategy that matches your content's structure, size chunks to your data with modest overlap, enrich them with metadata, and validate with a recall metric. Do that and your RAG system improves more than any prompt tweak could deliver.
Related reading
- RAG Systems Explained — the full retrieval pipeline this fits into.
- Context Engineering: The Real Skill Behind Reliable LLM Apps — chunks are what you feed the context window.
- LLM Observability — track retrieval quality in production.