RAG Done Right: Semantic Chunking Strategies
RAG accuracy is determined before the model sees a single token. It is determined at indexing time, by how you chunk your documents. A model given perfectly relevant context will produce accurate outputs. A model given noisy, fragmented, or misaligned context will hallucinate — confidently and at scale. Chunking strategy is the part of the retrieval pipeline that most enterprise RAG implementations get wrong, and then spend months trying to compensate for with prompt engineering.
This post covers four chunking approaches, when each wins, and the benchmark reality that should temper both optimism and over-engineering.
Fixed-Size Chunking: Fast, Simple, Wrong
The default chunking strategy across most RAG frameworks — LangChain, LlamaIndex, and their derivatives — is fixed-size chunking with overlap. Typically: 512-token windows, 50-token overlap, split on whitespace or sentence boundaries where possible. It is the default because it is fast, predictable, and dependency-free. You can index a 10,000-document corpus overnight without any specialized infrastructure.
The problem is that semantic boundaries don't align with token counts. A 512-token window ends where it ends — not where the document's logic ends. In a legal contract, a critical exception clause might start at token 480 of one chunk and continue into the next. Neither chunk contains the complete clause. A retrieval query for "termination notice requirements" might match both chunks weakly, retrieve neither as the top result, and miss the answer entirely. Or it retrieves both, and the model sees two incomplete fragments that contradict each other when read without context.
The 50-token overlap is designed to mitigate this. It doesn't. Fifty tokens of repeated content reduces the worst-case boundary splits but doesn't solve the fundamental problem: the chunk boundary was drawn by token count, not by meaning. For structured, homogeneous documents — short FAQ entries, product spec sheets with consistent formatting — fixed-size chunking is often adequate. For anything more complex, it is a source of retrieval errors that compound silently.
Semantic Chunking: What It Is and When It Wins
Semantic chunking replaces fixed token counts with embedding-based topic detection. The algorithm embeds each sentence (or small paragraph) as a vector, then computes similarity between adjacent sentences. When similarity drops below a threshold — indicating a topic transition — a chunk boundary is inserted. The result is chunks that correspond to discrete topics or arguments rather than arbitrary text windows.
The implementation requires a sentence embedding model running at index time. Options range from lightweight models optimised for speed to larger multilingual models optimised for accuracy. Indexing is 3-5x slower than fixed-size chunking because every sentence requires an embedding inference pass. For a large corpus, this is a meaningful cost — both in compute and in indexing latency for new documents.
The accuracy gains are real, but they are not uniform. Semantic chunking wins most decisively on heterogeneous, long-form documents: legal contracts, financial reports, clinical guidelines, policy documentation. These documents contain dense, varied content where a single topic may span a few paragraphs and then abruptly shift. Fixed-size chunking fractures these topics arbitrarily. Semantic chunking preserves them as discrete retrievable units.
A concrete example: a 40-page master services agreement contains 22 distinct legal provisions. Fixed-size chunking at 512 tokens produces roughly 80 chunks, many of which span two provisions. Semantic chunking produces approximately 30 chunks, each corresponding to one or two related provisions. Retrieval queries for specific provisions hit the right chunk first in semantic chunking 70-80% of the time, compared to 40-55% for fixed-size chunking — a meaningful gap when every retrieval miss is a potential hallucination.
Parent-Child Retrieval: Best of Both
Parent-child retrieval is a hybrid strategy that addresses a genuine tension in chunking design: small chunks retrieve precisely, large chunks provide coherent context. You want both. You can have both.
The approach: index fine-grained child chunks (128-256 tokens) for retrieval. Store their parent chunks (512-1024 tokens) for context delivery. At query time, retrieval runs against the child chunk index — small chunks, precise matching. When a child chunk is retrieved, the system returns its parent chunk as the context passed to the model.
The concrete example: an employment policy document has a section on leave entitlements that spans six paragraphs. Fixed-size chunking creates two 512-token chunks, both containing parts of the policy. Parent-child indexing creates eight 128-token child chunks from that section — one per paragraph — and a single parent chunk containing all six paragraphs.
A query about "parental leave carry-over rules" retrieves the specific 128-token child chunk that discusses carry-over. But the context passed to the model is the 512-token parent — the full leave entitlements section, with the carry-over clause in its proper context. The model generates from complete context. The user gets an answer that's both precise and coherent.
The indexing overhead is roughly 2x fixed-size chunking: you maintain two indexes, run embedding inference at two granularities, and need a mapping between child and parent chunks. For enterprise knowledge bases where retrieval accuracy has measurable business impact, this overhead is almost always justified.
2026 Benchmark Reality Check
The honest take, which the RAG tooling ecosystem has been slow to acknowledge: for homogeneous document collections, semantic chunking's advantage is smaller than its proponents claim.
Recent benchmarks on BEIR — the standard information retrieval evaluation suite — and emerging RAG-specific benchmarks show that for structured, predictable document types (product FAQs, technical wikis, structured knowledge bases with consistent formatting), fixed-size chunking with 50-token overlap performs within a few percentage points of semantic chunking on standard retrieval metrics (NDCG@10, MRR). The gap is real but often not large enough to justify the indexing overhead.
Semantic chunking's gains concentrate on heterogeneous, long-form documents — the legal, financial, and policy document types described above. If your knowledge base consists primarily of short, structured documents with consistent formatting, fixed-size chunking is a defensible choice. If your knowledge base contains a mix of long-form, structurally varied content, semantic chunking or parent-child retrieval will meaningfully improve accuracy.
The practical implication: don't choose a chunking strategy based on what sounds most sophisticated. Profile your document corpus first. If 80% of your documents are structured and homogeneous, fixed-size chunking with overlap is probably fine. If 40% are long-form heterogeneous documents that contain most of the business-critical knowledge, semantic chunking will pay for itself.
Domain-Aware Chunking
The limitation of all single-strategy approaches is that real enterprise knowledge bases are not homogeneous. They contain legal contracts, product documentation, HR policies, financial reports, engineering runbooks, and customer communications — each of which has different structural properties and different optimal chunking strategies.
Domain-aware chunking applies document classification before chunking. A classifier identifies the document type — legal, technical, policy, financial — and selects the appropriate chunking strategy for that type. Legal contracts get clause-level semantic chunking because the clause is the atomic unit of legal meaning. Product documentation gets parent-child retrieval with feature-level child chunks because users query specific features but need section-level context. HR policies get fixed-size chunking with semantic overlap detection because they are generally well-structured and short.
Scabera applies document classification at index time to assign each document to its appropriate chunking strategy. The classifier runs on document metadata (file name, source directory, document format) combined with a lightweight content sample. Classification adds roughly 50ms per document at index time — negligible overhead for the accuracy improvement it enables.
The result is a knowledge base where each document is chunked optimally for its type rather than forced into a one-size-fits-all strategy. Enterprise RAG systems that handle genuinely diverse knowledge bases — and most enterprise knowledge bases are genuinely diverse — need this kind of adaptive approach. Fixed-size chunking applied uniformly across a mixed corpus is one of the most common sources of unexplained retrieval underperformance in enterprise RAG deployments. It is also one of the easiest to fix once the root cause is identified.