Reranking: The Layer Most RAG Systems Skip (And Why It Destroys Accuracy)

Most enterprise RAG implementations stop at cosine similarity. Embed the query, embed the documents, find the nearest neighbors, pass the top-k to the model. It's fast, it's simple, and it consistently underperforms what's possible. The missing layer is reranking — and the difference is not marginal.

How Basic RAG Works (And Where It Breaks)

Standard RAG retrieval is a two-step process: embed everything into a vector space, then retrieve by proximity. At query time, the user's question is embedded using a bi-encoder model. The system finds the document chunks whose vectors are closest to the query vector — typically measured by cosine similarity — and returns the top-k results as context for generation.

Bi-encoders are fast because query and document embeddings are computed independently. A pre-built index of millions of chunks can be searched in milliseconds. The problem is what "nearest" actually means in embedding space.

Embeddings encode semantic similarity — topics, concepts, vocabulary. A query about "contract termination clauses" will retrieve documents about contract law, legal agreements, and termination procedures. All of those are semantically related. Not all of them answer the question. The document chunking strategy affects which passages are even available for retrieval — but once indexed, the ranking of results is determined entirely by cosine similarity, which is a blunt instrument for relevance.

Why Cosine Similarity Retrieves Related, Not Relevant

Semantic similarity and relevance are not the same thing. A document about contract formation is semantically similar to a query about contract termination. They share concepts, vocabulary, and domain. A cosine similarity score won't reliably distinguish between them.

This matters more in enterprise knowledge bases than it might seem. Real document collections contain near-duplicates, version histories, general overviews, and specific technical references — all with similar embeddings. A question about a specific policy revision will often retrieve the general policy overview because their embeddings are closer than the specific amendment that actually answers the question.

The result: the model generates an answer from documents that are related to the question but don't actually address it. The answer sounds plausible. It cites real documents. It's wrong.

What Reranking Does: Cross-Encoder Scoring

Reranking is a second-pass scoring step that runs after initial retrieval. Instead of computing query and document embeddings independently, a cross-encoder model takes the query and a candidate document as a joint input and scores their relevance together.

This joint encoding is what makes cross-encoders more accurate. The model can attend to the relationship between specific query terms and specific document passages — not just whether they live near each other in embedding space. A cross-encoder can distinguish between "this document is about contract law" (high cosine similarity) and "this document answers the question about termination notice periods" (high cross-encoder relevance score).

The pipeline becomes: embed → retrieve top-50 by cosine similarity → rerank top-50 → pass top-5 to the model. The retrieval step casts a wide net. The reranking step selects for actual relevance. The model generates from genuinely relevant context.

Studies on standard information retrieval benchmarks consistently show reranking improving NDCG (Normalized Discounted Cumulative Gain) and MRR (Mean Reciprocal Rank) by 10–30% over bi-encoder retrieval alone. In enterprise RAG systems with heterogeneous knowledge bases, the improvement is often larger because the initial retrieval noise is higher.

The Reranker Landscape: API vs. Local

Several reranking solutions have emerged as production options. Cohere Rerank is the most widely cited API-based reranker — it's accurate, well-benchmarked, and easy to integrate. Voyage AI and Jina AI offer similar API-based reranking services. For teams already using cloud infrastructure, these add meaningful accuracy with minimal integration complexity.

The problem for air-gap deployments is obvious: API-based rerankers require an outbound call. Every document chunk being reranked leaves your network boundary. For regulated industries, this defeats the purpose of local deployment. Private RAG requires the entire pipeline — including reranking — to run within your infrastructure.

The solution is a local cross-encoder reranker running entirely on-premise, with no external API calls. Latency is higher than a bi-encoder alone but acceptable for enterprise retrieval pipelines. Accuracy improvements over pure cosine similarity are comparable to cloud-based rerankers for most use cases.

Scabera's Approach: Local Reranking as a First-Class Component

Scabera uses local cross-encoder reranking as a core component of the retrieval pipeline — not as an optional add-on. Every query goes through initial vector retrieval followed by cross-encoder reranking before the top results are passed to the model.

This matters for accuracy. It also matters for compliance. Reranking happens on-premise, within your infrastructure, with no external network dependency. The documents being scored never leave your environment. For enterprise RAG deployments in regulated industries, this is the only architecture that keeps the full retrieval pipeline under your control.

Basic RAG systems that skip reranking aren't just leaving accuracy on the table. They're building citation pipelines on a foundation that retrieves confidently and retrieves wrong. In an enterprise context where the outputs drive decisions, that's not a minor inefficiency. It's a reliability problem.

How Basic RAG Works (And Where It Breaks)

Why Cosine Similarity Retrieves Related, Not Relevant

What Reranking Does: Cross-Encoder Scoring

The Reranker Landscape: API vs. Local

Scabera's Approach: Local Reranking as a First-Class Component

See Scabera in action