Reranking & Cross-Encoders – Miguel Armengol

Definition

Reranking is the process of reordering a candidate set of retrieved documents using a more precise relevance model after the initial retrieval stage.

Initial retrieval prioritizes recall (not missing relevant documents). Reranking prioritizes precision (ensuring the final results are the most relevant).

In production Retrieval-Augmented Generation (RAG) systems, reranking is typically the final retrieval step before context is sent to the LLM.

Multi-Stage Retrieval Architecture

Standard architecture:

User Query
   ↓
Retriever (BM25 / Vector / Hybrid)
   ↓
Top-N Candidates (typically 20–200)
   ↓
Reranker (Cross-Encoder)
   ↓
Top-K Final Results (typically 3–10)
   ↓
LLM Context

The first stage must not miss relevant documents; the second stage must remove weak matches.

Why Retrieval Alone Is Not Enough

Embedding retrieval (Bi-Encoders) uses independent representations where the query and document are encoded separately. Similarity is computed using vector distance. This is efficient but lossy, as it compares compressed semantic representations rather than direct relevance.

Bi-Encoder vs. Cross-Encoder

Feature	Bi-Encoder (Retrieval)	Cross-Encoder (Reranking)
Process	`Query -> Vector`, `Doc -> Vector`	`(Query + Doc) -> Single Model`
Evaluation	Vector proximity	Direct relevance scoring
Scalability	High (supports ANN indexing)	Low (must run per candidate)
Precision	Semantic "relatedness"	Exact "answer" matching

Example

Query: "How do I reset a failed Stripe webhook retry?"

Bi-Encoder: May return general Stripe webhook documentation or event delivery docs.
Cross-Encoder: Correctly ranks the specific "webhook retry" instructions first by evaluating the exact relationship between query and content.

Reranking Strategy

Reranking is computationally expensive and is never performed on the full corpus.

Top-N and Top-K Selection

Most systems start with a Top 50 → Top 5 strategy:

Retrieve the top 50 candidates (Recall).
Pass these 50 pairs (query, chunk) to the reranker.
Keep the top 5 results based on the reranker's score (Precision).

Token Limits and Chunks

Reranking is a chunk-level operation. Cross-encoders have context limits (often 512 tokens). Best Practice: Rerank retrieval-sized chunks (300–800 tokens), not entire documents or PDFs, to avoid truncation and excessive latency.

Implementation Ownership

Reranking requires coordination between the scoring model and the application layer.

Component	Responsibility	Ownership
Scoring Intelligence	System	The Cross-Encoder model (e.g., Vertex AI Ranking API, Cohere) performs the comparison.
Candidate Selection	Developer	The developer must decide how many candidates (N) to send to the reranker.
Orchestration	Developer	The developer writes the code to pair the query with each chunk and call the reranking endpoint.
Final Filtering	Developer	The developer re-sorts and truncates the list based on the returned scores.

Common Reranker Models

API-Based (Managed)

Cohere Rerank: Widely used, strong quality, simple integration.
Google Vertex AI Ranking API: Native fit for GCP; ideal when embeddings and retrieval already run on Vertex.

Local Models (Self-Hosted)

BAAI BGE Reranker: Strong open-source baseline (e.g., bge-reranker-large).
Jina AI Reranker: High-performance retrieval-focused models.

Best Practice

Production baseline for reliable enterprise retrieval: Hybrid Retrieval + RRF Fusion + Cross-Encoder Reranking + Metadata Filtering.

The LLM should never receive raw retriever output without ranking refinement unless the corpus is very small. The reranker is the precision layer; without it, the LLM receives semantically related noise instead of directly relevant evidence.