Definition
Reranking is the process of reordering a candidate set of retrieved documents using a more precise relevance model after the initial retrieval stage.
Initial retrieval prioritizes recall (not missing relevant documents). Reranking prioritizes precision (ensuring the final results are the most relevant).
In production Retrieval-Augmented Generation (RAG) systems, reranking is typically the final retrieval step before context is sent to the LLM.
Multi-Stage Retrieval Architecture
Standard architecture:
User Query
↓
Retriever (BM25 / Vector / Hybrid)
↓
Top-N Candidates (typically 20–200)
↓
Reranker (Cross-Encoder)
↓
Top-K Final Results (typically 3–10)
↓
LLM Context
The first stage must not miss relevant documents; the second stage must remove weak matches.
Why Retrieval Alone Is Not Enough
Embedding retrieval (Bi-Encoders) uses independent representations where the query and document are encoded separately. Similarity is computed using vector distance. This is efficient but lossy, as it compares compressed semantic representations rather than direct relevance.
Bi-Encoder vs. Cross-Encoder
| Feature | Bi-Encoder (Retrieval) | Cross-Encoder (Reranking) |
|---|---|---|
| Process | Query -> Vector, Doc -> Vector |
(Query + Doc) -> Single Model |
| Evaluation | Vector proximity | Direct relevance scoring |
| Scalability | High (supports ANN indexing) | Low (must run per candidate) |
| Precision | Semantic "relatedness" | Exact "answer" matching |
Example
Query: "How do I reset a failed Stripe webhook retry?"
- Bi-Encoder: May return general Stripe webhook documentation or event delivery docs.
- Cross-Encoder: Correctly ranks the specific "webhook retry" instructions first by evaluating the exact relationship between query and content.
Reranking Strategy
Reranking is computationally expensive and is never performed on the full corpus.
Top-N and Top-K Selection
Most systems start with a Top 50 → Top 5 strategy:
- Retrieve the top 50 candidates (Recall).
- Pass these 50 pairs
(query, chunk)to the reranker. - Keep the top 5 results based on the reranker's score (Precision).
Token Limits and Chunks
Reranking is a chunk-level operation. Cross-encoders have context limits (often 512 tokens). Best Practice: Rerank retrieval-sized chunks (300–800 tokens), not entire documents or PDFs, to avoid truncation and excessive latency.
Implementation Ownership
Reranking requires coordination between the scoring model and the application layer.
| Component | Responsibility | Ownership |
|---|---|---|
| Scoring Intelligence | System | The Cross-Encoder model (e.g., Vertex AI Ranking API, Cohere) performs the comparison. |
| Candidate Selection | Developer | The developer must decide how many candidates (N) to send to the reranker. |
| Orchestration | Developer | The developer writes the code to pair the query with each chunk and call the reranking endpoint. |
| Final Filtering | Developer | The developer re-sorts and truncates the list based on the returned scores. |
Common Reranker Models
API-Based (Managed)
- Cohere Rerank: Widely used, strong quality, simple integration.
- Google Vertex AI Ranking API: Native fit for GCP; ideal when embeddings and retrieval already run on Vertex.
Local Models (Self-Hosted)
- BAAI BGE Reranker: Strong open-source baseline (e.g.,
bge-reranker-large). - Jina AI Reranker: High-performance retrieval-focused models.
Best Practice
Production baseline for reliable enterprise retrieval: Hybrid Retrieval + RRF Fusion + Cross-Encoder Reranking + Metadata Filtering.
The LLM should never receive raw retriever output without ranking refinement unless the corpus is very small. The reranker is the precision layer; without it, the LLM receives semantically related noise instead of directly relevant evidence.