Back to Courses

Reranking: Retrieve-then-Rerank with Cross-Encoders

The highest-ROI quality lever in RAG after chunking. A bi-encoder retrieves fast but ranks coarsely; a cross-encoder reads the query and each candidate together and scores them far more accurately. Learn the retrieve-then-rerank pattern, its latency budget, and how to measure the lift.

Day 1 Progress0%

Why Top-K Retrieval Isn't Enough

In the beginner course you built a RAG pipeline that embeds the question, runs a top-K vector search, and pastes the results into the prompt. It works — until you measure it carefully. Then you discover an uncomfortable gap.

The Recall–Precision Gap

Run a real evaluation on a vector-only retriever and you'll typically see something like this:

CutoffContains the right chunk
Top-50~95%
Top-20~88%
Top-5~70%

The right answer is almost always somewhere in the top-50 — but it frequently ranks 8th, or 17th, or 34th. Your prompt only has room for a handful of chunks, so you take the top-5, and a quarter of the time the chunk you needed never makes it in. The LLM is then forced to answer from the wrong evidence or refuse.

This is the single most common reason a RAG system "feels dumb" even though the data is right there in the index.

Why the Ordering Is Coarse

The vector search ranks by cosine similarity between two pre-computed embeddings: one for the query, one for the chunk. Each embedding is a lossy summary — a few hundred or thousand floats standing in for a whole passage. Two things follow:

  1. The query and the document never "meet." Their vectors are computed completely independently, before either knows the other exists. Similarity is measured after the fact, in vector space.
  2. Fine distinctions get averaged away. "The API key rotates every 90 days" and "the API key cannot be rotated" produce nearly identical embeddings — same words, opposite meaning. Cosine similarity can't tell them apart reliably.

Over-Fetching Doesn't Fix It

You might think: just put 50 chunks in the prompt. Three problems:

  • Cost scales with input tokens. 50 chunks is 10× the tokens of 5.
  • Lost in the middle — modern LLMs pay less attention to the middle of a long context, so burying the right chunk at position 30 of 50 can be worse than not including it.
  • Noise — 45 irrelevant chunks give the model 45 chances to anchor on something wrong.

What you actually want is to retrieve broadly but generate narrowly: pull a wide candidate set with the fast retriever, then use a more accurate (and more expensive) scorer to pick the true best 5. That second scorer is a reranker, and it's the subject of this lesson.

Key Takeaways
  • Vector search has high recall at large K but coarse ordering — the right chunk is often retrieved but ranked too low to reach the prompt
  • Query and document embeddings are computed independently and are lossy, so fine semantic distinctions get blurred
  • Stuffing more chunks into the prompt costs more, triggers 'lost in the middle', and adds noise — the fix is a better ranker, not a bigger prompt

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
50 min
Lessons
5 sections