Name: Reranking: Retrieve-then-Rerank with Cross-Encoders
Availability: InStock

Why Top-K Retrieval Isn't Enough

In the beginner course you built a RAG pipeline that embeds the question, runs a top-K vector search, and pastes the results into the prompt. It works — until you measure it carefully. Then you discover an uncomfortable gap.

The Recall–Precision Gap

Run a real evaluation on a vector-only retriever and you'll typically see something like this:

Cutoff	Contains the right chunk
Top-50	~95%
Top-20	~88%
Top-5	~70%

The right answer is almost always somewhere in the top-50 — but it frequently ranks 8th, or 17th, or 34th. Your prompt only has room for a handful of chunks, so you take the top-5, and a quarter of the time the chunk you needed never makes it in. The LLM is then forced to answer from the wrong evidence or refuse.

This is the single most common reason a RAG system "feels dumb" even though the data is right there in the index.

Why the Ordering Is Coarse

The vector search ranks by cosine similarity between two pre-computed embeddings: one for the query, one for the chunk. Each embedding is a lossy summary — a few hundred or thousand floats standing in for a whole passage. Two things follow:

The query and the document never "meet." Their vectors are computed completely independently, before either knows the other exists. Similarity is measured after the fact, in vector space.
Fine distinctions get averaged away. "The API key rotates every 90 days" and "the API key cannot be rotated" produce nearly identical embeddings — same words, opposite meaning. Cosine similarity can't tell them apart reliably.

Over-Fetching Doesn't Fix It

You might think: just put 50 chunks in the prompt. Three problems:

Cost scales with input tokens. 50 chunks is 10× the tokens of 5.
Lost in the middle — modern LLMs pay less attention to the middle of a long context, so burying the right chunk at position 30 of 50 can be worse than not including it.
Noise — 45 irrelevant chunks give the model 45 chances to anchor on something wrong.

What you actually want is to retrieve broadly but generate narrowly: pull a wide candidate set with the fast retriever, then use a more accurate (and more expensive) scorer to pick the true best 5. That second scorer is a reranker, and it's the subject of this lesson.

Key Takeaways

Vector search has high recall at large K but coarse ordering — the right chunk is often retrieved but ranked too low to reach the prompt
Query and document embeddings are computed independently and are lossy, so fine semantic distinctions get blurred
Stuffing more chunks into the prompt costs more, triggers 'lost in the middle', and adds noise — the fix is a better ranker, not a bigger prompt

Reranking: Retrieve-then-Rerank with Cross-Encoders

Why Top-K Retrieval Isn't Enough

Why Top-K Retrieval Isn't Enough

The Recall–Precision Gap

Why the Ordering Is Coarse

Over-Fetching Doesn't Fix It

Bi-Encoders vs Cross-Encoders

The Retrieve-then-Rerank Pattern

Rerankers in Practice

Tuning and Measuring the Lift

AI Learning Assistant

Course Stats

Up Next