The highest-ROI quality lever in RAG after chunking. A bi-encoder retrieves fast but ranks coarsely; a cross-encoder reads the query and each candidate together and scores them far more accurately. Learn the retrieve-then-rerank pattern, its latency budget, and how to measure the lift.
In the beginner course you built a RAG pipeline that embeds the question, runs a top-K vector search, and pastes the results into the prompt. It works — until you measure it carefully. Then you discover an uncomfortable gap.
Run a real evaluation on a vector-only retriever and you'll typically see something like this:
| Cutoff | Contains the right chunk |
|---|---|
| Top-50 | ~95% |
| Top-20 | ~88% |
| Top-5 | ~70% |
The right answer is almost always somewhere in the top-50 — but it frequently ranks 8th, or 17th, or 34th. Your prompt only has room for a handful of chunks, so you take the top-5, and a quarter of the time the chunk you needed never makes it in. The LLM is then forced to answer from the wrong evidence or refuse.
This is the single most common reason a RAG system "feels dumb" even though the data is right there in the index.
The vector search ranks by cosine similarity between two pre-computed embeddings: one for the query, one for the chunk. Each embedding is a lossy summary — a few hundred or thousand floats standing in for a whole passage. Two things follow:
You might think: just put 50 chunks in the prompt. Three problems:
What you actually want is to retrieve broadly but generate narrowly: pull a wide candidate set with the fast retriever, then use a more accurate (and more expensive) scorer to pick the true best 5. That second scorer is a reranker, and it's the subject of this lesson.
Powered by advanced LLM
Get personalized help with concepts, code examples, and explanations tailored to your learning pace.
Agentic RAG