Back to Courses

Retrieval Evaluation: The Measurement Loop

The measurement infrastructure that ties the whole course together. Build a gold set, pick the right metric (recall@K / MRR / NDCG) for your workload, navigate the offline-vs-online chasm, and run the full tuning loop in the right cost order. Intermediate capstone.

Day 5 Progress0%

The Measurement Loop You Don't Have

Every previous day in this course has said some variant of "measure this." Day 1: "measure recall before tuning HNSW." Day 2: "build a gold set before changing chunking." Day 3: "test hybrid against your eval set, not your intuition." Day 4: "test on representative data, not benchmarks." Today is about actually doing it — the retrieval evaluation infrastructure without which everything else is guesswork.

Why Most Teams Don't Have One

Building an evaluation set is real upfront work. Nobody is asking for it. The benefits are invisible — you don't ship anything new the day you finish your gold set. The classic dev-team prioritization treats evaluation as Phase 2 work that gets perpetually deprioritized.

The reasons this is a mistake:

Without a gold set, every retrieval change is a guess. You change the chunk size, the embedding model, the hybrid weights — and you have no way to know if it made things better. You convince yourself things look better, but "looks better" turns out to mean "looks better on the three queries I happened to test."

Production traffic IS your eval — but only after the fact. A retrieval regression takes days or weeks to surface through user behavior, and by then the original change is buried in fifty others. Without a held-out set, you cannot rapidly iterate.

User-facing degradation is invisible. Bad retrieval doesn't crash. It returns slightly worse answers. Users don't complain — they just churn quietly. The team that lacks evaluation infrastructure also lacks the signal to notice.

The Shape of the Loop

What you're building:

  1. Gold set: a collection of representative queries, each paired with the doc IDs that should be returned for it
  2. Evaluation function: takes your retriever + the gold set, computes a quality score
  3. The tuning loop: change one variable (chunk size, model, hybrid weights, k₁) → re-evaluate → keep what helps, revert what doesn't

Once you have this, every tuning decision becomes a data-supported choice. Before this, it's vibes.

What "Measure" Actually Looks Like in Numbers

A simple eval run produces a table like:

Configurationrecall@5recall@10MRRNotes
Baseline (dense only, chunk=500)0.710.830.62Current production
chunk=300 with overlap=500.740.850.65+3 pts; worth shipping
chunk=300 + hybrid (RRF)0.790.900.71+8 pts; ship if infra OK
chunk=300 + hybrid + rerank0.840.930.76+13 pts; latency cost matters

This table doesn't exist if you don't have a gold set. Once you have one, every internal debate ("should we add a reranker?") gets resolved by re-running the script.

The Half-Done Trap

A common failure mode: teams build a tiny gold set (5-10 queries) and call it done. Then they ship retrieval changes and the gold set tells them everything is fine. Real recall has dropped 15 points but the eval lies because 10 queries can't measure that finely.

This day will give you the rules of thumb (50-200 queries minimum, periodic refresh, distribution-matching to production) that make the gold set actually informative.

Why This Is the Most Important Day in the Course

Every other Intermediate-level technique — chunking strategy, hybrid search, multi-modal, ANN tuning — is just one variable in the tuning loop. The loop itself is what makes any of them productive to deploy. A team without evaluation infrastructure tunes one variable at random per quarter, gets a 1-point improvement they can't really verify, and convinces themselves they're making progress.

A team with evaluation infrastructure runs experiments weekly, finds the 8-point chunk-size win in week two, ships it Monday morning. Same engineers, different velocity.

Key Takeaways
  • Without a gold set, every retrieval tuning decision is guesswork — production traffic is too slow a feedback loop to iterate against
  • The full loop is: gold set → eval function → change one variable → re-eval → keep what helps. Once it exists, decisions become data-supported.
  • A half-built gold set (5-10 queries) is worse than nothing — it lies confidently. 50-200 representative queries is the realistic floor.

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
46 min
Lessons
5 sections