The measurement infrastructure that ties the whole course together. Build a gold set, pick the right metric (recall@K / MRR / NDCG) for your workload, navigate the offline-vs-online chasm, and run the full tuning loop in the right cost order. Intermediate capstone.
Every previous day in this course has said some variant of "measure this." Day 1: "measure recall before tuning HNSW." Day 2: "build a gold set before changing chunking." Day 3: "test hybrid against your eval set, not your intuition." Day 4: "test on representative data, not benchmarks." Today is about actually doing it — the retrieval evaluation infrastructure without which everything else is guesswork.
Building an evaluation set is real upfront work. Nobody is asking for it. The benefits are invisible — you don't ship anything new the day you finish your gold set. The classic dev-team prioritization treats evaluation as Phase 2 work that gets perpetually deprioritized.
The reasons this is a mistake:
Without a gold set, every retrieval change is a guess. You change the chunk size, the embedding model, the hybrid weights — and you have no way to know if it made things better. You convince yourself things look better, but "looks better" turns out to mean "looks better on the three queries I happened to test."
Production traffic IS your eval — but only after the fact. A retrieval regression takes days or weeks to surface through user behavior, and by then the original change is buried in fifty others. Without a held-out set, you cannot rapidly iterate.
User-facing degradation is invisible. Bad retrieval doesn't crash. It returns slightly worse answers. Users don't complain — they just churn quietly. The team that lacks evaluation infrastructure also lacks the signal to notice.
What you're building:
Once you have this, every tuning decision becomes a data-supported choice. Before this, it's vibes.
A simple eval run produces a table like:
| Configuration | recall@5 | recall@10 | MRR | Notes |
|---|---|---|---|---|
| Baseline (dense only, chunk=500) | 0.71 | 0.83 | 0.62 | Current production |
| chunk=300 with overlap=50 | 0.74 | 0.85 | 0.65 | +3 pts; worth shipping |
| chunk=300 + hybrid (RRF) | 0.79 | 0.90 | 0.71 | +8 pts; ship if infra OK |
| chunk=300 + hybrid + rerank | 0.84 | 0.93 | 0.76 | +13 pts; latency cost matters |
This table doesn't exist if you don't have a gold set. Once you have one, every internal debate ("should we add a reranker?") gets resolved by re-running the script.
A common failure mode: teams build a tiny gold set (5-10 queries) and call it done. Then they ship retrieval changes and the gold set tells them everything is fine. Real recall has dropped 15 points but the eval lies because 10 queries can't measure that finely.
This day will give you the rules of thumb (50-200 queries minimum, periodic refresh, distribution-matching to production) that make the gold set actually informative.
Every other Intermediate-level technique — chunking strategy, hybrid search, multi-modal, ANN tuning — is just one variable in the tuning loop. The loop itself is what makes any of them productive to deploy. A team without evaluation infrastructure tunes one variable at random per quarter, gets a 1-point improvement they can't really verify, and convinces themselves they're making progress.
A team with evaluation infrastructure runs experiments weekly, finds the 8-point chunk-size win in week two, ships it Monday morning. Same engineers, different velocity.
Powered by advanced LLM
Get personalized help with concepts, code examples, and explanations tailored to your learning pace.
Distributed Vector Search