Production Evaluation Systems

The operational scaffolding that makes production retrieval stay good over years. Continuous evaluation infrastructure, per-segment gold sets, online metric pipelines, A/B testing with shadow traffic, and drift detection — the discipline that distinguishes teams that ship retrieval successfully from teams that watch quality silently degrade.

Day 4 Progress0%

From Gold Set to Continuous Evaluation

Intermediate Day 5 introduced the concept: build a gold set, measure recall@K, iterate. That's the right starting point — but it's a one-shot discipline. Production teams need continuous evaluation infrastructure that runs on every retrieval change AND every corpus change, automatically, with alerts.

This day is about the infrastructure that takes evaluation from "a script someone runs occasionally" to "a system that catches regressions before they ship."

The Five Stages of Eval Maturity

Where most teams sit:

Stage 0: No eval. Retrieval changes ship based on developer intuition. Regressions hit production and live there for weeks. Engineering knows quality is "fine" because no one is loudly complaining.

Stage 1: Occasional manual eval. Someone runs an eval script before big changes. Catches catastrophic regressions but misses gradual ones. Numbers exist but aren't tracked over time.

Stage 2: CI-gated eval. Every PR touching retrieval runs the gold set evaluation. Recall drops by more than X% block the merge. Regressions get caught before merge instead of after.

Stage 3: Online metrics tracked continuously. Production query logs feed into dashboards: CTR on top results, dwell time, reformulation rate. Drift visible in real time.

Stage 4: Shadow traffic + A/B framework. New retrieval variants run in parallel with production. Statistical significance testing on every shipped change. Automated rollback if regression detected.

Most teams that take retrieval seriously land at Stage 2-3. Stage 4 is for large-scale operations where each regression costs serious money.

Why Teams Underinvest

The work doesn't ship features. A month spent building eval infrastructure produces zero new user-visible capability. Compare that to "I added hybrid search" or "I shipped the new embedding model" — visible accomplishments.

The catch: without the infrastructure, those shipped changes also ship the regressions they caused. Six months later the system is mysteriously worse than it was, and nobody knows which change did it.

The investment math is brutal: a single un-caught retrieval regression that costs ~0.5% conversion rate over a quarter is worth more than every quarter of eval-infrastructure work combined. Teams that ship retrieval at scale and don't have eval infrastructure eventually pay this cost. Then they invest. Then they wonder why they didn't invest earlier.

What "Continuous Evaluation" Actually Looks Like

A mature setup has four pieces:

  1. Versioned gold sets (covered in Section 2): not a static file, an evolving artifact with per-segment coverage
  2. Offline regression CI running on every change: gold set + eval script + alerting on drops
  3. Online metrics pipeline (Section 3): production queries → logs → warehouse → dashboards
  4. A/B testing framework (Section 4): consistent bucketing, exposure logging, statistical testing

Each piece is real engineering. Together they make retrieval iteration boring — you can change anything because the system tells you what broke.

A Concrete Maturity Audit

Ask your team:

  • Could you tell me, right now, your recall@10 on a representative query set?
  • Could you tell me if it's gone up or down in the last month?
  • If a PR landed yesterday that dropped recall by 8 points, would you know?
  • If you A/B tested a new chunking strategy for two weeks, could you tell me the statistical significance of the result?

A "no" on any of these is a gap. The number of yes answers correlates strongly with how much trouble retrieval will cause you in the next 6 months.

This day is about getting to yes on all four.

Key Takeaways
  • Production retrieval needs continuous evaluation infrastructure (offline CI + online metrics + A/B + drift detection), not a one-shot manual gold set
  • Most teams skip this investment because it doesn't ship features; the cost of skipping it (silent regressions hitting production) is real but accumulates slowly
  • A maturity audit (can you tell me recall@10? regressions? trends? A/B significance?) — number of 'yes' answers correlates with how much trouble retrieval will cause in the next 6 months

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
50 min
Lessons
5 sections