The operational scaffolding that makes production retrieval stay good over years. Continuous evaluation infrastructure, per-segment gold sets, online metric pipelines, A/B testing with shadow traffic, and drift detection — the discipline that distinguishes teams that ship retrieval successfully from teams that watch quality silently degrade.
Intermediate Day 5 introduced the concept: build a gold set, measure recall@K, iterate. That's the right starting point — but it's a one-shot discipline. Production teams need continuous evaluation infrastructure that runs on every retrieval change AND every corpus change, automatically, with alerts.
This day is about the infrastructure that takes evaluation from "a script someone runs occasionally" to "a system that catches regressions before they ship."
Where most teams sit:
Stage 0: No eval. Retrieval changes ship based on developer intuition. Regressions hit production and live there for weeks. Engineering knows quality is "fine" because no one is loudly complaining.
Stage 1: Occasional manual eval. Someone runs an eval script before big changes. Catches catastrophic regressions but misses gradual ones. Numbers exist but aren't tracked over time.
Stage 2: CI-gated eval. Every PR touching retrieval runs the gold set evaluation. Recall drops by more than X% block the merge. Regressions get caught before merge instead of after.
Stage 3: Online metrics tracked continuously. Production query logs feed into dashboards: CTR on top results, dwell time, reformulation rate. Drift visible in real time.
Stage 4: Shadow traffic + A/B framework. New retrieval variants run in parallel with production. Statistical significance testing on every shipped change. Automated rollback if regression detected.
Most teams that take retrieval seriously land at Stage 2-3. Stage 4 is for large-scale operations where each regression costs serious money.
The work doesn't ship features. A month spent building eval infrastructure produces zero new user-visible capability. Compare that to "I added hybrid search" or "I shipped the new embedding model" — visible accomplishments.
The catch: without the infrastructure, those shipped changes also ship the regressions they caused. Six months later the system is mysteriously worse than it was, and nobody knows which change did it.
The investment math is brutal: a single un-caught retrieval regression that costs ~0.5% conversion rate over a quarter is worth more than every quarter of eval-infrastructure work combined. Teams that ship retrieval at scale and don't have eval infrastructure eventually pay this cost. Then they invest. Then they wonder why they didn't invest earlier.
A mature setup has four pieces:
Each piece is real engineering. Together they make retrieval iteration boring — you can change anything because the system tells you what broke.
Ask your team:
A "no" on any of these is a gap. The number of yes answers correlates strongly with how much trouble retrieval will cause you in the next 6 months.
This day is about getting to yes on all four.