This is the intermediate capstone. The last four days gave you the building blocks — hybrid search, metadata filtering, a production RAG schema, and quantization. Today you learn the discipline that ties them together: measuring retrieval quality. You will build a gold set, compute recall@k and MRR with SQL, read EXPLAIN ANALYZE output, and run the end-to-end tuning loop that turns guesses about ef_search and probes into defensible, measured decisions.
For four days you built retrieval machinery on Postgres: hybrid search fusing pgvector and tsvector, metadata filtering with partial and composite indexes, a production RAG schema, and quantization to make embeddings fit in RAM. Every one of those days ended with a tuning knob — RRF weights, ef_search, probes, rerank depth. This capstone is about the only honest way to turn those knobs: measurement.
The most common way teams "evaluate" retrieval is to type a handful of queries into a demo, skim the results, and declare victory. This feels like testing. It is not. You have no number, so you cannot tell whether a config change made things better or worse, you cannot detect a regression when you bump pgvector, and you cannot defend a latency-for-recall trade to your team.
Retrieval quality is a measurable property. The moment you have a number, tuning stops being folklore and becomes engineering.
Retrieval is the first stage of RAG. If the right chunk never makes it into the top-k, no amount of LLM cleverness downstream can recover it. So the questions to measure are:
These are retrieval metrics, not generation metrics. They isolate the part of the pipeline you control with Postgres tuning. Generation-quality metrics (faithfulness, answer correctness) sit downstream and are noisier; fix retrieval first.
By the end you will have an evaluation harness that lives next to your data — a gold set of query/answer pairs in a table, SQL that computes recall@k and MRR directly against your live index, and a tuning loop that sweeps ef_search / probes and reads the recall-vs-latency curve to pick a setting. Because the harness runs in Postgres against the real index, the numbers it produces are the numbers your application will actually see.
You are turning "the search feels pretty good" into "recall@10 is 0.94 at p95 = 38 ms with ef_search = 100, and raising it to 200 buys 1.5 points of recall for 22 ms — not worth it."