Back to Courses

Tuning & Evaluating Postgres Retrieval

This is the intermediate capstone. The last four days gave you the building blocks — hybrid search, metadata filtering, a production RAG schema, and quantization. Today you learn the discipline that ties them together: measuring retrieval quality. You will build a gold set, compute recall@k and MRR with SQL, read EXPLAIN ANALYZE output, and run the end-to-end tuning loop that turns guesses about ef_search and probes into defensible, measured decisions.

Day 5 Progress0%

Why You Cannot Tune What You Cannot Measure

For four days you built retrieval machinery on Postgres: hybrid search fusing pgvector and tsvector, metadata filtering with partial and composite indexes, a production RAG schema, and quantization to make embeddings fit in RAM. Every one of those days ended with a tuning knob — RRF weights, ef_search, probes, rerank depth. This capstone is about the only honest way to turn those knobs: measurement.

The Trap of Eyeballing

The most common way teams "evaluate" retrieval is to type a handful of queries into a demo, skim the results, and declare victory. This feels like testing. It is not. You have no number, so you cannot tell whether a config change made things better or worse, you cannot detect a regression when you bump pgvector, and you cannot defend a latency-for-recall trade to your team.

Retrieval quality is a measurable property. The moment you have a number, tuning stops being folklore and becomes engineering.

Two Numbers That Matter

Retrieval is the first stage of RAG. If the right chunk never makes it into the top-k, no amount of LLM cleverness downstream can recover it. So the questions to measure are:

  1. Did we retrieve the relevant chunk at all, within k results? → answered by recall@k
  2. How high up the list did it appear? → answered by MRR (Mean Reciprocal Rank)

These are retrieval metrics, not generation metrics. They isolate the part of the pipeline you control with Postgres tuning. Generation-quality metrics (faithfulness, answer correctness) sit downstream and are noisier; fix retrieval first.

What This Day Builds Toward

By the end you will have an evaluation harness that lives next to your data — a gold set of query/answer pairs in a table, SQL that computes recall@k and MRR directly against your live index, and a tuning loop that sweeps ef_search / probes and reads the recall-vs-latency curve to pick a setting. Because the harness runs in Postgres against the real index, the numbers it produces are the numbers your application will actually see.

The One-Sentence Summary

You are turning "the search feels pretty good" into "recall@10 is 0.94 at p95 = 38 ms with ef_search = 100, and raising it to 200 buys 1.5 points of recall for 22 ms — not worth it."

Key Takeaways
  • Eyeballing a few demo queries is not evaluation — without a number you cannot detect regressions or defend trade-offs
  • Retrieval is RAG's first stage: if the right chunk misses the top-k, the LLM cannot recover it, so measure retrieval before generation
  • The two retrieval metrics that matter are recall@k (did we find it?) and MRR (how high did it rank?)

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
50 min
Lessons
5 sections