Name: Tuning & Evaluating Postgres Retrieval
Availability: InStock

Why You Cannot Tune What You Cannot Measure

For four days you built retrieval machinery on Postgres: hybrid search fusing pgvector and tsvector, metadata filtering with partial and composite indexes, a production RAG schema, and quantization to make embeddings fit in RAM. Every one of those days ended with a tuning knob — RRF weights, ef_search, probes, rerank depth. This capstone is about the only honest way to turn those knobs: measurement.

The Trap of Eyeballing

The most common way teams "evaluate" retrieval is to type a handful of queries into a demo, skim the results, and declare victory. This feels like testing. It is not. You have no number, so you cannot tell whether a config change made things better or worse, you cannot detect a regression when you bump pgvector, and you cannot defend a latency-for-recall trade to your team.

Retrieval quality is a measurable property. The moment you have a number, tuning stops being folklore and becomes engineering.

Two Numbers That Matter

Retrieval is the first stage of RAG. If the right chunk never makes it into the top-k, no amount of LLM cleverness downstream can recover it. So the questions to measure are:

Did we retrieve the relevant chunk at all, within k results? → answered by recall@k
How high up the list did it appear? → answered by MRR (Mean Reciprocal Rank)

These are retrieval metrics, not generation metrics. They isolate the part of the pipeline you control with Postgres tuning. Generation-quality metrics (faithfulness, answer correctness) sit downstream and are noisier; fix retrieval first.

What This Day Builds Toward

By the end you will have an evaluation harness that lives next to your data — a gold set of query/answer pairs in a table, SQL that computes recall@k and MRR directly against your live index, and a tuning loop that sweeps ef_search / probes and reads the recall-vs-latency curve to pick a setting. Because the harness runs in Postgres against the real index, the numbers it produces are the numbers your application will actually see.

The One-Sentence Summary

You are turning "the search feels pretty good" into "recall@10 is 0.94 at p95 = 38 ms with ef_search = 100, and raising it to 200 buys 1.5 points of recall for 22 ms — not worth it."

Key Takeaways

Eyeballing a few demo queries is not evaluation — without a number you cannot detect regressions or defend trade-offs
Retrieval is RAG's first stage: if the right chunk misses the top-k, the LLM cannot recover it, so measure retrieval before generation
The two retrieval metrics that matter are recall@k (did we find it?) and MRR (how high did it rank?)

Tuning & Evaluating Postgres Retrieval

Why You Cannot Tune What You Cannot Measure

Why You Cannot Tune What You Cannot Measure

The Trap of Eyeballing

Two Numbers That Matter

What This Day Builds Toward

The One-Sentence Summary

Building a Gold Set on Postgres

Computing recall@k and MRR in SQL

Tuning ef_search, probes, and Reading EXPLAIN ANALYZE

The End-to-End Tuning Loop

AI Learning Assistant

Course Stats

Up Next