This is the capstone of the Intermediate level. Across this track you built an extraction pipeline, designed an ontology, wired up hybrid graph + vector retrieval, and resolved duplicate entities. Today you learn to prove any of it works: how to measure graph and GraphRAG retrieval quality with gold subgraphs, answer sets, and metrics like subgraph recall, path correctness, hop-precision, and answer faithfulness — then close the tuning loop that ties extraction, resolution, and retrieval together into a system you can actually improve.
This is the capstone of the Intermediate level. Days 1–4 built a GraphRAG system: you extracted entities and relations from text, designed an ontology to give them shape, combined graph traversal with vectors in hybrid retrieval, and collapsed duplicates with entity resolution. Each of those days improved the system in some way. The uncomfortable question is: how do you know it improved? Today answers that.
The single most common failure mode in GraphRAG projects is tuning by vibes. You add a hop to your traversal, ask three questions, the answers look nicer, and you ship it. Three weeks later a different class of question regresses and nobody notices because there's no measurement. Every change you made across this track — a new relation type, a stricter entity-resolution threshold, a re-weighted hybrid score — is a hypothesis. Evaluation is how you test the hypothesis instead of believing it.
In the Vector DB track, retrieval quality reduced to one clean idea: did you return the true top-K nearest neighbors? Recall@K against a brute-force ground truth told the whole story. Graph retrieval is messier for three reasons:
Acme, Beth, and Widget is useless if you missed the edges that say Beth WORKS_AT Acme and Acme MAKES Widget. The connection is the answer.Throughout today, keep two distinct targets separate:
A system can have great retrieval and a hallucinating generator, or a faithful generator starved of the right context. Measuring them separately tells you which half to fix.