Back to Courses

Evaluating Graph Retrieval

This is the capstone of the Intermediate level. Across this track you built an extraction pipeline, designed an ontology, wired up hybrid graph + vector retrieval, and resolved duplicate entities. Today you learn to prove any of it works: how to measure graph and GraphRAG retrieval quality with gold subgraphs, answer sets, and metrics like subgraph recall, path correctness, hop-precision, and answer faithfulness — then close the tuning loop that ties extraction, resolution, and retrieval together into a system you can actually improve.

Day 5 Progress0%

Why Graph Retrieval Needs Its Own Evaluation

This is the capstone of the Intermediate level. Days 1–4 built a GraphRAG system: you extracted entities and relations from text, designed an ontology to give them shape, combined graph traversal with vectors in hybrid retrieval, and collapsed duplicates with entity resolution. Each of those days improved the system in some way. The uncomfortable question is: how do you know it improved? Today answers that.

The "It Feels Better" Trap

The single most common failure mode in GraphRAG projects is tuning by vibes. You add a hop to your traversal, ask three questions, the answers look nicer, and you ship it. Three weeks later a different class of question regresses and nobody notices because there's no measurement. Every change you made across this track — a new relation type, a stricter entity-resolution threshold, a re-weighted hybrid score — is a hypothesis. Evaluation is how you test the hypothesis instead of believing it.

Why You Can't Just Reuse Vector-Search Metrics

In the Vector DB track, retrieval quality reduced to one clean idea: did you return the true top-K nearest neighbors? Recall@K against a brute-force ground truth told the whole story. Graph retrieval is messier for three reasons:

  • The unit of retrieval isn't a flat list. A graph retriever returns a subgraph — a set of nodes and edges — or a path connecting entities. "Top-K" doesn't capture whether the right edges came back.
  • Structure carries meaning. Returning the nodes Acme, Beth, and Widget is useless if you missed the edges that say Beth WORKS_AT Acme and Acme MAKES Widget. The connection is the answer.
  • The pipeline is multi-stage. A wrong answer might come from bad extraction (the fact was never in the graph), bad resolution (two nodes that should be one stayed split), or bad retrieval (the fact was there but you didn't traverse to it). A single end-to-end number can't tell these apart.

Two Things Worth Measuring

Throughout today, keep two distinct targets separate:

  1. Retrieval quality — given a question, did the retriever surface the right nodes, edges, and paths? This is judged against a gold subgraph.
  2. Answer quality — given what was retrieved, did the final generated answer state the correct facts without inventing any? This is judged against a gold answer set and includes faithfulness (no claims unsupported by the retrieved context).

A system can have great retrieval and a hallucinating generator, or a faithful generator starved of the right context. Measuring them separately tells you which half to fix.

Key Takeaways
  • Tuning GraphRAG by vibes is the default failure mode — every change you made this track was a hypothesis that needs a measurement to confirm
  • Vector recall@K isn't enough: graph retrieval returns subgraphs and paths, where the edges and connections carry the meaning
  • Separate retrieval quality (gold subgraph) from answer quality (gold answer set + faithfulness) so you know which half to fix

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
50 min
Lessons
5 sections