Name: Evaluating Graph Retrieval
Availability: InStock

Why Graph Retrieval Needs Its Own Evaluation

This is the capstone of the Intermediate level. Days 1–4 built a GraphRAG system: you extracted entities and relations from text, designed an ontology to give them shape, combined graph traversal with vectors in hybrid retrieval, and collapsed duplicates with entity resolution. Each of those days improved the system in some way. The uncomfortable question is: how do you know it improved? Today answers that.

The "It Feels Better" Trap

The single most common failure mode in GraphRAG projects is tuning by vibes. You add a hop to your traversal, ask three questions, the answers look nicer, and you ship it. Three weeks later a different class of question regresses and nobody notices because there's no measurement. Every change you made across this track — a new relation type, a stricter entity-resolution threshold, a re-weighted hybrid score — is a hypothesis. Evaluation is how you test the hypothesis instead of believing it.

Why You Can't Just Reuse Vector-Search Metrics

In the Vector DB track, retrieval quality reduced to one clean idea: did you return the true top-K nearest neighbors? Recall@K against a brute-force ground truth told the whole story. Graph retrieval is messier for three reasons:

The unit of retrieval isn't a flat list. A graph retriever returns a subgraph — a set of nodes and edges — or a path connecting entities. "Top-K" doesn't capture whether the right edges came back.
Structure carries meaning. Returning the nodes Acme, Beth, and Widget is useless if you missed the edges that say Beth WORKS_AT Acme and Acme MAKES Widget. The connection is the answer.
The pipeline is multi-stage. A wrong answer might come from bad extraction (the fact was never in the graph), bad resolution (two nodes that should be one stayed split), or bad retrieval (the fact was there but you didn't traverse to it). A single end-to-end number can't tell these apart.

Two Things Worth Measuring

Throughout today, keep two distinct targets separate:

Retrieval quality — given a question, did the retriever surface the right nodes, edges, and paths? This is judged against a gold subgraph.
Answer quality — given what was retrieved, did the final generated answer state the correct facts without inventing any? This is judged against a gold answer set and includes faithfulness (no claims unsupported by the retrieved context).

A system can have great retrieval and a hallucinating generator, or a faithful generator starved of the right context. Measuring them separately tells you which half to fix.

Key Takeaways

Tuning GraphRAG by vibes is the default failure mode — every change you made this track was a hypothesis that needs a measurement to confirm
Vector recall@K isn't enough: graph retrieval returns subgraphs and paths, where the edges and connections carry the meaning
Separate retrieval quality (gold subgraph) from answer quality (gold answer set + faithfulness) so you know which half to fix

Evaluating Graph Retrieval

Why Graph Retrieval Needs Its Own Evaluation

Why Graph Retrieval Needs Its Own Evaluation

The "It Feels Better" Trap

Why You Can't Just Reuse Vector-Search Metrics

Two Things Worth Measuring

Building the Gold Set: Subgraphs and Answer Sets

The Metrics: Subgraph Recall, Path Correctness, Hop-Precision

Offline vs. Online: Two Loops That Need Each Other

Closing the Loop: Tuning Extraction, Resolution, and Retrieval Together

AI Learning Assistant

Course Stats

Up Next