Back to Courses

Entity Resolution & Deduplication

Extraction never produces a clean graph on the first pass. “IBM”, “I.B.M.”, and “International Business Machines” arrive as three separate nodes; the same person appears under a nickname in one document and a full legal name in another. Today you'll learn entity resolution — the record-linkage discipline of deciding when two nodes are the same real-world thing — and how to blocking, score, cluster, and safely merge duplicates without corrupting the edges that depend on them.

Day 4 Progress0%

Why Duplicate Entities Wreck a Graph

In the Beginner course you built graphs by hand, so every node was unique by construction. The moment you start extracting a graph from real documents — the Day 1–3 pipeline of this course — that guarantee disappears. The same real-world entity shows up under many surface forms, and each one becomes its own node.

The Problem in One Picture

A relation-extraction pass over a few news articles might produce:

  • (Apple)-[:FOUNDED_BY]->(Steve Jobs)
  • (Apple Inc.)-[:HEADQUARTERED_IN]->(Cupertino)
  • (AAPL)-[:HAS_CEO]->(Tim Cook)

To a human these are obviously one company. To the graph they are three separate nodes, each holding a fragment of the truth. The graph "knows" who founded Apple and who runs AAPL, but it can't connect those facts because they live on different nodes.

Why This Is Worse Than a Duplicate Row

In a relational table, a duplicate row is wasteful but mostly harmless — you can SELECT DISTINCT it away. In a graph, duplicates are actively destructive because the value of a graph is its connectivity:

  • Multi-hop queries break. "Which companies are run by people who founded other companies?" can't traverse from Apple to AAPL because there's no edge between them.
  • Centrality and community detection lie. A heavily-discussed entity split across five nodes looks like five minor nodes instead of one hub. PageRank, betweenness, and clustering all produce garbage.
  • GraphRAG retrieval misses context. If a question matches the Apple Inc. node, the retriever never pulls the founding fact that's stranded on the Apple node.
  • Aggregations undercount. "How many products does Apple make?" returns whatever fraction happens to hang off the node you matched.

The Discipline Has a Name

Resolving these duplicates is the decades-old field of record linkage (a.k.a. entity resolution, deduplication, or "merge/purge"). It predates knowledge graphs — it grew out of joining census and medical records that lacked shared keys — but the graph setting adds a twist: after you decide two nodes are the same, you have to merge them without breaking the edges that point at either one.

Two Flavors

  • Deduplication — collapsing duplicates within a single dataset (the Apple/AAPL case above).
  • Record linkage / entity linking — matching new records against an existing canonical set, e.g. linking an extracted Apple mention to the Wikidata entity Q312.

The machinery is the same: decide which records refer to the same thing, then unify them. The rest of this day walks the pipeline end to end: block → score → cluster → merge.

Key Takeaways
  • Extraction produces many nodes per real-world entity ('Apple', 'Apple Inc.', 'AAPL') — duplicates are the default, not the exception
  • Graph duplicates are worse than table duplicates: they sever connectivity, so multi-hop queries, centrality, and GraphRAG retrieval all degrade
  • Resolving them is the classic discipline of record linkage; the graph twist is merging nodes without breaking their edges

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
50 min
Lessons
5 sections