Name: Entity Resolution & Deduplication
Availability: InStock

Why Duplicate Entities Wreck a Graph

In the Beginner course you built graphs by hand, so every node was unique by construction. The moment you start extracting a graph from real documents — the Day 1–3 pipeline of this course — that guarantee disappears. The same real-world entity shows up under many surface forms, and each one becomes its own node.

The Problem in One Picture

A relation-extraction pass over a few news articles might produce:

(Apple)-[:FOUNDED_BY]->(Steve Jobs)
(Apple Inc.)-[:HEADQUARTERED_IN]->(Cupertino)
(AAPL)-[:HAS_CEO]->(Tim Cook)

To a human these are obviously one company. To the graph they are three separate nodes, each holding a fragment of the truth. The graph "knows" who founded Apple and who runs AAPL, but it can't connect those facts because they live on different nodes.

Why This Is Worse Than a Duplicate Row

In a relational table, a duplicate row is wasteful but mostly harmless — you can SELECT DISTINCT it away. In a graph, duplicates are actively destructive because the value of a graph is its connectivity:

Multi-hop queries break. "Which companies are run by people who founded other companies?" can't traverse from Apple to AAPL because there's no edge between them.
Centrality and community detection lie. A heavily-discussed entity split across five nodes looks like five minor nodes instead of one hub. PageRank, betweenness, and clustering all produce garbage.
GraphRAG retrieval misses context. If a question matches the Apple Inc. node, the retriever never pulls the founding fact that's stranded on the Apple node.
Aggregations undercount. "How many products does Apple make?" returns whatever fraction happens to hang off the node you matched.

The Discipline Has a Name

Resolving these duplicates is the decades-old field of record linkage (a.k.a. entity resolution, deduplication, or "merge/purge"). It predates knowledge graphs — it grew out of joining census and medical records that lacked shared keys — but the graph setting adds a twist: after you decide two nodes are the same, you have to merge them without breaking the edges that point at either one.

Two Flavors

Deduplication — collapsing duplicates within a single dataset (the Apple/AAPL case above).
Record linkage / entity linking — matching new records against an existing canonical set, e.g. linking an extracted Apple mention to the Wikidata entity Q312.

The machinery is the same: decide which records refer to the same thing, then unify them. The rest of this day walks the pipeline end to end: block → score → cluster → merge.

Key Takeaways

Extraction produces many nodes per real-world entity ('Apple', 'Apple Inc.', 'AAPL') — duplicates are the default, not the exception
Graph duplicates are worse than table duplicates: they sever connectivity, so multi-hop queries, centrality, and GraphRAG retrieval all degrade
Resolving them is the classic discipline of record linkage; the graph twist is merging nodes without breaking their edges

Entity Resolution & Deduplication

Why Duplicate Entities Wreck a Graph

Why Duplicate Entities Wreck a Graph

The Problem in One Picture

Why This Is Worse Than a Duplicate Row

The Discipline Has a Name

Two Flavors

Blocking — Don't Compare Everything to Everything

Similarity & Match Scoring

Clustering Matches & Merging Nodes Safely

Precision / Recall Trade-offs in Production

AI Learning Assistant

Course Stats

Up Next