Extraction never produces a clean graph on the first pass. “IBM”, “I.B.M.”, and “International Business Machines” arrive as three separate nodes; the same person appears under a nickname in one document and a full legal name in another. Today you'll learn entity resolution — the record-linkage discipline of deciding when two nodes are the same real-world thing — and how to blocking, score, cluster, and safely merge duplicates without corrupting the edges that depend on them.
In the Beginner course you built graphs by hand, so every node was unique by construction. The moment you start extracting a graph from real documents — the Day 1–3 pipeline of this course — that guarantee disappears. The same real-world entity shows up under many surface forms, and each one becomes its own node.
A relation-extraction pass over a few news articles might produce:
(Apple)-[:FOUNDED_BY]->(Steve Jobs)(Apple Inc.)-[:HEADQUARTERED_IN]->(Cupertino)(AAPL)-[:HAS_CEO]->(Tim Cook)To a human these are obviously one company. To the graph they are three separate nodes, each holding a fragment of the truth. The graph "knows" who founded Apple and who runs AAPL, but it can't connect those facts because they live on different nodes.
In a relational table, a duplicate row is wasteful but mostly harmless — you can SELECT DISTINCT it away. In a graph, duplicates are actively destructive because the value of a graph is its connectivity:
Apple to AAPL because there's no edge between them.Apple Inc. node, the retriever never pulls the founding fact that's stranded on the Apple node.Resolving these duplicates is the decades-old field of record linkage (a.k.a. entity resolution, deduplication, or "merge/purge"). It predates knowledge graphs — it grew out of joining census and medical records that lacked shared keys — but the graph setting adds a twist: after you decide two nodes are the same, you have to merge them without breaking the edges that point at either one.
Apple mention to the Wikidata entity Q312.The machinery is the same: decide which records refer to the same thing, then unify them. The rest of this day walks the pipeline end to end: block → score → cluster → merge.