A knowledge graph is only as good as the facts you put into it — and most of those facts start life as messy, unstructured text. This day is about the pipeline that turns a corpus of documents into a clean graph: spotting the entities, pulling out the relationships between them, emitting them as (subject, predicate, object) triples, and merging duplicates so the same real-world thing becomes a single node.
In the Beginner course you built graphs by hand — you knew the nodes and edges up front and typed them in. Real knowledge graphs are rarely built that way. The facts you need are buried in documents: contracts, research papers, support tickets, news articles, wikis. This day is about the pipeline that turns that unstructured text into a graph automatically.
Almost every extraction pipeline aims at the same output: a list of triples. A triple is a single fact in (subject, predicate, object) form:
(Marie Curie, discovered, Polonium)(Marie Curie, born_in, Warsaw)(Polonium, is_a, Chemical Element)Each subject and object becomes a node, and each predicate becomes a labeled edge between them. Collect enough triples from enough documents and you have a graph. This is the RDF mental model from the Beginner course, but now the triples come from a machine reading text rather than from you typing them.
A typical extraction pipeline has four stages, and the rest of this day walks through each:
There are broadly two ways to implement these stages:
Modern pipelines often combine them: a fast NER model to find candidate entities, an LLM to extract the nuanced relations. We'll look at both.
If extraction were perfect, knowledge-graph construction would be a solved problem. It isn't, because language is ambiguous: the same entity is written many ways ("IBM", "I.B.M.", "International Business Machines"), the same surface form means different things ("Apple" the company vs the fruit vs Apple Records), and relationships are often implied rather than stated. Every stage below exists to fight one of these ambiguities.