Back to Courses

Entity Extraction & Graph Construction

A knowledge graph is only as good as the facts you put into it — and most of those facts start life as messy, unstructured text. This day is about the pipeline that turns a corpus of documents into a clean graph: spotting the entities, pulling out the relationships between them, emitting them as (subject, predicate, object) triples, and merging duplicates so the same real-world thing becomes a single node.

Day 1 Progress0%

From Text to Triples: The Extraction Pipeline

In the Beginner course you built graphs by hand — you knew the nodes and edges up front and typed them in. Real knowledge graphs are rarely built that way. The facts you need are buried in documents: contracts, research papers, support tickets, news articles, wikis. This day is about the pipeline that turns that unstructured text into a graph automatically.

The Target Representation

Almost every extraction pipeline aims at the same output: a list of triples. A triple is a single fact in (subject, predicate, object) form:

  • (Marie Curie, discovered, Polonium)
  • (Marie Curie, born_in, Warsaw)
  • (Polonium, is_a, Chemical Element)

Each subject and object becomes a node, and each predicate becomes a labeled edge between them. Collect enough triples from enough documents and you have a graph. This is the RDF mental model from the Beginner course, but now the triples come from a machine reading text rather than from you typing them.

The Four Stages

A typical extraction pipeline has four stages, and the rest of this day walks through each:

  1. Named-Entity Recognition (NER) — find the spans of text that name real-world things (people, organizations, places, dates) and classify them by type.
  2. Coreference resolution — figure out that "she", "the company", and "Curie" all refer to the same entity, so you don't create duplicate nodes.
  3. Relation extraction — determine how the entities relate to each other, producing the predicates that connect them.
  4. Construction & deduplication — assemble the triples into a graph, merging mentions of the same entity into a single canonical node.

Two Schools of Extraction

There are broadly two ways to implement these stages:

  • Classical NLP models — purpose-built models (spaCy, Stanford CoreNLP, fine-tuned BERT taggers) that are fast, cheap, and run locally, but are limited to the entity and relation types they were trained on.
  • LLM-based extraction — prompt a large language model to read the text and emit structured triples directly. Far more flexible (you can ask for any schema in plain English), but slower, more expensive, and prone to hallucinating facts that aren't in the text.

Modern pipelines often combine them: a fast NER model to find candidate entities, an LLM to extract the nuanced relations. We'll look at both.

Why This Is Hard

If extraction were perfect, knowledge-graph construction would be a solved problem. It isn't, because language is ambiguous: the same entity is written many ways ("IBM", "I.B.M.", "International Business Machines"), the same surface form means different things ("Apple" the company vs the fruit vs Apple Records), and relationships are often implied rather than stated. Every stage below exists to fight one of these ambiguities.

Key Takeaways
  • Extraction pipelines turn unstructured text into (subject, predicate, object) triples — subjects/objects become nodes, predicates become edges
  • The four stages are NER, coreference resolution, relation extraction, and construction/dedup
  • Classical NLP models are fast and cheap but fixed-schema; LLMs are flexible but slower, costlier, and can hallucinate

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
50 min
Lessons
5 sections