Name: Entity Extraction & Graph Construction
Availability: InStock

From Text to Triples: The Extraction Pipeline

In the Beginner course you built graphs by hand — you knew the nodes and edges up front and typed them in. Real knowledge graphs are rarely built that way. The facts you need are buried in documents: contracts, research papers, support tickets, news articles, wikis. This day is about the pipeline that turns that unstructured text into a graph automatically.

The Target Representation

Almost every extraction pipeline aims at the same output: a list of triples. A triple is a single fact in (subject, predicate, object) form:

(Marie Curie, discovered, Polonium)
(Marie Curie, born_in, Warsaw)
(Polonium, is_a, Chemical Element)

Each subject and object becomes a node, and each predicate becomes a labeled edge between them. Collect enough triples from enough documents and you have a graph. This is the RDF mental model from the Beginner course, but now the triples come from a machine reading text rather than from you typing them.

The Four Stages

A typical extraction pipeline has four stages, and the rest of this day walks through each:

Named-Entity Recognition (NER) — find the spans of text that name real-world things (people, organizations, places, dates) and classify them by type.
Coreference resolution — figure out that "she", "the company", and "Curie" all refer to the same entity, so you don't create duplicate nodes.
Relation extraction — determine how the entities relate to each other, producing the predicates that connect them.
Construction & deduplication — assemble the triples into a graph, merging mentions of the same entity into a single canonical node.

Two Schools of Extraction

There are broadly two ways to implement these stages:

Classical NLP models — purpose-built models (spaCy, Stanford CoreNLP, fine-tuned BERT taggers) that are fast, cheap, and run locally, but are limited to the entity and relation types they were trained on.
LLM-based extraction — prompt a large language model to read the text and emit structured triples directly. Far more flexible (you can ask for any schema in plain English), but slower, more expensive, and prone to hallucinating facts that aren't in the text.

Modern pipelines often combine them: a fast NER model to find candidate entities, an LLM to extract the nuanced relations. We'll look at both.

Why This Is Hard

If extraction were perfect, knowledge-graph construction would be a solved problem. It isn't, because language is ambiguous: the same entity is written many ways ("IBM", "I.B.M.", "International Business Machines"), the same surface form means different things ("Apple" the company vs the fruit vs Apple Records), and relationships are often implied rather than stated. Every stage below exists to fight one of these ambiguities.

Key Takeaways

Extraction pipelines turn unstructured text into (subject, predicate, object) triples — subjects/objects become nodes, predicates become edges
The four stages are NER, coreference resolution, relation extraction, and construction/dedup
Classical NLP models are fast and cheap but fixed-schema; LLMs are flexible but slower, costlier, and can hallucinate

Entity Extraction & Graph Construction

From Text to Triples: The Extraction Pipeline

From Text to Triples: The Extraction Pipeline

The Target Representation

The Four Stages

Two Schools of Extraction

Why This Is Hard

Named-Entity Recognition

Coreference Resolution

Relation Extraction with LLMs

Construction, Deduplication & Pitfalls

AI Learning Assistant

Course Stats

Up Next