Name: Schema Design for RAG
Availability: InStock

Why Two Tables: Documents and Chunks

A RAG pipeline retrieves chunks, but it answers questions about documents. Those are two different grains of data, and conflating them into a single table is the most common schema mistake in RAG systems built on Postgres.

The Grain Mismatch

A "document" is a source artifact: a PDF, a Confluence page, a support-ticket thread, a product manual. A "chunk" is a slice of that document — a few hundred tokens — small enough to embed meaningfully and fit inside an LLM context window.

One document produces many chunks. That is a textbook one-to-many relationship, and relational databases model it with two tables joined by a foreign key:

documents — one row per source artifact. Holds document-level facts: title, source URI, author, MIME type, the content hash, ingestion timestamp.
chunks — one row per slice. Holds the chunk text, the embedding vector, positional metadata, and a foreign key back to its parent document.

Why Not One Big Table?

If you flatten everything into a single chunks table with the document title copied onto every row, you've denormalized prematurely. Three problems follow:

Update anomalies. Rename a document and you must update every chunk row, or they drift out of sync.
Wasted storage. The title, URI, and author repeat on every chunk. A 200-chunk PDF stores its metadata 200 times.
No place for document-level state. Where do you record "this document was re-ingested at 14:03" without touching 200 rows?

The two-table design fixes all three: document facts live once, chunk facts live per chunk, and the foreign key keeps them consistent.

The Shape of the Relationship

documents (1) ───< chunks (many)
   id                document_id  ──┐
   title                            │ FK references documents(id)
   source_uri                       │
   content_hash                     │

Every chunk row carries a document_id that points at exactly one document. Postgres enforces this with a REFERENCES constraint, so you can never insert a chunk that points at a document that doesn't exist (referential integrity), and — with ON DELETE CASCADE — deleting a document automatically removes its chunks.

Key Takeaways

Documents and chunks are two different grains — model them as a one-to-many relationship, not one flat table
A single table forces update anomalies, wastes storage on repeated metadata, and leaves no home for document-level state
A foreign key with ON DELETE CASCADE keeps chunks consistent and cleans them up when their parent is deleted

Schema Design for RAG

Why Two Tables: Documents and Chunks

Why Two Tables: Documents and Chunks

The Grain Mismatch

Why Not One Big Table?

The Shape of the Relationship

What Lives on a Chunk Row

Parent-Document Retrieval

Stable Chunk IDs and Idempotent Re-Embedding

Normalization vs Denormalization Tradeoffs

AI Learning Assistant

Course Stats

Up Next