Back to Courses

Schema Design for RAG

A RAG pipeline is only as good as the tables underneath it. Design a clean two-table schema — documents and chunks — wire them together with foreign keys, store the embedding alongside the chunk text and metadata, and make re-embedding idempotent so you can swap models without orphaning data.

Day 3 Progress0%

Why Two Tables: Documents and Chunks

A RAG pipeline retrieves chunks, but it answers questions about documents. Those are two different grains of data, and conflating them into a single table is the most common schema mistake in RAG systems built on Postgres.

The Grain Mismatch

A "document" is a source artifact: a PDF, a Confluence page, a support-ticket thread, a product manual. A "chunk" is a slice of that document — a few hundred tokens — small enough to embed meaningfully and fit inside an LLM context window.

One document produces many chunks. That is a textbook one-to-many relationship, and relational databases model it with two tables joined by a foreign key:

  • documents — one row per source artifact. Holds document-level facts: title, source URI, author, MIME type, the content hash, ingestion timestamp.
  • chunks — one row per slice. Holds the chunk text, the embedding vector, positional metadata, and a foreign key back to its parent document.

Why Not One Big Table?

If you flatten everything into a single chunks table with the document title copied onto every row, you've denormalized prematurely. Three problems follow:

  1. Update anomalies. Rename a document and you must update every chunk row, or they drift out of sync.
  2. Wasted storage. The title, URI, and author repeat on every chunk. A 200-chunk PDF stores its metadata 200 times.
  3. No place for document-level state. Where do you record "this document was re-ingested at 14:03" without touching 200 rows?

The two-table design fixes all three: document facts live once, chunk facts live per chunk, and the foreign key keeps them consistent.

The Shape of the Relationship

documents (1) ───< chunks (many)
   id                document_id  ──┐
   title                            │ FK references documents(id)
   source_uri                       │
   content_hash                     │

Every chunk row carries a document_id that points at exactly one document. Postgres enforces this with a REFERENCES constraint, so you can never insert a chunk that points at a document that doesn't exist (referential integrity), and — with ON DELETE CASCADE — deleting a document automatically removes its chunks.

Key Takeaways
  • Documents and chunks are two different grains — model them as a one-to-many relationship, not one flat table
  • A single table forces update anomalies, wastes storage on repeated metadata, and leaves no home for document-level state
  • A foreign key with ON DELETE CASCADE keeps chunks consistent and cleans them up when their parent is deleted

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
50 min
Lessons
5 sections