Name: Capstone: Postgres RAG Platform Design
Availability: InStock

Two Profiles, One Design Process

This is the final capstone of the PostgreSQL for AI track. Across the previous four days you learned the pieces in isolation: replication (Day 1), Citus sharding (Day 2), embedding pipelines (Day 3), and observability (Day 4). Today you assemble them into a complete, production Postgres-backed RAG platform — and you do it twice, against two deliberately different profiles, so you can feel where each decision flips.

The Two Case Studies

Profile A — Internal Knowledge Assistant (single-tenant). A 2,000-person company wants a "chat with our docs" assistant over Confluence, Google Drive, and a Slack export. Roughly 2 million chunks total, growing ~5% a month. Peak load is ~20 QPS during business hours, near zero overnight. One trust boundary (everyone is an employee), modest latency expectations (~2s end-to-end is fine), and a small platform team.

Profile B — Multi-Tenant SaaS RAG (large). A document-intelligence SaaS serves 8,000 customer tenants, each with their own corpus. Aggregate 120 million chunks and climbing 10M/month. Peak 1,500 QPS, strict tenant isolation (a query must never leak across tenants), p95 latency SLO of 400ms for retrieval, and a 99.9% availability target.

These are not "small vs big" in size only — they differ in isolation model, growth rate, SLO, and team capacity. Those four axes drive almost every decision below.

The Design Process (the spine of this day)

For each profile we walk the same seven-step spine:

Schema — tables, the vector column, metadata columns, and the chunk/document relationship.
pgvector indexing — HNSW vs IVFFlat, dimensionality, quantization, and maintenance_work_mem.
Hybrid search — combining dense vector similarity with full-text (tsvector/BM25-style) ranking.
Metadata filtering — pre- vs post-filter, partial indexes, and the tenant predicate.
Scaling — read replicas vs Citus distribution, and when each is warranted.
Incremental embedding — keeping the index fresh without full re-embeds.
Observability — what to measure and the alerts that catch silent recall loss.

The Single Most Important Up-Front Number

Before any of that: estimate storage, because it sets everything downstream (index type, RAM, whether you shard at all). The formula you'll reuse all day:

bytes_per_chunk ≈ (dims × 4) + raw_text + metadata + index_overhead

For 1536-dim OpenAI embeddings: the vector alone is 1536 × 4 = 6,144 bytes (~6 KB). HNSW adds roughly 2–4 KB/chunk of graph. Add ~1–2 KB for text + metadata + row overhead. Call it ~10 KB/chunk all-in as a planning rule.

Profile A: 2M × 10 KB ≈ 20 GB — fits in RAM on a single mid-size instance. You almost certainly do not need to shard.
Profile B: 120M × 10 KB ≈ 1.2 TB — far past a single comfortable node's RAM. You will shard (Citus) and/or partition by tenant.

That one calculation already tells you Profile A is a single-node-plus-replicas story and Profile B is a distributed story. Everything else is detail.

Key Takeaways

Four axes drive every design decision: isolation model, growth rate, SLO, and team capacity — size alone is not enough
Always estimate all-in storage first (~10 KB/chunk for 1536-dim + HNSW) — it determines index type, RAM, and whether you shard at all
Profile A (~20 GB) is a single-node-plus-replicas story; Profile B (~1.2 TB) is inherently a distributed (Citus) story

Capstone: Postgres RAG Platform Design

Two Profiles, One Design Process

Two Profiles, One Design Process

The Two Case Studies

The Design Process (the spine of this day)

The Single Most Important Up-Front Number

Schema and pgvector Indexing

Hybrid Search and Metadata Filtering

Scaling and Incremental Embedding

Observability, Cost, and Launch Playbook

AI Learning Assistant

Course Stats

Track complete 🎉