A single Postgres primary can index your vectors, but it can't serve the read traffic of a busy RAG application alone. This day is about scaling reads: streaming replication and read replicas, routing search queries to followers without serving stale answers, taming replication lag and read-your-writes, sizing a PgBouncer pool so a vector workload doesn't exhaust connections, and surviving a failover.
The Intermediate course got pgvector working on a single Postgres instance: an HNSW index, <=> cosine queries, decent recall. The Advanced course starts where that breaks. A single primary is simultaneously taking writes (embedding upserts) and serving reads (similarity search) — and a busy RAG application is overwhelmingly read-heavy.
A typical RAG workload looks like this:
ORDER BY embedding <=> $1 LIMIT k query, often several (multi-query retrieval, re-ranking candidates). Thousands of these per second at peak.On one machine those reads and writes fight over the same CPU, the same buffer cache, and the same I/O. An HNSW similarity scan is CPU- and memory-bandwidth-hungry; a vacuum or a big INSERT ... ON CONFLICT batch evicts your hot index pages from shared_buffers right when you need them.
The first lever is always vertical: more RAM so the HNSW index stays resident, more cores for parallel scans, faster NVMe. A single r6i.8xlarge (256 GB RAM, 32 vCPU) holds ~50M 768-dim vectors with HNSW comfortably and serves a few thousand QPS. Do this first — it's one machine, no routing, no consistency questions.
You scale out — to read replicas — when one of these is true:
Two words get confused and the confusion costs weeks:
This day is entirely about replication. If your index fits on one machine but you can't serve the read load, replication is the answer. If the index itself no longer fits, you need sharding — a different and harder problem.
Add replicas when you can name the bottleneck with a number: "the primary is at 85% CPU and read latency p99 crossed 200ms" is a real signal. "We might get popular" is not. Every replica you add is another machine to monitor, another source of replication lag, and another way to accidentally serve stale results.