Replication & Read Scaling for pgvector

A single Postgres primary can index your vectors, but it can't serve the read traffic of a busy RAG application alone. This day is about scaling reads: streaming replication and read replicas, routing search queries to followers without serving stale answers, taming replication lag and read-your-writes, sizing a PgBouncer pool so a vector workload doesn't exhaust connections, and surviving a failover.

Day 1 Progress0%

Why One Postgres Isn't Enough for RAG

The Intermediate course got pgvector working on a single Postgres instance: an HNSW index, <=> cosine queries, decent recall. The Advanced course starts where that breaks. A single primary is simultaneously taking writes (embedding upserts) and serving reads (similarity search) — and a busy RAG application is overwhelmingly read-heavy.

The Read/Write Asymmetry

A typical RAG workload looks like this:

  • Writes are bursty and batched: you re-embed a corpus, or ingest a day's documents. Tens to hundreds of upserts per second during ingestion, near-zero otherwise.
  • Reads are continuous and latency-sensitive: every user question fires at least one ORDER BY embedding <=> $1 LIMIT k query, often several (multi-query retrieval, re-ranking candidates). Thousands of these per second at peak.

On one machine those reads and writes fight over the same CPU, the same buffer cache, and the same I/O. An HNSW similarity scan is CPU- and memory-bandwidth-hungry; a vacuum or a big INSERT ... ON CONFLICT batch evicts your hot index pages from shared_buffers right when you need them.

Scale Up Before You Scale Out

The first lever is always vertical: more RAM so the HNSW index stays resident, more cores for parallel scans, faster NVMe. A single r6i.8xlarge (256 GB RAM, 32 vCPU) holds ~50M 768-dim vectors with HNSW comfortably and serves a few thousand QPS. Do this first — it's one machine, no routing, no consistency questions.

You scale out — to read replicas — when one of these is true:

  • Read QPS exceeds what one machine can serve even when fully tuned
  • You need to isolate analytics / batch jobs from the live serving path
  • You need availability: the primary cannot be a single point of failure

Replication vs Sharding (Different Problems)

Two words get confused and the confusion costs weeks:

  • Replication keeps full copies of the same database on other machines. Every replica has all your vectors. It scales reads, adds availability, and lowers geographic latency. It does not let you store more data than fits on one machine.
  • Sharding splits the data across machines so each holds a fraction. It scales write throughput and total capacity. (That's Day 2, with Citus.)

This day is entirely about replication. If your index fits on one machine but you can't serve the read load, replication is the answer. If the index itself no longer fits, you need sharding — a different and harder problem.

The Honest Default

Add replicas when you can name the bottleneck with a number: "the primary is at 85% CPU and read latency p99 crossed 200ms" is a real signal. "We might get popular" is not. Every replica you add is another machine to monitor, another source of replication lag, and another way to accidentally serve stale results.

Key Takeaways
  • RAG is read-heavy: continuous latency-sensitive similarity searches contend with bursty embedding writes on a single primary
  • Scale vertically first (more RAM to keep the HNSW index resident); add read replicas only when a tuned single node is still the bottleneck
  • Replication copies the whole database to scale reads and availability; sharding splits data to scale capacity — this day is about replication

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
50 min
Lessons
5 sections