Name: Replication & Read Scaling for pgvector
Availability: InStock

Why One Postgres Isn't Enough for RAG

The Intermediate course got pgvector working on a single Postgres instance: an HNSW index, <=> cosine queries, decent recall. The Advanced course starts where that breaks. A single primary is simultaneously taking writes (embedding upserts) and serving reads (similarity search) — and a busy RAG application is overwhelmingly read-heavy.

The Read/Write Asymmetry

A typical RAG workload looks like this:

Writes are bursty and batched: you re-embed a corpus, or ingest a day's documents. Tens to hundreds of upserts per second during ingestion, near-zero otherwise.
Reads are continuous and latency-sensitive: every user question fires at least one ORDER BY embedding <=> $1 LIMIT k query, often several (multi-query retrieval, re-ranking candidates). Thousands of these per second at peak.

On one machine those reads and writes fight over the same CPU, the same buffer cache, and the same I/O. An HNSW similarity scan is CPU- and memory-bandwidth-hungry; a vacuum or a big INSERT ... ON CONFLICT batch evicts your hot index pages from shared_buffers right when you need them.

Scale Up Before You Scale Out

The first lever is always vertical: more RAM so the HNSW index stays resident, more cores for parallel scans, faster NVMe. A single r6i.8xlarge (256 GB RAM, 32 vCPU) holds ~50M 768-dim vectors with HNSW comfortably and serves a few thousand QPS. Do this first — it's one machine, no routing, no consistency questions.

You scale out — to read replicas — when one of these is true:

Read QPS exceeds what one machine can serve even when fully tuned
You need to isolate analytics / batch jobs from the live serving path
You need availability: the primary cannot be a single point of failure

Replication vs Sharding (Different Problems)

Two words get confused and the confusion costs weeks:

Replication keeps full copies of the same database on other machines. Every replica has all your vectors. It scales reads, adds availability, and lowers geographic latency. It does not let you store more data than fits on one machine.
Sharding splits the data across machines so each holds a fraction. It scales write throughput and total capacity. (That's Day 2, with Citus.)

This day is entirely about replication. If your index fits on one machine but you can't serve the read load, replication is the answer. If the index itself no longer fits, you need sharding — a different and harder problem.

The Honest Default

Add replicas when you can name the bottleneck with a number: "the primary is at 85% CPU and read latency p99 crossed 200ms" is a real signal. "We might get popular" is not. Every replica you add is another machine to monitor, another source of replication lag, and another way to accidentally serve stale results.

Key Takeaways

RAG is read-heavy: continuous latency-sensitive similarity searches contend with bursty embedding writes on a single primary
Scale vertically first (more RAM to keep the HNSW index resident); add read replicas only when a tuned single node is still the bottleneck
Replication copies the whole database to scale reads and availability; sharding splits data to scale capacity — this day is about replication

Replication & Read Scaling for pgvector

Why One Postgres Isn't Enough for RAG

Why One Postgres Isn't Enough for RAG

The Read/Write Asymmetry

Scale Up Before You Scale Out

Replication vs Sharding (Different Problems)

The Honest Default

Streaming Replication and Read Replicas

Replication Lag and Read-Your-Writes

Connection Pooling with PgBouncer

High Availability and Failover

AI Learning Assistant

Course Stats

Up Next