One Postgres node eventually runs out of RAM for HNSW. This day takes pgvector horizontal: distributing tables across a Citus cluster, choosing a distribution key that keeps co-located joins cheap, fanning a vector search across worker shards, and merging per-shard top-K into a correct global top-K — including the recall bug that silently ships when you ask each shard for too few rows.
By now you can run pgvector on a single Postgres instance: a vector column, an HNSW index, and ORDER BY embedding <=> $1 LIMIT 10. That setup is genuinely good — it ships, it has transactions, and it fits the vast majority of products. This day is about the next problem: the one Postgres node that can no longer hold the index.
HNSW is an in-memory graph. For acceptable query latency, the index has to live in RAM (or at least in the OS page cache). A 1536-dim float32 vector is ~6 KB; the HNSW graph adds roughly m × 8 bytes per vector for neighbor links (~128 bytes at m=16), plus per-tuple Postgres overhead. Round to ~7 KB of resident memory per vector.
You hit the RAM wall long before you hit a disk or CPU wall. Once the working set spills out of memory, HNSW traversal turns random reads into page faults and p99 latency falls off a cliff.
Before reaching for a cluster, exhaust the single node — it is dramatically simpler to operate:
halfvec (float16) halves index memory at a small recall cost; binary/scalar quantization can cut 4–32×. This often buys you another order of magnitude on one node.The honest rule of thumb: shard pgvector only when a single, well-quantized, well-provisioned node can't hold your working set or serve your write/query load — and you can say which of those is true, with numbers.
Citus is a Postgres extension (now Microsoft-owned, also the engine behind Azure Cosmos DB for PostgreSQL) that turns one Postgres into a distributed cluster. It is still Postgres — same SQL, same drivers, same pgvector. The architecture:
You keep writing SELECT ... ORDER BY embedding <=> $1; Citus rewrites it into per-shard queries, runs them on the workers in parallel, and combines the results. The rest of this day is about making that rewrite correct and fast for vector workloads, where the naive combine step has a recall trap.