Sharding with Citus for Billion-Scale Vectors

One Postgres node eventually runs out of RAM for HNSW. This day takes pgvector horizontal: distributing tables across a Citus cluster, choosing a distribution key that keeps co-located joins cheap, fanning a vector search across worker shards, and merging per-shard top-K into a correct global top-K — including the recall bug that silently ships when you ask each shard for too few rows.

Day 2 Progress0%

When pgvector Outgrows One Node

By now you can run pgvector on a single Postgres instance: a vector column, an HNSW index, and ORDER BY embedding <=> $1 LIMIT 10. That setup is genuinely good — it ships, it has transactions, and it fits the vast majority of products. This day is about the next problem: the one Postgres node that can no longer hold the index.

The Wall Is RAM, Not Disk

HNSW is an in-memory graph. For acceptable query latency, the index has to live in RAM (or at least in the OS page cache). A 1536-dim float32 vector is ~6 KB; the HNSW graph adds roughly m × 8 bytes per vector for neighbor links (~128 bytes at m=16), plus per-tuple Postgres overhead. Round to ~7 KB of resident memory per vector.

  • 10M vectors → ~70 GB — comfortable on one big node
  • 100M vectors → ~700 GB — you're now buying the largest instances
  • 1B vectors → ~7 TB — past what a single machine holds

You hit the RAM wall long before you hit a disk or CPU wall. Once the working set spills out of memory, HNSW traversal turns random reads into page faults and p99 latency falls off a cliff.

Vertical First, Always

Before reaching for a cluster, exhaust the single node — it is dramatically simpler to operate:

  • Bigger instance. A 1–2 TB-RAM machine holds ~150–250M vectors. That covers most teams forever.
  • Quantization. pgvector's halfvec (float16) halves index memory at a small recall cost; binary/scalar quantization can cut 4–32×. This often buys you another order of magnitude on one node.
  • Lower dimensions. Matryoshka embeddings or PCA truncation shrink each vector linearly.
  • Partial / filtered indexes. If most queries hit recent data, index only the hot partition.

The honest rule of thumb: shard pgvector only when a single, well-quantized, well-provisioned node can't hold your working set or serve your write/query load — and you can say which of those is true, with numbers.

Enter Citus

Citus is a Postgres extension (now Microsoft-owned, also the engine behind Azure Cosmos DB for PostgreSQL) that turns one Postgres into a distributed cluster. It is still Postgres — same SQL, same drivers, same pgvector. The architecture:

  • A coordinator node holds metadata and plans queries. Clients connect here.
  • Worker nodes hold the actual data, split into shards (each shard is a regular Postgres table).
  • A distributed table is one logical table whose rows are spread across worker shards by a distribution column.

You keep writing SELECT ... ORDER BY embedding <=> $1; Citus rewrites it into per-shard queries, runs them on the workers in parallel, and combines the results. The rest of this day is about making that rewrite correct and fast for vector workloads, where the naive combine step has a recall trap.

Key Takeaways
  • Single-node pgvector hits a RAM wall (HNSW must be resident) around hundreds of millions of vectors — not a disk or CPU wall
  • Exhaust vertical options first: bigger instance, halfvec/quantization, lower dims, partial indexes — they are far simpler than a cluster
  • Citus turns Postgres into a coordinator + workers cluster of shards; it stays Postgres, so pgvector and your SQL keep working

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
50 min
Lessons
5 sections