Distributed Vector Search at Scale

What you build when one machine isn't enough. Sharding strategies, cross-shard top-K merging (and the silent recall bug that ships in every junior implementation), replication, consistency models, and how Pinecone, Weaviate, Qdrant, and Milvus actually compose these primitives.

Day 1 Progress0%

When a Single Node Breaks

The Intermediate course taught you how a single-node index works — HNSW graphs, IVF clusters, quantization. The Advanced course starts from a harsh reality: at scale, one node is never enough. This day is about what you build when you outgrow it.

The Three Walls

Single-node vector search hits three walls, usually in this order:

Wall 1: RAM ceiling. HNSW keeps the entire graph in memory. A 768-dim float32 vector takes 3 KB; the HNSW graph with M=16 adds roughly 2 KB per vector for pointers. So each vector costs ~5 KB of RAM. For 1B vectors, that's 5 TB of RAM. Even the largest single-machine instances on AWS (x2iedn.32xlarge, ~4 TB RAM) can't hold it. Quantization can knock 4–32x off, but at 100B vectors you're back to TB-scale.

Wall 2: Query throughput. HNSW queries are fast — single-digit milliseconds at 1M vectors — but a single machine can serve at most a few thousand queries per second per CPU core. A consumer-facing application doing 50,000 QPS at peak needs more than one machine, regardless of how much memory it has.

Wall 3: Blast radius. If your single node goes down, your entire search system goes down. For anything serving live user traffic, this is the wall that gets crossed first, before memory or throughput.

What "Distributed" Actually Means

Two distinct concepts get bundled under the word "distributed," and confusing them costs people weeks:

  • Sharding splits the data across machines. Each shard holds a fraction of the vectors. A query may need to visit multiple shards. This addresses RAM and (incidentally) throughput.
  • Replication keeps multiple copies of the same data across machines. Each replica holds the same vectors. Queries can go to any replica. This addresses throughput, availability, and read latency — not RAM.

A real system uses both: shards × replicas machines. For 100M vectors with 4-shard × 3-replica deployment, that's 12 machines, each holding 25M vectors, with 3 copies of each shard. The math says you can survive any 2 of those 12 going down and still serve queries.

When to Distribute

The honest answer most engineers won't say out loud: as late as possible. Distribution adds:

  • Operational complexity (deploying, monitoring, debugging across machines)
  • Code complexity (routing, merging, retries, fallback)
  • Latency overhead (network hops, tail-latency amplification — covered in Section 3)
  • Cost (more machines, more sophisticated infrastructure)

A single r6i.16xlarge machine (512 GB RAM) can hold ~100M float32 vectors with HNSW, serve ~5,000 QPS, and costs about $4/hour on demand. For most products that's enough, and the distributed alternative is a $40/hour, 10x-complexity multi-node deployment for marginal gain.

The signal to distribute: when you can articulate which of the three walls you're actually hitting, with numbers. "Our index is 600 GB and growing 50 GB a month" is a real signal. "We might need to scale someday" is not.

Key Takeaways
  • Single-node vector search hits three walls: RAM ceiling, query throughput, and blast radius — usually in that order
  • Sharding splits data across machines; replication copies the same data — they solve different problems and a real system needs both
  • Distribute as late as possible — the complexity cost is real and most products fit on one well-provisioned machine

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
48 min
Lessons
5 sections