What you build when one machine isn't enough. Sharding strategies, cross-shard top-K merging (and the silent recall bug that ships in every junior implementation), replication, consistency models, and how Pinecone, Weaviate, Qdrant, and Milvus actually compose these primitives.
The Intermediate course taught you how a single-node index works — HNSW graphs, IVF clusters, quantization. The Advanced course starts from a harsh reality: at scale, one node is never enough. This day is about what you build when you outgrow it.
Single-node vector search hits three walls, usually in this order:
Wall 1: RAM ceiling. HNSW keeps the entire graph in memory. A 768-dim float32 vector takes 3 KB; the HNSW graph with M=16 adds roughly 2 KB per vector for pointers. So each vector costs ~5 KB of RAM. For 1B vectors, that's 5 TB of RAM. Even the largest single-machine instances on AWS (x2iedn.32xlarge, ~4 TB RAM) can't hold it. Quantization can knock 4–32x off, but at 100B vectors you're back to TB-scale.
Wall 2: Query throughput. HNSW queries are fast — single-digit milliseconds at 1M vectors — but a single machine can serve at most a few thousand queries per second per CPU core. A consumer-facing application doing 50,000 QPS at peak needs more than one machine, regardless of how much memory it has.
Wall 3: Blast radius. If your single node goes down, your entire search system goes down. For anything serving live user traffic, this is the wall that gets crossed first, before memory or throughput.
Two distinct concepts get bundled under the word "distributed," and confusing them costs people weeks:
A real system uses both: shards × replicas machines. For 100M vectors with 4-shard × 3-replica deployment, that's 12 machines, each holding 25M vectors, with 3 copies of each shard. The math says you can survive any 2 of those 12 going down and still serve queries.
The honest answer most engineers won't say out loud: as late as possible. Distribution adds:
A single r6i.16xlarge machine (512 GB RAM) can hold ~100M float32 vectors with HNSW, serve ~5,000 QPS, and costs about $4/hour on demand. For most products that's enough, and the distributed alternative is a $40/hour, 10x-complexity multi-node deployment for marginal gain.
The signal to distribute: when you can articulate which of the three walls you're actually hitting, with numbers. "Our index is 600 GB and growing 50 GB a month" is a real signal. "We might need to scale someday" is not.
Powered by advanced LLM
Get personalized help with concepts, code examples, and explanations tailored to your learning pace.
Embedding Fine-Tuning