Back to Courses

Scaling Embeddings: Partitioning & Quantization

Your RAG schema works beautifully at ten thousand chunks. At fifty million, the embedding table no longer fits in RAM and your HNSW index won't build. This day is about the two levers that keep Postgres + pgvector fast and affordable at scale: partitioning the table so each index stays manageable, and quantization — storing vectors as halfvec, bit, or scalar-compressed codes to shrink memory and speed up scans, then reranking with full precision to claw recall back.

Day 4 Progress0%

The Storage Wall

Your Day 3 RAG schema is correct and it is fast — at the scale you tested it. The trouble with embeddings is that they are large, and the cost of storing and indexing them grows linearly with your corpus while your hardware does not.

Doing the Arithmetic

A pgvector vector(D) column stores each value as a 4-byte float32, plus a small per-row header. So one OpenAI text-embedding-3-large vector (D=3072) is:

3072 dims × 4 bytes = 12,288 bytes ≈ 12 KB per row (just the vector)

Multiply that across a corpus:

ChunksRaw vector data (D=3072)Raw vector data (D=1536)
100,000~1.2 GB~0.6 GB
1,000,000~12 GB~6 GB
10,000,000~120 GB~60 GB
50,000,000~600 GB~300 GB

And that is before the index. An HNSW index in pgvector stores the graph in addition to the vectors, typically adding roughly the same order of magnitude again. The heap table also carries your text, metadata JSONB, and tuple overhead on top.

Why "Fits On Disk" Isn't Good Enough

Postgres will happily store 600 GB on a disk. The problem is RAM, not disk. ANN index lookups are random-access: to walk an HNSW graph, Postgres jumps around the index pages following edges. If the index doesn't fit in shared_buffers plus OS page cache, every hop becomes a disk read, and query latency collapses from single-digit milliseconds to hundreds.

The binding constraint for fast vector search is: does the working set of the index fit in memory? Once it doesn't, you have three options:

  1. Buy more RAM — works, but expensive and eventually hits a ceiling.
  2. Partition the table so each child's index is small enough to stay hot (Section 2).
  3. Quantize the vectors so they take less memory in the first place (Sections 3–4).

In practice you combine partitioning and quantization, and this day teaches both.

A Note On What Slows Down First

It's usually not the disk capacity. The failure modes you actually hit, in rough order:

  • Index build time and memory. Building HNSW over tens of millions of vectors needs a large maintenance_work_mem; if the build spills, it can take many hours.
  • Cache pressure. As the index grows past RAM, p99 latency climbs even though p50 looks fine.
  • Write amplification. HNSW inserts get more expensive as the graph grows, slowing ingestion.

Partitioning attacks the first two by keeping each index small. Quantization attacks all three by shrinking the bytes.

Key Takeaways
  • A float32 vector costs D × 4 bytes; at D=1536 that is ~6 KB/row, so 50M chunks is ~300 GB of vectors before any index
  • The binding constraint is RAM, not disk — once the ANN index stops fitting in memory, random-access hops become disk reads and latency collapses
  • The two scaling levers are partitioning (keep each index small enough to stay hot) and quantization (shrink the bytes per vector)

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
50 min
Lessons
5 sections