Name: Scaling Embeddings: Partitioning & Quantization
Availability: InStock

The Storage Wall

Your Day 3 RAG schema is correct and it is fast — at the scale you tested it. The trouble with embeddings is that they are large, and the cost of storing and indexing them grows linearly with your corpus while your hardware does not.

Doing the Arithmetic

A pgvector vector(D) column stores each value as a 4-byte float32, plus a small per-row header. So one OpenAI text-embedding-3-large vector (D=3072) is:

3072 dims × 4 bytes = 12,288 bytes ≈ 12 KB per row (just the vector)

Multiply that across a corpus:

Chunks	Raw vector data (D=3072)	Raw vector data (D=1536)
100,000	~1.2 GB	~0.6 GB
1,000,000	~12 GB	~6 GB
10,000,000	~120 GB	~60 GB
50,000,000	~600 GB	~300 GB

And that is before the index. An HNSW index in pgvector stores the graph in addition to the vectors, typically adding roughly the same order of magnitude again. The heap table also carries your text, metadata JSONB, and tuple overhead on top.

Why "Fits On Disk" Isn't Good Enough

Postgres will happily store 600 GB on a disk. The problem is RAM, not disk. ANN index lookups are random-access: to walk an HNSW graph, Postgres jumps around the index pages following edges. If the index doesn't fit in shared_buffers plus OS page cache, every hop becomes a disk read, and query latency collapses from single-digit milliseconds to hundreds.

The binding constraint for fast vector search is: does the working set of the index fit in memory? Once it doesn't, you have three options:

Buy more RAM — works, but expensive and eventually hits a ceiling.
Partition the table so each child's index is small enough to stay hot (Section 2).
Quantize the vectors so they take less memory in the first place (Sections 3–4).

In practice you combine partitioning and quantization, and this day teaches both.

A Note On What Slows Down First

It's usually not the disk capacity. The failure modes you actually hit, in rough order:

Index build time and memory. Building HNSW over tens of millions of vectors needs a large maintenance_work_mem; if the build spills, it can take many hours.
Cache pressure. As the index grows past RAM, p99 latency climbs even though p50 looks fine.
Write amplification. HNSW inserts get more expensive as the graph grows, slowing ingestion.

Partitioning attacks the first two by keeping each index small. Quantization attacks all three by shrinking the bytes.

Key Takeaways

A float32 vector costs D × 4 bytes; at D=1536 that is ~6 KB/row, so 50M chunks is ~300 GB of vectors before any index
The binding constraint is RAM, not disk — once the ANN index stops fitting in memory, random-access hops become disk reads and latency collapses
The two scaling levers are partitioning (keep each index small enough to stay hot) and quantization (shrink the bytes per vector)

Scaling Embeddings: Partitioning & Quantization

The Storage Wall

The Storage Wall

Doing the Arithmetic

Why "Fits On Disk" Isn't Good Enough

A Note On What Slows Down First

Partitioning the Embedding Table

halfvec — Free 2x With fp16

Binary & Scalar Quantization + Rerank

A Scaling Playbook

AI Learning Assistant

Course Stats

Up Next