Your RAG schema works beautifully at ten thousand chunks. At fifty million, the embedding table no longer fits in RAM and your HNSW index won't build. This day is about the two levers that keep Postgres + pgvector fast and affordable at scale: partitioning the table so each index stays manageable, and quantization — storing vectors as halfvec, bit, or scalar-compressed codes to shrink memory and speed up scans, then reranking with full precision to claw recall back.
Your Day 3 RAG schema is correct and it is fast — at the scale you tested it. The trouble with embeddings is that they are large, and the cost of storing and indexing them grows linearly with your corpus while your hardware does not.
A pgvector vector(D) column stores each value as a 4-byte float32, plus a small per-row header. So one OpenAI text-embedding-3-large vector (D=3072) is:
3072 dims × 4 bytes = 12,288 bytes ≈ 12 KB per row (just the vector)
Multiply that across a corpus:
| Chunks | Raw vector data (D=3072) | Raw vector data (D=1536) |
|---|---|---|
| 100,000 | ~1.2 GB | ~0.6 GB |
| 1,000,000 | ~12 GB | ~6 GB |
| 10,000,000 | ~120 GB | ~60 GB |
| 50,000,000 | ~600 GB | ~300 GB |
And that is before the index. An HNSW index in pgvector stores the graph in addition to the vectors, typically adding roughly the same order of magnitude again. The heap table also carries your text, metadata JSONB, and tuple overhead on top.
Postgres will happily store 600 GB on a disk. The problem is RAM, not disk. ANN index lookups are random-access: to walk an HNSW graph, Postgres jumps around the index pages following edges. If the index doesn't fit in shared_buffers plus OS page cache, every hop becomes a disk read, and query latency collapses from single-digit milliseconds to hundreds.
The binding constraint for fast vector search is: does the working set of the index fit in memory? Once it doesn't, you have three options:
In practice you combine partitioning and quantization, and this day teaches both.
It's usually not the disk capacity. The failure modes you actually hit, in rough order:
maintenance_work_mem; if the build spills, it can take many hours.Partitioning attacks the first two by keeping each index small. Quantization attacks all three by shrinking the bytes.