The single highest-leverage decision in RAG retrieval quality. Fixed-size vs token-aware, recursive splitting, overlap and the boundary problem, parent-document retrieval, and layout-aware chunking for markdown, HTML, PDF, code, and tables.
Most beginners treat chunking as a configuration value — "use 500, that's what the tutorial said." The Intermediate course position is sharper: chunking is the single biggest decision controlling retrieval quality, and it's mostly invisible until you measure.
Three properties pull against each other:
You cannot maximize all three. The right answer depends on what your users will ask.
| Chunk size | Specificity | Context | Discriminability |
|---|---|---|---|
| Tiny (50–150 tokens) | High | Low | High |
| Small (200–400) | High | Medium | High |
| Medium (500–800) | Medium | Medium | Medium |
| Large (1000–2000) | Low | High | Low |
| Huge (>2000) | Very low | Very high | Very low |
Chunking happens at index time. To change your chunk size, you re-chunk every document, re-embed every chunk, and re-upsert every vector. For 10M docs at OpenAI's embedding pricing, that's a few hundred dollars and a few hours — not catastrophic but not a flip you make lightly.
This is why "measure before you ship" matters. Calibrate on a representative gold set (covered in Day 5 of this course) before committing to a chunk size in production. The cost of re-chunking is much higher than the cost of measuring well the first time.
Three things make chunking a high-leverage / hard-to-improve decision:
all-MiniLM-L6-v2 on 256-token max), others on longer (text-embedding-3-large on 8191). Chunks much shorter or longer than the model's training distribution embed less well.If you have no data yet: start with 500 tokens, 50 token overlap, recursive splitter. This is what LangChain's defaults give you. It's not optimal for any specific case, but it's a defensible starting point you can measure from.
Powered by advanced LLM
Get personalized help with concepts, code examples, and explanations tailored to your learning pace.
Hybrid Search