The generation call is where RAG spends most of its latency and cost budget. Model selection and build-vs-buy, self-hosted serving with vLLM / TGI / Ollama, the cache layers (KV, prompt, and semantic), and quantization and streaming — the levers that cut tail latency and cost without touching retrieval quality.
By now you've optimized retrieval to death — better chunking, reranking, hybrid search. But pull up a trace of a real RAG request (you built that instrumentation on Day 1) and look at where the milliseconds actually go:
| Stage | Typical latency |
|---|---|
| Embed the query | ~20–50 ms |
| Vector search + rerank | ~70–190 ms |
| LLM generation | ~800–3000 ms |
The generation call is an order of magnitude more expensive than everything else combined — in both latency and dollars. This lesson is about that one call.
Engineers conflate two very different latency metrics:
A long RAG prompt (lots of retrieved context) inflates TTFT because prefill has to read every input token. A long answer is bounded by per-token throughput. You optimize them with different levers.
A GPU running one request at a time is mostly idle — generation is memory-bandwidth bound, not compute bound. Serving many requests together amortizes the weight-loading cost across all of them. This is why a naive "one request, one forward pass" server gets maybe 5% of the throughput a batching server gets on the same hardware. Everything in Section 3 is about batching well.
Treat generation as a queue of token-streams sharing a GPU. Your job is to keep the GPU busy (high throughput → low cost) while not making any single user wait too long in the queue (low TTFT). Caching removes work entirely; batching shares it; quantization makes each unit of work cheaper. The rest of the lesson is those three levers.
Powered by advanced LLM
Get personalized help with concepts, code examples, and explanations tailored to your learning pace.
Advanced RAG