Inference & Serving Optimization

The generation call is where RAG spends most of its latency and cost budget. Model selection and build-vs-buy, self-hosted serving with vLLM / TGI / Ollama, the cache layers (KV, prompt, and semantic), and quantization and streaming — the levers that cut tail latency and cost without touching retrieval quality.

Day 3 Progress0%

The Generation Call Dominates

By now you've optimized retrieval to death — better chunking, reranking, hybrid search. But pull up a trace of a real RAG request (you built that instrumentation on Day 1) and look at where the milliseconds actually go:

StageTypical latency
Embed the query~20–50 ms
Vector search + rerank~70–190 ms
LLM generation~800–3000 ms

The generation call is an order of magnitude more expensive than everything else combined — in both latency and dollars. This lesson is about that one call.

Two Numbers That Aren't the Same

Engineers conflate two very different latency metrics:

  • Time to first token (TTFT) — how long until the user sees anything. Dominated by the prefill phase (processing the prompt) plus queue time. This is what makes a UI feel responsive.
  • Throughput (tokens/sec) — how fast tokens stream once generation starts, and how many total tokens/sec your server pushes across all concurrent requests. This is what determines your cost-per-request at scale.

A long RAG prompt (lots of retrieved context) inflates TTFT because prefill has to read every input token. A long answer is bounded by per-token throughput. You optimize them with different levers.

Why Batching Is the Whole Game

A GPU running one request at a time is mostly idle — generation is memory-bandwidth bound, not compute bound. Serving many requests together amortizes the weight-loading cost across all of them. This is why a naive "one request, one forward pass" server gets maybe 5% of the throughput a batching server gets on the same hardware. Everything in Section 3 is about batching well.

The Mental Model

Treat generation as a queue of token-streams sharing a GPU. Your job is to keep the GPU busy (high throughput → low cost) while not making any single user wait too long in the queue (low TTFT). Caching removes work entirely; batching shares it; quantization makes each unit of work cheaper. The rest of the lesson is those three levers.

Key Takeaways
  • The LLM generation call dominates RAG latency and cost by ~10x over retrieval — optimize it before micro-tuning search
  • Time to first token (prefill-bound, hurt by long prompts) and throughput (tokens/sec, set by batching) are distinct metrics with distinct levers
  • Generation is memory-bandwidth bound, so batching many requests on one GPU is the single biggest throughput win

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
55 min
Lessons
5 sections