Name: Inference & Serving Optimization
Availability: InStock

The Generation Call Dominates

By now you've optimized retrieval to death — better chunking, reranking, hybrid search. But pull up a trace of a real RAG request (you built that instrumentation on Day 1) and look at where the milliseconds actually go:

Stage	Typical latency
Embed the query	~20–50 ms
Vector search + rerank	~70–190 ms
LLM generation	~800–3000 ms

The generation call is an order of magnitude more expensive than everything else combined — in both latency and dollars. This lesson is about that one call.

Two Numbers That Aren't the Same

Engineers conflate two very different latency metrics:

Time to first token (TTFT) — how long until the user sees anything. Dominated by the prefill phase (processing the prompt) plus queue time. This is what makes a UI feel responsive.
Throughput (tokens/sec) — how fast tokens stream once generation starts, and how many total tokens/sec your server pushes across all concurrent requests. This is what determines your cost-per-request at scale.

A long RAG prompt (lots of retrieved context) inflates TTFT because prefill has to read every input token. A long answer is bounded by per-token throughput. You optimize them with different levers.

Why Batching Is the Whole Game

A GPU running one request at a time is mostly idle — generation is memory-bandwidth bound, not compute bound. Serving many requests together amortizes the weight-loading cost across all of them. This is why a naive "one request, one forward pass" server gets maybe 5% of the throughput a batching server gets on the same hardware. Everything in Section 3 is about batching well.

The Mental Model

Treat generation as a queue of token-streams sharing a GPU. Your job is to keep the GPU busy (high throughput → low cost) while not making any single user wait too long in the queue (low TTFT). Caching removes work entirely; batching shares it; quantization makes each unit of work cheaper. The rest of the lesson is those three levers.

Key Takeaways

The LLM generation call dominates RAG latency and cost by ~10x over retrieval — optimize it before micro-tuning search
Time to first token (prefill-bound, hurt by long prompts) and throughput (tokens/sec, set by batching) are distinct metrics with distinct levers
Generation is memory-bandwidth bound, so batching many requests on one GPU is the single biggest throughput win

Inference & Serving Optimization

The Generation Call Dominates

The Generation Call Dominates

Two Numbers That Aren't the Same

Why Batching Is the Whole Game

The Mental Model

Model Selection & Build-vs-Buy

Self-Hosted Serving: vLLM, TGI & Ollama

Caching: KV, Prompt & Semantic

Quantization & Streaming

AI Learning Assistant

Course Stats

Up Next