Back to Courses

Capstone: A Production RAG Service

The intermediate capstone. Assemble the five days into one production RAG service — the end-to-end request path from retrieve → rerank → conversational memory → generate, wrapped in the reliability, observability, and evaluation discipline that turns a working pipeline into a service you can ship and operate.

Day 5 Progress0%

The Architecture of a Production RAG Service

For four days you built techniques in isolation: reranking (Day 1), agentic tool use (Day 2), conversational memory (Day 3), and evaluation (Day 4). A production service is what you get when you wire them into one system that takes a user's message and returns a grounded, cited answer — reliably, observably, and within a latency and cost budget.

The Two Halves

Every RAG service splits into an offline half and an online half, exactly as in the beginner course — but each stage is now the production-grade version you spent the week learning.

  • Offline (indexing): ingest documents, chunk them, embed the chunks, and write vectors + text + metadata to the store. Runs on a schedule or via change-data-capture, not per request.
  • Online (serving): the request path that runs every time a user sends a message — and the part that must be fast, reliable, and instrumented.

The Component Diagram

                      ┌─────────────── OFFLINE ───────────────┐
   sources ─► ingest ─► chunk ─► embed ─► vector store + metadata
                      └────────────────────────────────────────┘

                      ┌─────────────── ONLINE ────────────────┐
   user msg ─► contextualize (memory, Day 3)
            └─► retrieve top-N (bi-encoder)
                  └─► rerank to top-K (cross-encoder, Day 1)
                        └─► [agentic loop? tool use, Day 2]
                              └─► assemble prompt + cite
                                    └─► generate (LLM)
                                          └─► answer + sources
                      └────────────────────────────────────────┘
                                    │
   evaluation (Day 4) ◄── sample traffic, gold sets, online metrics

The Stages as Components

Think of each online stage as a replaceable component with a typed contract:

ComponentIn → OutFrom
Contextualizer(message, history) → standalone queryDay 3
Retrieverquery → top-N candidatesBeginner + bi-encoder
Reranker(query, candidates) → top-K + scoresDay 1
Agent loop (optional)query → tool calls → observationsDay 2
Generator(prompt, context) → answer + citationsBeginner
Evaluator(query, context, answer) → scoresDay 4

Designing the service as components — not one monolithic function — is what lets you swap a reranker, add an agent step, or A/B a new prompt without rewriting the whole path. That decomposition is the single most important architectural decision in this lesson.

Key Takeaways
  • A production RAG service splits into an offline indexing half and an online serving half — the techniques from Days 1–4 are the production-grade versions of the online stages
  • Design the request path as independently swappable components with typed contracts, not one monolithic function
  • Evaluation (Day 4) is a sidecar: it samples live traffic rather than sitting on the request's critical path

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
60 min
Lessons
5 sections