Name: Capstone: A Production RAG Service
Availability: InStock

The Architecture of a Production RAG Service

For four days you built techniques in isolation: reranking (Day 1), agentic tool use (Day 2), conversational memory (Day 3), and evaluation (Day 4). A production service is what you get when you wire them into one system that takes a user's message and returns a grounded, cited answer — reliably, observably, and within a latency and cost budget.

The Two Halves

Every RAG service splits into an offline half and an online half, exactly as in the beginner course — but each stage is now the production-grade version you spent the week learning.

Offline (indexing): ingest documents, chunk them, embed the chunks, and write vectors + text + metadata to the store. Runs on a schedule or via change-data-capture, not per request.
Online (serving): the request path that runs every time a user sends a message — and the part that must be fast, reliable, and instrumented.

The Component Diagram

                      ┌─────────────── OFFLINE ───────────────┐
   sources ─► ingest ─► chunk ─► embed ─► vector store + metadata
                      └────────────────────────────────────────┘

                      ┌─────────────── ONLINE ────────────────┐
   user msg ─► contextualize (memory, Day 3)
            └─► retrieve top-N (bi-encoder)
                  └─► rerank to top-K (cross-encoder, Day 1)
                        └─► [agentic loop? tool use, Day 2]
                              └─► assemble prompt + cite
                                    └─► generate (LLM)
                                          └─► answer + sources
                      └────────────────────────────────────────┘
                                    │
   evaluation (Day 4) ◄── sample traffic, gold sets, online metrics

The Stages as Components

Think of each online stage as a replaceable component with a typed contract:

Component	In → Out	From
Contextualizer	(message, history) → standalone query	Day 3
Retriever	query → top-N candidates	Beginner + bi-encoder
Reranker	(query, candidates) → top-K + scores	Day 1
Agent loop (optional)	query → tool calls → observations	Day 2
Generator	(prompt, context) → answer + citations	Beginner
Evaluator	(query, context, answer) → scores	Day 4

Designing the service as components — not one monolithic function — is what lets you swap a reranker, add an agent step, or A/B a new prompt without rewriting the whole path. That decomposition is the single most important architectural decision in this lesson.

Key Takeaways

A production RAG service splits into an offline indexing half and an online serving half — the techniques from Days 1–4 are the production-grade versions of the online stages
Design the request path as independently swappable components with typed contracts, not one monolithic function
Evaluation (Day 4) is a sidecar: it samples live traffic rather than sitting on the request's critical path

Capstone: A Production RAG Service

The Architecture of a Production RAG Service

The Architecture of a Production RAG Service

The Two Halves

The Component Diagram

The Stages as Components

The Request Path, End to End

Reliability Patterns

Observability & Online Evaluation

Worked Case Study & Launch Playbook

AI Learning Assistant

Course Stats

Level Complete