Name: Capstone: Scaling & Operating a RAG Platform
Availability: InStock

The Platform Architecture

Across this track you built the pieces. Beginner: the five-stage RAG pipeline and a Q&A bot. Intermediate: reranking, agentic tool use, conversational memory, and evaluation. Advanced: observability, fine-tuning, and optimized serving. The capstone assembles them into one system you could actually run for a regulated-industry customer.

The Whole System on One Page

A production RAG platform is two planes that share a store.

                       ┌──────────────  CONTROL / OFFLINE  ──────────────┐
  sources ─► ingest ─► chunk ─► embed ─► [vector store + metadata]       │
                                              ▲        ▲                  │
                                              │        │   eval harness ◄─┤ (Day 4)
                                              │        │   fine-tune    ◄─┤ (Adv Day 2)
                       └──────────────────────┼────────┼─────────────────┘
                                              │        │
  user ─► gateway ─► contextualize ─► retrieve ─► rerank ─► [agent?] ─► generate ─► answer+cites
            (auth,      (Int Day 3      (top-N)    (top-K     (Int       (served on
            tenancy,     memory)                    Int        Day 2      vLLM/TGI,
            limits)                                 Day 1)     tools)     Adv Day 3)
                                  └──────── traced end to end (Adv Day 1) ────────┘

The request path (bottom) is the hot path — every user query flows through it and it must be fast and reliable. The control plane (top) runs offline: ingestion, fine-tuning, and the evaluation harness that samples real traffic and tells you whether the hot path is still good.

The Seams That Matter

A platform is defined less by its boxes than by the contracts between them:

Retrieve → Rerank: retrieval over-fetches top-N for recall; rerank narrows to top-K for precision. The seam is the candidate set size N.
Rerank → Generate: the reranker's calibrated score gates the prompt — below threshold, refuse rather than feed weak context.
Memory ↔ Retrieve: conversational state contextualizes the query before retrieval, so follow-ups retrieve the right thing.
Everything → Observability: every stage emits a span. Without this seam you are flying blind in production.

Designing Backwards from Requirements

Senior engineers design the platform backwards from three numbers: peak QPS, latency SLO, and the cost ceiling. Those three constrain every box — how many serving replicas, whether you can afford a reranker on every call, how aggressively you cache. The rest of this lesson walks those decisions in order.

Key Takeaways

A RAG platform is a hot request path (contextualize → retrieve → rerank → generate) plus an offline control plane (ingest, fine-tune, evaluate) sharing the store
The contracts between stages — over-fetch N, rerank-score gate, memory-before-retrieval — matter more than the boxes themselves
Design backwards from three numbers: peak QPS, latency SLO, and cost ceiling; they constrain every component choice

Capstone: Scaling & Operating a RAG Platform

The Platform Architecture

The Platform Architecture

The Whole System on One Page

The Seams That Matter

Designing Backwards from Requirements

Capacity & Cost Modeling

Multi-Tenancy & Reliability

Rollout & Operations

Worked Case Study & Go-Live Checklist

AI Learning Assistant

Course Stats

Track Complete