System Design Overview

This is the capstone. Days 1-4 gave you the building blocks: the RAG pipeline, prompt discipline, retrieval chains, and vector store integration. Day 5 puts them together into a complete production Q&A bot — the kind of system that lives in a real product, serves real users, and survives real outages. **The bot has four phases:** **Ingestion (offline)** turns source documents into searchable vectors. Runs on a schedule or whenever sources change. Slow operations belong here — chunking decisions, embedding API calls, batched upserts. **Query (online)** turns a user's question into context-grounded retrieval. The latency-sensitive path. Cache aggressively, time out fast, fall back gracefully. **Generation (online)** turns retrieved chunks into a final answer. Streams to the user. Manages citations, refusal, length. **Operation (continuous)** measures, monitors, and improves. Eval against a gold set. Watch cost per query. Detect drift. Roll out changes carefully. **The component diagram of a production Q&A bot:** ``` sources ──→ parser ──→ chunker ──→ embedder ──→ vector store ↑ (ingestion runs offline, on a schedule or trigger) │ │ user query ─→ classifier ─→ rewrite ─→ search ──────┘ │ │ └─→ direct (cheap) └─→ rerank │ ↓ generator with citations ──→ user │ ↓ observability (logs, costs, eval) ``` **Decision matrix for which Day 1-4 features apply:** - **Day 1 (RAG anatomy)** — every system starts here. Single-pass is the baseline; layer the rest on top. - **Day 2 (prompts)** — every system needs version-controlled prompts and structured outputs. Non-negotiable. - **Day 3 (chains)** — add rewriting when you see conversational queries fail; add HyDE when corpus is heavily declarative; add multi-step only for genuine multi-hop needs. - **Day 4 (vector store)** — pick a store based on team familiarity first, performance second. Thin internal interface. Tenant-isolated filter that's impossible to skip. **The capstone discipline:** start with the smallest thing that works. Single-pass RAG, simple prompt, pgvector or whatever's nearest, in-process embedding cache. Get it serving real users. *Then* measure where it fails, and add the chain step or operational hygiene that fixes the failure you actually have. Adding all of Day 3's chain steps at once costs you debugging surface, latency, and money for problems you may not have. A working simple system always beats a half-built sophisticated one.

// The minimum viable Q&A bot — Days 1-4 collapsed
async function answerQuestion(query, ctx) {
  // 1. Tenant-isolated retrieval (Day 4)
  const queryVec = await embed(query);
  const chunks = await ctx.store.search(queryVec, 5, {
    tenant_id: ctx.tenantId, // never bypassed
  });

  // 2. Build grounded prompt (Day 2)
  const context = chunks.map((c, i) =>
    `[${i + 1}] ${c.text}`
  ).join("\n\n");

  // 3. Generate with citations (Day 2 + 5)
  const answer = await ctx.llm.generate({
    model: "gpt-4o-mini",
    messages: [
      { role: "system", content:
        "Answer using ONLY the provided context. " +
        "Cite sources as [N]. If context doesn't contain the answer, say " +
        '"I don\'t have enough information to answer that."'
      },
      { role: "user", content:
        `Context:\n${context}\n\nQuestion: ${query}`
      },
    ],
  });

  return { answer, sources: chunks.map(c => c.id) };
}

// This is ~30 lines and works. Day 3's chain steps come later
// when you measure a problem they'd solve.

Key Takeaways

Four phases: ingestion (offline), query (online), generation (online), operation (continuous)
Start with single-pass + clean prompt; layer chain steps when failures justify them
Tenant isolation, stable IDs, observability are non-negotiable from day one
A 30-line minimum viable Q&A bot beats a half-built sophisticated one
Measure failures in production before adding sophistication

Building a Q&A Bot

System Design Overview

Key Takeaways

AI Learning Assistant

Course Stats

Course Complete