Back to Courses

Conversational RAG: Multi-Turn Memory & State

Real users don't ask one isolated question — they ask follow-ups. "What about the enterprise tier?" means nothing on its own. A conversational RAG system has to remember the thread and rewrite each turn into a self-contained query beforeit retrieves, all while keeping the conversation inside the model's context window.

Day 3 Progress0%

Why Multi-Turn Breaks Naive RAG

The Q&A bot you built in the beginner course answers one question at a time. Each request is independent: embed the question, retrieve, generate, done. That model quietly assumes every question is self-contained — and the moment a real user has a conversation, that assumption shatters.

The Follow-Up Problem

Watch what happens on the second turn:

User: What does the Pro plan include? Bot: The Pro plan includes unlimited projects, priority support, and SSO. User: What about the enterprise tier?

Your naive pipeline takes "What about the enterprise tier?" and embeds that string. But notice what's missing: the user never said the word "include," "plan," "pricing," or "features." The standalone meaning of the follow-up is "tell me about the enterprise tier — in the same respect we were just discussing." The pipeline has no idea what that respect was.

Pronouns Make It Worse

The most common — and most broken — case is the pronoun:

User: How do I rotate an API key? Bot: Go to Settings → Security → Rotate. User: How long is it valid after I do that?

Embed "How long is it valid after I do that?" and the retriever sees no key terms at all — "it" and "that" carry the entire meaning, and they're invisible to a bag-of-vectors search. You'll retrieve generic chunks about validity, time, or nothing relevant, and the model will either guess or refuse.

Why Concatenating History Isn't the Fix

The tempting quick fix is to glue the whole conversation together and embed that:

embed("What does the Pro plan include? ... What about the enterprise tier?")

This is better than nothing, but it degrades fast. The query vector now averages two different topics, so retrieval gets fuzzy. Ten turns in, the "query" is a paragraph spanning five subjects, and similarity search returns a muddle. History helps the meaning, but dumping raw history into the retriever hurts the signal.

Two Distinct Jobs

The fix is to separate two jobs the naive pipeline conflated:

  1. Understand the question in context — resolve "it," "that," and the implied topic using the conversation history.
  2. Retrieve and generate — do this with a clean, self-contained query, exactly like single-turn RAG.

The rest of this lesson is about doing job #1 well (contextualization and memory) so that job #2 — the RAG you already know — keeps working turn after turn.

Key Takeaways
  • Naive RAG treats every question as self-contained; real conversations are full of follow-ups and pronouns that only mean something in context
  • Embedding a context-free follow-up ('how long is it valid?') retrieves the wrong chunks because the meaning lives in words that aren't there
  • Concatenating raw history into the query mixes topics and degrades retrieval — context belongs in a separate contextualization step, not the retrieval vector

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
50 min
Lessons
5 sections