Name: Conversational RAG: Multi-Turn Memory & State
Availability: InStock

Why Multi-Turn Breaks Naive RAG

The Q&A bot you built in the beginner course answers one question at a time. Each request is independent: embed the question, retrieve, generate, done. That model quietly assumes every question is self-contained — and the moment a real user has a conversation, that assumption shatters.

The Follow-Up Problem

Watch what happens on the second turn:

User: What does the Pro plan include? Bot: The Pro plan includes unlimited projects, priority support, and SSO. User: What about the enterprise tier?

Your naive pipeline takes "What about the enterprise tier?" and embeds that string. But notice what's missing: the user never said the word "include," "plan," "pricing," or "features." The standalone meaning of the follow-up is "tell me about the enterprise tier — in the same respect we were just discussing." The pipeline has no idea what that respect was.

Pronouns Make It Worse

The most common — and most broken — case is the pronoun:

User: How do I rotate an API key? Bot: Go to Settings → Security → Rotate. User: How long is it valid after I do that?

Embed "How long is it valid after I do that?" and the retriever sees no key terms at all — "it" and "that" carry the entire meaning, and they're invisible to a bag-of-vectors search. You'll retrieve generic chunks about validity, time, or nothing relevant, and the model will either guess or refuse.

Why Concatenating History Isn't the Fix

The tempting quick fix is to glue the whole conversation together and embed that:

embed("What does the Pro plan include? ... What about the enterprise tier?")

This is better than nothing, but it degrades fast. The query vector now averages two different topics, so retrieval gets fuzzy. Ten turns in, the "query" is a paragraph spanning five subjects, and similarity search returns a muddle. History helps the meaning, but dumping raw history into the retriever hurts the signal.

Two Distinct Jobs

The fix is to separate two jobs the naive pipeline conflated:

Understand the question in context — resolve "it," "that," and the implied topic using the conversation history.
Retrieve and generate — do this with a clean, self-contained query, exactly like single-turn RAG.

The rest of this lesson is about doing job #1 well (contextualization and memory) so that job #2 — the RAG you already know — keeps working turn after turn.

Key Takeaways

Naive RAG treats every question as self-contained; real conversations are full of follow-ups and pronouns that only mean something in context
Embedding a context-free follow-up ('how long is it valid?') retrieves the wrong chunks because the meaning lives in words that aren't there
Concatenating raw history into the query mixes topics and degrades retrieval — context belongs in a separate contextualization step, not the retrieval vector

Conversational RAG: Multi-Turn Memory & State

Why Multi-Turn Breaks Naive RAG

Why Multi-Turn Breaks Naive RAG

The Follow-Up Problem

Pronouns Make It Worse

Why Concatenating History Isn't the Fix

Two Distinct Jobs

History-Aware Query Contextualization

Memory Types: Buffer, Windowed, and Summary

Managing State & the Token Budget

Sessions, Persistence & Pitfalls

AI Learning Assistant

Course Stats

Up Next