Name: Retrieval Chains
Availability: InStock

Day 1's pipeline did one round of retrieval: embed the query, search the vector DB, hand the top-k chunks to the LLM. That gets you 70-80% of what production RAG needs. The remaining 20-30% is where teams get stuck, and the fix is almost always **more steps**, not bigger embeddings. **Where single-pass RAG breaks:** The user's query is a bad search query. Real users type things like "fix this", "when?", or "thoughts on the new version?" These are perfectly natural conversational utterances and almost completely useless as embedding inputs. There's no semantic content for the embedding model to latch onto, so retrieval returns whatever was most popular in the corpus — often unrelated. The query requires multi-hop reasoning. "Which of our customers in healthcare are using the new API?" needs you to find healthcare customers, then check which use the new API, then intersect. One round of retrieval can't do that intersection; you need separate searches and a combine step. The query language doesn't match the document language. Users ask "how do I get my money back" but the docs say "refund processing procedures." Embeddings handle this badly — questions and statements live in different parts of the embedding space. Even good embeddings have a "question-statement gap" that one round of search struggles to bridge. The answer requires aggregating across documents. "Summarize all the bugs reported in Q3" needs you to retrieve everything tagged Q3, not just the top-5 most similar. The top-k pattern can't do that — it's structurally a "find the most relevant" tool, not a "find all matching" tool. **The shape of the fix:** A retrieval chain adds steps **before**, **around**, or **after** the search. Before: rewrite the query. Around: search multiple times with different angles and combine. After: examine the retrieved chunks and decide whether to search again. Each step is independent and small; the chain is a pipeline of these steps. The principle: when you're tempted to "make retrieval better," try adding a step instead. A 200ms LLM call that rewrites the query usually beats spending a week tuning embeddings.

// Single-pass RAG (Day 1)
function singlePassRAG(query, vectorDB, llm) {
  const queryVec = embed(query);
  const chunks = vectorDB.search(queryVec, k=5);
  const context = chunks.map(c => c.text).join("\n\n");
  return llm(`Context: ${context}\nQuestion: ${query}`);
}

// Same problem with a chain (Day 3)
function chainedRAG(query, vectorDB, llm) {
  // Step 1: rewrite query into search-friendly form
  const searchQuery = llm(`Rewrite this as a search query: ${query}`);

  // Step 2: also generate a hypothetical answer for HyDE
  const hypoAnswer = llm(`Write a 1-sentence answer to: ${query}`);

  // Step 3: search with BOTH the rewritten query and the hypothetical answer
  const r1 = vectorDB.search(embed(searchQuery), k=5);
  const r2 = vectorDB.search(embed(hypoAnswer), k=5);

  // Step 4: combine results, dedupe
  const combined = rrfMerge([r1, r2], k=5);

  // Step 5: generate final answer with combined context
  const context = combined.map(c => c.text).join("\n\n");
  return llm(`Context: ${context}\nQuestion: ${query}`);
}

// Cost: 3 LLM calls + 2 embed + 2 search instead of 1+1+1
// Quality: typically 15-30% better recall on hard queries
// Worth it when "hard queries" is more than a fraction of your traffic

Key Takeaways

Single-pass RAG handles 70-80% of queries; the rest need chains
User queries are often bad search queries — too short, too ambiguous, wrong vocabulary
Multi-hop, aggregation, and question-statement gap queries all fail single-pass
Chains add steps before/around/after retrieval rather than tuning the retriever harder
A 200ms query-rewrite LLM call usually beats a week of embedding tuning

Retrieval Chains

When Single-Pass RAG Isn't Enough

Key Takeaways

AI Learning Assistant

Course Stats

Up Next