Back to Courses

Hybrid Search: Dense + Sparse

Where dense retrieval fails, BM25 wins, and vice versa. The fundamentals of keyword search, Reciprocal Rank Fusion (the production standard for combining the two), when hybrid actually beats pure dense, and how Pinecone / Weaviate / Qdrant / Elasticsearch each implement it.

Day 3 Progress0%

Why Vector Search Alone Isn't Enough

Pure dense retrieval — the kind covered in the Beginner course — is incredible at meaning matching. Ask "how do I make my database faster" and it finds docs about query tuning, indexing, and caching even when none of them use the literal phrase "make my database faster."

But it has a blind spot, and the blind spot is exactly where keyword search excels.

Where Dense Fails

Dense embedding models compress text into a fixed-size vector that captures semantic meaning. The mechanism that makes them work for paraphrase — collapsing surface variation into the same conceptual region — also makes them lose information about literal tokens.

Three classes of queries where this hurts:

Identifiers and codes. A user types "error E_AUTH_4012". The vector for that string captures something like "this looks like an authentication-related error message." It does NOT carry the literal token E_AUTH_4012 as a strong feature. A doc that literally contains E_AUTH_4012 won't necessarily rank above a doc that talks about authentication errors in general.

Person and product names. "When did John Smith join the company?" — the embedding represents "someone with a first and last name joined." It's weakly anchored to "John Smith" specifically. If your corpus has hundreds of name-mention chunks, vector search returns name-mention-like results, not the right person's biographical chunk.

Out-of-distribution terminology. Domain-specific jargon that wasn't in the embedding model's training data (chemical names, gene symbols, protocol identifiers, your company's internal codenames) ends up represented as the closest "shape-like" token — which is often a different word entirely.

Where BM25 Excels

Keyword search — specifically the BM25 algorithm that powers Elasticsearch, OpenSearch, Solr, and basically every production text search built in the last 15 years — does the opposite. It scores documents on how much the literal query words appear, with adjustments for word rarity and document length.

That's terrible at paraphrase ("database speed" doesn't match "make my DB faster") but excellent at exact-term match. E_AUTH_4012 either appears or doesn't; BM25 nails it instantly.

The Insight: They Fail in Opposite Directions

This is the whole motivation for hybrid search. Dense and sparse retrieval miss different things:

Query typeDenseBM25
"How do I make my DB faster"✓ finds rephrasing✗ misses if exact words absent
"error E_AUTH_4012"✗ doesn't anchor on token✓ exact match wins
"Who is John Smith"✗ name not strongly represented✓ matches "John Smith" literally
"best practices for caching"✓ semantic match✓ literal word overlap

The two methods agree on the easy queries and disagree on the hard ones. Combining them captures both kinds of relevance — and you keep the wins from each.

What "Hybrid" Actually Means

Hybrid search runs both retrievals — dense ANN search and BM25 — gets a ranked list from each, and fuses the two lists into a single ranked list before sending to the LLM (or to the user).

The fusion strategy matters; the next three sections cover the math and the trade-offs. But the architecture is universal: index your docs twice (once with embeddings, once with an inverted index), query twice in parallel, merge the results.

How Production Has Quietly Settled This

Look at the search systems that handle the highest traffic in the world:

  • Google's ranking has been hybrid for years (dense neural models + traditional keyword signals)
  • Algolia ships hybrid by default
  • Elasticsearch added native dense vector support specifically to enable hybrid alongside its BM25 history
  • Weaviate, Qdrant, Vespa all ship hybrid as a first-class feature
  • Pinecone added sparse vector support in 2023 to enable hybrid

Pure dense-only retrieval is the beginner default. Production large-scale systems are almost all hybrid. This day is about closing that gap.

Key Takeaways
  • Dense retrieval excels at paraphrase but loses information about literal tokens — identifiers, names, and rare terminology fall through the cracks
  • BM25 keyword search excels at exact-term match but fails on paraphrase — it's the opposite failure mode
  • Every large-scale production search system is hybrid; pure dense is the beginner default that doesn't scale

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
47 min
Lessons
5 sections