Back to Courses

Hybrid Search: tsvector + pgvector

Pure vector search misses exact terms — product codes, error strings, rare names. This day combines Postgres full-text search (tsvector) with semantic search (pgvector) and fuses the two rankings with Reciprocal Rank Fusion, entirely in SQL.

Day 1 Progress0%

Why Pure Vector Search Isn't Enough

If you finished the beginner track you can already store embeddings in a vector column and run ORDER BY embedding <=> query to get semantic neighbors. That's powerful — but on its own it quietly fails on a whole category of queries.

Where Embeddings Fall Down

Vector search retrieves by meaning, which is exactly wrong when the user wants an exact token:

  • Identifiers and codesSKU-4471, ORA-00942, CVE-2024-3094. Embedding models tokenize these oddly and place near-duplicates far apart.
  • Rare proper nouns — a surname or product name the embedding model never saw in training collapses toward generic neighbors.
  • Negation and exact phrasing — "invoices not yet paid" embeds close to "paid invoices."
  • Out-of-domain jargon — internal acronyms that simply aren't in the model's vocabulary.

Keyword search has the opposite failure mode: it nails exact terms but misses synonyms and paraphrases ("car" vs "automobile", "how do I cancel" vs "termination process").

The Hybrid Idea

Hybrid search runs both retrievers and fuses their results:

  1. A lexical retriever — Postgres full-text search over a tsvector, ranked by ts_rank / ts_rank_cd (BM25-like term weighting).
  2. A semantic retriever — pgvector nearest-neighbor by cosine or L2 distance.

Each retriever produces a ranked list. A fusion step merges the two lists into one final ranking. The result reliably beats either retriever alone on real-world query mixes, because the two cover each other's blind spots.

One Database, Two Indexes

The nice part for a Postgres shop: you don't need a separate search cluster. A single table can carry both a tsvector column (with a GIN index) and a vector column (with an HNSW or IVFFlat index). Both retrievers, the fusion, and your business filters all live in one SQL query against one transactional store — no sync pipeline, no dual-write consistency problem.

What This Day Covers

  • Section 2: Postgres full-text search internals — tsvector, tsquery, ts_rank, and the GIN index.
  • Section 3: pgvector recap and why its scores aren't directly comparable to ts_rank.
  • Section 4: Reciprocal Rank Fusion (RRF) and why fusing ranks beats fusing scores.
  • Section 5: Implementing the full hybrid query in SQL with CTEs, plus when hybrid is and isn't worth it.
Key Takeaways
  • Vector search retrieves by meaning and misses exact tokens like codes, IDs, and rare names
  • Keyword (full-text) search nails exact terms but misses synonyms and paraphrases
  • Hybrid search runs both retrievers and fuses the rankings — covering each other's blind spots, all inside one Postgres table

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
50 min
Lessons
5 sections