Multi-Vector Retrieval: ColBERT and Late Interaction

One vector per document is a compromise. Each token in your doc gets its own vector, and the MaxSim operator scores documents by per-token matching. ColBERT, PLAID compression, production deployment patterns, and ColPali's extension to document images.

Day 3 Progress0%

The Single-Vector Bottleneck

Every retrieval technique you've learned so far has assumed one fundamental constraint: one vector per document. The whole document — title, body, conclusion, code samples, every sentence — gets compressed into a single 768-dim or 1536-dim vector.

This sounds reasonable until you think about what it actually means.

The Compression Problem

Imagine a 1,500-token document about authentication. It mentions JWT tokens, OAuth flows, session storage, security best practices, and a worked example. The embedding model reads all 1,500 tokens and produces one vector — a 768-dim summary of the whole thing.

The information that gets averaged away:

  • The specific phrase "JWT token rotation" no longer has a strong signal
  • The relationship between "OAuth" and "session storage" is collapsed
  • Fine-grained matching against a query like "how do I refresh JWT tokens after expiry" is approximate

The single vector is a weighted average of the document's content. Like any average, it loses everything that distinguishes one part of the document from another.

Why This Matters in Practice

When a query lands in your retrieval system, it's also compressed into a single vector. Two single vectors are compared. The query "JWT rotation" is matched against the document's averaged representation, which encodes "general authentication topics" rather than "JWT rotation specifically."

Result: documents that contain the precise concept the user asked about may not score noticeably higher than documents that broadly cover the topic area. The cosine similarity isn't sensitive enough to distinguish "contains the exact answer" from "discusses the same area."

Three Concrete Failure Modes

Dilution by length. A short doc that says "JWT tokens are refreshed via the /auth/refresh endpoint" embeds as roughly that meaning. A 2,000-token doc that mentions the same fact in one sentence among many others embeds as the average of many topics — the JWT refresh signal is diluted.

Phrase-level information loss. "Customer churn rate" and "rate of customer departures" mean the same thing. But a query for the literal phrase "customer churn rate" might match a doc containing "rate" and "churn" and "customer" in different contexts before matching a doc containing the exact phrase. The single vector doesn't preserve which words appear next to which.

Multi-faceted queries. A query like "what error codes does the auth service return and how do I handle them" has multiple parts (error codes, handling). Single-vector embeddings represent the gist of the query. Documents matching different parts of the query individually don't get scored higher than documents matching the gist weakly.

The Alternative

What if, instead of one vector per document, you had one vector per token? Each word in the document gets its own contextualized embedding. At query time, each query token gets matched against all document tokens individually.

The query "JWT rotation" now consists of two query tokens. Each finds its best match anywhere in the document. The document scores high if it has strong matches for both query tokens, even if those matches are at different positions.

This is multi-vector retrieval — also called late interaction because the interaction between query and document tokens happens late in the pipeline (at query time, not at indexing time). It's the central idea behind ColBERT (Contextualized Late Interaction over BERT, Stanford 2020), the canonical model in this space.

Why "Late Interaction" Matters

Compare three approaches:

Cross-encoder (early interaction): runs query and document through the same transformer simultaneously, producing one score per (query, doc) pair. Most accurate. Most expensive — has to re-run on every doc at query time, so unusable beyond ~100 candidates.

Bi-encoder (no interaction): independently encodes query and document into single vectors. Cheap and scalable. Less accurate — no per-token matching.

Late interaction (ColBERT): independently encodes query and document into multiple vectors (one per token). At query time, computes per-token similarity. Most of the cost is at indexing (encoding all token vectors); query-time MaxSim is fast and parallelizable.

ColBERT splits the difference: more accurate than bi-encoders because of token-level matching, much faster than cross-encoders because doc encoding is precomputed.

This day is about the mechanics, the trade-offs, and when this approach is worth the operational complexity.

Key Takeaways
  • Single-vector retrieval compresses an entire document into one vector — the average loses phrase-level information, fine-grained matching, and multi-faceted query handling
  • Multi-vector / late interaction approaches (ColBERT) keep one vector per token, allowing per-token query-document matching at search time
  • Late interaction sits between bi-encoders (cheap, less accurate) and cross-encoders (accurate, prohibitively expensive) — better accuracy than bi-encoders without re-running the full model per candidate

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
50 min
Lessons
5 sections