Back to Courses

Multi-Modal Embeddings: CLIP and Beyond

Text and images in the same vector space. How CLIP works (contrastive training on 400M internet pairs), when to reach for SigLIP / ImageBind / managed APIs, where it breaks (compositional reasoning, OCR), and the production patterns for unified multi-modal indexes.

Day 4 Progress0%

What Multi-Modal Embeddings Are (And Why)

So far this course has treated embeddings as a way to put text into a vector space. Multi-modal embeddings extend that idea: put text, images, audio, video — and sometimes more exotic things like depth maps or motion sensor data — into the same vector space.

That means a photo of a cat and the literal word "cat" end up near each other in 512-dimensional space. You can search for images by typing text. You can find similar text by uploading an image. The boundary between modalities, from the database's perspective, disappears.

What "Same Space" Actually Means

Concretely: a multi-modal model has two encoders (or more, one per modality). The text encoder takes a string and produces a 512-dim vector. The image encoder takes a pixel grid and also produces a 512-dim vector. These two vectors live in the same coordinate system — cosine similarity between them is meaningful.

Compare to a single-modal setup where you'd have text-embedding-3 producing one vector space and a separate vision model producing another, with no relationship between them. Cross-modal queries are impossible in that world.

The Use Cases

A few that justify the complexity:

  • Image search by text query. "Find product photos of red dresses" → encode the text, search image vectors. Powers most modern e-commerce search.
  • Reverse image search. Upload a photo, find similar ones. Powers reverse-image lookup, content moderation, deduplication.
  • Cross-modal recommendation. "Users who looked at this image also read these articles" — same embedding space lets you compute similarity across content types.
  • Zero-shot image classification. Without training a classifier, you can categorize images by the text labels you supply at query time. Useful when you don't have labeled training data.
  • Content moderation. Flag images matching forbidden text descriptions ("violence," "graphic content") without maintaining a labeled image dataset.
  • Multi-modal RAG. Retrieve relevant images alongside text chunks to give a vision-capable LLM (GPT-4V, Claude, Gemini) richer context.

What This Day Will and Won't Cover

This day teaches you the architecture and production patterns. It won't teach you to train a CLIP model from scratch — that's a separate ML course requiring serious GPU resources. What you'll be doing in production is:

  • Choosing a pre-trained multi-modal embedding model
  • Encoding your text and images with it (one API call each)
  • Storing the resulting vectors in your existing vector DB
  • Querying across modalities

The model is upstream; the vector DB and your application code don't change much from the text-only case.

A Quick Sanity Check

If multi-modal embeddings sound magical, they should sound magical the first time you try them. The fact that a model can take a JPEG of a cat and a literal string "cat" and produce two 512-dim vectors that are similar to each other — without anyone ever telling the model "this image IS this word" — is one of the more remarkable results in recent ML.

That magic is the result of contrastive training on 400+ million (image, caption) pairs, which the next section explains.

Key Takeaways
  • Multi-modal embeddings put text, images, and sometimes audio/video into the same vector space — cosine similarity between modalities is meaningful
  • The architecture: separate encoders per modality, jointly trained so their outputs align in the same coordinate system
  • In production, the model is upstream — your vector DB and application code stay almost identical to the text-only case

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
47 min
Lessons
5 sections