Text and images in the same vector space. How CLIP works (contrastive training on 400M internet pairs), when to reach for SigLIP / ImageBind / managed APIs, where it breaks (compositional reasoning, OCR), and the production patterns for unified multi-modal indexes.
So far this course has treated embeddings as a way to put text into a vector space. Multi-modal embeddings extend that idea: put text, images, audio, video — and sometimes more exotic things like depth maps or motion sensor data — into the same vector space.
That means a photo of a cat and the literal word "cat" end up near each other in 512-dimensional space. You can search for images by typing text. You can find similar text by uploading an image. The boundary between modalities, from the database's perspective, disappears.
Concretely: a multi-modal model has two encoders (or more, one per modality). The text encoder takes a string and produces a 512-dim vector. The image encoder takes a pixel grid and also produces a 512-dim vector. These two vectors live in the same coordinate system — cosine similarity between them is meaningful.
Compare to a single-modal setup where you'd have text-embedding-3 producing one vector space and a separate vision model producing another, with no relationship between them. Cross-modal queries are impossible in that world.
A few that justify the complexity:
This day teaches you the architecture and production patterns. It won't teach you to train a CLIP model from scratch — that's a separate ML course requiring serious GPU resources. What you'll be doing in production is:
The model is upstream; the vector DB and your application code don't change much from the text-only case.
If multi-modal embeddings sound magical, they should sound magical the first time you try them. The fact that a model can take a JPEG of a cat and a literal string "cat" and produce two 512-dim vectors that are similar to each other — without anyone ever telling the model "this image IS this word" — is one of the more remarkable results in recent ML.
That magic is the result of contrastive training on 400+ million (image, caption) pairs, which the next section explains.
Powered by advanced LLM
Get personalized help with concepts, code examples, and explanations tailored to your learning pace.
Retrieval Evaluation