When off-the-shelf embedders leave double-digit recall on the table for your domain. Contrastive learning fundamentals, hard negative mining (the most important step), LLM-generated training data at scale, and the deployment discipline that prevents catastrophic forgetting.
The Intermediate course showed you how to wring every percentage point of recall out of off-the-shelf embedding models — better chunking, hybrid search, multi-modal extension, reranking. Eventually you hit a ceiling that no amount of retrieval engineering can clear. The remaining gap is in the embedding space itself: your domain uses concepts and distinctions that the pre-trained model doesn't represent well.
This is when you fine-tune.
Pre-trained embedding models — OpenAI's text-embedding-3, Cohere Embed v3, BGE, E5 — are trained on web crawls, Wikipedia, public papers, GitHub. They develop excellent representations for general concepts in mainstream languages.
Your domain is probably not what they were trained on:
When the pre-trained model encounters these, it does its best — usually mapping them to the closest generic concept it does know. "MERN stack" gets embedded near "stack" generally. "tendon" (in construction, meaning post-tensioning steel cable) gets embedded near "tendon" (the anatomical structure). Subtle but consequential.
Three concrete failure modes:
Lookalike collapse. Domain-specific terms with distinct meanings end up at similar coordinates. A retrieval system can't distinguish "rebar" from "tendon" from "post-tensioned cable" because the embedder maps all three to roughly the same "metal-bar in construction context" region. Queries about one return docs about the others.
Query-document mismatch. Users phrase queries colloquially ("how do I get reimbursed"); docs use formal language ("Procedure for Submission of Expense Claims"). Off-the-shelf models trained on encyclopedia text don't have enough exposure to question-style queries to bridge this gap. The embeddings end up in different neighborhoods even when the meaning matches.
Out-of-distribution catastrophe. Internal company codenames, abbreviations specific to your industry, jargon from a recent paper that wasn't in the training cutoff — these get embedded as essentially random vectors. The model has no signal to put them anywhere meaningful.
Three preconditions, all of which need to be true:
The same three preconditions, inverted:
When fine-tuning IS warranted, typical recall@10 improvements:
| Domain match | Lift from fine-tuning |
|---|---|
| Generic content (close to web crawl) | 0–3 pts (often not worth it) |
| Moderate specialization (typical SaaS, e-commerce) | 3–8 pts |
| Heavy specialization (legal, medical, scientific) | 10–25 pts |
| Exotic / internal jargon (codenames, abbreviations) | 15–30 pts |
The expensive operational machinery — training data pipeline, fine-tuning compute, re-embedding the entire corpus, version management — pays back when the lift is in the double digits. For 3-point lifts, the ongoing cost usually isn't worth the gain.
The rest of this day covers, in order: the training mechanics (contrastive/triplet losses), the single most important step (hard negative mining), how to generate training data at scale without humans (LLM-synthesized pairs), and the deployment discipline that makes fine-tuned models actually shippable in production.
Powered by advanced LLM
Get personalized help with concepts, code examples, and explanations tailored to your learning pace.
Multi-Vector Retrieval