Incremental Embedding Pipelines

Embeddings rot the moment their source rows change. Today you build the machinery that keeps them fresh: stale flags driven by triggers, a LISTEN/NOTIFY wake-up for a worker, a durable job queue dequeued with SELECT … FOR UPDATE SKIP LOCKED, logical replication (CDC) as the heavier alternative, and the idempotency, backfill, and batching discipline that keeps the whole thing correct and cheap.

Day 3 Progress0%

Why Embeddings Go Stale

An embedding is a derived value. It's a function of source content at a moment in time: embedding = embed(model_version, text). The instant any input to that function changes — the text, the model, the chunking strategy — the stored embedding is wrong. Not "slightly off." Wrong, in the sense that semantic search will rank it against a query as if it still describes content that no longer exists.

The Three Ways an Embedding Becomes Stale

1. The source row changed. A user edits a document, a product description is updated, a support ticket gets a new comment. The text column changes; the embedding column still reflects the old text. This is the common case and the focus of today.

2. The model changed. You upgrade from text-embedding-3-small to text-embedding-3-large, or from one provider to another. Now every row is stale — but only relative to the new model. Embeddings from different models live in different vector spaces and must never be compared. A mixed-model index silently returns garbage.

3. The chunking or preprocessing changed. You changed how documents are split, how you strip boilerplate, or how you prepend titles. The text that was embedded no longer matches the text you'd produce today.

The Naive Approach and Why It Fails

The first thing everyone tries: re-embed everything on a cron. "Every night, loop over all rows, call the embedding API, write back." This works at 10,000 rows. It collapses at scale:

  • Cost. Re-embedding 50M rows nightly when only 40,000 changed is paying the embedding provider 1000x what you need to.
  • Latency. A document edited at 9:01am isn't searchable-with-fresh-meaning until the 2am batch. For many products that 17-hour staleness window is unacceptable.
  • API limits. Embedding providers rate-limit. A full re-embed of a large corpus can take many hours and burn your entire quota, starving live ingestion.

The Goal: Incremental, Idempotent, Cheap

What you actually want is a pipeline with three properties:

  • Incremental — only re-embed what changed, ideally within seconds of the change.
  • Idempotent — running it twice on the same row produces the same result and costs no extra API calls. Crashes, retries, and at-least-once delivery are facts of life; the pipeline must tolerate them.
  • Cheap — batch calls to the embedding API, never re-embed unchanged content, and never re-embed text whose embedding you already computed for that exact (text, model) pair.

The rest of this lesson builds exactly this: a column that marks rows stale, a trigger that sets it automatically, a job queue that lets many workers drain the backlog without stepping on each other, and the correctness rules that keep it all honest.

The Mental Model

Think of it as a three-stage flow:

  1. Detect — something changed; mark the row (or enqueue a job).
  2. Schedule — a worker is told there's work (or polls for it) and claims a batch without contention.
  3. Apply — call the embedding API in batches, write embeddings back, and mark the work done — atomically and idempotently.

Each stage has a Postgres-native tool: triggers for detect, LISTEN/NOTIFY for schedule, SELECT … FOR UPDATE SKIP LOCKED for claim. We'll take them in order.

Key Takeaways
  • An embedding is derived state — it goes stale when the source text, the model, or the preprocessing changes, and a mixed-model index silently returns garbage
  • Re-embedding everything on a cron is correct but unaffordable and slow at scale; you want incremental, idempotent, and cheap
  • Store the model name and an input fingerprint next to every embedding so the pipeline can prove a given embedding is still fresh

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
50 min
Lessons
5 sections