Embeddings rot the moment their source rows change. Today you build the machinery that keeps them fresh: stale flags driven by triggers, a LISTEN/NOTIFY wake-up for a worker, a durable job queue dequeued with SELECT … FOR UPDATE SKIP LOCKED, logical replication (CDC) as the heavier alternative, and the idempotency, backfill, and batching discipline that keeps the whole thing correct and cheap.
An embedding is a derived value. It's a function of source content at a moment in time: embedding = embed(model_version, text). The instant any input to that function changes — the text, the model, the chunking strategy — the stored embedding is wrong. Not "slightly off." Wrong, in the sense that semantic search will rank it against a query as if it still describes content that no longer exists.
1. The source row changed. A user edits a document, a product description is updated, a support ticket gets a new comment. The text column changes; the embedding column still reflects the old text. This is the common case and the focus of today.
2. The model changed. You upgrade from text-embedding-3-small to text-embedding-3-large, or from one provider to another. Now every row is stale — but only relative to the new model. Embeddings from different models live in different vector spaces and must never be compared. A mixed-model index silently returns garbage.
3. The chunking or preprocessing changed. You changed how documents are split, how you strip boilerplate, or how you prepend titles. The text that was embedded no longer matches the text you'd produce today.
The first thing everyone tries: re-embed everything on a cron. "Every night, loop over all rows, call the embedding API, write back." This works at 10,000 rows. It collapses at scale:
What you actually want is a pipeline with three properties:
The rest of this lesson builds exactly this: a column that marks rows stale, a trigger that sets it automatically, a job queue that lets many workers drain the backlog without stepping on each other, and the correctness rules that keep it all honest.
Think of it as a three-stage flow:
Each stage has a Postgres-native tool: triggers for detect, LISTEN/NOTIFY for schedule, SELECT … FOR UPDATE SKIP LOCKED for claim. We'll take them in order.