Name: PII & PHI De-Identification in the RAG Path
Availability: InStock

Why De-Identify in the RAG Path

In a regulated industry — healthcare, finance, legal — your source documents are full of personal data: patient names, medical record numbers, social security numbers, account numbers, addresses. The moment you point a RAG pipeline at that corpus, that personal data starts flowing through stages you don't fully control.

Every Stage Is a Disclosure Vector

Trace one patient record through a naive RAG system and count the copies:

Chunks — the raw text, PHI and all, is split and stored in your document store.
Embeddings — the chunk is sent to an embedding model (often a third-party API) and the resulting vector is stored. Embeddings are not anonymous: they can be inverted to approximate the source text.
The prompt — at query time, the retrieved chunk is pasted into a prompt and sent to the LLM (frequently a third-party API outside your trust boundary).
The completion — the model may echo PHI back into the answer, and from there into the UI and downstream systems.
The logs — and this is the one teams forget: prompts, completions, and traces get logged for debugging and observability, scattering PHI across log aggregators, error trackers, and analytics.

Each arrow is a place personal data leaves the system. Under HIPAA (US health data) and GDPR (EU personal data), every one of those is a potential reportable disclosure.

Why "the LLM vendor is compliant" Isn't Enough

A signed BAA with your LLM provider covers the API call. It does not cover the embedding vendor, your log pipeline, your error tracker, or the engineer who copies a failing prompt into a ticket. De-identification is a defense-in-depth control: minimize the personal data that enters the pipeline at all, so a leak anywhere downstream discloses less.

The Goal: Minimization, Not Perfection

You will not catch 100% of PHI — no detector does. The goal is data minimization: strip what you can detect, before it spreads, and pair that with the data-layer controls (access control, encryption) you'll see in Day 4. De-identification reduces the blast radius; it is not the only line of defense.

Key Takeaways

PHI/PII in a RAG pipeline leaks through chunks, embeddings, prompts, the LLM vendor, the answer, AND your logs — every stage is a disclosure vector
A BAA with the LLM vendor covers one hop; de-identification is defense-in-depth that minimizes personal data across all of them
The realistic goal is data minimization, not perfect detection — pair de-id with the access/encryption controls from Day 4

PII & PHI De-Identification in the RAG Path

Why De-Identify in the RAG Path

Why De-Identify in the RAG Path

Every Stage Is a Disclosure Vector

Why "the LLM vendor is compliant" Isn't Enough

The Goal: Minimization, Not Perfection

Detecting PII and PHI

Redaction Strategies

De-Identification vs Retrieval Quality

Presidio in Practice

AI Learning Assistant

Course Stats

Up Next