Back to Courses

PII & PHI De-Identification in the RAG Path

In a regulated-industry RAG system, personal and health data leaks through every stage — chunks, embeddings, prompts, the third-party LLM, and your logs. Learn to detect PII/PHI with Microsoft Presidio, choose a redaction strategy that survives retrieval, and place de-identification at the right point in the pipeline.

Day 3 Progress0%

Why De-Identify in the RAG Path

In a regulated industry — healthcare, finance, legal — your source documents are full of personal data: patient names, medical record numbers, social security numbers, account numbers, addresses. The moment you point a RAG pipeline at that corpus, that personal data starts flowing through stages you don't fully control.

Every Stage Is a Disclosure Vector

Trace one patient record through a naive RAG system and count the copies:

  1. Chunks — the raw text, PHI and all, is split and stored in your document store.
  2. Embeddings — the chunk is sent to an embedding model (often a third-party API) and the resulting vector is stored. Embeddings are not anonymous: they can be inverted to approximate the source text.
  3. The prompt — at query time, the retrieved chunk is pasted into a prompt and sent to the LLM (frequently a third-party API outside your trust boundary).
  4. The completion — the model may echo PHI back into the answer, and from there into the UI and downstream systems.
  5. The logs — and this is the one teams forget: prompts, completions, and traces get logged for debugging and observability, scattering PHI across log aggregators, error trackers, and analytics.

Each arrow is a place personal data leaves the system. Under HIPAA (US health data) and GDPR (EU personal data), every one of those is a potential reportable disclosure.

Why "the LLM vendor is compliant" Isn't Enough

A signed BAA with your LLM provider covers the API call. It does not cover the embedding vendor, your log pipeline, your error tracker, or the engineer who copies a failing prompt into a ticket. De-identification is a defense-in-depth control: minimize the personal data that enters the pipeline at all, so a leak anywhere downstream discloses less.

The Goal: Minimization, Not Perfection

You will not catch 100% of PHI — no detector does. The goal is data minimization: strip what you can detect, before it spreads, and pair that with the data-layer controls (access control, encryption) you'll see in Day 4. De-identification reduces the blast radius; it is not the only line of defense.

Key Takeaways
  • PHI/PII in a RAG pipeline leaks through chunks, embeddings, prompts, the LLM vendor, the answer, AND your logs — every stage is a disclosure vector
  • A BAA with the LLM vendor covers one hop; de-identification is defense-in-depth that minimizes personal data across all of them
  • The realistic goal is data minimization, not perfect detection — pair de-id with the access/encryption controls from Day 4

AI Learning Assistant

Powered by advanced LLM

Get personalized help with concepts, code examples, and explanations tailored to your learning pace.

Course Stats

Estimated Time
55 min
Lessons
5 sections