In a regulated-industry RAG system, personal and health data leaks through every stage — chunks, embeddings, prompts, the third-party LLM, and your logs. Learn to detect PII/PHI with Microsoft Presidio, choose a redaction strategy that survives retrieval, and place de-identification at the right point in the pipeline.
In a regulated industry — healthcare, finance, legal — your source documents are full of personal data: patient names, medical record numbers, social security numbers, account numbers, addresses. The moment you point a RAG pipeline at that corpus, that personal data starts flowing through stages you don't fully control.
Trace one patient record through a naive RAG system and count the copies:
Each arrow is a place personal data leaves the system. Under HIPAA (US health data) and GDPR (EU personal data), every one of those is a potential reportable disclosure.
A signed BAA with your LLM provider covers the API call. It does not cover the embedding vendor, your log pipeline, your error tracker, or the engineer who copies a failing prompt into a ticket. De-identification is a defense-in-depth control: minimize the personal data that enters the pipeline at all, so a leak anywhere downstream discloses less.
You will not catch 100% of PHI — no detector does. The goal is data minimization: strip what you can detect, before it spreads, and pair that with the data-layer controls (access control, encryption) you'll see in Day 4. De-identification reduces the blast radius; it is not the only line of defense.
Powered by advanced LLM
Get personalized help with concepts, code examples, and explanations tailored to your learning pace.
Data-Layer Security