Built a PHI de-identification benchmark that tests streaming data, not just single documents

vkatg · March 13, 2026, 3:55am

Been working on this for a while and just pushed a big update. Wanted to share it here because I think it’s actually useful and I’d love to hear if people run into issues with it.

The core problem I kept running into: every PHI benchmark I found treats each clinical document as independent. You mask it, you score it. But that’s not how re-identification actually works. The threat is cumulative. A name in a note, the same name in an ASR transcript 10 minutes later, a matching date in imaging metadata after that. Each one looks clean. Together they’re a problem.

So I built a benchmark around that.

What it does

The dataset simulates a multimodal clinical stream across text, ASR, image, waveform, and audio proxy events. An RL controller (PPO) watches cumulative re-identification risk build across events and selects a masking policy from five tiers: raw, weak, synthetic, pseudo, redact. The policy escalates as risk crosses configurable thresholds.

The result on the bursty workload: Privacy@HighRisk 0.9907, Utility@LowRisk 0.8466, at the same time. No static policy gets both. Always-Redact gets the privacy score but utility goes to zero. Always-Pseudo gets close on privacy but utility drops to 0.44. The re-identifier AUROC drops by 0.9167 across 10 runs with std 0.0.

What’s in this dataset

Three HuggingFace configs:

default – 34 live adaptive masking events, each with full risk breakdown, policy decision, consent status, and CRDT risk signed - same events with ECDSA signatures and a Merkle chain, FHIR-exportable crossmodal - 260 rows across 5 scenarios testing cross-modal PHI linkage. Scenario E is an adversarial attacker that stays below individual risk thresholds while accumulating cross-modal links across modalities

Supplementary files:

leakage breakdown by entity type (MRN, date, name, facility)
full risk component trace per event (units factor, recency, link bonus)
threshold sensitivity sweep across 8 values and 3 workloads
baseline comparison: 6 policies across 3 workloads, every number needed to reproduce the Pareto frontier

What’s new in this version

Previously the dataset had placeholder charts and incomplete supplementary files. This version has everything regenerated from the actual run: risk trace, leakage breakdown, threshold sensitivity, and all four inline charts in the README are plotted from real event data.

Also added the adversarial crossmodal scenario, the FHIR audit trail, and the signed Merkle audit log for anyone working on compliance tooling.

Also just published: DAG Remediation Traces

The streaming benchmark tells you when to escalate masking policy. This companion dataset is about what to actually do once you know risk is high, how to plan a remediation sequence across modalities under a cost budget.

vkatg/dag_remediation_traces : 8,500 records pairing a patient risk profile with a full DAG planning trace: which actions were selected, which dependency injections fired, the topological execution order, residual risk, and total cost. Covers text, image, audio, and EHR modalities with 14 actions and a dependency graph that forces realistic ordering constraints (you can’t run cross_modal_unlink without first redacting faces, stripping voice, and masking direct IDs).

There’s also a hard config: 1,000 records with input risk fixed between 0.75 and 0.99 and budget capped at 0.60. Every record requires planning, the injected dependency rate jumps from 33% to 68%, and average residual risk is higher. It’s there specifically to stress-test planners rather than just evaluate on easy cases.

The two datasets connect directly: the risk_score and retok_prob inputs in the DAG traces are the kind of outputs the DCPG encoder in the streaming benchmark produces. If you’re building end-to-end, one feeds the other.

from datasets import load_dataset

# default split
ds = load_dataset("vkatg/dag_remediation_traces")

# hard benchmark split
hard = load_dataset("vkatg/dag_remediation_traces", "hard")

MIT licensed, no DUA, fully synthetic

i2b2 and PhysioNet both require data use agreements. That makes sense given they have real patient data. It also means you can’t just clone and run without going through an approval process.

This has no real patient data. You can load it right now:

from datasets import load_dataset

ds = load_dataset("vkatg/streaming-phi-deidentification-benchmark")
cm = load_dataset("vkatg/streaming-phi-deidentification-benchmark", "crossmodal")

Full code is at phi-exposure-guard if you want to run the controller yourself or extend it.

If you’re building a de-identification system, doing privacy research, or working on clinical NLP, what’s the hardest part you keep running into? Is it the risk scoring, the action selection, or something earlier in the pipeline like PHI detection itself? Curious where people are actually getting stuck.

Topic		Replies	Views
streaming PHI de-identification benchmark with cross-modal causality annotations Research	0	15	March 10, 2026
Been building a stateful PHI de-identification system for streaming multimodal data. Here's what I learned Research	0	24	March 15, 2026
A Bidirectional LLM Firewall: Architecture, Failure Modes, and Evaluation Results Research	62	446	January 6, 2026
Model for identifying PII Beginners	0	423	December 9, 2021
A Bidirectional LLM Firewall: Next Level X1 - help wanted! Models	24	385	April 15, 2026

Built a PHI de-identification benchmark that tests streaming data, not just single documents

Related topics