Remix.run Logo
LatencyKills 10 hours ago

Couldn't this be used to locate private data in unstructured text without having to rely on other means of PII detection?

1. Pass the raw text through the filter to obtain the spans.

2. Map all the spans back to the original text.

Now you have all the PII information.

Everdred2dx 9 hours ago | parent | next [-]

Yep, and already has been done.

https://github.com/chiefautism/privacy-parser

yjftsjthsd-h 8 hours ago | parent | prev [-]

If you have the redacted and unredacted versions, then you can diff them; that seems unsurprising? Unless I'm really misunderstanding "spans"?

LatencyKills an hour ago | parent [-]

> If you have the redacted and unredacted versions, then you can diff them; that seems unsurprising?

I'm suggesting that a model designed for high-accuracy redaction can also be used to find all PII in unredacted text. For example, if I don't already know how to find PII (e.g., regex, NLP, etc.) I can use OpenAI's Privacy Filter model to do the work for me.

And because each span has a type (PRIVATE_NAME, etc.) I don't even need to do any work to find only the specific information I am looking for; something that simple diffing wouldn't do.

I'm not saying it's an issue, I just think it is interesting that a tool designed to protect PII can also be used to find it with minimal effort. And it looks like someone already implemented it: https://github.com/chiefautism/privacy-parser.