▲ | Top Secret: Automatically filter sensitive information(thoughtbot.com) | |||||||||||||
125 points by thunderbong 4 days ago | 15 comments | ||||||||||||||
▲ | fine_tune 3 days ago | parent | next [-] | |||||||||||||
I'm no ruby expert, so forgive my ignorance, but it looks like a small "NER model" packaged as a string convince wrapper named `filter` that tries to filter out "sensitive info" on input strings. I assume the NER model is small enough to run on CPU at less than 1s~ per pass at the trade off of storage per instance (1s is fast enough in dev, in prod with long convos - that's a lot of inference time), generally a neat idea though. Couple questions; - NER doesn't often perform well in different domains, how accurate is the model? - How do you actually allocate compute/storage for inferring on the NER model? - Are you batching these `filter` calls or is it just sequential 1 by 1 calls | ||||||||||||||
| ||||||||||||||
▲ | neilv 3 days ago | parent | prev | next [-] | |||||||||||||
When I had to implement "deidentification" for a kind of sensitive safety reporting, an LLM would've been a good way to augment the approaches I used. Today, if I had to do it, I'd probably throw multiple computer approaches at it, including LLM-based one, and take the union of those as the computer result, and check it against a human result. (If computer and human agree, that's a good sign; if they disagree, see why before the document goes where it needs to be deidentified.) (In some kinds of flight safety reporting, any kind of personnel can submit a report about any observation related to safety. It gets very seriously handled and analyzed. There are also multiple ways in which the reporting parties are protected. There are situations in which some artifacts need to have identifying information redacted.) | ||||||||||||||
▲ | sbpayne 3 days ago | parent | prev | next [-] | |||||||||||||
This is great but it does not “prevent”; it reduces the chances of. NER is not 100% performant. It is very good in many cases, but use with caution! | ||||||||||||||
| ||||||||||||||
▲ | dwa3592 3 days ago | parent | prev | next [-] | |||||||||||||
Oh hey! Good to see this. I built something similar in python a while ago. Check it out: https://github.com/deepanwadhwa/zink The shield functionality fits directly in your LLM workflow. | ||||||||||||||
▲ | wombatpm 3 days ago | parent | prev | next [-] | |||||||||||||
There is an extension for PostGres, https://postgresql-anonymizer.readthedocs.io that allows you to mask data by user or group at the schema level with the options to return full mask, partial mask or dummy data. | ||||||||||||||
▲ | thinkingemote 3 days ago | parent | prev | next [-] | |||||||||||||
Can filters of this type be used when doing a screenshare or streaming? E.g any application that is on the screen. Would be good to not have to worry about leaking a lot of stuff to the world. | ||||||||||||||
| ||||||||||||||
▲ | jgalt212 3 days ago | parent | prev | next [-] | |||||||||||||
How can one use this for logging? Won't the LLM performance lag grind your application to a halt? | ||||||||||||||
| ||||||||||||||
▲ | keepamovin 2 days ago | parent | prev | next [-] | |||||||||||||
US Marshalls told me they need tech like this, automatic redaction. | ||||||||||||||
▲ | prettyman 3 days ago | parent | prev | next [-] | |||||||||||||
[dead] | ||||||||||||||
▲ | prettyman 3 days ago | parent | prev [-] | |||||||||||||
[dead] |