Remix.run Logo
timbilt 4 days ago

I think one of the biggest advantages is the security/privacy benefit — you can see in the demo that the model can mask entities instead of tagging. This means that instead of transcribing and then scrubbing sensitive info, you can prevent the sensitive info from ever being transcribed. Another potential benefit is in lower latency. The paper doesn't specifically mention latency but it seems to be on par with normal Whisper, so you save all of the time it would normally take to do entity tagging — big deal for real-time applications

Tsarp 3 days ago | parent [-]

Ive worked on some enterprise NER systems (specifically privacy/redaction), and in almost all cases the cost of missing out masking was significantly higher than latency (ofc in an ideal world youd have both).

And in all the research we did, the best solutions ended up passing through a workflow of 1.NN based NER, 2.Regex and 3.Dictionary look ups to really clean information. Using a single method worked well in customer demos but always ended up in what we thought were edge cases in prod.

That being said, latency stuff makes sense. This might work great in conversational use cases. Picking out intent and responding. Every millisecond helps in making things sound natural.