▲ | Tsarp 4 days ago | |||||||
Wouldnt it be better to run normal Whisper and NER on top of the transcription before streaming a response or writing anything to disk? What advantage does this offer? | ||||||||
▲ | timbilt 4 days ago | parent | next [-] | |||||||
I think one of the biggest advantages is the security/privacy benefit — you can see in the demo that the model can mask entities instead of tagging. This means that instead of transcribing and then scrubbing sensitive info, you can prevent the sensitive info from ever being transcribed. Another potential benefit is in lower latency. The paper doesn't specifically mention latency but it seems to be on par with normal Whisper, so you save all of the time it would normally take to do entity tagging — big deal for real-time applications | ||||||||
| ||||||||
▲ | PeterisP 3 days ago | parent | prev | next [-] | |||||||
The general principle is that "pipelines" impose a restriction where the errors of the first step get baked-in and can't effectively use the knowledge of the following step(s) to fix them. So if the first step isn't near-perfect (which ASR isn't) and if there is some information or "world knowledge" in the later step(s) which is helpful in deciding that (which is true with respect to knowledge about named entities and ASR) then you can get better accuracy by having an end-to-end system where you don't attempt to pick just one best option at the system boundary. Also, joint training can be helpful, but that IMHO is less important. | ||||||||
▲ | its_down_again 3 days ago | parent | prev | next [-] | |||||||
From my experience, ASR-to-NER pipelines don't perform adequately out of the box. Even though SOTA ASR systems claim 85% word accuracy, the distribution of errors is worth looking into. Errors around critical entities like credit card numbers or addresses are particularly prone, and even a small mistake renders the result useless. These ASR errors cascade into the NER step, further degrading recall and precision. Combining ASR and NER into a joint model or integrated approach can reduce these issues in theory, it's just more complex to implement and less commonly used. | ||||||||
▲ | conradev 4 days ago | parent | prev [-] | |||||||
Yeah, I’m also curious about that. Does combining ASR and NER into one model improve performance for either? | ||||||||
|