Wouldnt it be better to run normal Whisper and NER on top of the transcription before streaming a response or writing anything to disk?

What advantage does this offer?

▲

timbilt 8 months ago | parent | next [-]

I think one of the biggest advantages is the security/privacy benefit — you can see in the demo that the model can mask entities instead of tagging. This means that instead of transcribing and then scrubbing sensitive info, you can prevent the sensitive info from ever being transcribed. Another potential benefit is in lower latency. The paper doesn't specifically mention latency but it seems to be on par with normal Whisper, so you save all of the time it would normally take to do entity tagging — big deal for real-time applications

	▲	Tsarp 8 months ago \| parent [-]
		Ive worked on some enterprise NER systems (specifically privacy/redaction), and in almost all cases the cost of missing out masking was significantly higher than latency (ofc in an ideal world youd have both). And in all the research we did, the best solutions ended up passing through a workflow of 1.NN based NER, 2.Regex and 3.Dictionary look ups to really clean information. Using a single method worked well in customer demos but always ended up in what we thought were edge cases in prod. That being said, latency stuff makes sense. This might work great in conversational use cases. Picking out intent and responding. Every millisecond helps in making things sound natural.

▲

PeterisP 8 months ago | parent | prev | next [-]

The general principle is that "pipelines" impose a restriction where the errors of the first step get baked-in and can't effectively use the knowledge of the following step(s) to fix them.

So if the first step isn't near-perfect (which ASR isn't) and if there is some information or "world knowledge" in the later step(s) which is helpful in deciding that (which is true with respect to knowledge about named entities and ASR) then you can get better accuracy by having an end-to-end system where you don't attempt to pick just one best option at the system boundary. Also, joint training can be helpful, but that IMHO is less important.

▲

its_down_again 8 months ago | parent | prev | next [-]

From my experience, ASR-to-NER pipelines don't perform adequately out of the box. Even though SOTA ASR systems claim 85% word accuracy, the distribution of errors is worth looking into. Errors around critical entities like credit card numbers or addresses are particularly prone, and even a small mistake renders the result useless.

These ASR errors cascade into the NER step, further degrading recall and precision. Combining ASR and NER into a joint model or integrated approach can reduce these issues in theory, it's just more complex to implement and less commonly used.

▲

conradev 8 months ago | parent | prev [-]

Yeah, I’m also curious about that. Does combining ASR and NER into one model improve performance for either?

	▲	anewhnaccount2 8 months ago \| parent [-]
		Almost definitely. You can think of there being a type of triangle inequality for cascading different systems where manually combined systems almost always perform worse given comparable data and model capacity. Alternatively you have tied the models hands by forcing it to bottleneck through a representation you chose.