Seems like one of the places where LLMs make a lot of sense. I see some boneheaded transcriptions in videos pretty regularly. Comparing them against "more-likely" words or phrases seems like an ideal use case.

▲

leetharris 8 months ago | parent | next [-]

A few problems with this approach:

1. It brings everything back to the "average." Any outliers get discarded. For example, someone who is a circus performer plays fetch with their frog. An LLM would think this is an obvious error and correct it to "dog."

2. LLMs want to format everything as internet text which does not align well to natural human speech.

3. Hallucinations still happen at scale, regardless of model quality.

We've done a lot of experiments on this at Rev and it's still useful for the right scenario, but not as reliable as you may think.

▲

ldenoue 8 months ago | parent | next [-]

Do you have something to read about your study, experiments? Genuinely interested. Perhaps the prompts can be made to tell the LLM it's specifically handling human speech, not written text?

▲

falcor84 8 months ago | parent | prev [-]

Regarding the frog, I would assume that the way to address this would be to feed the LLM screenshots from the video, if the budget allows.

▲

leetharris 8 months ago | parent [-]

Generally yes. That being said, sometimes multimodal LLMs show decreased performance with extra modalities.

The extra dimensions of analysis cause increased hallucination at times. So maybe it solves the frog problem, but now it's hallucinating in another section because it got confused by another frame's tokens.

One thing we've wanted to explore lately has been video based diarization. If I have a video to accompany some audio, can I help with cross talk and sound separation by matching lips with audio and assign the correct speaker more accurately? There's likely something there.

	▲	orion138 8 months ago \| parent [-]
		Google published Looking to Listen a while back. https://research.google/blog/looking-to-listen-audio-visual-...

▲

devmor 8 months ago | parent | prev | next [-]

Those transcriptions are already done by LLMs in the first place - in fact, audio transcription was one of the very first large scale commercial uses of the technology in its current iteration.

This is just like playing a game of markov telephone where the step in OP's solution is likely higher compute cost than the step YT uses, because YT is interested in minimizing costs.

▲

albertzeyer 8 months ago | parent [-]

Probably just "regular" LMs, not large LMs, I assume. I assume some LM with 10-100M params or so, which is cheap to use (and very standard for ASR).

	▲	devmor 7 months ago \| parent [-]
		Could be. I ran through some offline LMs for voice assisted home automation a couple years ago and they were subpar compared to even the pathetic offering that Youtube provides - but Google of course has much more focused resources to fine tune a small dataset model.

▲

dylan604 8 months ago | parent | prev | next [-]

What about the cases where the human speaking is actually using nonsense words during a meandering off topic bit of "weaving"? Replacing those nonsense words would be a disservice as it would totally change the tone of the speech.

▲

petesergeant 8 months ago | parent | prev [-]

Also useful I think for checking human-entered transcriptions, which even on expensively produced shows, can often be garbage or just wrong. One human + two separate LLMs, and something to tie-break, and we could possibly finally get decent subtitles for stuff.