Remix.run Logo
leetharris 5 hours ago

A few problems with this approach:

1. It brings everything back to the "average." Any outliers get discarded. For example, someone who is a circus performer plays fetch with their frog. An LLM would think this is an obvious error and correct it to "dog."

2. LLMs want to format everything as internet text which does not align well to natural human speech.

3. Hallucinations still happen at scale, regardless of model quality.

We've done a lot of experiments on this at Rev and it's still useful for the right scenario, but not as reliable as you may think.

falcor84 3 hours ago | parent [-]

Regarding the frog, I would assume that the way to address this would be to feed the LLM screenshots from the video, if the budget allows.

leetharris 3 hours ago | parent [-]

Generally yes. That being said, sometimes multimodal LLMs show decreased performance with extra modalities.

The extra dimensions of analysis cause increased hallucination at times. So maybe it solves the frog problem, but now it's hallucinating in another section because it got confused by another frame's tokens.

One thing we've wanted to explore lately has been video based diarization. If I have a video to accompany some audio, can I help with cross talk and sound separation by matching lips with audio and assign the correct speaker more accurately? There's likely something there.

orion138 3 hours ago | parent [-]

Google published Looking to Listen a while back.

https://research.google/blog/looking-to-listen-audio-visual-...