Regarding the frog, I would assume that the way to address this would be to feed the LLM screenshots from the video, if the budget allows.

▲

leetharris 8 months ago | parent [-]

Generally yes. That being said, sometimes multimodal LLMs show decreased performance with extra modalities.

The extra dimensions of analysis cause increased hallucination at times. So maybe it solves the frog problem, but now it's hallucinating in another section because it got confused by another frame's tokens.

One thing we've wanted to explore lately has been video based diarization. If I have a video to accompany some audio, can I help with cross talk and sound separation by matching lips with audio and assign the correct speaker more accurately? There's likely something there.

	▲	orion138 8 months ago \| parent [-]
		Google published Looking to Listen a while back. https://research.google/blog/looking-to-listen-audio-visual-...