▲ | leetharris 3 hours ago | |
Generally yes. That being said, sometimes multimodal LLMs show decreased performance with extra modalities. The extra dimensions of analysis cause increased hallucination at times. So maybe it solves the frog problem, but now it's hallucinating in another section because it got confused by another frame's tokens. One thing we've wanted to explore lately has been video based diarization. If I have a video to accompany some audio, can I help with cross talk and sound separation by matching lips with audio and assign the correct speaker more accurately? There's likely something there. | ||
▲ | orion138 3 hours ago | parent [-] | |
Google published Looking to Listen a while back. https://research.google/blog/looking-to-listen-audio-visual-... |