| ▲ | yjftsjthsd-h 14 hours ago | ||||||||||||||||
> Visual prompting: Click on the person or object in the video that’s making a sound to isolate their audio. How does that work? Correlating sound with movement? | |||||||||||||||||
| ▲ | janalsncm 12 hours ago | parent | next [-] | ||||||||||||||||
If it’s anything like the original SAM, thousands of hours of annotator time. If I had to do it synthetically, take single subjects with a single sound and combine them together. Then train a model to separate them again. | |||||||||||||||||
| ▲ | yodon 14 hours ago | parent | prev [-] | ||||||||||||||||
Think about it conceptually: Could you watch a music video and say "that's the snare drum, that's the lead singer, keyboard, bass, that's the truck that's making the engine noise, that's the crowd that's cheering, oh and that's a jackhammer in the background"? So can AI. Could you point out who is lead guitar and who is rhythm guitar? So can AI. | |||||||||||||||||
| |||||||||||||||||