> Visual prompting: Click on the person or object in the video that’s making a sound to isolate their audio.

How does that work? Correlating sound with movement?

janalsncm 12 hours ago | parent | next [-]

If it’s anything like the original SAM, thousands of hours of annotator time.

If I had to do it synthetically, take single subjects with a single sound and combine them together. Then train a model to separate them again.

▲

yodon 14 hours ago | parent | prev [-]

Think about it conceptually:

Could you watch a music video and say "that's the snare drum, that's the lead singer, keyboard, bass, that's the truck that's making the engine noise, that's the crowd that's cheering, oh and that's a jackhammer in the background"? So can AI.

Could you point out who is lead guitar and who is rhythm guitar? So can AI.

▲

scarecrowbob 11 hours ago | parent [-]

I mean, sometimes I -mixing- a show and I couldn't tell you where a specific sound is coming from....

	▲	yodon 11 hours ago \| parent [-]
		> sometimes I -mixing- a show and I couldn't tell you where a specific sound is coming from And in those situations it won't work. Is any of this really a surprise?