| ▲ | Kalabint 5 hours ago | ||||||||||||||||||||||||||||
> Can someone smarter than me explain what this is about? I think you can find the answer under point 3: > In this work, our primary goal is to show that pretrained text-to-image diffusion models can be repurposed as object trackers without task-specific finetuning. Meaning that you can track Objects in Videos without using specialised ML Models for Video Object Tracking. | |||||||||||||||||||||||||||||
| ▲ | echelon 4 hours ago | parent [-] | ||||||||||||||||||||||||||||
All of these emergent properties of image and video models leads me to believe that evolution of animal intelligence around motility and visually understanding the physical environment might be "easy" relative to other "hard steps". The more complex that an eye gets, the more the brain evolves not just the physics and chemistry of optics, but also rich feature sets about predator/prey labels, tracking, movement, self-localization, distance, etc. These might not be separate things. These things might just come "for free". | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||