> Can someone smarter than me explain what this is about?

I think you can find the answer under point 3:

> In this work, our primary goal is to show that pretrained text-to-image diffusion models can be repurposed as object trackers without task-specific finetuning.

Meaning that you can track Objects in Videos without using specialised ML Models for Video Object Tracking.

▲

echelon 4 hours ago | parent [-]

All of these emergent properties of image and video models leads me to believe that evolution of animal intelligence around motility and visually understanding the physical environment might be "easy" relative to other "hard steps".

The more complex that an eye gets, the more the brain evolves not just the physics and chemistry of optics, but also rich feature sets about predator/prey labels, tracking, movement, self-localization, distance, etc.

These might not be separate things. These things might just come "for free".

▲

jacquesm 3 hours ago | parent | next [-]

There is a massive amount of pre-processing already done in the retina itself and in the LGN:

https://en.wikipedia.org/wiki/Lateral_geniculate_nucleus

So the brain does not necessarily receive 'raw' images to process to begin with, there is already a lot of high level data extracted at that point such as optical flow to detect moving objects.

	▲	DrierCycle 2 hours ago \| parent \| next [-]
		And the occipital is developed around extraordinary levels of image separation, broken down into tiny areas of the input, scattered and woven for details of motion, gradient, contrast, etc.
	▲	Mkengin 2 hours ago \| parent \| prev [-]
		Interesting. So similar to the vision encoder + projector in VLMs?

▲

fxtentacle 3 hours ago | parent | prev [-]

I wouldn't call these properties "emergent".

If you train a system to memorize A-B pairs and then you normally use it to find B when given A, then it's not surprising that finding A when given B also works, because you trained it in an almost symmetrical fashion on A-B pairs, which are, obviously, also B-A pairs.