Remix.run Logo
magicalhippo 4 hours ago

Glossing through the paper, here's my take.

Someone previously found that that the cross-attention layers in text-to-image diffusion models captures correlation between the input text tokens and corresponding image regions, so that one can use this to segment the image, pixels containing "cat" for example. However this segmentation was rather coarse. The authors of this paper found that also using the self-attention layers leads to a much more detailed segmentation.

They then extend this to video by using the self-attention between two consecutive frames to determine how the segmentation changes from one frame to the next.

Now, text-to-image diffusion models require a text input to generate the image to begin with. From what I can gather they limit themselves to semi-supervised video segmentation, so that the first frame is already segmented by say a human or some other process.

They then run a "inversion" procedure which tries to generate text that causes the text-to-image diffusion model to segment the first frame as closely as possible to the provided segmentation.

With the text in hand, they can then run the earlier segmentation propagation steps to track the segmented object throughout the video.

The key here is that the text-to-image diffusion model is pretrained, and not fine-tuned for this task.

That said, I'm no expert.

jacquesm 3 hours ago | parent | next [-]

For a 'not an expert' explanation you did a better job than the original paper.

nicolailolansen an hour ago | parent | prev [-]

Bravo!