It takes a text prompt along with the image input, dancing is presumably what they've used for the examples