Remix.run Logo
stefl14 10 hours ago

First model I've seen that was consistently compositional, easily handling requests like

“Generate an image of an african elephant painted in the New England flag, doing a backflip in front of the russian federal assembly.”

OpenAI made the biggest step change towards compositionality in image generation when they started directly generating image tokens for decoders from foundation llms, and it worked very well (openais images were better in this regard than nano banana 1, but struggled with some OOD images like elephants doing backflips), but banana 2 nails this stuff in a way I haven't seen anywhere else

if video follows the same trends as images in terms of prompt adherence, that will be very valuable... and interesting