I've updated the GenAI Image comparison site (which focuses heavily on strict text-to-image prompt adherence) to reflect the new Google Gemini 2.5 Flash model (aka nano-banana).

https://genai-showdown.specr.net

This model gets 8 of the 12 prompts correct and easily comes within striking distance of the best-in-class models Imagen and gpt-image-1 and is a significant upgrade over the old Gemini Flash 2.0 model. The reigning champ, gpt-image-1, only manages to edge out Flash 2.5 on the maze and 9-pointed star.

What's honestly most astonishing to me is how long gpt-image-1 has remained at the top of the class - closing in on half a year which is basically a lifetime in this field. Though fair warning, gpt-image-1 is borderline useless as an "editor" since it almost always changes the whole image instead of doing localized inpainting-style edits like Kontext, Qwen, or Nano-Banana.

Comparison of gpt-image-1, flash, and imagen.

https://genai-showdown.specr.net?models=OPENAI_4O%2CIMAGEN_4...

▲

bla3 7 days ago | parent | next [-]

Why do Hunyuan, OpenAI 4o and Gwen get a pass for the octopus test? They don't cover "each tentacle", just some. And midjourney covers 9 of 8 arms with sock puppets.

	▲	vunderba 7 days ago \| parent [-]
		Good point. I probably need to adjust the success pass ratios to be a bit stricter, especially as the models get better. > midjourney covers 9 of 8 arms with sock puppets. Midjourney is shown as a fail so I'm not sure what your point is. And those don't even look remotely close to sock puppets, they resemble stockings at best.

▲

bn-l 7 days ago | parent | prev | next [-]

You need a separate benchmark for editing of course

▲

cubefox 7 days ago | parent | prev | next [-]

What's interesting is that Imagen 4 and Gemini 2.5 Flash Image look suspiciously similar in several of these tests cases. Maybe Gemini 2.5 Flash first calls Imagen in the background to get a detailed baseline image (diffusion models are good at this) and then Gemini edits the resulting image for better prompt adherence.

	▲	pkach 6 days ago \| parent [-]
		Yes, saw on a reddit about an employee confirming this is the case (at least on Gemini app) where the request for an image from scratch is routed to imagen and the follow-up edits are done using Gemini.

▲

MrOrelliOReilly 6 days ago | parent | prev | next [-]

This is incredibly useful! I was manually generating my own model comparisons last night, so great to see this :)

I will note that, personally, while adherence is a useful measure, it does miss some of the qualitative differences between models. For your "spheron" test for example, you note that "4o absolutely dominated this test," but the image exhibits all the hallmarks of a ChatGPT-generated image that I personally dislike (yellow, with veiny, almost impasto brush strokes). I have stopped using ChatGPT for image generation altogether because I find the style so awful. I wonder what objective measures one could track for "style"?

It reminders be a bit of ChatGPT vs Claude for software development... Regardless of how each scores on benchmarks, Claude has been a clear winner in terms of actual results.

	▲	vunderba 6 days ago \| parent [-]
		Yeah - unfortunately the ubiquitous "piss filter" strikes again. You pretty much have to pass GPT-image-1 through a tone map, LUT, etc. in something like Krita or Photoshop to try to mitigate this. I'm honestly a bit surprised that they haven't built this in already given how obvious the color shift is.

▲

gundmc 7 days ago | parent | prev | next [-]

> Though fair warning, gpt-image-1 is borderline useless as an "editor" since it almost always changes the whole image instead of doing localized inpainting-style edits like Kontext, Qwen, or Nano-Banana.

Came into this thread looking for this post. It's a great way to compare prompt adherence across models. Have you considered adding editing capabilities in a similar way given the recent trend of inpainting-style prompting?

▲

vunderba 7 days ago | parent [-]

Adding a separate section for image editing capabilities is a great idea.

I've done some experimentation with Qwen and Kontext and been pretty impressed, but it would be nice to see some side by sides now that we have essentially three models that are capable of highly localized in-painting without affecting the rest of the image.

https://mordenstar.com/blog/edits-with-kontext

	▲	dostick 5 days ago \| parent [-]
		For editing prompts testing it is best to start with “only change …” to prevent model from changing everything. Even Nano banana does that.

▲

jay_kyburz 7 days ago | parent | prev | next [-]

I really like your site.

Do you know of any similar sites that that compares how well the various models can adhere to a style guide? Perhaps you could add this?

I.e. pride the model with a collection of drawings in a single style, then follow prompts and generate images in the same style?

For example if you wanted to illustrate a book, and have all the illustrations look like they were from the same artists.

	▲	vunderba 6 days ago \| parent [-]
		Hi Jay, unfortunately I haven't see a site like that but being able to rank models in terms of "style adherence" but it would be a nice feature. It's basically a necessity if you're working on something like a game or comic where you need consistency around characters, sprites, etc.

▲

mrcwinn 6 days ago | parent | prev [-]

I really enjoyed reviewing this! Good work.