▲ | vunderba 7 days ago | ||||||||||||||||
I've updated the GenAI Image comparison site (which focuses heavily on strict text-to-image prompt adherence) to reflect the new Google Gemini 2.5 Flash model (aka nano-banana). https://genai-showdown.specr.net This model gets 8 of the 12 prompts correct and easily comes within striking distance of the best-in-class models Imagen and gpt-image-1 and is a significant upgrade over the old Gemini Flash 2.0 model. The reigning champ, gpt-image-1, only manages to edge out Flash 2.5 on the maze and 9-pointed star. What's honestly most astonishing to me is how long gpt-image-1 has remained at the top of the class - closing in on half a year which is basically a lifetime in this field. Though fair warning, gpt-image-1 is borderline useless as an "editor" since it almost always changes the whole image instead of doing localized inpainting-style edits like Kontext, Qwen, or Nano-Banana. Comparison of gpt-image-1, flash, and imagen. https://genai-showdown.specr.net?models=OPENAI_4O%2CIMAGEN_4... | |||||||||||||||||
▲ | bla3 7 days ago | parent | next [-] | ||||||||||||||||
Why do Hunyuan, OpenAI 4o and Gwen get a pass for the octopus test? They don't cover "each tentacle", just some. And midjourney covers 9 of 8 arms with sock puppets. | |||||||||||||||||
| |||||||||||||||||
▲ | bn-l 7 days ago | parent | prev | next [-] | ||||||||||||||||
You need a separate benchmark for editing of course | |||||||||||||||||
▲ | cubefox 7 days ago | parent | prev | next [-] | ||||||||||||||||
What's interesting is that Imagen 4 and Gemini 2.5 Flash Image look suspiciously similar in several of these tests cases. Maybe Gemini 2.5 Flash first calls Imagen in the background to get a detailed baseline image (diffusion models are good at this) and then Gemini edits the resulting image for better prompt adherence. | |||||||||||||||||
| |||||||||||||||||
▲ | MrOrelliOReilly 6 days ago | parent | prev | next [-] | ||||||||||||||||
This is incredibly useful! I was manually generating my own model comparisons last night, so great to see this :) I will note that, personally, while adherence is a useful measure, it does miss some of the qualitative differences between models. For your "spheron" test for example, you note that "4o absolutely dominated this test," but the image exhibits all the hallmarks of a ChatGPT-generated image that I personally dislike (yellow, with veiny, almost impasto brush strokes). I have stopped using ChatGPT for image generation altogether because I find the style so awful. I wonder what objective measures one could track for "style"? It reminders be a bit of ChatGPT vs Claude for software development... Regardless of how each scores on benchmarks, Claude has been a clear winner in terms of actual results. | |||||||||||||||||
| |||||||||||||||||
▲ | gundmc 7 days ago | parent | prev | next [-] | ||||||||||||||||
> Though fair warning, gpt-image-1 is borderline useless as an "editor" since it almost always changes the whole image instead of doing localized inpainting-style edits like Kontext, Qwen, or Nano-Banana. Came into this thread looking for this post. It's a great way to compare prompt adherence across models. Have you considered adding editing capabilities in a similar way given the recent trend of inpainting-style prompting? | |||||||||||||||||
| |||||||||||||||||
▲ | jay_kyburz 7 days ago | parent | prev | next [-] | ||||||||||||||||
I really like your site. Do you know of any similar sites that that compares how well the various models can adhere to a style guide? Perhaps you could add this? I.e. pride the model with a collection of drawings in a single style, then follow prompts and generate images in the same style? For example if you wanted to illustrate a book, and have all the illustrations look like they were from the same artists. | |||||||||||||||||
| |||||||||||||||||
▲ | mrcwinn 6 days ago | parent | prev [-] | ||||||||||||||||
I really enjoyed reviewing this! Good work. |