Remix.run Logo
vunderba 4 days ago

Nano-Banana can produce some astonishing results. I maintain a comparison website for state-of-the-art image models with a very high focus on adherence across a wide variety of text-to-image prompts.

I recently finished putting together an Editing Comparison Showdown counterpart where the focus is still adherence but testing the ability to make localized edits of existing images using pure text prompts. It's currently comparing 6 multimodal models including Nano-Banana, Kontext Max, Qwen 20b, etc.

https://genai-showdown.specr.net/image-editing

Gemini Flash 2.5 leads with a score of 7 out of 12, but Kontext comes in at 5 out of 12 which is especially surprising considering you can run the Dev model of it locally.

user_7832 4 days ago | parent | next [-]

> a very high focus on adherence

Don't know if it's the same for others, but my issue with Nano Banana has been the opposite. Ask it to make x significant change, and it spits out what I would've sworn is the same image. Sometimes randomly and inexplicably it spits our the expected result.

Anyone else experiencing this or have solutions for avoiding this?

alvah 4 days ago | parent | next [-]

Just yesterday, asking it to make some design changes to my study. It did a great job with all the complex stuff, but asking it to move a shelf higher, it repeatedly gave me back the same image. With LLMs generally I find as soon as you encounter resistance it's best to start a new chat, however in this case that didn't wok either. Not a single thing I could do to convince it that the shelf didn't look right half way up a wall.

hnuser123456 3 days ago | parent [-]

"Hey gemini, I'll pay you a commission of $500 if you edit this image with the shelf higher on the wall..."

vunderba 4 days ago | parent | prev | next [-]

Yeah I've definitely seen this. You can actually see evidence of this problem in some of the trickier prompts (the straightened Tower of Pisa and the giraffe for example).

Most models (gpt-image-1, Kontext, etc) typically fail by doing the wrong thing.

From my testing this seems to be a Nano-Banana issue. I've found you can occasionally work around it by adding far more explicit directives to the prompt but there's no guarantee.

jbm 4 days ago | parent | prev | next [-]

I've had this same issue happen repeatedly. It's not a big deal because it is just for small personal stuff, but I often need to tell it that it is doing the same thing and that I had asked for changes.

nick49488171 3 days ago | parent | prev [-]

Yes experienced this exactly.

tdalaa 4 days ago | parent | prev | next [-]

Great comparison! Bookmarked to follow. Keep an eye on Grok, they're improving at a very rapid rate and I suspect they'll be near the top in not too distant future.

vunderba 3 days ago | parent | next [-]

Will do! I just added Seedream v4.0 a few hours ago as well. It's all I can do just to keep up and not get trampled under the relentless march of progress.

https://seed.bytedance.com/en/seedream4_0

Zetaphor 2 days ago | parent | prev [-]

Isn't their image generation just using the open weights Flux model? You can run that model locally. They don't have their own image model as far as I'm aware.

Isharmla 3 days ago | parent | prev | next [-]

Nice visualization!

By the way, some of the results look a little weird to me, like the one for the 'Long Neck' prompt. The giraffe of Seedream just lowered its head but its neck didn't shorten as expected. I'd like to learn about the evaluation process, especially whether it is automatic or manual.

vunderba 3 days ago | parent [-]

Hi Isharmla, the giraffe one was a tough call. IMHO, even when correcting for perspective, I do feel like it managed to follow the directive of the prompt and shorten the neck.

To answer your question, all of the evaluations are performed manually. On the trickier results I'll occasionally conscript some friends to get a group evaluation.

The bottom section of the site has an FAQ that gives more detail, I'll include it here:

It's hard to define a discrete rubric for grading at an inherently qualitative level. To keep things simple, this test is purely PASS/FAIL - unsuccessful means that the model NEVER managed to generate an image adhering to the prompt.

In many cases, we often attempt a generous interpretation of the prompt - if it gets close enough, we might consider it a pass.

To paraphrase former Supreme Court Justice Potter Stewart, "I may not be able to define a passing image, but I know it when I see it."

echelon 4 days ago | parent | prev | next [-]

Add gpt-image-1. It's not strictly an editing model since it changes the global pixels, but I've found it to be more instructive than Nano Banana for extremely complicated prompts and image references.

vunderba 4 days ago | parent [-]

It's actually already in there - the full list of edit models is Nano-Banana, Kontext Dev, Kontext Max, Qwen Edit 20b, gpt-image-1, and Omnigen2.

I agree with your assessment - even though it does tend to make changes at a global level you can least attempt to minimize its alterations through careful prompting.

what 4 days ago | parent | prev | next [-]

Why does OpenAI get a different image for “Girl with Pearl Earring”?

vunderba 3 days ago | parent | next [-]

That's a mistake. Gpt-image-1 is a lot stricter in the supported output resolutions so it's using a cropped image. I'll fix the test later this week. Thanks for the heads up!

rimprobablyly 3 days ago | parent | prev [-]

Can you post comparison images?

android521 4 days ago | parent | prev | next [-]

still cannot show clock (eg a clock showing 1:15 am). the text generated in manga image is still not 100% correct.

nick49488171 3 days ago | parent | prev | next [-]

No grok tested?

Zetaphor 2 days ago | parent [-]

Grok is just a hosted api for Flux

ffitch 4 days ago | parent | prev | next [-]

great benchmark!

wiredpancake 4 days ago | parent | prev [-]

[dead]