Remix.run Logo
MrOrelliOReilly 6 days ago

This is incredibly useful! I was manually generating my own model comparisons last night, so great to see this :)

I will note that, personally, while adherence is a useful measure, it does miss some of the qualitative differences between models. For your "spheron" test for example, you note that "4o absolutely dominated this test," but the image exhibits all the hallmarks of a ChatGPT-generated image that I personally dislike (yellow, with veiny, almost impasto brush strokes). I have stopped using ChatGPT for image generation altogether because I find the style so awful. I wonder what objective measures one could track for "style"?

It reminders be a bit of ChatGPT vs Claude for software development... Regardless of how each scores on benchmarks, Claude has been a clear winner in terms of actual results.

vunderba 6 days ago | parent [-]

Yeah - unfortunately the ubiquitous "piss filter" strikes again. You pretty much have to pass GPT-image-1 through a tone map, LUT, etc. in something like Krita or Photoshop to try to mitigate this. I'm honestly a bit surprised that they haven't built this in already given how obvious the color shift is.