Remix.run Logo
minimaxir 20 hours ago

Pricing-wise, this API is going to be hard to justify the value unless you really can get value out of providing references. A generated `medium` 1024x1024 is $0.04/image, which is in the same cost class as Imagen 3 and Flux 1.1 Pro. Testing from their new playground (https://platform.openai.com/playground/images), the medium images are indeed lower quality than either of of two competitor models and still takes 15+ seconds to generate: https://x.com/minimaxir/status/1915114021466017830

Prompting the model is also substantially more different and difficult than traditional models, unsurprisingly given the way the model works. The traditional image tricks don't work out-of-the-box and I'm struggling to get something that works without significant prompt augmentation (which is what I suspect was used for the ChatGPT image generations)

raincole 19 hours ago | parent | next [-]

ChatGPT's prompt adherence is light years ahead of all the others. I won't even call Flux/Midjoueny its competitors. ChatGPT image gen is practically a one-of-its-kind unique product on the market: the only usable AI image editor for people without image editing experience.

I think in terms of image generation, ChatGPT is the biggest leap since Stable Diffusion's release. LoRA/ControlNet/Flux are forgettable in comparison.

thegeomaster 18 hours ago | parent | next [-]

Well, there's also gemini-2.0-flash-exp-image-generation. Also autoregressive/transfusion based.

thefourthchime 18 hours ago | parent | next [-]

Such a good name....

Yiling-J 14 hours ago | parent | prev | next [-]

gemini-2.0-flash-exp-image-generation doesn’t perform as well as GPT-4o's image generation, as mentioned in section 5.1 of this paper: https://arxiv.org/pdf/2504.02782. However based on my test, for certain types of images such as realistic recipe images, the results are quite good. You can see some examples here: https://github.com/Yiling-J/tablepilot/tree/main/examples/10...

raincole 11 hours ago | parent | prev | next [-]

It's quite bad now, but I have no doubt that Google will catch up.

The AI field looks awfully like {OpenAI, Google, The Irrelevent}.

yousif_123123 17 hours ago | parent | prev | next [-]

It's also good but clearly not close still. Maybe Gemini 2.5 or 3 will have better image gen.

swyx 13 hours ago | parent | prev [-]

> transfusion based.

what is that?

echelon 14 hours ago | parent | prev | next [-]

I'd go out on a limb and say that even your praise of gpt-image-1 is underselling its true potential. This model is as remarkable as when ChatGPT first entered the market. People are sleeping on its capabilities. It's a replacement for ComfyUI and potentially most of Adobe in time.

Now for the bad part: I don't think Black Forest Labs, StabilityAI, MidJourney, or any of the others can compete with this. They probably don't have the money to train something this large and sophisticated. We might be stuck with OpenAI and Google (soon) for providing advanced multimodal image models.

Maybe we'll get lucky and one of the large Chinese tech companies will drop a model with this power. But I doubt it.

This might be the first OpenAI product with an extreme moat.

raincole 12 hours ago | parent [-]

> Now for the bad part: I don't think Black Forest Labs, StabilityAI, MidJourney, or any of the others can compete with this.

Yeah. I'm a tad sad about it. I once thought the SD ecosystem proves open-source won when it comes to image gen (a naive idea, I know). It turns out big corps won hard in this regard.

soared 19 hours ago | parent | prev [-]

This is a take so incredulous it doesn’t seem credible.

stavros 18 hours ago | parent | next [-]

I can confirm, ChatGPT's prompt adherence is so incredibly good, it gets even really small details right, to a level that diffusion-based generators couldn't even dream of.

mediaman 18 hours ago | parent | prev | next [-]

It is correct, the shift from diffusion to transformers is a very, very big difference.

abhpro 15 hours ago | parent | prev | next [-]

Also chiming in to say you're wrong, I mean they're correct

tacoooooooo 19 hours ago | parent | prev [-]

its 100% the correct take

fkyoureadthedoc 19 hours ago | parent [-]

yeah this is my personal experience. The new image generation is the only reason I keep an OpenAI subscription rather than switching to Google.

adamhowell 19 hours ago | parent | prev | next [-]

So, I've long dreamed of building an AI-powered https://iconfinder.com.

I started Accomplice v1 back in 2021 with this goal in mind and raised some VC money but it was too early.

Now, with these latest imagen-3.0-generate-002 (Gemini) and gpt-image-1 (OpenAI) models – especially this API release from OpenAI – I've been able to resurrect Accomplice as a little side project.

Accomplice v2 (https://accomplice.ai) is just getting started back up again – I honestly decided to rebuild it only a couple weeks ago in preparation for today once I saw ChatGPT's new image model – but so far 1,000s of free to download PNGs (and any SVGs that have already been vectorized are free too (costs a credit to vectorize)).

I generate new icons every few minutes from a huge list of "useful icons" I've built. Will be 100% pay-as-you-go. And for a credit, paid users can vectorize any PNGs they like, tweak them using AI, upload their own images to vectorize and download, or create their own icons (with my prompt injections baked in to get you good icon results)

Do multi-modal models make something like this obsolete? I honestly am not sure. In my experience with Accomplice v1, a lot of users didn't know what to do with a blank textarea, so the thinking here is there's value in doing some of the work for them upfront with a large searchable archive. Would love to hear others' thoughts.

But I'm having fun again either way.

stavros 18 hours ago | parent | next [-]

That looks interesting, but I don't know how useful single icons can be. For me, the really useful part would be to get a suite of icons that all have a consistent visual style. Bonus points if I can prompt the model to generate more icons with that same style.

throwup238 18 hours ago | parent [-]

Recraft has a style feature where you give some images. I wonder if that would work for icons. You can also try giving an image of a bunch of icons to ChatGPT and have it generate more, then vectorize them.

vunderba 16 hours ago | parent | next [-]

Recraft's icon generator let's you do this.

https://imgur.com/a/BTzbsfh

It definitely captures the style - but any reasonably complicated prompt was beyond it.

stavros 18 hours ago | parent | prev [-]

I think the latter approach is the best bet right now, agree.

egypturnash 15 hours ago | parent | prev [-]

[flagged]

tough 20 hours ago | parent | prev | next [-]

It seems to me like this is a new hybrid product for -vibe coders- beacuse otherwise the -wrapping- of prompting/improving a prompt with an LLM before hitting the text2image model can certainly be done as you say cheaper if you just run it yourself.

maybe OpenAI thinks model business is over and they need to start sherlocking all the way from the top to final apps (Thus their interest on buying out cursor, finally ending up with windsurf)

Idk this feels like a new offering between a full raw API and a final product where you abstract some of it for a few cents, and they're basically bundling their SOTA llm models with their image models for extra margin

vineyardmike 20 hours ago | parent [-]

> It seems to me like this is a new hybrid product for -vibe coders- beacuse otherwise the -wrapping- of prompting/improving a prompt with an LLM before hitting the text2image model can certainly be done as you say cheaper if you just run it yourself.

In case you didn’t know, it’s not just wrapping in an LLM. The image model they’re referencing is a model that’s directly integrated into the LLM for functionality. It’s not possible to extract, because the LLM outputs tokens which are part of the image itself.

That said, they’re definitely trying to focus on building products over raw models now. They want to be a consumer subscription instead of commodity model provider.

tough 20 hours ago | parent | next [-]

Right! I forgot the new model was a multi-modal one generating image outputs from both image and text inputs, i guess this is good and price will come down eventually.

waiting for some FOSS multi-modal model to come out eventually too

great to see openAI expanding into making actual usable products i guess

spilldahill 19 hours ago | parent | prev [-]

yeah, the integration is the real shift here. by embedding image generation into the LLM’s token stream, it’s no longer a pipeline of separate systems but a single unified model interface. that unlocks new use cases where you can reason, plan, and render all in one flow. it’s not just about replacing diffusion models, it’s about making generation part of a broader agentic loop. pricing will drop over time, but the shift in how you build with this is the more interesting part.

furyofantares 19 hours ago | parent | prev | next [-]

I find prompting the model substantially easier than traditional models, is it really more difficult or are you just used to traditional models?

I suspect what I'll do with the API is iterate at medium quality and then generate a high quality image when I'm done.

vunderba 18 hours ago | parent | prev | next [-]

> Prompting the model is also substantially more different and difficult than traditional models

Can you elaborate? This was not my experience - retesting the prompts that I used for my GenAI image shootout against gpt-image-1 API proved largely similar.

https://genai-showdown.specr.net

simonw 19 hours ago | parent | prev | next [-]

It may lose against other models on prompt-to-image, but I'd be very excited to see another model that's as good at this one as image+prompt-to-image. Editing photos with ChatGPT over the past few weeks has been SO much fun.

Here's my dog in a pelican costume: https://bsky.app/profile/simonwillison.net/post/3lneuquczzs2...

steve_adams_86 19 hours ago | parent [-]

The dog ChatGPT generated doesn't actually look like your dog. The eyes are so different. Really cute image, though.

thot_experiment 19 hours ago | parent | prev | next [-]

Similarly to how 90% of my LLM needs are met by Mistral 3.1, there's no reason to use 4o for most t2i or i2i, however there's a definite set of tasks that are not possible with diffusion models, or if they are they require a giant ball of node spaghetti in comfyui to achieve. The price is high but the likelyhood of getting the right answer on the first try is absolutely worth the cost imo.

varenc 19 hours ago | parent | prev | next [-]

pretty amazing that in ~two years a 15 second latency AI image generation API that cost 4 cents lags behind competitors.

echelon 14 hours ago | parent [-]

This product does not lag behind competitors. Once you take the time to understand how it works, it's clear that this is an order of magnitude more powerful than anything else on the market.

While there's a market need for fast diffusion, that's already been filled and is now a race to the bottom. There's nobody else that can do what OpenAI does with gpt-image-1. This model is a truly programmable graphics workflow engine. And this type of model has so much more value than mere "image generation".

gpt-image-1 replaces ComfyUI, inpainting/outpainting, LoRAs, and in time one could imagine it replaces Adobe Photoshop and nearly all the things people use it for. It's an image manipulation engine, not just a diffusion model. It understands what you want on the first try, and it does a remarkably good job at it.

gpt-image-1 is a graphics design department in a box.

Please don't think of this as a model where you prompt things like "a dog and a cat hugging". This is so much more than that.

Sohcahtoa82 19 hours ago | parent | prev | next [-]

> A generated `medium` 1024x1024 is $0.04/image

It's actually more than that. It's about 16.7 cents per image.

$0.04/image is the pricing for DALL-E 3.

mkl 18 hours ago | parent | next [-]

16.7 cents is the high quality cost, and medium is 4.2 cents: https://platform.openai.com/docs/pricing#:~:text=1M%20charac...

Sohcahtoa82 17 hours ago | parent [-]

Ah, they changed that page since I saw it yesterday.

They didn't show low/med/high quality, they just said an image was a certain number of tokens with a price per token that led to $0.16/image.

weird-eye-issue 19 hours ago | parent | prev [-]

No, it's not

doctorpangloss 20 hours ago | parent | prev | next [-]

It's far and away the most powerful image model right now. $0.04/image is a decent price!

arevno 19 hours ago | parent [-]

This is extremely domain-specific. Diffusion models work much better for certain things.

thot_experiment 19 hours ago | parent | next [-]

Can you cite an example? I'm really curious where that set of usecases lies.

koakuma-chan 19 hours ago | parent | next [-]

Explicit adult content.

thot_experiment 19 hours ago | parent [-]

False. That has nothing to do with the model architecture and everything to do with cloud inference providers wanting to avoid regulatory scrutiny.

echelon 19 hours ago | parent | prev [-]

I work in the space. There are a lot of use cases that get censored by OpenAI, Kling, Runway, and various other providers for a wide variety of reasons:

- OpenAI is notorious for blocking copyrighted characters. They do prompt keyword scanning, but also run a VLM on the results so you can't "trick" the model.

- Lots of providers block public figures and celebrities.

- Various providers block LGBT imagery, even safe for work prompts. Kling is notorious for this.

- I was on a sales call with someone today who runs a father's advocacy group. I don't know what system he was using, but he said he found it impossible to generate an adult male with a child. In a totally safe for work context.

- Some systems block "PG-13" images of characters that are in bathing suits or scantily clad.

None of this is porn, mind you.

thot_experiment 19 hours ago | parent | next [-]

Sure but that has nothing to do with the model architecture and everything to do with the cloud inference providers wanting to cover their asses.

throwaway314155 19 hours ago | parent | prev [-]

What does any of that have to do with the distinction between diffusion vs. autoregressive models?

echelon 19 hours ago | parent | prev [-]

I don't think so. This model kills the need for Flux, ComfyUI, LoRAs, fine tuning, and pretty much everything that's come before it.

This is the god model in images right now.

I don't think open source diffusion models can catch up with this. From what I've heard, this model took a huge amount of money to train that not even Black Forest Labs has access to.

thot_experiment 19 hours ago | parent | next [-]

ComfyUI supports 4o natively so you get the best of both worlds, there is so much that you can't do with 4o because there's a fundamental limit on the level of control you can have over image generation when your conditioning is just tokens in an autoregressive model. There's plenty of reason to use comfy even if 4o is part of your workflow.

As for LoRAs and fine tuning and open source in general; if you've ever been to civit.ai it should be immediately obvious why those things aren't going away.

18 hours ago | parent [-]
[deleted]
AuryGlenz 10 hours ago | parent | prev [-]

95% of what I do with image models is train LoRAs/finetune family and friends and create images of them.

Sure, I can ghiblify specific images of them on this model, but anything approaching realistic changes their looks. I've also done specific LoRAs for things that may or may not be in their training data, such as specific movies.

Wowfunhappy 16 hours ago | parent | prev [-]

Huh? For me the quality of the API seems to be identical to what I'm getting in ChatGPT.