Remix.run Logo
yousif_123123 17 hours ago

It's a mix of both it feels to me as I've been testing it. For example, you can't get it to make a clock showing custom time like 3:30, or someone writing with their left hand.. And it can't do follow many instructions or do them very precisely. But it shows that this kind of architecture will be be capable of that if scaled up most likely.

jumploops 15 hours ago | parent [-]

These are great tests, thanks for sharing!

And you seem to be right, though the only reference I can find is in one of the example images of a whiteboard posted on the announcement[0].

It shows: tokens -> [transformer] -> [diffusion] pixels

hjups22 on Reddit[1] describes it as:

> It's a hybrid model. The AR component generates control embeddings that then get decoded by a diffusion model. But the control embeddings are accurate enough to edit and reconstruct the images surprisingly well.

[0]https://openai.com/index/introducing-4o-image-generation/

[1]https://www.reddit.com/r/MachineLearning/comments/1jkt42w/co...