| ▲ | jumploops 8 months ago | |
These are great tests, thanks for sharing! And you seem to be right, though the only reference I can find is in one of the example images of a whiteboard posted on the announcement[0]. It shows: tokens -> [transformer] -> [diffusion] pixels hjups22 on Reddit[1] describes it as: > It's a hybrid model. The AR component generates control embeddings that then get decoded by a diffusion model. But the control embeddings are accurate enough to edit and reconstruct the images surprisingly well. [0]https://openai.com/index/introducing-4o-image-generation/ [1]https://www.reddit.com/r/MachineLearning/comments/1jkt42w/co... | ||
| ▲ | yousif_123123 8 months ago | parent [-] | |
Yes. Also, when testing low vs high, it seems the difference is mainly in the diffusion part, as the structure of the image and the instruction following ability is usually the same. Still, very exciting and for the future as well. It's still pretty expensive and slow. But moving in the right direction. | ||